Python is the most widely used language for Data Analysts because of its simple syntax, powerful libraries (Pandas, NumPy, Matplotlib, Scikit-learn), and flexibility across ETL, reporting, visualization, and modeling.
This guide consolidates basic Python interview questions, explains key data structures (list, tuple, set, dictionary, array), introduces Pandas essentials (joins, groupby, pivot), and ends with FAQs and cheat sheets to clear up confusion during interviews.
1. What is Python, and what are its key features?
Python is a high-level, interpreted, object-oriented language designed for readability and productivity.
Key Features
- ✅ Simple Syntax – Easy for beginners, efficient for experts.
- ✅ Interpreted – Executes code line by line, easier debugging.
- ✅ Dynamic Typing – No need to declare variable types explicitly.
- ✅ Extensive Libraries – Pandas, NumPy, Scikit-learn, Django.
- ✅ Cross-Platform – Works on Windows, macOS, Linux.
Data Analyst Connection
- Pandas → data cleaning and wrangling.
- NumPy → numerical computations.
- Matplotlib/Seaborn → data visualization.
- SQLAlchemy → database connectivity.
2. Core Data Structures in Python
📌 Comparison Table
Feature | List | Tuple | Set | Dictionary | Array (NumPy) |
---|---|---|---|---|---|
Definition | Ordered, mutable collection | Ordered, immutable collection | Unordered, unique elements | Key-value mapping | Homogeneous, numeric arrays |
Syntax Example | a = [1, 2, 3, 3] |
t = (1, 2, 3, 3) |
s = {1, 2, 3} |
d = {"a":1, "b":2} |
arr = np.array([1,2,3]) |
Order | ✅ Preserves order | ✅ Preserves order | ❌ Unordered | ✅ Preserves order (3.7+) | ✅ Preserves order |
Mutable | ✅ Yes: a[0]=10 |
❌ No (immutable) | ✅ Add/remove: s.add(4) |
✅ Yes: d["c"]=3 |
✅ Yes: arr[0]=10 |
Duplicates | ✅ Allowed: [1,1,2] |
✅ Allowed: (1,1,2) |
❌ Not allowed: {1,2} |
❌ Keys unique, values can repeat: {"a":1,"b":1} |
✅ Allowed: [1,1,2] |
Indexing | ✅ By index: a[1] → 2 |
✅ By index: t[1] → 2 |
❌ Not available | ✅ By key: d["a"] → 1 |
✅ By index: arr[1] → 2 |
Methods | .append(4) .sort() |
.count(2) .index(3) |
.union() .intersection() |
.keys() .values() |
.mean() .reshape() |
Performance | Moderate | Faster (immutable) | Very fast membership test | Fast lookups (hash table) | Extremely fast (vectorized) |
Best For | General-purpose collections | Fixed schema, constants | Unique IDs, categories | Mappings, JSON-style data | Numeric-heavy analytics |
Analyst Example | [1200, 800, 450] → raw sales |
(2023,"Q1",5000) → report key |
{1001,1002,1003} → unique IDs |
{101:"Electronics"} → product map |
[[100,200],[300,400]] → sales matrix |
📌 Dimensionality
- List → 1D:
[1,2,3]
; 2D:[[1,2],[3,4]]
(inefficient for math). - Tuple → Similar to list, but immutable.
- Set → Always 1D (no index).
- Dictionary → 1D mapping; nested dicts simulate 2D.
- NumPy Array → True 1D, 2D, 3D, nD (efficient for matrix ops).
📌 Methods Cheat Sheet with 1D & 2D Examples
🔹 List
# 1D List
a = [3, 1, 2]
a.append(4) # [3, 1, 2, 4]
a.sort() # [1, 2, 3, 4]
# 2D List (List of Lists)
matrix = [[1, 2], [3, 4], [5, 6]]
matrix[1][0] # 3
[row[1] for row in matrix] # [2, 4, 6] → column extraction
✅ Use Case: Representing tabular data before loading into Pandas.
🔹 Tuple
# 1D Tuple
t = (1, 2, 3, 2)
t.count(2) # 2
t.index(3) # 2
# Nested Tuple (Tuple of Tuples)
coords = ((10, 20), (30, 40))
coords[1][0] # 30
✅ Use Case: Immutable coordinates, fixed reference mappings.
🔹 Set
# 1D Set
s = {1, 2, 2, 3}
s.add(4) # {1, 2, 3, 4}
s.intersection({2,5}) # {2}
# Set of Frozensets (to allow nested sets)
fs = {frozenset([1,2]), frozenset([3,4])}
✅ Use Case: Deduplication, unique groups, fast membership tests.
🔹 Dictionary
# 1D Dictionary
d = {"Alice":100, "Bob":200}
d["Charlie"] = 300
list(d.keys()) # ['Alice','Bob','Charlie']
# Nested Dictionary
students = {
"Alice": {"Math": 90, "Eng": 85},
"Bob": {"Math": 78, "Eng": 92}
}
students["Alice"]["Math"] # 90
✅ Use Case: Key-value storage, hierarchies like customer → orders.
🔹 NumPy Array
import numpy as np
# 1D Array
arr = np.array([1, 2, 3, 4])
arr.mean() # 2.5
# 2D Array (Matrix)
mat = np.array([[1, 2], [3, 4], [5, 6]])
mat.shape # (3,2)
mat[:, 1] # [2,4,6] → column slice
# 3D Array
cube = np.arange(8).reshape(2,2,2)
✅ Use Case: Efficient math on 2D/3D data (images, stats, regression input).
🚀 Summary of “Why & When”
Structure | Best For | Pros | Cons | Analyst Approach |
---|---|---|---|---|
List | Sequential, tabular-like scratch data | Flexible, indexable | Duplicates, slower on large sets | Start raw data wrangling |
Tuple | Fixed records | Immutable, safe | Can’t update | Use for stable lookups (coordinates, IDs) |
Set | Unique values | Deduplication, fast membership | Unordered, no indexing | Find distinct customers/products |
Dict | Key → value | Fast, readable | Memory heavy | Hierarchies (customer → orders) |
NumPy | Math, ML input | Vectorized, memory-efficient | Rigid types | Statistical ops, transformations |
📌 Data Analyst Use Cases
Collection | Example Scenario |
---|---|
List | Store raw sales records before cleaning → [1200, 800, 450] . |
Tuple | Fixed (Year, Quarter, Revenue) values for reporting. |
Set | Deduplicate customer IDs quickly: set(customer_list) . |
Dictionary | Product → Category mapping → {101: "Electronics"} . |
Array | Perform matrix operations, e.g., sales by Region × Month. |
3. __init__()
Method in Python
The __init__()
method is a constructor in OOP used to initialize objects.
class SalesRecord:
def __init__(self, date, amount, region):
self.date = date
self.amount = amount
self.region = region
record = SalesRecord("2024-01-01", 500, "North")
print(record.region) # North
Analyst Use Case – Create custom objects such as Customer
or Transaction
classes to structure datasets in ETL pipelines.
4. Mutable vs Immutable
Type | Examples | Change Allowed? |
---|---|---|
Mutable | List, Dict, Set | ✅ Yes |
Immutable | Int, Float, String, Tuple | ❌ No |
Example
- Store transaction logs in lists/dicts (mutable).
- Store
(Year, Month, ProductID)
in tuples (immutable, fixed schema).
5. Comprehensions
List Comprehension
squares = [i**2 for i in range(5)] # [0,1,4,9,16]
Dictionary Comprehension
sales = {region: revenue for region, revenue in [("A",1000),("B",2000)]}
Tuple Comprehension (generator)
gen = (i for i in range(5))<br>list(gen) # [0,1,2,3,4]
Analyst Example – Filter out negative sales:
clean_sales = [x for x in sales if x > 0]
6. Global Interpreter Lock (GIL)
- Prevents multiple Python threads from executing simultaneously.
- ❌ Limitation → CPU-bound tasks (slow for heavy math).
- ✅ Works fine → I/O-bound tasks (API calls, file operations).
Analyst Example
- Web scraping multiple APIs → threading is effective.
- Heavy analytics models → prefer multiprocessing or NumPy vectorization.
7. Conversions Between Formats
Conversion | Example Code | Resulting Output | Analyst Use Case |
---|---|---|---|
List → Set | ids = [101,101,102,103] |
{101,102,103} |
Remove duplicate customer IDs from survey responses or transaction logs. |
Dict → DataFrame | d = {"A":100,"B":200} |
0 1
|
Convert mapping into a tabular report for analysis or export to Excel/CSV. |
List → NumPy Array | sales = [100,200,300] |
array([100,200,300]) |
Perform vectorized operations and statistical analysis on sales figures. |
Array → List | arr = np.array([1,2,3]) |
[1,2,3] |
Convert numeric arrays back to Python lists for JSON/CSV export or API responses. |
8. Pandas Essentials
Series
import pandas as pd
s = pd.Series([10,20,30], index=["a","b","c"])
print(s["b"]) # 20
DataFrame
df = pd.DataFrame({"Name":["Alice","Bob"], "Sales":[100,200]})
Core Operations
Method | Example | Purpose |
---|---|---|
df.head() | df.head(5) |
Preview first 5 rows |
df.info() | df.info() |
Schema + memory usage |
df.describe() | df.describe() |
Summary statistics |
df.isnull().sum() | df.isnull().sum() |
Check missing values |
df.dropna() | df.dropna() |
Remove null rows |
df.fillna() | df.fillna(0) |
Replace nulls with 0 |
df.groupby() | df.groupby("Region")["Sales"].sum() |
Aggregate sales by region |
df.sort_values() | df.sort_values("Sales") |
Sort by Sales column |
Joins in Pandas
Join Type | Pandas Code | Example (df1 vs df2) | Analyst Use Case |
---|---|---|---|
Inner Join | pd.merge(df1, df2, on="id", how="inner") |
df1 IDs: [1,2,3], df2 IDs: [2,3,4] → Result: [2,3] | Customers with transactions |
Left Join | pd.merge(df1, df2, on="id", how="left") |
df1 IDs: [1,2,3] → Keeps all [1,2,3] | All customers + purchases |
Right Join | pd.merge(df1, df2, on="id", how="right") |
df2 IDs: [2,3,4] → Keeps all [2,3,4] | All sales records |
Outer Join | pd.merge(df1, df2, on="id", how="outer") |
df1 [1,2,3], df2 [2,3,4] → Result: [1,2,3,4] | Union of both |
GroupBy and Pivot
# Group sales by region
df.groupby("Region")["Sales"].sum()
# Pivot table: Regions vs Products
df.pivot(index="Region", columns="Product", values="Sales")
9. FAQs (Common Confusions)
- Are lists 1D or 2D? → Lists are 1D; nested lists simulate 2D; use NumPy for real 2D.
- Why tuples if lists exist? → Tuples are immutable (faster, safer).
- Why sets if lists can remove duplicates? → Sets offer O(1) membership lookup.
- Are dictionaries 2D? → Single dict = 1D; nested dicts = 2D like a table.
- Which structure is closest to SQL tables? → List of dicts or Pandas DataFrame.
- When to use arrays over lists? → Use arrays for numeric-heavy computations.
- Difference between merge and concat? → Merge = SQL join logic, Concat = stack vertically/horizontally.
10. Analyst Interview Cheat Sheet
Question | Short Answer |
---|---|
Why use a set vs list? | Deduplication and faster membership test. |
Why use a tuple? | Fixed schema like (year, month, product_id) . |
Why use a dictionary? | Fast lookups, mapping codes to names. |
Why use arrays? | Efficient vectorized computations. |
List comprehension use? | Quick filtering/cleaning in ETL. |
GIL impact? | Good for I/O, bad for CPU-bound tasks. |
Pandas groupby use? | Aggregations (e.g., sales by region). |
Pandas pivot use? | Reshape data like Excel pivot tables. |
Final Takeaway
- List → staging raw data.
- Tuple → fixed, unchangeable schema.
- Set → uniqueness and fast membership checks.
- Dictionary → mapping keys to values.
- Array (NumPy) → numeric efficiency at scale.
- Pandas → indispensable for tabular wrangling, joins, and aggregations.
Mastering these fundamentals, along with their analyst-focused use cases, ensures you can confidently answer both technical and applied interview questions.
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.