Python Interview Preparation Guide for Data Analysts – list, tuple, set, dictionary comparison

Python is the most widely used language for Data Analysts because of its simple syntax, powerful libraries (Pandas, NumPy, Matplotlib, Scikit-learn), and flexibility across ETL, reporting, visualization, and modeling.

This guide consolidates basic Python interview questions, explains key data structures (list, tuple, set, dictionary, array), introduces Pandas essentials (joins, groupby, pivot), and ends with FAQs and cheat sheets to clear up confusion during interviews.

1. What is Python, and what are its key features?

Python is a high-level, interpreted, object-oriented language designed for readability and productivity.

Key Features

  • Simple Syntax – Easy for beginners, efficient for experts.
  • Interpreted – Executes code line by line, easier debugging.
  • Dynamic Typing – No need to declare variable types explicitly.
  • Extensive Libraries – Pandas, NumPy, Scikit-learn, Django.
  • Cross-Platform – Works on Windows, macOS, Linux.

Data Analyst Connection

  • Pandas → data cleaning and wrangling.
  • NumPy → numerical computations.
  • Matplotlib/Seaborn → data visualization.
  • SQLAlchemy → database connectivity.

2. Core Data Structures in Python

📌 Comparison Table

Feature List Tuple Set Dictionary Array (NumPy)
Definition Ordered, mutable collection Ordered, immutable collection Unordered, unique elements Key-value mapping Homogeneous, numeric arrays
Syntax Example a = [1, 2, 3, 3] t = (1, 2, 3, 3) s = {1, 2, 3} d = {"a":1, "b":2} arr = np.array([1,2,3])
Order ✅ Preserves order ✅ Preserves order ❌ Unordered ✅ Preserves order (3.7+) ✅ Preserves order
Mutable ✅ Yes: a[0]=10 ❌ No (immutable) ✅ Add/remove: s.add(4) ✅ Yes: d["c"]=3 ✅ Yes: arr[0]=10
Duplicates ✅ Allowed: [1,1,2] ✅ Allowed: (1,1,2) ❌ Not allowed: {1,2} ❌ Keys unique,
values can repeat: {"a":1,"b":1}
✅ Allowed: [1,1,2]
Indexing ✅ By index: a[1] → 2 ✅ By index: t[1] → 2 ❌ Not available ✅ By key: d["a"] → 1 ✅ By index: arr[1] → 2
Methods .append(4)
.sort()
.count(2)
.index(3)
.union()
.intersection()
.keys()
.values()
.mean()
.reshape()
Performance Moderate Faster (immutable) Very fast membership test Fast lookups (hash table) Extremely fast (vectorized)
Best For General-purpose collections Fixed schema, constants Unique IDs, categories Mappings, JSON-style data Numeric-heavy analytics
Analyst Example [1200, 800, 450] → raw sales (2023,"Q1",5000) → report key {1001,1002,1003} → unique IDs {101:"Electronics"} → product map [[100,200],[300,400]] → sales matrix

📌 Dimensionality

  • List → 1D: [1,2,3]; 2D: [[1,2],[3,4]] (inefficient for math).
  • Tuple → Similar to list, but immutable.
  • Set → Always 1D (no index).
  • Dictionary → 1D mapping; nested dicts simulate 2D.
  • NumPy Array → True 1D, 2D, 3D, nD (efficient for matrix ops).

📌 Methods Cheat Sheet with 1D & 2D Examples

🔹 List

# 1D List
a = [3, 1, 2]
a.append(4)          # [3, 1, 2, 4]
a.sort()             # [1, 2, 3, 4]

# 2D List (List of Lists)
matrix = [[1, 2], [3, 4], [5, 6]]
matrix[1][0]         # 3
[row[1] for row in matrix]   # [2, 4, 6] → column extraction

Use Case: Representing tabular data before loading into Pandas.

🔹 Tuple

# 1D Tuple
t = (1, 2, 3, 2)
t.count(2)           # 2
t.index(3)           # 2

# Nested Tuple (Tuple of Tuples)
coords = ((10, 20), (30, 40))
coords[1][0]         # 30

Use Case: Immutable coordinates, fixed reference mappings.

🔹 Set

# 1D Set
s = {1, 2, 2, 3}
s.add(4)             # {1, 2, 3, 4}
s.intersection({2,5}) # {2}

# Set of Frozensets (to allow nested sets)
fs = {frozenset([1,2]), frozenset([3,4])}

Use Case: Deduplication, unique groups, fast membership tests.

🔹 Dictionary

# 1D Dictionary
d = {"Alice":100, "Bob":200}
d["Charlie"] = 300
list(d.keys())       # ['Alice','Bob','Charlie']

# Nested Dictionary
students = {
    "Alice": {"Math": 90, "Eng": 85},
    "Bob":   {"Math": 78, "Eng": 92}
}
students["Alice"]["Math"]   # 90

Use Case: Key-value storage, hierarchies like customer → orders.

🔹 NumPy Array

import numpy as np

# 1D Array
arr = np.array([1, 2, 3, 4])
arr.mean()            # 2.5

# 2D Array (Matrix)
mat = np.array([[1, 2], [3, 4], [5, 6]])
mat.shape             # (3,2)
mat[:, 1]             # [2,4,6] → column slice

# 3D Array
cube = np.arange(8).reshape(2,2,2)

Use Case: Efficient math on 2D/3D data (images, stats, regression input).

🚀 Summary of “Why & When”

StructureBest ForProsConsAnalyst Approach
ListSequential, tabular-like scratch dataFlexible, indexableDuplicates, slower on large setsStart raw data wrangling
TupleFixed recordsImmutable, safeCan’t updateUse for stable lookups (coordinates, IDs)
SetUnique valuesDeduplication, fast membershipUnordered, no indexingFind distinct customers/products
DictKey → valueFast, readableMemory heavyHierarchies (customer → orders)
NumPyMath, ML inputVectorized, memory-efficientRigid typesStatistical ops, transformations

📌 Data Analyst Use Cases

CollectionExample Scenario
ListStore raw sales records before cleaning → [1200, 800, 450].
TupleFixed (Year, Quarter, Revenue) values for reporting.
SetDeduplicate customer IDs quickly: set(customer_list).
DictionaryProduct → Category mapping → {101: "Electronics"}.
ArrayPerform matrix operations, e.g., sales by Region × Month.

3. __init__() Method in Python

The __init__() method is a constructor in OOP used to initialize objects.

class SalesRecord:
    def __init__(self, date, amount, region):
        self.date = date
        self.amount = amount
        self.region = region

record = SalesRecord("2024-01-01", 500, "North")
print(record.region)   # North

Analyst Use Case – Create custom objects such as Customer or Transaction classes to structure datasets in ETL pipelines.

4. Mutable vs Immutable

TypeExamplesChange Allowed?
MutableList, Dict, Set✅ Yes
ImmutableInt, Float, String, Tuple❌ No

Example

  • Store transaction logs in lists/dicts (mutable).
  • Store (Year, Month, ProductID) in tuples (immutable, fixed schema).

5. Comprehensions

List Comprehension

squares = [i**2 for i in range(5)]  # [0,1,4,9,16]

Dictionary Comprehension

sales = {region: revenue for region, revenue in [("A",1000),("B",2000)]}

Tuple Comprehension (generator)

gen = (i for i in range(5))<br>list(gen)  # [0,1,2,3,4]

Analyst Example – Filter out negative sales:

clean_sales = [x for x in sales if x > 0]

6. Global Interpreter Lock (GIL)

  • Prevents multiple Python threads from executing simultaneously.
  • ❌ Limitation → CPU-bound tasks (slow for heavy math).
  • ✅ Works fine → I/O-bound tasks (API calls, file operations).

Analyst Example

  • Web scraping multiple APIs → threading is effective.
  • Heavy analytics models → prefer multiprocessing or NumPy vectorization.

7. Conversions Between Formats

Conversion Example Code Resulting Output Analyst Use Case
List → Set ids = [101,101,102,103]
set(ids)
{101,102,103} Remove duplicate customer IDs from survey responses or transaction logs.
Dict → DataFrame d = {"A":100,"B":200}
pd.DataFrame(d.items())
0 1
A 100
B 200
Convert mapping into a tabular report for analysis or export to Excel/CSV.
List → NumPy Array sales = [100,200,300]
np.array(sales)
array([100,200,300]) Perform vectorized operations and statistical analysis on sales figures.
Array → List arr = np.array([1,2,3])
arr.tolist()
[1,2,3] Convert numeric arrays back to Python lists for JSON/CSV export or API responses.

8. Pandas Essentials

Series

import pandas as pd
s = pd.Series([10,20,30], index=["a","b","c"])
print(s["b"])   # 20

DataFrame

df = pd.DataFrame({"Name":["Alice","Bob"], "Sales":[100,200]})

Core Operations

Method Example Purpose
df.head() df.head(5) Preview first 5 rows
df.info() df.info() Schema + memory usage
df.describe() df.describe() Summary statistics
df.isnull().sum() df.isnull().sum() Check missing values
df.dropna() df.dropna() Remove null rows
df.fillna() df.fillna(0) Replace nulls with 0
df.groupby() df.groupby("Region")["Sales"].sum() Aggregate sales by region
df.sort_values() df.sort_values("Sales") Sort by Sales column

Joins in Pandas

Join Type Pandas Code Example (df1 vs df2) Analyst Use Case
Inner Join pd.merge(df1, df2, on="id", how="inner") df1 IDs: [1,2,3], df2 IDs: [2,3,4] → Result: [2,3] Customers with transactions
Left Join pd.merge(df1, df2, on="id", how="left") df1 IDs: [1,2,3] → Keeps all [1,2,3] All customers + purchases
Right Join pd.merge(df1, df2, on="id", how="right") df2 IDs: [2,3,4] → Keeps all [2,3,4] All sales records
Outer Join pd.merge(df1, df2, on="id", how="outer") df1 [1,2,3], df2 [2,3,4] → Result: [1,2,3,4] Union of both

GroupBy and Pivot

# Group sales by region
df.groupby("Region")["Sales"].sum()

# Pivot table: Regions vs Products
df.pivot(index="Region", columns="Product", values="Sales")

9. FAQs (Common Confusions)

  • Are lists 1D or 2D? → Lists are 1D; nested lists simulate 2D; use NumPy for real 2D.
  • Why tuples if lists exist? → Tuples are immutable (faster, safer).
  • Why sets if lists can remove duplicates? → Sets offer O(1) membership lookup.
  • Are dictionaries 2D? → Single dict = 1D; nested dicts = 2D like a table.
  • Which structure is closest to SQL tables? → List of dicts or Pandas DataFrame.
  • When to use arrays over lists? → Use arrays for numeric-heavy computations.
  • Difference between merge and concat? → Merge = SQL join logic, Concat = stack vertically/horizontally.

10. Analyst Interview Cheat Sheet

QuestionShort Answer
Why use a set vs list?Deduplication and faster membership test.
Why use a tuple?Fixed schema like (year, month, product_id).
Why use a dictionary?Fast lookups, mapping codes to names.
Why use arrays?Efficient vectorized computations.
List comprehension use?Quick filtering/cleaning in ETL.
GIL impact?Good for I/O, bad for CPU-bound tasks.
Pandas groupby use?Aggregations (e.g., sales by region).
Pandas pivot use?Reshape data like Excel pivot tables.

Final Takeaway

  • List → staging raw data.
  • Tuple → fixed, unchangeable schema.
  • Set → uniqueness and fast membership checks.
  • Dictionary → mapping keys to values.
  • Array (NumPy) → numeric efficiency at scale.
  • Pandas → indispensable for tabular wrangling, joins, and aggregations.

Mastering these fundamentals, along with their analyst-focused use cases, ensures you can confidently answer both technical and applied interview questions.


Discover more from Data Engineer Journey

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Discover more from Data Engineer Journey

Subscribe now to keep reading and get access to the full archive.

Continue reading