Juhi Kushwah

Posted on Jan 8 • Edited on Jan 12

Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing

#100daysofcode #mlbasics

Before diving deep into Machine Learning, I would like to share tiny, beginner-friendly code-based exercises based on NumPy, Pandas and Data Preprocessing - small, focused and ML oriented.

✅ NumPy Mini Exercises (Level: Very Easy)

Make sure you import NumPy first:

import numpy as np

Exercise 1 — Create a NumPy array

Create a NumPy array containing these numbers:
[2, 4, 6, 8]

Solution:

a = np.array([2, 4, 6, 8])

Exercise 2 — Create a 2D array
Create this 2×2 matrix:

1 2
3 4

Solution:

m = np.array([[1, 2],
              [3, 4]])

Exercise 3 — Array shape
Find the shape of this array:
a = np.array([[10, 20, 30], [40, 50, 60]])

Solution:

a = np.array([[10, 20, 30], [40, 50, 60]])
a.shape

Output:

(2, 3)

Exercise 4 — Element-wise operations
Given:

a = np.array([1, 2, 3])
b = np.array([10, 20, 30])

Compute:

a + b
a * b

Solution:
I. Addition

a + b

Output:

array([11, 22, 33])

II. Multiplication

a * b

Output:

array([10, 40, 90])

Exercise 5 — Slicing
Given:
a = np.array([5, 10, 15, 20, 25])

Extract the middle three values:
[10, 15, 20]

Solution:

a = np.array([5, 10, 15, 20, 25])
middle = a[1:4]

Output:

array([10, 15, 20])

Exercise 6 — Zero and Ones arrays
Create:

A 3×3 matrix of zeros
A 2×4 matrix of ones

Solution:
I. 3×3 matrix of zeros

np.zeros((3, 3))

II. 2×4 matrix of ones

np.ones((2, 4))

Exercise 7 — Random numbers
Generate a NumPy array of five random numbers between 0 and 1.

Solution:

r = np.random.rand(5)

Output:

array([0.23, 0.91, 0.49, 0.11, 0.76])

Exercise 8 — Matrix multiplication
Given:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

Compute:
A @ B
(or np.dot(A, B))

Solution:

A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

A @ B   # or np.dot(A, B)

Output:

array([[19, 22],
       [43, 50]])

Exercise 9 — Mean of an array
Compute the mean of:
x = np.array([4, 8, 12, 16])

Solution:

x = np.array([4, 8, 12, 16])
np.mean(x)

Output:

10.0

Exercise 10 — Reshape
Given:
x = np.array([1, 2, 3, 4, 5, 6])

Reshape it into a 2×3 matrix.

Solution:

x = np.array([1, 2, 3, 4, 5, 6])
x.reshape(2, 3)

Output:

array([[1, 2, 3],
       [4, 5, 6]])

📊 Pandas Mini Exercises (Level: Very Easy)

Start with:

import pandas as pd

Exercise 1 — Create a DataFrame
Create a DataFrame from this dictionary:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

Solution:

data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
df

🧠 Explanation

Dictionary keys → column names
Lists → column values

Output:

    Age    Salary
0   25     50000
1   30     60000
2   35     70000

Exercise 2 — View data
Using the DataFrame from Exercise 1:

Display the first 2 rows
Display the column names
Display the shape of the DataFrame

Solution:

df.head(2)
df.columns
df.shape

🧠 Explanation

head(2) → first 2 rows
columns → column names
shape → (rows, columns)

Output:

#first 2 rows
   Age  Salary  
0   25   50000
1   30   60000

#column names
Index(['Age', 'Salary'], dtype='object')

#3 rows, 2 columns
(3, 2)

Exercise 3 — Select a column
Select only the Salary column.

Solution:

df["Salary"]

🧠 Explanation

Single brackets → returns a Series

Output:

#This is a Series, not a DataFrame

0    50000
1    60000
2    70000
Name: Salary, dtype: int64

Exercise 4 — Select multiple columns
Select Age and Salary together.

Solution:

df[["Age", "Salary"]]

🧠 Explanation

Double brackets → returns a DataFrame

Output:

#Double brackets → DataFrame

   Age  Salary
0   25   50000
1   30   60000
2   35   70000

Exercise 5 — Filter rows
From the DataFrame, select rows where:

Age > 28

Solution:

df[df["Age"] > 28]

🧠 Explanation

Boolean condition filters rows; it is a core Pandas skill
Very common in data cleaning

Output:

   Age  Salary
1   30   60000
2   35   70000

Exercise 6 — Add a new column
Add a column called Tax which is 10% of Salary.

Solution:

df["Tax"] = 0.10 * df["Salary"]
df

🧠 Explanation

Pandas supports vectorized operations
Applied to entire column at once

Output:

# Operations apply row-wise automatically

   Age  Salary     Tax
0   25   50000  5000.0
1   30   60000  6000.0
2   35   70000  7000.0

Exercise 7 — Basic statistics
Compute:

Mean Age
Maximum Salary

Solution:

df["Age"].mean()
df["Salary"].max()

🧠 Explanation

Pandas has built-in descriptive stats
Used heavily during EDA

Output:

# Mean Age
30.0

#Maximum Salary
70000

Exercise 8 — Handle missing values
Given:

data = {
    "Age": [25, None, 35],
    "Salary": [50000, 60000, None]
}
df = pd.DataFrame(data)

Detect missing values
Fill missing values with the column mean

Solution:

df.isnull()
df_filled = df.fillna(df.mean())
df

🧠 Explanation

isnull() → detects missing values
fillna(df.mean()) → fills numeric NaNs with column mean

Output:

    Age   Salary
0  25.0  50000.0
1   NaN  60000.0
2  35.0      NaN

Breaking this down here:
✅ Detect missing values

df.isnull()

Output:

     Age  Salary
0  False   False
1   True   False
2  False    True

✅ Fill missing values with mean

df_filled = df.fillna(df.mean())
df_filled

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2  35.0  55000.0

🧠 Means used:

Age mean = 30
Salary mean = 55,000

Exercise 9 — Sort values
Sort the DataFrame by Salary (descending order).

Solution:

df.sort_values(by="Salary", ascending=False)

🧠 Explanation

Sorting helps identify top/bottom values
Common during analysis

Output:

    Age   Salary
1  30.0  60000.0
2  35.0  55000.0
0  25.0  50000.0

Exercise 10 — Convert to NumPy (ML step)
Convert:

Features → Age, Salary
Target → Tax

into NumPy arrays.

Solution:

X = df[["Age", "Salary"]].values
y = df["Tax"].values

X

🧠 Explanation

.values converts Pandas → NumPy
scikit-learn expects NumPy arrays

Output:

array([[2.5e+01, 5.0e+04],
       [3.0e+01, 6.0e+04],
       [3.5e+01, 5.5e+04]])

✅ Target

Output:

array([50000., 60000., 55000.])

🧠 This is exactly the format ML models expect

🧪 Data Preprocessing: Code-Based Mini Exercises

Start with:

import pandas as pd
import numpy as np

Exercise 1 — Train/Test Split (Ratio practice)
Given:

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

👉 Split the data into 80% training and 20% testing.
Use:

from sklearn.model_selection import train_test_split

Solution:

from sklearn.model_selection import train_test_split

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

🧠 Explanation

test_size=0.2 → 20% test, 80% train
random_state ensures reproducibility
Model learns from X_train, evaluated on X_test

Output (one possible split with random_state=42):

X_train = [[4], [2], [5], [3]]
X_test  = [[1]]

y_train = [40, 20, 50, 30]
y_test  = [10]

Exercise 2 — Detect missing values
Given:

data = {
    "Age": [25, 30, None, 40],
    "Salary": [50000, None, 70000, 80000]
}
df = pd.DataFrame(data)

👉 Write code to:

Detect missing values
Count missing values per column

Solution:

df.isnull()

df.isnull().sum()

🧠 Explanation

isnull() → True/False for each cell
sum() counts missing values per column

Output of df.isnull():

     Age  Salary
0  False   False
1  False    True
2   True   False
3  False   False

Output of df.isnull().sum():

Age       1
Salary    1
dtype: int64

Exercise 3 — Fill missing values (Mean)
Using the same DataFrame above:
👉 Fill missing values using column mean.

Solution:

df_filled = df.fillna(df.mean())

🧠 Explanation

Replaces NaN with column mean
Common for numerical ML features
Keeps dataset size intact

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  66666.7
2  31.7  70000.0
3  40.0  80000.0

(Mean values used)

Exercise 4 — One-Hot Encoding (Categorical Data)
Given:

df = pd.DataFrame({
    "City": ["Delhi", "Mumbai", "Delhi", "Chennai"]
})

👉 Convert City into numerical columns using one-hot encoding.

Solution:

encoded_df = pd.get_dummies(df["City"])

OR keep original structure:

encoded_df = pd.get_dummies(df, columns=["City"])

🧠 Explanation

Converts text categories into binary columns
Avoids false numeric ordering
Required before ML models

Input:

City
Delhi
Mumbai
Delhi
Chennai

Output:

   City_Chennai  City_Delhi  City_Mumbai
0             0           1             0
1             0           0             1
2             0           1             0
3             1           0             0

Exercise 5 — Feature Scaling (Standardization)
Given:

X = np.array([
    [20, 30000],
    [30, 50000],
    [40, 70000]
])

👉 Apply Standard Scaling to X.

Use:

from sklearn.preprocessing import StandardScaler

Solution:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

🧠 Explanation

Centers data around mean = 0
Std deviation = 1
Essential for distance-based models

Output(approx):

[[-1.2247, -1.2247],
 [ 0.0000,  0.0000],
 [ 1.2247,  1.2247]]

Exercise 6 — Feature Selection
Given:

df = pd.DataFrame({
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000],
    "EmployeeID": [101, 102, 103]
})

👉 Remove the EmployeeID column.

Solution:

df_selected = df.drop("EmployeeID", axis=1)

🧠 Explanation

IDs carry no predictive value
Removing noise improves model learning

Input columns:

['Age', 'Salary', 'EmployeeID']

Output columns:

['Age', 'Salary']

Exercise 7 — Outlier Detection (Simple logic)
Given:
ages = np.array([22, 23, 24, 25, 120])

👉 Write code to remove values greater than 100.

Solution:

filtered_ages = ages[ages <= 100]

🧠 Explanation

Simple rule-based filtering
Useful for obvious data errors
Always inspect before removing

Output:

[22, 23, 24, 25]

Exercise 8 — Data Leakage Check (Thinking + Code)
Given:

from sklearn.preprocessing import StandardScaler

👉 Write the correct order of code to:

Split data
Fit scaler on training data
Transform both training and test data

(No need to run it — just write the correct sequence.)

Solution:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. Fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 3. Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)

🧠 Explanation

Training data defines statistics
Test data must remain unseen
Prevents unrealistically high accuracy

Output (conceptual sequence):

1. Split data
2. Fit scaler on training data
3. Transform training data
4. Transform test data

If you can do these exercises comfortably, you’re ML-ready at a foundational level.

If you are new to python, you can install python version 3.x and try playing around with these exercises in your IDE. I use Jupyter notebook.

📌Recommendation (if you're a beginner):

Do NOT learn scikit-learn models yet.
First learn how a model learns.
Then use scikit-learn as a tool, not a teacher.

Will be exploring more on this in subsequent posts.

You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:
Understanding NumPy in the context of Python for Machine Learning
The next basic concept of Machine Learning after NumPy: Pandas
Understanding Data Preprocessing

DEV Community

Beginner-friendly exercises on NumPy, Pandas and Data Preprocessing

✅ NumPy Mini Exercises (Level: Very Easy)

📊 Pandas Mini Exercises (Level: Very Easy)

🧪 Data Preprocessing: Code-Based Mini Exercises

📌Recommendation (if you're a beginner):

Top comments (0)