Before diving deep into Machine Learning, I would like to share tiny, beginner-friendly code-based exercises based on NumPy, Pandas and Data Preprocessing - small, focused and ML oriented.
β NumPy Mini Exercises (Level: Very Easy)
Make sure you import NumPy first:
import numpy as np
Exercise 1 β Create a NumPy array
Create a NumPy array containing these numbers:
[2, 4, 6, 8]
Solution:
a = np.array([2, 4, 6, 8])
Exercise 2 β Create a 2D array
Create this 2Γ2 matrix:
1 2
3 4
Solution:
m = np.array([[1, 2],
[3, 4]])
Exercise 3 β Array shape
Find the shape of this array:
a = np.array([[10, 20, 30], [40, 50, 60]])
Solution:
a = np.array([[10, 20, 30], [40, 50, 60]])
a.shape
Output:
(2, 3)
Exercise 4 β Element-wise operations
Given:
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
Compute:
- a + b
- a * b
Solution:
I. Addition
a + b
Output:
array([11, 22, 33])
II. Multiplication
a * b
Output:
array([10, 40, 90])
Exercise 5 β Slicing
Given:
a = np.array([5, 10, 15, 20, 25])
Extract the middle three values:
[10, 15, 20]
Solution:
a = np.array([5, 10, 15, 20, 25])
middle = a[1:4]
Output:
array([10, 15, 20])
Exercise 6 β Zero and Ones arrays
Create:
- A 3Γ3 matrix of zeros
- A 2Γ4 matrix of ones
Solution:
I. 3Γ3 matrix of zeros
np.zeros((3, 3))
II. 2Γ4 matrix of ones
np.ones((2, 4))
Exercise 7 β Random numbers
Generate a NumPy array of five random numbers between 0 and 1.
Solution:
r = np.random.rand(5)
Output:
array([0.23, 0.91, 0.49, 0.11, 0.76])
Exercise 8 β Matrix multiplication
Given:
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
Compute:
A @ B
(or np.dot(A, B))
Solution:
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
A @ B # or np.dot(A, B)
Output:
array([[19, 22],
[43, 50]])
Exercise 9 β Mean of an array
Compute the mean of:
x = np.array([4, 8, 12, 16])
Solution:
x = np.array([4, 8, 12, 16])
np.mean(x)
Output:
10.0
Exercise 10 β Reshape
Given:
x = np.array([1, 2, 3, 4, 5, 6])
Reshape it into a 2Γ3 matrix.
Solution:
x = np.array([1, 2, 3, 4, 5, 6])
x.reshape(2, 3)
Output:
array([[1, 2, 3],
[4, 5, 6]])
π Pandas Mini Exercises (Level: Very Easy)
Start with:
import pandas as pd
Exercise 1 β Create a DataFrame
Create a DataFrame from this dictionary:
data = {
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}
Solution:
data = {
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
df
π§ Explanation
- Dictionary keys β column names
- Lists β column values
Output:
Age Salary
0 25 50000
1 30 60000
2 35 70000
Exercise 2 β View data
Using the DataFrame from Exercise 1:
- Display the first 2 rows
- Display the column names
- Display the shape of the DataFrame
Solution:
df.head(2)
df.columns
df.shape
π§ Explanation
- head(2) β first 2 rows
- columns β column names
- shape β (rows, columns)
Output:
#first 2 rows
Age Salary
0 25 50000
1 30 60000
#column names
Index(['Age', 'Salary'], dtype='object')
#3 rows, 2 columns
(3, 2)
Exercise 3 β Select a column
Select only the Salary column.
Solution:
df["Salary"]
π§ Explanation
- Single brackets β returns a Series
Output:
#This is a Series, not a DataFrame
0 50000
1 60000
2 70000
Name: Salary, dtype: int64
Exercise 4 β Select multiple columns
Select Age and Salary together.
Solution:
df[["Age", "Salary"]]
π§ Explanation
- Double brackets β returns a DataFrame
Output:
#Double brackets β DataFrame
Age Salary
0 25 50000
1 30 60000
2 35 70000
Exercise 5 β Filter rows
From the DataFrame, select rows where:
Age > 28
Solution:
df[df["Age"] > 28]
π§ Explanation
- Boolean condition filters rows; it is a core Pandas skill
- Very common in data cleaning
Output:
Age Salary
1 30 60000
2 35 70000
Exercise 6 β Add a new column
Add a column called Tax which is 10% of Salary.
Solution:
df["Tax"] = 0.10 * df["Salary"]
df
π§ Explanation
- Pandas supports vectorized operations
- Applied to entire column at once
Output:
# Operations apply row-wise automatically
Age Salary Tax
0 25 50000 5000.0
1 30 60000 6000.0
2 35 70000 7000.0
Exercise 7 β Basic statistics
Compute:
- Mean Age
- Maximum Salary
Solution:
df["Age"].mean()
df["Salary"].max()
π§ Explanation
- Pandas has built-in descriptive stats
- Used heavily during EDA
Output:
# Mean Age
30.0
#Maximum Salary
70000
Exercise 8 β Handle missing values
Given:
data = {
"Age": [25, None, 35],
"Salary": [50000, 60000, None]
}
df = pd.DataFrame(data)
- Detect missing values
- Fill missing values with the column mean
Solution:
df.isnull()
df_filled = df.fillna(df.mean())
df
π§ Explanation
-
isnull()β detects missing values -
fillna(df.mean())β fills numeric NaNs with column mean
Output:
Age Salary
0 25.0 50000.0
1 NaN 60000.0
2 35.0 NaN
Breaking this down here:
β
Detect missing values
df.isnull()
Output:
Age Salary
0 False False
1 True False
2 False True
β
Fill missing values with mean
df_filled = df.fillna(df.mean())
df_filled
Output:
Age Salary
0 25.0 50000.0
1 30.0 60000.0
2 35.0 55000.0
π§ Means used:
- Age mean = 30
- Salary mean = 55,000
Exercise 9 β Sort values
Sort the DataFrame by Salary (descending order).
Solution:
df.sort_values(by="Salary", ascending=False)
π§ Explanation
- Sorting helps identify top/bottom values
- Common during analysis
Output:
Age Salary
1 30.0 60000.0
2 35.0 55000.0
0 25.0 50000.0
Exercise 10 β Convert to NumPy (ML step)
Convert:
- Features β
Age,Salary - Target β
Tax
into NumPy arrays.
Solution:
X = df[["Age", "Salary"]].values
y = df["Tax"].values
X
π§ Explanation
-
.valuesconverts Pandas β NumPy - scikit-learn expects NumPy arrays
Output:
array([[2.5e+01, 5.0e+04],
[3.0e+01, 6.0e+04],
[3.5e+01, 5.5e+04]])
β
Target
y
Output:
array([50000., 60000., 55000.])
π§ This is exactly the format ML models expect
π§ͺ Data Preprocessing: Code-Based Mini Exercises
Start with:
import pandas as pd
import numpy as np
Exercise 1 β Train/Test Split (Ratio practice)
Given:
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])
π Split the data into 80% training and 20% testing.
Use:
from sklearn.model_selection import train_test_split
Solution:
from sklearn.model_selection import train_test_split
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
π§ Explanation
-
test_size=0.2β 20% test, 80% train -
random_stateensures reproducibility - Model learns from
X_train, evaluated onX_test
Output (one possible split with random_state=42):
X_train = [[4], [2], [5], [3]]
X_test = [[1]]
y_train = [40, 20, 50, 30]
y_test = [10]
Exercise 2 β Detect missing values
Given:
data = {
"Age": [25, 30, None, 40],
"Salary": [50000, None, 70000, 80000]
}
df = pd.DataFrame(data)
π Write code to:
- Detect missing values
- Count missing values per column
Solution:
df.isnull()
df.isnull().sum()
π§ Explanation
-
isnull()β True/False for each cell -
sum()counts missing values per column
Output of df.isnull():
Age Salary
0 False False
1 False True
2 True False
3 False False
Output of df.isnull().sum():
Age 1
Salary 1
dtype: int64
Exercise 3 β Fill missing values (Mean)
Using the same DataFrame above:
π Fill missing values using column mean.
Solution:
df_filled = df.fillna(df.mean())
π§ Explanation
- Replaces NaN with column mean
- Common for numerical ML features
- Keeps dataset size intact
Output:
Age Salary
0 25.0 50000.0
1 30.0 66666.7
2 31.7 70000.0
3 40.0 80000.0
(Mean values used)
Exercise 4 β One-Hot Encoding (Categorical Data)
Given:
df = pd.DataFrame({
"City": ["Delhi", "Mumbai", "Delhi", "Chennai"]
})
π Convert City into numerical columns using one-hot encoding.
Solution:
encoded_df = pd.get_dummies(df["City"])
OR keep original structure:
encoded_df = pd.get_dummies(df, columns=["City"])
π§ Explanation
- Converts text categories into binary columns
- Avoids false numeric ordering
- Required before ML models
Input:
City
Delhi
Mumbai
Delhi
Chennai
Output:
City_Chennai City_Delhi City_Mumbai
0 0 1 0
1 0 0 1
2 0 1 0
3 1 0 0
Exercise 5 β Feature Scaling (Standardization)
Given:
X = np.array([
[20, 30000],
[30, 50000],
[40, 70000]
])
π Apply Standard Scaling to X.
Use:
from sklearn.preprocessing import StandardScaler
Solution:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
π§ Explanation
- Centers data around mean = 0
- Std deviation = 1
- Essential for distance-based models
Output(approx):
[[-1.2247, -1.2247],
[ 0.0000, 0.0000],
[ 1.2247, 1.2247]]
Exercise 6 β Feature Selection
Given:
df = pd.DataFrame({
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000],
"EmployeeID": [101, 102, 103]
})
π Remove the EmployeeID column.
Solution:
df_selected = df.drop("EmployeeID", axis=1)
π§ Explanation
- IDs carry no predictive value
- Removing noise improves model learning
Input columns:
['Age', 'Salary', 'EmployeeID']
Output columns:
['Age', 'Salary']
Exercise 7 β Outlier Detection (Simple logic)
Given:
ages = np.array([22, 23, 24, 25, 120])
π Write code to remove values greater than 100.
Solution:
filtered_ages = ages[ages <= 100]
π§ Explanation
- Simple rule-based filtering
- Useful for obvious data errors
- Always inspect before removing
Output:
[22, 23, 24, 25]
Exercise 8 β Data Leakage Check (Thinking + Code)
Given:
from sklearn.preprocessing import StandardScaler
π Write the correct order of code to:
- Split data
- Fit scaler on training data
- Transform both training and test data
(No need to run it β just write the correct sequence.)
Solution:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 1. Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 2. Fit scaler ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# 3. Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)
π§ Explanation
- Training data defines statistics
- Test data must remain unseen
- Prevents unrealistically high accuracy
Output (conceptual sequence):
1. Split data
2. Fit scaler on training data
3. Transform training data
4. Transform test data
If you can do these exercises comfortably, youβre ML-ready at a foundational level.
If you are new to python, you can install python version 3.x and try playing around with these exercises in your IDE. I use Jupyter notebook.
πRecommendation (if you're a beginner):
Do NOT learn scikit-learn models yet.
First learn how a model learns.
Then use scikit-learn as a tool, not a teacher.
Will be exploring more on this in subsequent posts.
You can refer to these posts for understanding NumPy, Pandas and Data Preprocessing:
Understanding NumPy in the context of Python for Machine Learning
The next basic concept of Machine Learning after NumPy: Pandas
Understanding Data Preprocessing
Top comments (0)