DEV Community

Cover image for Data Analysis with Python: Spotify Songs Dataset
Nuria
Nuria

Posted on

Data Analysis with Python: Spotify Songs Dataset

Data Analysis with Python: Spotify Songs Dataset

Within the field of data science, loading or exploratory data analysis are some of the tasks you can perform on a dataset. Additionally, depending on the information you need to obtain, you'll have to carry out other additional tasks.

Before starting a data analysis, it's necessary to know the steps to follow. In the following list, you can see the order of their implementation:

  1. Loading data (dataset).

  2. Exploratory data analysis.

  3. Data preparation and preprocessing.

  4. Data visualization.

  5. Machine learning model generation.

  6. Machine learning model training.

  7. Predictive model definition.

  8. Evaluation of the trained model with reserved data.

In the exercise that I explain below, I only want to obtain information about Spotify songs. Since this is a brief analysis written in Python, if you want to see the complete exercise, you can download it from the repository on Github.


⚠️ Before Starting

Before starting a data analysis, it's very important to define the information you need to obtain, because without a clear objective, you won't have a starting point.


Loading the Dataset

The dataset (MostStreamedSpotifySongs2024.csv) consists of several columns that reference the main streaming music platforms. In this case, I only want to explore Spotify data. The information I want to know is the following:

  • Songs by year
  • Song percentage: Explicit VS Non-Explicit
  • Most listened to songs by year with and without explicit content
  • Song with the most streams

Importing the Libraries

The Pandas, Numpy, Matplotlib, and Seaborn libraries make the work much easier due to the large number of methods they offer.

# Data manipulation with DataFrames.
import pandas as pd

# Numerical operations and array handling.
import numpy as np

# Chart creation.
import matplotlib.pyplot as plt

# Advanced statistical visualization.
import seaborn as sns

# Display charts in Jupyter notebook.
%matplotlib inline

Enter fullscreen mode Exit fullscreen mode

Reading the File

In this exercise, there is only a single file in csv format with ISO-8859-1 encoding. To avoid reading errors, it's important to specify the encoding, as some files contain special characters.

# Reading the file, the encoding is ISO-8859-1

file_path = ('MostStreamedSpotifySongs2024.csv')
data = pd.read_csv(file_path, encoding='ISO-8859-1')

Enter fullscreen mode Exit fullscreen mode

Visualizing the Data Table

Once the data is loaded, you need to visualize the information it contains. The head() method displays the first five rows of the file.

# View the table with all the data

data.head()
Enter fullscreen mode Exit fullscreen mode

Dataset Dimensions

Knowing the dimensions of the dataset helps understand the amount of data you'll be working with.

# Dataset dimensions

print(f'Dataset size: {data.shape}')
Enter fullscreen mode Exit fullscreen mode

DataFrame Observation

Before starting data cleaning, you need to check if there is missing data.

# List of categorical and numerical variables

data.info()
Enter fullscreen mode Exit fullscreen mode

Null Data and Duplicate Data

After observing that data is missing in the columns, the next step is to know the number of null and duplicate data. To get the total of both, add the sum() method to each one.

# Sum of null values

data.isnull().sum()
Enter fullscreen mode Exit fullscreen mode
# Sum of duplicate records

data.duplicated().sum()
Enter fullscreen mode Exit fullscreen mode

Data Cleaning

The following cleaning processes are necessary to achieve an intact dataset.

Duplicate Rows

The drop_duplicates() method is used to remove duplicate data.

# Find all duplicate records

duplicated_rows = data[data.duplicated()]

# Display duplicate records

print(duplicated_rows)

# Remove duplicate rows

print(f'Dataset size before removing duplicate rows: {data.shape}')
data.drop_duplicates(inplace=True) 
print(f'Dataset size after removing duplicate rows: {data.shape}')

Enter fullscreen mode Exit fullscreen mode

Null Rows

The first step is to filter the rows where Artist is null and remove them.

# Filter rows where 'Artist' is null

null_artists = data[data['Artist'].isnull()]

# Display the indices of rows with null values in 'Artist'

print("\nIndices of artists that are null:")
print(null_artists.index.tolist())

# Remove null artists

print(f"Number of null artists before removing them: {data['Artist'].isnull().sum()}")
data.dropna(subset=['Artist'], inplace=True)
print(f"Number of null artists after removing them: {data['Artist'].isnull().sum()}")
Enter fullscreen mode Exit fullscreen mode

Transforming the Data

Since the objective of the analysis is to explore only Spotify data, the columns corresponding to other music platforms are removed.

# Remove columns that are not considered for the main objective
# Define the list of columns to remove

columns_to_drop = [
    'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views', 
    'YouTube Playlist Reach', 'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins', 
    'Deezer Playlist Count', 'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams', 
    'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity'
]

# Remove the columns

data.drop(columns=columns_to_drop, axis=1, inplace=True)

Enter fullscreen mode Exit fullscreen mode

Data Visualization

After performing the data loading, cleaning, and transformation processes, the next step is to visualize the information requested by the exercise.

Songs by Year

# Count the number of songs by year

songs_by_year = data['Year'].value_counts().sort_index()

# Create the chart

plt.figure(figsize=(10, 6))
songs_by_year.plot(kind='bar', color='skyblue')
plt.title('Number of Songs by Year')
plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='-', alpha=0.7)

# Display the chart

plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Song Percentage: Explicit vs Non-Explicit

# Total songs with explicit lyrics
# Count the number of occurrences of 0 and 1

value_counts = data['Explicit Track'].value_counts()

# Map binary values to explicit labels

labels = ['Explicit', 'Non-Explicit']
sizes = [value_counts.get(1, 0), value_counts.get(0, 0)]

# Create the pie chart

plt.figure(figsize=(4, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['skyblue', 'salmon'])
plt.title('Song Distribution: Explicit vs Non-Explicit')

# Display the chart

plt.show()

Enter fullscreen mode Exit fullscreen mode

Most Listened to Songs by Year with and without Explicit Content

# Filter explicit and non-explicit songs

explicit_data = data[data['Explicit Track'] == 1]
no_explicit_data = data[data['Explicit Track'] == 0]

# Group by year with explicit content and without explicit content

explicit_track = explicit_data.groupby('Year')['Track'].count().reset_index()
no_explicit_track = no_explicit_data.groupby('Year')['Track'].count().reset_index()

# Rename columns to unify the DataFrame

explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
explicit_track['Explicit'] = 'Yes'
no_explicit_track.rename(columns={'Track': 'Count'}, inplace=True)
no_explicit_track['Explicit'] = 'No'

# Merge the two DataFrames

data_combined = pd.concat([explicit_track, no_explicit_track])

# Create the chart using Seaborn

plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Create bar chart

sns.barplot(data=data_combined, x='Year', y='Count', hue='Explicit')

# Add title and labels

plt.title('Songs by Year According to Their Content')
plt.xlabel('Year')
plt.ylabel('Number of Songs')

# Display the chart

plt.show()

Enter fullscreen mode Exit fullscreen mode

Song with the Most Streams

# Identify the row with the most listened to song

most_listened_song = data.loc[data['Spotify Streams'].idxmax()]
print(f"The song with the most streams is '{most_listened_song['Track']}' by {most_listened_song['Artist']} with {most_listened_song['Spotify Streams']} streams.")

Enter fullscreen mode Exit fullscreen mode

Conclusions

After exploring and visualizing the data of the most listened to songs on Spotify in 2024, I've drawn the following insights.

In the chart of Songs by year according to their content, you can observe an increase in the number of songs with explicit content from 2015 onwards. The explanation for this increase may be due to the following factors:

  • Increase in new artists who use more explicit language.
  • Emergence or fusion of new musical styles.
  • Reflections of society in song lyrics with advocacy motives.
  • Other reasons.

Another result that I found curious is that the song with the most plays is one of my favorites and it's not the one with the highest score. Then, the question arises: what is the key to success in a song?

πŸš€ Want to explore the project further?

I hope this article has been useful to you. πŸ€

Top comments (0)