|

Netflix Data Analysis with Python: Beginner-Friendly Project with Code & Insights

KANGKAN KALITA
netflix data analysis

🗂️ Project Overview:

In this project, we’ll explore a real-world Netflix dataset using Python, Netflix Data Analysis . You’ll learn how to clean data, extract insights, and visualize trends using libraries like Pandas, Matplotlib, and Seaborn. This is perfect for Python beginners looking to build data analysis skills with a practical, fun project.

📥 Dataset Link:

We’ll use the Netflix Movies and TV Shows dataset from Kaggle:
🔗 Netflix Titles Dataset on Kaggle

Download the dataset (netflix_titles.csv) directly from Kaggle after logging in.

🧰 Tools Required:

  • Python 3.x
  • Jupyter Notebook or Google Colab
  • Libraries: pandas, matplotlib, seaborn

Install libraries (if needed):

pip install pandas matplotlib seaborn

🧪 Step 1: Import Libraries and Load the Dataset

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("netflix_titles.csv")

# Display first 5 rows
df.head()

🧼 Step 2: Explore and Clean the Dataset

# Check data shape and info
print("Dataset shape:", df.shape)
df.info()

# Check for null values
df.isnull().sum()

Handle missing data:

# Fill missing 'director' and 'cast' with 'Unknown'
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)

# Drop rows with missing 'date_added' or 'country'
df.dropna(subset=['date_added', 'country'], inplace=True)

# Convert 'date_added' to datetime
df['date_added'] = pd.to_datetime(df['date_added'])

📊 Step 3: Data Questions and Visualizations

Let’s explore insights visually.


🔎 Q1: What type of content is most common on Netflix?

# Count of Movies vs TV Shows
df['type'].value_counts().plot(kind='bar', color=['red', 'blue'])
plt.title('Content Type Distribution')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

📆 Q2: How has content changed over the years?

# Extract year from date_added
df['year_added'] = df['date_added'].dt.year

# Group by year
content_per_year = df['year_added'].value_counts().sort_index()

# Plot trend over time
content_per_year.plot(kind='line', marker='o')
plt.title('Content Added Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.show()

🌎 Q3: Which countries produce the most Netflix content?

top_countries = df['country'].value_counts().head(10)

sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries by Number of Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()

Here’s Part 2 of the tutorial: clean, copy-paste-ready, beginner-friendly, and SEO-tuned. This completes the Netflix Data Analysis Using Python project for your website.


⏱️ Q4: What is the distribution of movie durations?

Netflix includes short films, full-length movies, and miniseries. Let’s focus on movies and analyze their duration.

# Filter only Movies
movies_df = df[df['type'] == 'Movie']

# Extract numeric duration (e.g., "90 min" → 90)
movies_df['duration_int'] = movies_df['duration'].str.extract('(\d+)').astype(float)

# Plot distribution
plt.figure(figsize=(10,6))
sns.histplot(movies_df['duration_int'], bins=30, kde=True, color='coral')
plt.title('Distribution of Movie Durations')
plt.xlabel('Duration (minutes)')
plt.ylabel('Number of Movies')
plt.show()

🧪 Insight:

Most Netflix movies are around 90–100 minutes, with a sharp drop-off after 120 minutes.


📺 Q5: What’s the distribution of Netflix content ratings?

Let’s visualize how Netflix categorizes its shows and movies by audience rating.

plt.figure(figsize=(12,6))
sns.countplot(data=df, x='rating', order=df['rating'].value_counts().index[:10], palette='Set2')
plt.title('Top Content Ratings on Netflix')
plt.xlabel('Rating')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()

🧪 Insight:

TV-MA and TV-14 are the most common ratings, indicating mature and teen content dominates.


🧾 Summary of Findings

Let’s recap the key insights from this Netflix data analysis:

  • Movies dominate Netflix’s catalog over TV Shows.
  • Content additions peaked around 2018–2019, with a slowdown in recent years.
  • The U.S., India, and the U.K. lead in content production.
  • Most movies are under 120 minutes, clustered around the 90-minute mark.
  • Mature ratings (TV-MA, R) are the most common, suggesting an adult-oriented content focus.

💡 Project Extension Ideas

If you want to take this project further, here are a few ideas:

  • Analyze the most frequent directors or actors.
  • Track genre popularity over time.
  • Cluster content by language or region.
  • Create an interactive dashboard using Plotly or Streamlit.

    📁 Bonus: Save Notebook for Download

    If you want to export the project as a .ipynb notebook:

    # In Jupyter, go to File > Download as > Notebook (.ipynb)
    # Or in Google Colab: File > Download > Download .ipynb
    

    Explore more:

    BLOGS | DATA SCIENCE

    15 Best Machine Learning Projects for Your Resume That Will Impress Recruiters [2025 Guide]

    ByKANGKAN KALITA May 17, 2025

    Introduction In 2025, employers are looking for more than just academic knowledge—they want proof you can apply machine learning in the real world. That’s where machine learning projects come in. Hands-on experience is what sets you apart from the competition, especially in fields like AI, data science, and analytics. Whether you’re a student, a fresher,…

    Read More 15 Best Machine Learning Projects for Your Resume That Will Impress Recruiters [2025 Guide]

    DATA SCIENCE

    Top 5 Machine Learning Datasets on Kaggle That Every Beginner Should Explore [2025]

    ByKANGKAN KALITA May 14, 2025

    Introduction Getting started with machine learning can feel overwhelming. Between the theory, algorithms, and coding, it’s easy to

    Similar Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *