Exploratory Data Analysis on COVID-19 Data using Python -

Author
Recent Posts

Data Scientist at LeadTech Group

Passionate about unlocking insights from data, I am a dedicated data scientist with a keen interest in AI and Machine Learning. As a tech enthusiast, I constantly explore new technologies and innovations. My journey is driven by a love for learning and a commitment to leveraging data to create meaningful impact.

Latest posts by KANGKAN KALITA (see all)

SQL for beginners : A Complete Guide - June 24, 2025
Predictive Analytics Techniques: A Beginner’s Guide to Turning Data into Future Insights - June 15, 2025
Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025

Exploratory Data Analysis (EDA) is one of the most essential steps in the data science pipeline. It involves understanding the dataset, summarizing its main characteristics, and visualizing patterns using statistical graphics and other data visualization methods. EDA is critical because it helps uncover insights, detect anomalies, and lays the groundwork for feature selection and predictive modeling.

In this Python Data Science Tutorial, we will perform an in-depth COVID-19 Analysis using EDA techniques. The dataset used for this tutorial contains daily records of confirmed cases, deaths, and recoveries from different countries and regions. This COVID Visualization will enable us to track the pandemic’s spread and impact across time and geography.

We will use the following Python libraries for our analysis:

Pandas: for data manipulation and analysis
NumPy: for numerical operations
Matplotlib: for data visualization
Seaborn: for advanced plots
Plotly (optional): for interactive visualizations

By the end of this tutorial, students and beginners in data science will gain a practical understanding of how to perform data analysis with Python.

2. Loading the Dataset

The first step in our COVID-19 Analysis is loading the dataset into our Python environment.

# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional for interactive visualization
import plotly.express as px

# Reading the COVID-19 dataset
data = pd.read_csv('covid_19_data.csv')

# Displaying the first few rows of the dataset
data.head()

The dataset typically contains the following key columns:

Observation Date
Country/Region
Confirmed Cases
Deaths
Recovered

3. Initial Data Exploration

Before diving deep into analysis, it is important to understand the basic structure and content of the dataset.

# Checking the shape of the dataset
print("Dataset Shape:", data.shape)

# Data types and non-null values
data.info()

# Summary statistics
data.describe()

# Checking for missing values
data.isnull().sum()

Initial exploration helps identify:

Missing values
Irregular data types
Columns with low variance
Outliers or anomalies

4. Data Cleaning

Clean data is essential for reliable analysis. Here are the steps involved in cleaning the COVID-19 dataset:

Handling Missing Values:

# Filling missing values with 0
data.fillna(0, inplace=True)

Removing Duplicates:

# Removing duplicates
data.drop_duplicates(inplace=True)

Converting Date Column:

# Converting 'ObservationDate' to datetime format
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])

Creating Active Cases Column:

# Creating 'ActiveCases' column
data['ActiveCases'] = data['Confirmed'] - data['Deaths'] - data['Recovered']

These steps ensure that the dataset is ready for meaningful analysis.

5. Univariate Analysis

Univariate analysis focuses on examining the distribution of individual variables.

Total Confirmed Cases Distribution:

plt.figure(figsize=(10,5))
sns.histplot(data['Confirmed'], bins=50, kde=True)
plt.title('Distribution of Confirmed COVID-19 Cases')
plt.xlabel('Confirmed Cases')
plt.ylabel('Frequency')
plt.show()

Top Countries by Total Cases:

top_countries = data.groupby('Country/Region')['Confirmed'].sum().sort_values(ascending=False).head(10)
top_countries.plot(kind='bar', color='skyblue')
plt.title('Top 10 Countries by Confirmed COVID-19 Cases')
plt.xlabel('Country')
plt.ylabel('Total Confirmed Cases')
plt.xticks(rotation=45)
plt.show()

This step helps identify the most affected regions and understand the data distribution.

6. Bivariate and Multivariate Analysis

While univariate analysis helps us understand individual variables, bivariate and multivariate analysis allow us to explore relationships between variables, uncovering patterns and correlations in the data. This is an important step in any COVID-19 analysis project using Python.

Analyzing Correlation Between Variables

One of the most basic techniques for multivariate analysis is computing the correlation matrix to identify how strongly features like confirmed cases, deaths, and recoveries are related to each other.

import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
correlation_matrix = covid_df[['Confirmed', 'Recovered', 'Deaths', 'Active']].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='Blues', linewidths=0.5)
plt.title('Correlation Matrix of COVID-19 Cases')
plt.show()

Insights:

A strong positive correlation between confirmed cases and deaths indicates that higher confirmed cases often lead to more deaths.
Active cases also show a high correlation with confirmed cases, which is expected.

Line Plot: Confirmed vs Deaths Over Time

Line plots are excellent for showing trends over time.

# Grouping the data by date
time_series = covid_df.groupby('Date')[['Confirmed', 'Deaths']].sum().reset_index()

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(time_series['Date'], time_series['Confirmed'], label='Confirmed Cases', color='blue')
plt.plot(time_series['Date'], time_series['Deaths'], label='Deaths', color='red')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('COVID-19: Confirmed vs Deaths Over Time')
plt.legend()
plt.grid(True)
plt.show()

This helps visualize how the death toll has changed in comparison to the confirmed cases, making it a vital part of COVID visualization.

7. Time Series Analysis

In any Python data science tutorial, analyzing time-series data is a key learning goal. COVID-19 datasets are time-indexed, making them ideal for this type of analysis.

Grouping by Week or Month

You can group the dataset by weeks or months to reduce noise and observe meaningful trends.

# Convert to datetime again if needed
covid_df['Date'] = pd.to_datetime(covid_df['Date'])

# Set date as index
covid_df.set_index('Date', inplace=True)

# Resample to monthly
monthly_data = covid_df.resample('M').sum()

# Plotting
monthly_data[['Confirmed', 'Recovered', 'Deaths']].plot(figsize=(12,6))
plt.title('Monthly Trends of COVID-19 Cases')
plt.ylabel('Total Cases')
plt.grid(True)
plt.show()

This smoothed-out view helps in spotting spikes, dips, or surges in COVID-19 cases more clearly.

8. Country-wise or Region-wise Analysis

Let’s analyze the data at a country level to find out which countries were most impacted.

Top 10 Countries by Confirmed Cases

# Group by Country
country_data = covid_df.groupby('Country/Region')[['Confirmed', 'Deaths', 'Recovered']].sum().sort_values(by='Confirmed', ascending=False)

# Select Top 10
top_10 = country_data.head(10)

# Plotting
top_10['Confirmed'].plot(kind='barh', figsize=(10, 6), color='orange')
plt.title('Top 10 Countries by Confirmed COVID-19 Cases')
plt.xlabel('Confirmed Cases')
plt.gca().invert_yaxis()  # Highest on top
plt.grid(True)
plt.show()

Compare Death and Recovery Rates

top_10[['Deaths', 'Recovered']].plot(kind='bar', figsize=(12,6))
plt.title('Deaths vs Recoveries in Top 10 Affected Countries')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

These visualizations provide real insight into which regions need more resources and how effectively each country managed the pandemic.

9. Advanced Visualizations (Optional)

Interactive Visualizations with Plotly

For interactive, hover-enabled plots, Plotly is an excellent tool:

import plotly.express as px

# Reset index for Plotly
country_data = country_data.reset_index()

# Top 20 countries
fig = px.bar(country_data.head(20), 
             x='Country/Region', y='Confirmed',
             color='Confirmed',
             title='Top 20 Countries with Most Confirmed COVID-19 Cases')

fig.show()

Choropleth Map of Confirmed Cases

fig = px.choropleth(country_data, 
                    locations='Country/Region',
                    locationmode='country names',
                    color='Confirmed',
                    hover_name='Country/Region',
                    color_continuous_scale='Reds',
                    title='Global Spread of COVID-19: Confirmed Cases')
fig.show()

These advanced visualizations can significantly enhance your COVID-19 analysis presentations and dashboards.

10. Insights and Summary

After conducting a complete Exploratory Data Analysis (EDA) on COVID-19 data, here are some key takeaways:

Confirmed cases and deaths are positively correlated.
The number of cases spiked dramatically in certain months (e.g., April 2020, Jan 2021).
The top impacted countries included the USA, India, Brazil, and Russia.
Some countries had higher recovery rates compared to others.
Advanced plots such as choropleth maps reveal geographical spread effectively.

This data analysis with Python forms the foundation for future steps like forecasting, machine learning modeling, or dashboard creation.

11. Conclusion

EDA is an essential step in every data science project. In this Python data science tutorial, we explored how to perform a full-fledged COVID-19 analysis using real-world data and libraries like Pandas, Matplotlib, Seaborn, and Plotly.

This project helped us:

Load, clean, and process real data
Perform univariate and multivariate analysis
Explore trends over time and geography
Derive meaningful, actionable insights

🧰 Tools Used:

Python
Pandas
Matplotlib
Seaborn
Plotly

📂 Dataset Used:

COVID-19 Dataset from Our World in Data or Kaggle: COVID-19 Dataset

📘 Next Steps:

Try conducting EDA on other COVID-related datasets like vaccinations or testing rates. You can also build dashboards using Streamlit or begin predictive modeling using machine learning algorithms.

Thank you for following along!
For more tutorials, visit our website and follow us for weekly Python and data science insights.

Post Views: 45

Exploratory Data Analysis on COVID-19 Data using Python

2. Loading the Dataset

3. Initial Data Exploration

4. Data Cleaning

5. Univariate Analysis

6. Bivariate and Multivariate Analysis

Analyzing Correlation Between Variables

Line Plot: Confirmed vs Deaths Over Time

7. Time Series Analysis

Grouping by Week or Month

8. Country-wise or Region-wise Analysis

Top 10 Countries by Confirmed Cases

Compare Death and Recovery Rates

9. Advanced Visualizations (Optional)

Interactive Visualizations with Plotly

Choropleth Map of Confirmed Cases

10. Insights and Summary

11. Conclusion

🧰 Tools Used:

📂 Dataset Used:

📘 Next Steps:

Spam Email Detection Using Machine Learning

Health Insurance Cost Prediction Using Machine Learning

Chatbot Using Python for Beginners

Heart Disease Prediction Project Using Machine Learning

Stock Market Sentiment Analysis Using NLP with Source Code

Stock Price Prediction using Machine Learning in Python

Leave a Reply Cancel reply

2. Loading the Dataset

3. Initial Data Exploration

4. Data Cleaning

5. Univariate Analysis

6. Bivariate and Multivariate Analysis

Analyzing Correlation Between Variables

Line Plot: Confirmed vs Deaths Over Time

7. Time Series Analysis

Grouping by Week or Month

8. Country-wise or Region-wise Analysis

Top 10 Countries by Confirmed Cases

Compare Death and Recovery Rates

9. Advanced Visualizations (Optional)

Interactive Visualizations with Plotly

Choropleth Map of Confirmed Cases

10. Insights and Summary

11. Conclusion

🧰 Tools Used:

📂 Dataset Used:

📘 Next Steps:

Similar Posts

Leave a Reply Cancel reply