Exploratory Data Analysis on COVID-19 Data using Python
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025

Exploratory Data Analysis (EDA) is one of the most essential steps in the data science pipeline. It involves understanding the dataset, summarizing its main characteristics, and visualizing patterns using statistical graphics and other data visualization methods. EDA is critical because it helps uncover insights, detect anomalies, and lays the groundwork for feature selection and predictive modeling.
In this Python Data Science Tutorial, we will perform an in-depth COVID-19 Analysis using EDA techniques. The dataset used for this tutorial contains daily records of confirmed cases, deaths, and recoveries from different countries and regions. This COVID Visualization will enable us to track the pandemic’s spread and impact across time and geography.
We will use the following Python libraries for our analysis:
- Pandas: for data manipulation and analysis
- NumPy: for numerical operations
- Matplotlib: for data visualization
- Seaborn: for advanced plots
- Plotly (optional): for interactive visualizations
By the end of this tutorial, students and beginners in data science will gain a practical understanding of how to perform data analysis with Python.
2. Loading the Dataset
The first step in our COVID-19 Analysis is loading the dataset into our Python environment.
# Importing necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Optional for interactive visualization import plotly.express as px # Reading the COVID-19 dataset data = pd.read_csv('covid_19_data.csv') # Displaying the first few rows of the dataset data.head()
The dataset typically contains the following key columns:
- Observation Date
- Country/Region
- Confirmed Cases
- Deaths
- Recovered
3. Initial Data Exploration
Before diving deep into analysis, it is important to understand the basic structure and content of the dataset.
# Checking the shape of the dataset print("Dataset Shape:", data.shape) # Data types and non-null values data.info() # Summary statistics data.describe() # Checking for missing values data.isnull().sum()
Initial exploration helps identify:
- Missing values
- Irregular data types
- Columns with low variance
- Outliers or anomalies
4. Data Cleaning
Clean data is essential for reliable analysis. Here are the steps involved in cleaning the COVID-19 dataset:
Handling Missing Values:
# Filling missing values with 0 data.fillna(0, inplace=True)
Removing Duplicates:
# Removing duplicates data.drop_duplicates(inplace=True)
Converting Date Column:
# Converting 'ObservationDate' to datetime format data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
Creating Active Cases Column:
# Creating 'ActiveCases' column data['ActiveCases'] = data['Confirmed'] - data['Deaths'] - data['Recovered']
These steps ensure that the dataset is ready for meaningful analysis.
5. Univariate Analysis
Univariate analysis focuses on examining the distribution of individual variables.
Total Confirmed Cases Distribution:
plt.figure(figsize=(10,5)) sns.histplot(data['Confirmed'], bins=50, kde=True) plt.title('Distribution of Confirmed COVID-19 Cases') plt.xlabel('Confirmed Cases') plt.ylabel('Frequency') plt.show()
Top Countries by Total Cases:
top_countries = data.groupby('Country/Region')['Confirmed'].sum().sort_values(ascending=False).head(10) top_countries.plot(kind='bar', color='skyblue') plt.title('Top 10 Countries by Confirmed COVID-19 Cases') plt.xlabel('Country') plt.ylabel('Total Confirmed Cases') plt.xticks(rotation=45) plt.show()
This step helps identify the most affected regions and understand the data distribution.
6. Bivariate and Multivariate Analysis
While univariate analysis helps us understand individual variables, bivariate and multivariate analysis allow us to explore relationships between variables, uncovering patterns and correlations in the data. This is an important step in any COVID-19 analysis project using Python.
Analyzing Correlation Between Variables
One of the most basic techniques for multivariate analysis is computing the correlation matrix to identify how strongly features like confirmed cases, deaths, and recoveries are related to each other.
import seaborn as sns import matplotlib.pyplot as plt # Compute correlation matrix correlation_matrix = covid_df[['Confirmed', 'Recovered', 'Deaths', 'Active']].corr() # Plot heatmap plt.figure(figsize=(8, 6)) sns.heatmap(correlation_matrix, annot=True, cmap='Blues', linewidths=0.5) plt.title('Correlation Matrix of COVID-19 Cases') plt.show()
Insights:
- A strong positive correlation between confirmed cases and deaths indicates that higher confirmed cases often lead to more deaths.
- Active cases also show a high correlation with confirmed cases, which is expected.
Line Plot: Confirmed vs Deaths Over Time
Line plots are excellent for showing trends over time.
# Grouping the data by date time_series = covid_df.groupby('Date')[['Confirmed', 'Deaths']].sum().reset_index() # Plotting plt.figure(figsize=(12, 6)) plt.plot(time_series['Date'], time_series['Confirmed'], label='Confirmed Cases', color='blue') plt.plot(time_series['Date'], time_series['Deaths'], label='Deaths', color='red') plt.xlabel('Date') plt.ylabel('Count') plt.title('COVID-19: Confirmed vs Deaths Over Time') plt.legend() plt.grid(True) plt.show()
This helps visualize how the death toll has changed in comparison to the confirmed cases, making it a vital part of COVID visualization.
7. Time Series Analysis
In any Python data science tutorial, analyzing time-series data is a key learning goal. COVID-19 datasets are time-indexed, making them ideal for this type of analysis.
Grouping by Week or Month
You can group the dataset by weeks or months to reduce noise and observe meaningful trends.
# Convert to datetime again if needed covid_df['Date'] = pd.to_datetime(covid_df['Date']) # Set date as index covid_df.set_index('Date', inplace=True) # Resample to monthly monthly_data = covid_df.resample('M').sum() # Plotting monthly_data[['Confirmed', 'Recovered', 'Deaths']].plot(figsize=(12,6)) plt.title('Monthly Trends of COVID-19 Cases') plt.ylabel('Total Cases') plt.grid(True) plt.show()
This smoothed-out view helps in spotting spikes, dips, or surges in COVID-19 cases more clearly.
8. Country-wise or Region-wise Analysis
Let’s analyze the data at a country level to find out which countries were most impacted.
Top 10 Countries by Confirmed Cases
# Group by Country country_data = covid_df.groupby('Country/Region')[['Confirmed', 'Deaths', 'Recovered']].sum().sort_values(by='Confirmed', ascending=False) # Select Top 10 top_10 = country_data.head(10) # Plotting top_10['Confirmed'].plot(kind='barh', figsize=(10, 6), color='orange') plt.title('Top 10 Countries by Confirmed COVID-19 Cases') plt.xlabel('Confirmed Cases') plt.gca().invert_yaxis() # Highest on top plt.grid(True) plt.show()
Compare Death and Recovery Rates
top_10[['Deaths', 'Recovered']].plot(kind='bar', figsize=(12,6)) plt.title('Deaths vs Recoveries in Top 10 Affected Countries') plt.ylabel('Count') plt.xticks(rotation=45) plt.grid(True) plt.show()
These visualizations provide real insight into which regions need more resources and how effectively each country managed the pandemic.
9. Advanced Visualizations (Optional)
Interactive Visualizations with Plotly
For interactive, hover-enabled plots, Plotly is an excellent tool:
import plotly.express as px # Reset index for Plotly country_data = country_data.reset_index() # Top 20 countries fig = px.bar(country_data.head(20), x='Country/Region', y='Confirmed', color='Confirmed', title='Top 20 Countries with Most Confirmed COVID-19 Cases') fig.show()
Choropleth Map of Confirmed Cases
fig = px.choropleth(country_data, locations='Country/Region', locationmode='country names', color='Confirmed', hover_name='Country/Region', color_continuous_scale='Reds', title='Global Spread of COVID-19: Confirmed Cases') fig.show()
These advanced visualizations can significantly enhance your COVID-19 analysis presentations and dashboards.
10. Insights and Summary
After conducting a complete Exploratory Data Analysis (EDA) on COVID-19 data, here are some key takeaways:
- Confirmed cases and deaths are positively correlated.
- The number of cases spiked dramatically in certain months (e.g., April 2020, Jan 2021).
- The top impacted countries included the USA, India, Brazil, and Russia.
- Some countries had higher recovery rates compared to others.
- Advanced plots such as choropleth maps reveal geographical spread effectively.
This data analysis with Python forms the foundation for future steps like forecasting, machine learning modeling, or dashboard creation.
11. Conclusion
EDA is an essential step in every data science project. In this Python data science tutorial, we explored how to perform a full-fledged COVID-19 analysis using real-world data and libraries like Pandas, Matplotlib, Seaborn, and Plotly.
This project helped us:
- Load, clean, and process real data
- Perform univariate and multivariate analysis
- Explore trends over time and geography
- Derive meaningful, actionable insights
🧰 Tools Used:
- Python
- Pandas
- Matplotlib
- Seaborn
- Plotly
📂 Dataset Used:
COVID-19 Dataset from Our World in Data or Kaggle: COVID-19 Dataset
📘 Next Steps:
Try conducting EDA on other COVID-related datasets like vaccinations or testing rates. You can also build dashboards using Streamlit or begin predictive modeling using machine learning algorithms.
Thank you for following along!
For more tutorials, visit our website and follow us for weekly Python and data science insights.