Predicting Air Quality Index Using Python -thesmartcoder.com

Author
Recent Posts

Data Scientist at LeadTech Group

Passionate about unlocking insights from data, I am a dedicated data scientist with a keen interest in AI and Machine Learning. As a tech enthusiast, I constantly explore new technologies and innovations. My journey is driven by a love for learning and a commitment to leveraging data to create meaningful impact.

Latest posts by KANGKAN KALITA (see all)

SQL for beginners : A Complete Guide - June 24, 2025
Predictive Analytics Techniques: A Beginner’s Guide to Turning Data into Future Insights - June 15, 2025
Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025

Predicting Air Quality Index Using Python

Introduction

Air quality has a significant impact on human health and the environment. Predicting the Air Quality Index (AQI) using machine learning can help authorities take preventive measures to improve air quality and reduce health risks. This project will use machine learning to predict AQI based on environmental and atmospheric data from various Indian cities.

Air pollution is a growing concern worldwide, impacting human health and the environment. With increasing urbanization and industrialization, monitoring air quality has become more critical than ever. The Air Quality Index (AQI) is a crucial metric that helps assess air pollution levels in different locations. It provides valuable insights into the concentration of pollutants like PM2.5, PM10, CO, NO₂, SO₂, and O₃.

In this data science project, we will analyze and predict AQI using Python. Leveraging libraries like Pandas, NumPy, Matplotlib, and Scikit-Learn, we will explore real-world air quality datasets, clean and preprocess data, perform exploratory data analysis (EDA), and build machine learning models to forecast AQI. This project will provide a hands-on approach to understanding how data science can be applied to environmental monitoring.

Why This Project Matters

Understanding AQI trends helps policymakers, researchers, and citizens make informed decisions about outdoor activities, pollution control measures, and health precautions. By using Python for AQI analysis, we can visualize air pollution patterns, identify high-risk areas, and even predict future trends based on historical data. This project is ideal for data science enthusiasts looking to apply machine learning to real-world environmental challenges.

Objective

To develop a machine learning model that predicts AQI based on pollutant levels and environmental factors.
To explore and analyze the relationship between pollutants and AQI.
To evaluate and compare multiple machine learning models for better prediction accuracy.

Dataset Overview

We will use the Air Quality Data in India dataset from Kaggle, which includes the following files:
✅ city_day.csv – Daily AQI data at the city level.
✅ city_hour.csv – Hourly AQI data at the city level.
✅ station_day.csv – Daily AQI data at the station level.
✅ station_hour.csv – Hourly AQI data at the station level.
✅ stations.csv – Metadata about monitoring stations.

Key Features:

City/Station – Location of data collection
Datetime – Date and time of data collection
PM2.5, PM10 – Particulate matter concentration
NO2, SO2, CO, O3 – Pollutant levels
AQI – Air Quality Index (Target variable)

Tools and Libraries Used

Python – Programming language
Pandas – Data manipulation
NumPy – Numerical operations
Matplotlib, Seaborn – Data visualization
Scikit-learn – Machine learning library
XGBoost, RandomForest – Advanced machine learning models
TensorFlow – Deep learning models (optional)

Step 1: Importing Libraries

Let’s start by importing the essential libraries.

# Import libraries for data handling and manipulation
import pandas as pd
import numpy as np

# Import libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import libraries for preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Import machine learning models
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Import evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Explanation:

✅ Pandas – For reading, cleaning, and manipulating data.
✅ NumPy – For mathematical computations.
✅ Matplotlib/Seaborn – For visualizing data to identify patterns and trends.
✅ Scikit-learn – For preprocessing, model building, and evaluation.
✅ XGBoost – An advanced machine learning algorithm known for handling structured data effectively.
✅ TensorFlow – Optional for building deep learning models.

Let’s proceed to Step 2: Loading the Dataset.

Step 2: Loading the Dataset

We will work with the city_day.csv file, which contains daily air quality data at the city level. Let’s load the data and take a quick look at its structure.

# Load the dataset
file_path = 'path_to_dataset/city_day.csv' # Replace with actual path
df = pd.read_csv(file_path)

# Display the first five rows of the dataset
df.head()

Explanation:

✅ pd.read_csv() – Reads the CSV file into a DataFrame.
✅ df.head() – Displays the first five rows of the dataset to give a quick overview of the data structure.

Step 3: Data Exploration

Before we start cleaning and analyzing the data, let’s explore the dataset to understand its structure and contents.

Basic Information About the Dataset

# Display basic information about the dataset
df.info()

Explanation:

✅ df.info() – Provides an overview of the dataset, including column names, data types, and the number of non-null values. This helps us identify missing values and understand the types of data we are working with.

Check for Missing Values

# Check for missing values in the dataset
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_values

Explanation:

✅ df.isnull().sum() – Counts the number of missing values in each column.
✅ .sort_values() – Sorts the results in descending order to easily spot columns with the highest number of missing values.

Summary Statistics

# Get summary statistics of the dataset
df.describe()

Explanation:

✅ df.describe() – Provides statistical details such as mean, standard deviation, minimum, and maximum values for numerical columns. This helps in understanding data distribution and identifying outliers.

Unique Cities and Pollutants

# Check the unique cities present in the dataset
print("Number of unique cities:", df['City'].nunique())
print("Unique cities:", df['City'].unique())

# Check the pollutants being measured
pollutants = [col for col in df.columns if 'PM' in col or 'NO' in col or 'CO' in col]
print("Pollutants measured:", pollutants)

Explanation:

✅ nunique() – Counts the number of unique cities.
✅ unique() – Lists the names of unique cities.
✅ Pollutant selection using list comprehension – Filters columns related to air pollutants for analysis.

Step 4: Data Cleaning

Cleaning the data is crucial to improve model accuracy and avoid biased predictions. Let’s address missing values, duplicates, and data inconsistencies.

1. Handle Missing Values

First, let’s check the percentage of missing values in each column.

# Calculate percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_percentage = missing_percentage[missing_percentage > 0].sort_values(ascending=False)
missing_percentage

Explanation:

✅ (df.isnull().sum() / len(df)) * 100 – Calculates the percentage of missing values for each column.
✅ sort_values(ascending=False) – Sorts the columns by the highest percentage of missing values for better visualization.

2. Remove Columns with High Missing Values

Columns with over 40% missing data can be dropped to improve model performance.

# Drop columns with more than 40% missing data
df = df.dropna(thresh=len(df) * 0.6, axis=1)
df.info()

Explanation:

✅ dropna(thresh=len(df) * 0.6, axis=1) – Drops columns with over 40% missing values.
✅ axis=1 – Ensures it drops columns, not rows.

3. Fill Missing Values with Mean/Median

For numeric columns with moderate missing values, fill them with the mean or median.

# Fill numeric missing values with mean
for col in df.columns:
    if df[col].dtype in ['float64', 'int64']:
        df[col].fillna(df[col].mean(), inplace=True)

Explanation:

✅ fillna(df[col].mean(), inplace=True) – Fills missing numeric values with the column mean.
✅ This ensures consistency without distorting data distribution.

4. Fill Categorical Missing Values with Mode

Categorical values can be filled with the most frequent value (mode).

# Fill categorical missing values with mode
for col in df.columns:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)

Explanation:

✅ mode()[0] – Selects the most frequently occurring value.
✅ This avoids random guessing and retains consistency.

5. Remove Duplicates

# Remove duplicate rows
df.drop_duplicates(inplace=True)

Explanation:

✅ drop_duplicates() – Removes duplicate rows to avoid data redundancy.
✅ inplace=True – Applies the change directly to the dataset.

6. Convert Date Column to Datetime Format

Ensure the date column is in a consistent format for time-based analysis.

# Convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
df.info()

Explanation:

✅ pd.to_datetime() – Converts the date column to a datetime format for time-based grouping and analysis.

7. Remove Negative and Unreasonable Values

Ensure no pollutant values are negative, as they are not physically meaningful.

# Remove negative values
for col in df.columns:
    if df[col].dtype in ['float64', 'int64']:
        df = df[df[col] >= 0]

Explanation:

✅ Removes records where pollutant values are negative, as they represent incorrect measurements or errors.

Step 5: Exploratory Data Analysis (EDA)

EDA helps us understand the dataset’s structure, patterns, and potential relationships between variables. Let’s explore the data in detail.

1. Overview of Numerical Data

Let’s generate descriptive statistics to understand the distribution of numerical features.

# Descriptive statistics for numerical columns
df.describe()

Explanation:

✅ describe() – Summarizes numerical columns, including:

count – Number of non-null values
mean – Average value
std – Standard deviation (measure of spread)
min, max – Minimum and maximum values
25%, 50%, 75% – Quartiles (important for detecting outliers)

2. Distribution of AQI (Air Quality Index)

Let’s plot the AQI distribution to observe its skewness and central tendency.

# Distribution of AQI
plt.figure(figsize=(8, 5))
sns.histplot(df['AQI'], kde=True, color='blue', bins=30)
plt.title('Distribution of AQI')
plt.xlabel('Air Quality Index')
plt.ylabel('Frequency')
plt.show()

Explanation:

✅ histplot() – Plots the distribution of AQI values.
✅ kde=True – Adds a kernel density estimate to show the distribution shape.
✅ Skewness or multimodal patterns could suggest the need for transformation.

3. Box Plot to Detect Outliers

Box plots help visualize the presence of outliers in AQI data.

# Box plot for AQI
plt.figure(figsize=(6, 4))
sns.boxplot(x=df['AQI'], color='lightblue')
plt.title('Box Plot of AQI')
plt.show()

Explanation:

✅ boxplot() – Shows the distribution, quartiles, and outliers.
✅ Outliers beyond the whiskers (1.5 * IQR) need investigation.

4. AQI Trends Over Time

Analyze how AQI changes over time by plotting a line graph.

# Trend of AQI over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='AQI', data=df)
plt.title('AQI Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Air Quality Index')
plt.show()

Explanation:

✅ lineplot() – Plots AQI values over time.
✅ Helps identify seasonal or long-term patterns in air quality.

5. Correlation Between Pollutants and AQI

Let’s calculate and visualize the correlation matrix.

# Correlation matrix
corr_matrix = df.corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Explanation:

✅ corr() – Calculates Pearson correlation between numerical features.
✅ heatmap() – Visualizes the correlation.
✅ High correlation between AQI and specific pollutants (e.g., PM2.5) indicates strong predictive potential.

6. Pair Plot to Explore Relationships

Pair plots show pairwise relationships between numerical variables.

# Pairplot of pollutants and AQI
sns.pairplot(df[['AQI', 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']])
plt.show()

Explanation:

✅ pairplot() – Plots scatter plots for all combinations of selected features.
✅ Identifies linear or non-linear relationships.

7. AQI by City

Let’s visualize the variation in AQI across different cities.

# Average AQI by city
city_aqi = df.groupby('City')['AQI'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
city_aqi.plot(kind='bar', color='skyblue')
plt.title('Top 10 Cities by Average AQI')
plt.xlabel('City')
plt.ylabel('Average AQI')
plt.show()

Explanation:

✅ groupby() – Groups data by city and calculates mean AQI.
✅ plot(kind='bar') – Plots a bar chart for easy comparison.
✅ Highlights cities with poor air quality.

Step 6: Feature Engineering

Feature engineering involves creating, modifying, or transforming features to improve model performance. It helps uncover hidden patterns and makes the model more effective.

1. Handling Missing Values

Let’s handle missing values systematically:

Fill missing AQI values using the average AQI of the same city and month.
Drop rows with excessive missing data in pollutant values.

# Fill missing AQI values with mean AQI for the same city and month
df['AQI'] = df.groupby(['City', df['Date'].dt.month])['AQI'].transform(lambda x: x.fillna(x.mean()))

# Drop rows with more than 50% missing values
df.dropna(thresh=int(df.shape[1] * 0.5), inplace=True)

# Fill remaining missing values with median
df.fillna(df.median(), inplace=True)

Explanation:

✅ groupby() groups data by city and month.
✅ transform() applies filling based on group averages.
✅ dropna() removes rows with excessive missing data.
✅ fillna() fills remaining missing values with median to avoid skewness.

2. Creating New Features

We can create new features to improve model performance:

Month and Day of the week from the date column.
Pollutant ratio to measure the contribution of different pollutants.

# Extract month and day of the week
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.dayofweek

# Create pollutant ratio
df['PM2.5/PM10'] = df['PM2.5'] / df['PM10']
df['NO2/SO2'] = df['NO2'] / df['SO2']
df['CO/O3'] = df['CO'] / df['O3']

Explanation:

✅ dt.month and dt.dayofweek extract time-based features.
✅ Creating pollutant ratios helps capture interactions between pollutants.

3. Encoding Categorical Variables

Convert categorical variables like City into numerical format using One-Hot Encoding.

# One-Hot Encoding for City
df = pd.get_dummies(df, columns=['City'], drop_first=True)

Explanation:

✅ get_dummies() creates binary columns for each city.
✅ drop_first=True prevents multicollinearity.

4. Feature Scaling

Scale numerical features to bring them to a similar range using StandardScaler.

from sklearn.preprocessing import StandardScaler

# Scale pollutant and AQI features
scaler = StandardScaler()
scaled_cols = ['AQI', 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']
df[scaled_cols] = scaler.fit_transform(df[scaled_cols])

Explanation:

✅ StandardScaler scales data to zero mean and unit variance.
✅ Ensures that models based on distance (e.g., KNN, SVM) perform better.

Step 7: Splitting the Data

To evaluate the model’s performance effectively, we need to split the dataset into training and test sets. A typical split is 80% for training and 20% for testing. This helps the model learn from the training set and evaluate on unseen data.

Code:

from sklearn.model_selection import train_test_split

# Define features and target variable
X = df.drop(['AQI', 'Date'], axis=1)  # Drop target and date column
y = df['AQI']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation:

✅ drop() removes the target (AQI) and non-numerical columns (Date).
✅ train_test_split() splits the data into training and testing sets.
✅ test_size=0.2 means 20% of data is set aside for testing.
✅ random_state=42 ensures reproducibility of results.

Step 8: Model Building

We will now build multiple models to predict the Air Quality Index (AQI). We’ll start with simple models like Linear Regression and gradually move to more complex models like Random Forest and Gradient Boosting. This allows us to compare model performance and understand which model works best.

8.1 Linear Regression

Linear Regression is a basic model that assumes a linear relationship between features and the target variable.

from sklearn.linear_model import LinearRegression

# Initialize and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on test data
y_pred_lr = lr_model.predict(X_test)

✅ Linear Regression Assumption: It assumes that the relationship between the features and target is linear.
✅ fit() trains the model using training data.
✅ predict() generates predictions on test data.

8.2 Decision Tree Regressor

Decision Tree models split data into branches based on feature values and make predictions accordingly.

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the Decision Tree model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predict on test data
y_pred_dt = dt_model.predict(X_test)

✅ Decision Tree Strength: Captures complex patterns in data.
✅ random_state ensures reproducibility of the results.
✅ Decision trees are prone to overfitting if not properly tuned.

8.3 Random Forest Regressor

Random Forest is an ensemble learning model that builds multiple decision trees and averages their outputs to improve accuracy and reduce overfitting.

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on test data
y_pred_rf = rf_model.predict(X_test)

✅ n_estimators: Number of decision trees used.
✅ Random forests reduce overfitting and improve accuracy.
✅ Averaging multiple trees increases the model’s robustness.

8.4 Gradient Boosting Regressor

Gradient Boosting builds models sequentially, where each new model tries to correct the errors of the previous one.

from sklearn.ensemble import GradientBoostingRegressor

# Initialize and train the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Predict on test data
y_pred_gb = gb_model.predict(X_test)

✅ Boosting focuses on correcting errors in previous models.
✅ Gradient boosting typically achieves high accuracy but is prone to overfitting.

8.5 XGBoost Regressor

XGBoost is an optimized version of Gradient Boosting that improves performance and computation time.

from xgboost import XGBRegressor

# Initialize and train the XGBoost model
xgb_model = XGBRegressor(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on test data
y_pred_xgb = xgb_model.predict(X_test)

✅ XGBoost handles missing values automatically.
✅ It’s efficient and often achieves better performance than other models.
✅ n_estimators=100 specifies the number of boosting rounds.

Step 9: Model Evaluation

After training multiple models, the next step is to evaluate their performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²). Evaluating different models helps identify the best-performing one for AQI prediction.

9.1 Evaluate Linear Regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Evaluate Linear Regression
print("Linear Regression MAE:", mean_absolute_error(y_test, y_pred_lr))
print("Linear Regression MSE:", mean_squared_error(y_test, y_pred_lr))
print("Linear Regression R²:", r2_score(y_test, y_pred_lr))

✅ MAE: Measures average absolute error between predicted and actual values.
✅ MSE: Measures average squared difference between predicted and actual values.
✅ R²: Measures how well the model explains the variance in the target variable (closer to 1 is better).

9.2 Evaluate Decision Tree

# Evaluate Decision Tree
print("Decision Tree MAE:", mean_absolute_error(y_test, y_pred_dt))
print("Decision Tree MSE:", mean_squared_error(y_test, y_pred_dt))
print("Decision Tree R²:", r2_score(y_test, y_pred_dt))

✅ Decision trees might overfit on small datasets, so lower R² could indicate overfitting.

9.3 Evaluate Random Forest

# Evaluate Random Forest
print("Random Forest MAE:", mean_absolute_error(y_test, y_pred_rf))
print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf))
print("Random Forest R²:", r2_score(y_test, y_pred_rf))

✅ Random Forest tends to generalize better than Decision Trees, so higher R² is expected.

9.4 Evaluate Gradient Boosting

# Evaluate Gradient Boosting
print("Gradient Boosting MAE:", mean_absolute_error(y_test, y_pred_gb))
print("Gradient Boosting MSE:", mean_squared_error(y_test, y_pred_gb))
print("Gradient Boosting R²:", r2_score(y_test, y_pred_gb))

✅ Gradient Boosting might improve accuracy at the cost of longer training time.

9.5 Evaluate XGBoost

# Evaluate XGBoost
print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb))
print("XGBoost MSE:", mean_squared_error(y_test, y_pred_xgb))
print("XGBoost R²:", r2_score(y_test, y_pred_xgb))

✅ XGBoost is expected to have high R² due to its ability to handle complex data patterns.

9.6 Compare Models’ Performance

We can now compare the performance of all models using a table:

# Create a dataframe to compare model performance
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting', 'XGBoost'],
    'MAE': [mean_absolute_error(y_test, y_pred_lr), mean_absolute_error(y_test, y_pred_dt), 
            mean_absolute_error(y_test, y_pred_rf), mean_absolute_error(y_test, y_pred_gb), 
            mean_absolute_error(y_test, y_pred_xgb)],
    'MSE': [mean_squared_error(y_test, y_pred_lr), mean_squared_error(y_test, y_pred_dt), 
            mean_squared_error(y_test, y_pred_rf), mean_squared_error(y_test, y_pred_gb), 
            mean_squared_error(y_test, y_pred_xgb)],
    'R²': [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_dt), 
           r2_score(y_test, y_pred_rf), r2_score(y_test, y_pred_gb), 
           r2_score(y_test, y_pred_xgb)]
})

# Sort by R² (descending)
results = results.sort_values(by='R²', ascending=False)
print(results)

✅ This table allows us to identify which model performed best based on R² and error metrics.
✅ A higher R² and lower MAE/MSE indicate better model performance.

MODEL DEPLOYMENT

Step 10: Building the Front-End with Flask

Now that we have a working model, the next step is to create a simple Flask-based web app to allow users to input AQI data and get predictions in real-time.

10.1 Install Flask

If Flask is not installed, you can install it using:

pip install flask

10.2 Create the Flask App

Create a new file called app.py and write the following code:

from flask import Flask, request, render_template
import pickle
import numpy as np

# Load the trained model
model = pickle.load(open('best_model.pkl', 'rb'))

app = Flask(__name__)

# Create a route for the homepage
@app.route('/')
def home():
    return render_template('index.html')

# Create a route for prediction
@app.route('/predict', methods=['POST'])
def predict():
    features = [float(x) for x in request.form.values()]
    features_array = np.array(features).reshape(1, -1)
    prediction = model.predict(features_array)[0]
    
    return render_template('index.html', prediction_text=f'Predicted AQI: {prediction:.2f}')

if __name__ == "__main__":
    app.run(debug=True)

10.3 Create the HTML Template

Create a folder named templates and inside it, create a file called index.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>AQI Prediction</title>
    <style>
        body { font-family: Arial, sans-serif; background-color: #f4f4f9; padding: 20px; }
        h2 { color: #333; }
        form { margin-top: 20px; }
        input { padding: 10px; margin: 5px; width: 80%; }
        button { padding: 10px 20px; background-color: #5cb85c; color: white; border: none; cursor: pointer; }
        .result { margin-top: 20px; font-size: 1.2rem; color: #2c3e50; }
    </style>
</head>
<body>
    <h2>AQI Prediction Model</h2>
    <form action="/predict" method="post">
        <input type="text" name="pm2_5" placeholder="Enter PM2.5" required="required" /><br />
        <input type="text" name="pm10" placeholder="Enter PM10" required="required" /><br />
        <input type="text" name="no" placeholder="Enter NO" required="required" /><br />
        <input type="text" name="no2" placeholder="Enter NO2" required="required" /><br />
        <input type="text" name="nox" placeholder="Enter NOx" required="required" /><br />
        <input type="text" name="co" placeholder="Enter CO" required="required" /><br />
        <button type="submit">Predict</button>
    </form>
    {% if prediction_text %}
        <div class="result">{{ prediction_text }}</div>
    {% endif %}
</body>
</html>

10.4 Save and Load the Model

To save the best model, use:

import pickle

# Save the model to a file
pickle.dump(best_model, open('best_model.pkl', 'wb'))

10.5 Run the Flask App

To run the Flask app, use the following command:

python app.py

✅ Open your browser and visit http://127.0.0.1:5000 to access the app.
✅ You can enter test data and see the predicted AQI directly in the web interface.

Step 11: Deployment

Now that we have a working Flask-based application, the next step is to deploy it so that users can access it from anywhere. We’ll use Render (or Heroku as an alternative) for deployment.

11.1 Create a GitHub Repository

Create a new repository on GitHub (e.g., aqi-prediction).
Push your code files to the repository:

git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin <repository-url>
git push -u origin main

11.2 Create a `requirements.txt` File

Create a requirements.txt file listing all dependencies:

Flask
numpy
pandas
scikit-learn
pickle

11.3 Create a `Procfile`

Create a Procfile to define how the app should run:

web: gunicorn app:app

11.4 Deploy on Render

Go to Render and create a new web service.
Link your GitHub repository.
Set the build command to:

pip install -r requirements.txt

Set the start command to:

gunicorn app:app

Deploy the app.

✅ Access the Deployed App

Once deployed, you will get a live URL like:

https://your-app-name.onrender.com

Users can now access the app, input their data, and get AQI predictions in real-time.

Conclusion

In this project, we successfully built a machine learning model to predict Air Quality Index (AQI) using historical air quality data from major cities in India. We followed a structured, step-by-step approach that included:

✅ Data collection and exploration
✅ Data cleaning and preprocessing
✅ Feature engineering and model building
✅ Training and evaluation of multiple models
✅ Building a user-friendly Flask-based front end
✅ Deployment on Render

The project demonstrated how to handle complex environmental data and create an accurate AQI prediction model. This solution can be further enhanced by adding more data sources, fine-tuning models, and experimenting with deep learning techniques for improved accuracy.

Next Steps

Try More Models: Experiment with more complex models like Gradient Boosting, XGBoost, and LightGBM to improve accuracy.
Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize model parameters.
Feature Engineering: Create additional features, such as seasonal patterns or regional AQI trends.
Deep Learning: Test neural networks like LSTM (Long Short-Term Memory) for better handling of time-series data.
Real-Time Prediction: Set up real-time data ingestion and prediction using an API.
Model Monitoring: Implement monitoring tools to track model performance over time and adjust for data drift.

Explore more data science Projects from here

Latest Posts:

Post Views: 68