Predicting Air Quality Index Using Python
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025
Predicting Air Quality Index Using Python

Introduction
Air quality has a significant impact on human health and the environment. Predicting the Air Quality Index (AQI) using machine learning can help authorities take preventive measures to improve air quality and reduce health risks. This project will use machine learning to predict AQI based on environmental and atmospheric data from various Indian cities.
Air pollution is a growing concern worldwide, impacting human health and the environment. With increasing urbanization and industrialization, monitoring air quality has become more critical than ever. The Air Quality Index (AQI) is a crucial metric that helps assess air pollution levels in different locations. It provides valuable insights into the concentration of pollutants like PM2.5, PM10, CO, NO₂, SO₂, and O₃.
In this data science project, we will analyze and predict AQI using Python. Leveraging libraries like Pandas, NumPy, Matplotlib, and Scikit-Learn, we will explore real-world air quality datasets, clean and preprocess data, perform exploratory data analysis (EDA), and build machine learning models to forecast AQI. This project will provide a hands-on approach to understanding how data science can be applied to environmental monitoring.
Why This Project Matters
Understanding AQI trends helps policymakers, researchers, and citizens make informed decisions about outdoor activities, pollution control measures, and health precautions. By using Python for AQI analysis, we can visualize air pollution patterns, identify high-risk areas, and even predict future trends based on historical data. This project is ideal for data science enthusiasts looking to apply machine learning to real-world environmental challenges.
Objective
- To develop a machine learning model that predicts AQI based on pollutant levels and environmental factors.
- To explore and analyze the relationship between pollutants and AQI.
- To evaluate and compare multiple machine learning models for better prediction accuracy.
Dataset Overview
We will use the Air Quality Data in India dataset from Kaggle, which includes the following files:
✅ city_day.csv – Daily AQI data at the city level.
✅ city_hour.csv – Hourly AQI data at the city level.
✅ station_day.csv – Daily AQI data at the station level.
✅ station_hour.csv – Hourly AQI data at the station level.
✅ stations.csv – Metadata about monitoring stations.
Key Features:
- City/Station – Location of data collection
- Datetime – Date and time of data collection
- PM2.5, PM10 – Particulate matter concentration
- NO2, SO2, CO, O3 – Pollutant levels
- AQI – Air Quality Index (Target variable)
Tools and Libraries Used
- Python – Programming language
- Pandas – Data manipulation
- NumPy – Numerical operations
- Matplotlib, Seaborn – Data visualization
- Scikit-learn – Machine learning library
- XGBoost, RandomForest – Advanced machine learning models
- TensorFlow – Deep learning models (optional)
Step 1: Importing Libraries
Let’s start by importing the essential libraries.
# Import libraries for data handling and manipulation import pandas as pd import numpy as np # Import libraries for visualization import matplotlib.pyplot as plt import seaborn as sns # Import libraries for preprocessing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder # Import machine learning models from sklearn.ensemble import RandomForestRegressor from xgboost import XGBRegressor # Import evaluation metrics from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Explanation:
✅ Pandas – For reading, cleaning, and manipulating data.
✅ NumPy – For mathematical computations.
✅ Matplotlib/Seaborn – For visualizing data to identify patterns and trends.
✅ Scikit-learn – For preprocessing, model building, and evaluation.
✅ XGBoost – An advanced machine learning algorithm known for handling structured data effectively.
✅ TensorFlow – Optional for building deep learning models.
Let’s proceed to Step 2: Loading the Dataset.
Step 2: Loading the Dataset
We will work with the city_day.csv file, which contains daily air quality data at the city level. Let’s load the data and take a quick look at its structure.
# Load the dataset file_path = 'path_to_dataset/city_day.csv' # Replace with actual path df = pd.read_csv(file_path) # Display the first five rows of the dataset df.head()
Explanation:
✅ pd.read_csv()
– Reads the CSV file into a DataFrame.
✅ df.head()
– Displays the first five rows of the dataset to give a quick overview of the data structure.
Step 3: Data Exploration
Before we start cleaning and analyzing the data, let’s explore the dataset to understand its structure and contents.
Basic Information About the Dataset
# Display basic information about the dataset df.info()
Explanation:
✅ df.info()
– Provides an overview of the dataset, including column names, data types, and the number of non-null values. This helps us identify missing values and understand the types of data we are working with.
Check for Missing Values
# Check for missing values in the dataset missing_values = df.isnull().sum().sort_values(ascending=False) missing_values
Explanation:
✅ df.isnull().sum()
– Counts the number of missing values in each column.
✅ .sort_values()
– Sorts the results in descending order to easily spot columns with the highest number of missing values.
Summary Statistics
# Get summary statistics of the dataset df.describe()
Explanation:
✅ df.describe()
– Provides statistical details such as mean, standard deviation, minimum, and maximum values for numerical columns. This helps in understanding data distribution and identifying outliers.
Unique Cities and Pollutants
# Check the unique cities present in the dataset print("Number of unique cities:", df['City'].nunique()) print("Unique cities:", df['City'].unique()) # Check the pollutants being measured pollutants = [col for col in df.columns if 'PM' in col or 'NO' in col or 'CO' in col] print("Pollutants measured:", pollutants)
Explanation:
✅ nunique()
– Counts the number of unique cities.
✅ unique()
– Lists the names of unique cities.
✅ Pollutant selection using list comprehension – Filters columns related to air pollutants for analysis.
Step 4: Data Cleaning
Cleaning the data is crucial to improve model accuracy and avoid biased predictions. Let’s address missing values, duplicates, and data inconsistencies.
1. Handle Missing Values
First, let’s check the percentage of missing values in each column.
# Calculate percentage of missing values missing_percentage = (df.isnull().sum() / len(df)) * 100 missing_percentage = missing_percentage[missing_percentage > 0].sort_values(ascending=False) missing_percentage
Explanation:
✅ (df.isnull().sum() / len(df)) * 100
– Calculates the percentage of missing values for each column.
✅ sort_values(ascending=False)
– Sorts the columns by the highest percentage of missing values for better visualization.
2. Remove Columns with High Missing Values
Columns with over 40% missing data can be dropped to improve model performance.
# Drop columns with more than 40% missing data df = df.dropna(thresh=len(df) * 0.6, axis=1) df.info()
Explanation:
✅ dropna(thresh=len(df) * 0.6, axis=1)
– Drops columns with over 40% missing values.
✅ axis=1
– Ensures it drops columns, not rows.
3. Fill Missing Values with Mean/Median
For numeric columns with moderate missing values, fill them with the mean or median.
# Fill numeric missing values with mean for col in df.columns: if df[col].dtype in ['float64', 'int64']: df[col].fillna(df[col].mean(), inplace=True)
Explanation:
✅ fillna(df[col].mean(), inplace=True)
– Fills missing numeric values with the column mean.
✅ This ensures consistency without distorting data distribution.
4. Fill Categorical Missing Values with Mode
Categorical values can be filled with the most frequent value (mode).
# Fill categorical missing values with mode for col in df.columns: if df[col].dtype == 'object': df[col].fillna(df[col].mode()[0], inplace=True)
Explanation:
✅ mode()[0]
– Selects the most frequently occurring value.
✅ This avoids random guessing and retains consistency.
5. Remove Duplicates
# Remove duplicate rows df.drop_duplicates(inplace=True)
Explanation:
✅ drop_duplicates()
– Removes duplicate rows to avoid data redundancy.
✅ inplace=True
– Applies the change directly to the dataset.
6. Convert Date Column to Datetime Format
Ensure the date column is in a consistent format for time-based analysis.
# Convert Date column to datetime format df['Date'] = pd.to_datetime(df['Date']) df.info()
Explanation:
✅ pd.to_datetime()
– Converts the date column to a datetime format for time-based grouping and analysis.
7. Remove Negative and Unreasonable Values
Ensure no pollutant values are negative, as they are not physically meaningful.
# Remove negative values for col in df.columns: if df[col].dtype in ['float64', 'int64']: df = df[df[col] >= 0]
Explanation:
✅ Removes records where pollutant values are negative, as they represent incorrect measurements or errors.
Step 5: Exploratory Data Analysis (EDA)
EDA helps us understand the dataset’s structure, patterns, and potential relationships between variables. Let’s explore the data in detail.
1. Overview of Numerical Data
Let’s generate descriptive statistics to understand the distribution of numerical features.
# Descriptive statistics for numerical columns df.describe()
Explanation:
✅ describe()
– Summarizes numerical columns, including:
- count – Number of non-null values
- mean – Average value
- std – Standard deviation (measure of spread)
- min, max – Minimum and maximum values
- 25%, 50%, 75% – Quartiles (important for detecting outliers)
2. Distribution of AQI (Air Quality Index)
Let’s plot the AQI distribution to observe its skewness and central tendency.
# Distribution of AQI plt.figure(figsize=(8, 5)) sns.histplot(df['AQI'], kde=True, color='blue', bins=30) plt.title('Distribution of AQI') plt.xlabel('Air Quality Index') plt.ylabel('Frequency') plt.show()
Explanation:
✅ histplot()
– Plots the distribution of AQI values.
✅ kde=True
– Adds a kernel density estimate to show the distribution shape.
✅ Skewness or multimodal patterns could suggest the need for transformation.
3. Box Plot to Detect Outliers
Box plots help visualize the presence of outliers in AQI data.
# Box plot for AQI plt.figure(figsize=(6, 4)) sns.boxplot(x=df['AQI'], color='lightblue') plt.title('Box Plot of AQI') plt.show()
Explanation:
✅ boxplot()
– Shows the distribution, quartiles, and outliers.
✅ Outliers beyond the whiskers (1.5 * IQR) need investigation.
4. AQI Trends Over Time
Analyze how AQI changes over time by plotting a line graph.
# Trend of AQI over time plt.figure(figsize=(12, 6)) sns.lineplot(x='Date', y='AQI', data=df) plt.title('AQI Trend Over Time') plt.xlabel('Date') plt.ylabel('Air Quality Index') plt.show()
Explanation:
✅ lineplot()
– Plots AQI values over time.
✅ Helps identify seasonal or long-term patterns in air quality.
5. Correlation Between Pollutants and AQI
Let’s calculate and visualize the correlation matrix.
# Correlation matrix corr_matrix = df.corr() # Plot correlation heatmap plt.figure(figsize=(10, 6)) sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5) plt.title('Correlation Heatmap') plt.show()
Explanation:
✅ corr()
– Calculates Pearson correlation between numerical features.
✅ heatmap()
– Visualizes the correlation.
✅ High correlation between AQI and specific pollutants (e.g., PM2.5) indicates strong predictive potential.
6. Pair Plot to Explore Relationships
Pair plots show pairwise relationships between numerical variables.
# Pairplot of pollutants and AQI sns.pairplot(df[['AQI', 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3']]) plt.show()
Explanation:
✅ pairplot()
– Plots scatter plots for all combinations of selected features.
✅ Identifies linear or non-linear relationships.
7. AQI by City
Let’s visualize the variation in AQI across different cities.
# Average AQI by city city_aqi = df.groupby('City')['AQI'].mean().sort_values(ascending=False).head(10) plt.figure(figsize=(10, 5)) city_aqi.plot(kind='bar', color='skyblue') plt.title('Top 10 Cities by Average AQI') plt.xlabel('City') plt.ylabel('Average AQI') plt.show()
Explanation:
✅ groupby()
– Groups data by city and calculates mean AQI.
✅ plot(kind='bar')
– Plots a bar chart for easy comparison.
✅ Highlights cities with poor air quality.
Step 6: Feature Engineering
Feature engineering involves creating, modifying, or transforming features to improve model performance. It helps uncover hidden patterns and makes the model more effective.
1. Handling Missing Values
Let’s handle missing values systematically:
- Fill missing AQI values using the average AQI of the same city and month.
- Drop rows with excessive missing data in pollutant values.
# Fill missing AQI values with mean AQI for the same city and month df['AQI'] = df.groupby(['City', df['Date'].dt.month])['AQI'].transform(lambda x: x.fillna(x.mean())) # Drop rows with more than 50% missing values df.dropna(thresh=int(df.shape[1] * 0.5), inplace=True) # Fill remaining missing values with median df.fillna(df.median(), inplace=True)
Explanation:
✅ groupby()
groups data by city and month.
✅ transform()
applies filling based on group averages.
✅ dropna()
removes rows with excessive missing data.
✅ fillna()
fills remaining missing values with median to avoid skewness.
2. Creating New Features
We can create new features to improve model performance:
- Month and Day of the week from the date column.
- Pollutant ratio to measure the contribution of different pollutants.
# Extract month and day of the week df['Month'] = df['Date'].dt.month df['DayOfWeek'] = df['Date'].dt.dayofweek # Create pollutant ratio df['PM2.5/PM10'] = df['PM2.5'] / df['PM10'] df['NO2/SO2'] = df['NO2'] / df['SO2'] df['CO/O3'] = df['CO'] / df['O3']
Explanation:
✅ dt.month
and dt.dayofweek
extract time-based features.
✅ Creating pollutant ratios helps capture interactions between pollutants.
3. Encoding Categorical Variables
Convert categorical variables like City
into numerical format using One-Hot Encoding.
# One-Hot Encoding for City df = pd.get_dummies(df, columns=['City'], drop_first=True)
Explanation:
✅ get_dummies()
creates binary columns for each city.
✅ drop_first=True
prevents multicollinearity.
4. Feature Scaling
Scale numerical features to bring them to a similar range using StandardScaler.
from sklearn.preprocessing import StandardScaler # Scale pollutant and AQI features scaler = StandardScaler() scaled_cols = ['AQI', 'PM2.5', 'PM10', 'NO2', 'SO2', 'CO', 'O3'] df[scaled_cols] = scaler.fit_transform(df[scaled_cols])
Explanation:
✅ StandardScaler
scales data to zero mean and unit variance.
✅ Ensures that models based on distance (e.g., KNN, SVM) perform better.
Step 7: Splitting the Data
To evaluate the model’s performance effectively, we need to split the dataset into training and test sets. A typical split is 80% for training and 20% for testing. This helps the model learn from the training set and evaluate on unseen data.
Code:
from sklearn.model_selection import train_test_split # Define features and target variable X = df.drop(['AQI', 'Date'], axis=1) # Drop target and date column y = df['AQI'] # Split the data into training and testing sets (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Explanation:
✅ drop()
removes the target (AQI
) and non-numerical columns (Date
).
✅ train_test_split()
splits the data into training and testing sets.
✅ test_size=0.2
means 20% of data is set aside for testing.
✅ random_state=42
ensures reproducibility of results.
Step 8: Model Building
We will now build multiple models to predict the Air Quality Index (AQI). We’ll start with simple models like Linear Regression and gradually move to more complex models like Random Forest and Gradient Boosting. This allows us to compare model performance and understand which model works best.
8.1 Linear Regression
Linear Regression is a basic model that assumes a linear relationship between features and the target variable.
from sklearn.linear_model import LinearRegression # Initialize and train the Linear Regression model lr_model = LinearRegression() lr_model.fit(X_train, y_train) # Predict on test data y_pred_lr = lr_model.predict(X_test)
✅ Linear Regression Assumption: It assumes that the relationship between the features and target is linear.
✅ fit() trains the model using training data.
✅ predict() generates predictions on test data.
8.2 Decision Tree Regressor
Decision Tree models split data into branches based on feature values and make predictions accordingly.
from sklearn.tree import DecisionTreeRegressor # Initialize and train the Decision Tree model dt_model = DecisionTreeRegressor(random_state=42) dt_model.fit(X_train, y_train) # Predict on test data y_pred_dt = dt_model.predict(X_test)
✅ Decision Tree Strength: Captures complex patterns in data.
✅ random_state ensures reproducibility of the results.
✅ Decision trees are prone to overfitting if not properly tuned.
8.3 Random Forest Regressor
Random Forest is an ensemble learning model that builds multiple decision trees and averages their outputs to improve accuracy and reduce overfitting.
from sklearn.ensemble import RandomForestRegressor # Initialize and train the Random Forest model rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # Predict on test data y_pred_rf = rf_model.predict(X_test)
✅ n_estimators: Number of decision trees used.
✅ Random forests reduce overfitting and improve accuracy.
✅ Averaging multiple trees increases the model’s robustness.
8.4 Gradient Boosting Regressor
Gradient Boosting builds models sequentially, where each new model tries to correct the errors of the previous one.
from sklearn.ensemble import GradientBoostingRegressor # Initialize and train the Gradient Boosting model gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42) gb_model.fit(X_train, y_train) # Predict on test data y_pred_gb = gb_model.predict(X_test)
✅ Boosting focuses on correcting errors in previous models.
✅ Gradient boosting typically achieves high accuracy but is prone to overfitting.
8.5 XGBoost Regressor
XGBoost is an optimized version of Gradient Boosting that improves performance and computation time.
from xgboost import XGBRegressor # Initialize and train the XGBoost model xgb_model = XGBRegressor(n_estimators=100, random_state=42) xgb_model.fit(X_train, y_train) # Predict on test data y_pred_xgb = xgb_model.predict(X_test)
✅ XGBoost handles missing values automatically.
✅ It’s efficient and often achieves better performance than other models.
✅ n_estimators=100
specifies the number of boosting rounds.
Step 9: Model Evaluation
After training multiple models, the next step is to evaluate their performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²). Evaluating different models helps identify the best-performing one for AQI prediction.
9.1 Evaluate Linear Regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Evaluate Linear Regression print("Linear Regression MAE:", mean_absolute_error(y_test, y_pred_lr)) print("Linear Regression MSE:", mean_squared_error(y_test, y_pred_lr)) print("Linear Regression R²:", r2_score(y_test, y_pred_lr))
✅ MAE: Measures average absolute error between predicted and actual values.
✅ MSE: Measures average squared difference between predicted and actual values.
✅ R²: Measures how well the model explains the variance in the target variable (closer to 1 is better).
9.2 Evaluate Decision Tree
# Evaluate Decision Tree print("Decision Tree MAE:", mean_absolute_error(y_test, y_pred_dt)) print("Decision Tree MSE:", mean_squared_error(y_test, y_pred_dt)) print("Decision Tree R²:", r2_score(y_test, y_pred_dt))
✅ Decision trees might overfit on small datasets, so lower R² could indicate overfitting.
9.3 Evaluate Random Forest
# Evaluate Random Forest print("Random Forest MAE:", mean_absolute_error(y_test, y_pred_rf)) print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf)) print("Random Forest R²:", r2_score(y_test, y_pred_rf))
✅ Random Forest tends to generalize better than Decision Trees, so higher R² is expected.
9.4 Evaluate Gradient Boosting
# Evaluate Gradient Boosting print("Gradient Boosting MAE:", mean_absolute_error(y_test, y_pred_gb)) print("Gradient Boosting MSE:", mean_squared_error(y_test, y_pred_gb)) print("Gradient Boosting R²:", r2_score(y_test, y_pred_gb))
✅ Gradient Boosting might improve accuracy at the cost of longer training time.
9.5 Evaluate XGBoost
# Evaluate XGBoost print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb)) print("XGBoost MSE:", mean_squared_error(y_test, y_pred_xgb)) print("XGBoost R²:", r2_score(y_test, y_pred_xgb))
✅ XGBoost is expected to have high R² due to its ability to handle complex data patterns.
9.6 Compare Models’ Performance
We can now compare the performance of all models using a table:
# Create a dataframe to compare model performance results = pd.DataFrame({ 'Model': ['Linear Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting', 'XGBoost'], 'MAE': [mean_absolute_error(y_test, y_pred_lr), mean_absolute_error(y_test, y_pred_dt), mean_absolute_error(y_test, y_pred_rf), mean_absolute_error(y_test, y_pred_gb), mean_absolute_error(y_test, y_pred_xgb)], 'MSE': [mean_squared_error(y_test, y_pred_lr), mean_squared_error(y_test, y_pred_dt), mean_squared_error(y_test, y_pred_rf), mean_squared_error(y_test, y_pred_gb), mean_squared_error(y_test, y_pred_xgb)], 'R²': [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_dt), r2_score(y_test, y_pred_rf), r2_score(y_test, y_pred_gb), r2_score(y_test, y_pred_xgb)] }) # Sort by R² (descending) results = results.sort_values(by='R²', ascending=False) print(results)
✅ This table allows us to identify which model performed best based on R² and error metrics.
✅ A higher R² and lower MAE/MSE indicate better model performance.
Step 10: Building the Front-End with Flask
Now that we have a working model, the next step is to create a simple Flask-based web app to allow users to input AQI data and get predictions in real-time.
10.1 Install Flask
If Flask is not installed, you can install it using:
pip install flask
10.2 Create the Flask App
Create a new file called app.py
and write the following code:
from flask import Flask, request, render_template import pickle import numpy as np # Load the trained model model = pickle.load(open('best_model.pkl', 'rb')) app = Flask(__name__) # Create a route for the homepage @app.route('/') def home(): return render_template('index.html') # Create a route for prediction @app.route('/predict', methods=['POST']) def predict(): features = [float(x) for x in request.form.values()] features_array = np.array(features).reshape(1, -1) prediction = model.predict(features_array)[0] return render_template('index.html', prediction_text=f'Predicted AQI: {prediction:.2f}') if __name__ == "__main__": app.run(debug=True)
10.3 Create the HTML Template
Create a folder named templates
and inside it, create a file called index.html
:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>AQI Prediction</title> <style> body { font-family: Arial, sans-serif; background-color: #f4f4f9; padding: 20px; } h2 { color: #333; } form { margin-top: 20px; } input { padding: 10px; margin: 5px; width: 80%; } button { padding: 10px 20px; background-color: #5cb85c; color: white; border: none; cursor: pointer; } .result { margin-top: 20px; font-size: 1.2rem; color: #2c3e50; } </style> </head> <body> <h2>AQI Prediction Model</h2> <form action="/predict" method="post"> <input type="text" name="pm2_5" placeholder="Enter PM2.5" required="required" /><br /> <input type="text" name="pm10" placeholder="Enter PM10" required="required" /><br /> <input type="text" name="no" placeholder="Enter NO" required="required" /><br /> <input type="text" name="no2" placeholder="Enter NO2" required="required" /><br /> <input type="text" name="nox" placeholder="Enter NOx" required="required" /><br /> <input type="text" name="co" placeholder="Enter CO" required="required" /><br /> <button type="submit">Predict</button> </form> {% if prediction_text %} <div class="result">{{ prediction_text }}</div> {% endif %} </body> </html>
10.4 Save and Load the Model
To save the best model, use:
import pickle # Save the model to a file pickle.dump(best_model, open('best_model.pkl', 'wb'))
10.5 Run the Flask App
To run the Flask app, use the following command:
python app.py
✅ Open your browser and visit http://127.0.0.1:5000
to access the app.
✅ You can enter test data and see the predicted AQI directly in the web interface.
Step 11: Deployment
Now that we have a working Flask-based application, the next step is to deploy it so that users can access it from anywhere. We’ll use Render (or Heroku as an alternative) for deployment.
11.1 Create a GitHub Repository
- Create a new repository on GitHub (e.g.,
aqi-prediction
). - Push your code files to the repository:
git init git add . git commit -m "Initial commit" git branch -M main git remote add origin <repository-url> git push -u origin main
11.2 Create a requirements.txt
File
Create a requirements.txt
file listing all dependencies:
Flask numpy pandas scikit-learn pickle
11.3 Create a Procfile
Create a Procfile
to define how the app should run:
web: gunicorn app:app
11.4 Deploy on Render
- Go to Render and create a new web service.
- Link your GitHub repository.
- Set the build command to:
pip install -r requirements.txt
- Set the start command to:
gunicorn app:app
- Deploy the app.
✅ Access the Deployed App
Once deployed, you will get a live URL like:
https://your-app-name.onrender.com
Users can now access the app, input their data, and get AQI predictions in real-time.
Conclusion
In this project, we successfully built a machine learning model to predict Air Quality Index (AQI) using historical air quality data from major cities in India. We followed a structured, step-by-step approach that included:
✅ Data collection and exploration
✅ Data cleaning and preprocessing
✅ Feature engineering and model building
✅ Training and evaluation of multiple models
✅ Building a user-friendly Flask-based front end
✅ Deployment on Render
The project demonstrated how to handle complex environmental data and create an accurate AQI prediction model. This solution can be further enhanced by adding more data sources, fine-tuning models, and experimenting with deep learning techniques for improved accuracy.
Next Steps
- Try More Models: Experiment with more complex models like Gradient Boosting, XGBoost, and LightGBM to improve accuracy.
- Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize model parameters.
- Feature Engineering: Create additional features, such as seasonal patterns or regional AQI trends.
- Deep Learning: Test neural networks like LSTM (Long Short-Term Memory) for better handling of time-series data.
- Real-Time Prediction: Set up real-time data ingestion and prediction using an API.
- Model Monitoring: Implement monitoring tools to track model performance over time and adjust for data drift.