|

Boston Housing Price Project Report with Source Code in Python

KANGKAN KALITA
 Boston Housing Price Project Report with Source Code in Python

Boston Housing Price Project Report with Source Code in Python

The Boston Housing Price Dataset is a classic dataset used in regression analysis. It contains information about housing prices in Boston suburbs, including factors like crime rate, property tax, number of rooms, and more. This project aims to predict house prices based on multiple features using Python. Below I have provided the complete project outline for Boston Housing Price Project Report with Source Code in Python. This report will cover data collection, cleaning, exploratory data analysis (EDA), handling outliers, visualization, feature engineering, model building, and evaluation.

Performing EDA on the Boston Housing dataset provides insights into feature relationships and their influence on house prices. This project will also cover building and evaluating regression models to predict prices.

Objective:

  • Conduct Exploratory Data Analysis (EDA) on the Boston Housing Price Dataset.
  • Visualize relationships between features and housing prices.
  • Develop and implement machine learning models to predict housing prices.

Dataset:

The Boston Housing Dataset is publicly available in the sklearn library.

from sklearn.datasets import load_boston
boston = load_boston()

Alternatively, it can be downloaded from online sources or Kaggle.

Tools & Libraries:

  • Python
  • Pandas – Data manipulation and analysis.
  • NumPy – Numerical operations.
  • Matplotlib – Visualization.
  • Seaborn – Advanced visualization.
  • Scikit-learn – Machine learning models.
  • Jupyter Notebook or Google Colab.

Implementation Steps:

1. Data Collection & Setup

In this step, we will load the dataset and import necessary libraries.

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
from sklearn.datasets import load_boston  

# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target  # Add target column
df.head()
  • Explanation:
    • load_boston() fetches the dataset directly from sklearn.
    • A DataFrame is created with feature names as columns.
    • The target variable (house price) is appended as PRICE.

If you are not able to use Boston dataset directly from Sklearn use this link to Download the Dataset and follow this step.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = (your dataset path}
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(df.head())

2. Data Exploration

Now, let’s explore the dataset to understand its structure and summary statistics.

df.info()  # Overview of data types and null values
df.describe()  # Summary statistics for numerical features
  • Explanation:
    • info() shows the dataset’s shape, columns, and missing data.
    • describe() provides statistics like mean, standard deviation, and quartiles.

3. Data Cleaning

We will handle missing values and ensure the dataset is ready for analysis.

# Check for missing values
df.isnull().sum()

# Fill or drop missing values if necessary (Example)
df.fillna(df.median(), inplace=True)
  • Explanation:
    • This step ensures there are no missing values that could disrupt analysis.
    • If necessary, missing values can be filled using median or mean values.

4. Visualization & Insights

Correlation Heatmap

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Boston Housing Dataset')
plt.show()
  • Explanation:
    • The heatmap shows the correlation between features and the target variable (PRICE).
  • Insight:
    • Features like RM (average number of rooms) and LSTAT (lower status population) have strong correlations with housing prices.

Scatter Plot – Price vs Rooms

plt.figure(figsize=(8,6))
sns.scatterplot(x='RM', y='PRICE', data=df)
plt.title('Price vs Average Number of Rooms')
plt.show()
  • Explanation:
    • Scatter plots visualize linear relationships between features and the target variable.
  • Insight:
    • Houses with more rooms (RM) generally have higher prices.

Histogram – House Price Distribution

plt.figure(figsize=(8,6))
sns.histplot(df['PRICE'], bins=30, kde=True)
plt.title('Distribution of House Prices')
plt.show()
  • Insight:
    • Most houses are priced between $10,000 and $40,000.

5. Handling Outliers

plt.figure(figsize=(8,6))
sns.boxplot(x=df['PRICE'])
plt.title('Box Plot for House Prices')
plt.show()
  • Explanation:
    • Box plots highlight potential outliers in the data.
Q1 = df['PRICE'].quantile(0.25)
Q3 = df['PRICE'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df = df[(df['PRICE'] >= lower) & (df['PRICE'] <= upper)]
  • Insight:
    • Houses with extremely high or low prices are removed to improve model performance.

6. Feature Engineering

We will create new features to improve model accuracy.

df['TAX_RM'] = df['TAX'] / df['RM']  # Tax per room
df['AGE_CAT'] = pd.qcut(df['AGE'], q=4, labels=[1,2,3,4])  # Categorize AGE
  • Explanation:
    • Feature engineering enhances model performance by introducing new variables derived from existing features.

7. Model Building

Now, let’s build a regression model to predict housing prices.

from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import mean_absolute_error, mean_squared_error  

X = df.drop('PRICE', axis=1)  
y = df['PRICE']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

model = LinearRegression()  
model.fit(X_train, y_train)  

y_pred = model.predict(X_test)  
  • Explanation:
    • A Linear Regression model is trained and tested on the dataset.

8. Model Evaluation

print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))  
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))  
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Display the predictions and actual values
predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(predictions.head(10))  # Display the first 10 predictions for better readability
  • Insight:
    • The error metrics provide an understanding of how well the model performs.

Conclusion:

Through this Boston Housing Price Project Report with Source Code in Python, we successfully conducted EDA, handled missing data, visualized feature relationships, and built a regression model to predict housing prices.

This project demonstrates key data science skills like data cleaning, visualization, and model evaluation.

Click Here To Download The ipynb File

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *