Boston Housing Price Project Report with Source Code in Python
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025

Boston Housing Price Project Report with Source Code in Python
The Boston Housing Price Dataset is a classic dataset used in regression analysis. It contains information about housing prices in Boston suburbs, including factors like crime rate, property tax, number of rooms, and more. This project aims to predict house prices based on multiple features using Python. Below I have provided the complete project outline for Boston Housing Price Project Report with Source Code in Python. This report will cover data collection, cleaning, exploratory data analysis (EDA), handling outliers, visualization, feature engineering, model building, and evaluation.
Performing EDA on the Boston Housing dataset provides insights into feature relationships and their influence on house prices. This project will also cover building and evaluating regression models to predict prices.
Objective:
- Conduct Exploratory Data Analysis (EDA) on the Boston Housing Price Dataset.
- Visualize relationships between features and housing prices.
- Develop and implement machine learning models to predict housing prices.
Dataset:
The Boston Housing Dataset is publicly available in the sklearn library.
from sklearn.datasets import load_boston boston = load_boston()
Alternatively, it can be downloaded from online sources or Kaggle.
Tools & Libraries:
- Python
- Pandas – Data manipulation and analysis.
- NumPy – Numerical operations.
- Matplotlib – Visualization.
- Seaborn – Advanced visualization.
- Scikit-learn – Machine learning models.
- Jupyter Notebook or Google Colab.
Implementation Steps:
1. Data Collection & Setup
In this step, we will load the dataset and import necessary libraries.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_boston # Load the dataset boston = load_boston() df = pd.DataFrame(boston.data, columns=boston.feature_names) df['PRICE'] = boston.target # Add target column df.head()

- Explanation:
load_boston()
fetches the dataset directly from sklearn.- A DataFrame is created with feature names as columns.
- The target variable (house price) is appended as
PRICE
.
If you are not able to use Boston dataset directly from Sklearn use this link to Download the Dataset and follow this step.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the dataset file_path = (your dataset path} df = pd.read_csv(file_path) # Display the first few rows of the dataset print(df.head())

2. Data Exploration
Now, let’s explore the dataset to understand its structure and summary statistics.
df.info() # Overview of data types and null values df.describe() # Summary statistics for numerical features

- Explanation:
info()
shows the dataset’s shape, columns, and missing data.describe()
provides statistics like mean, standard deviation, and quartiles.
3. Data Cleaning
We will handle missing values and ensure the dataset is ready for analysis.
# Check for missing values df.isnull().sum() # Fill or drop missing values if necessary (Example) df.fillna(df.median(), inplace=True)
- Explanation:
- This step ensures there are no missing values that could disrupt analysis.
- If necessary, missing values can be filled using median or mean values.
4. Visualization & Insights
Correlation Heatmap
plt.figure(figsize=(10,8)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title('Correlation Heatmap for Boston Housing Dataset') plt.show()

- Explanation:
- The heatmap shows the correlation between features and the target variable (
PRICE
).
- The heatmap shows the correlation between features and the target variable (
- Insight:
- Features like
RM
(average number of rooms) andLSTAT
(lower status population) have strong correlations with housing prices.
- Features like
Scatter Plot – Price vs Rooms
plt.figure(figsize=(8,6)) sns.scatterplot(x='RM', y='PRICE', data=df) plt.title('Price vs Average Number of Rooms') plt.show()

- Explanation:
- Scatter plots visualize linear relationships between features and the target variable.
- Insight:
- Houses with more rooms (
RM
) generally have higher prices.
- Houses with more rooms (
Histogram – House Price Distribution
plt.figure(figsize=(8,6)) sns.histplot(df['PRICE'], bins=30, kde=True) plt.title('Distribution of House Prices') plt.show()

- Insight:
- Most houses are priced between $10,000 and $40,000.
5. Handling Outliers
plt.figure(figsize=(8,6)) sns.boxplot(x=df['PRICE']) plt.title('Box Plot for House Prices') plt.show()

- Explanation:
- Box plots highlight potential outliers in the data.
Q1 = df['PRICE'].quantile(0.25) Q3 = df['PRICE'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR df = df[(df['PRICE'] >= lower) & (df['PRICE'] <= upper)]
- Insight:
- Houses with extremely high or low prices are removed to improve model performance.
6. Feature Engineering
We will create new features to improve model accuracy.
df['TAX_RM'] = df['TAX'] / df['RM'] # Tax per room df['AGE_CAT'] = pd.qcut(df['AGE'], q=4, labels=[1,2,3,4]) # Categorize AGE
- Explanation:
- Feature engineering enhances model performance by introducing new variables derived from existing features.
7. Model Building
Now, let’s build a regression model to predict housing prices.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error X = df.drop('PRICE', axis=1) y = df['PRICE'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test)
- Explanation:
- A Linear Regression model is trained and tested on the dataset.
8. Model Evaluation
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred)) print("Mean Squared Error:", mean_squared_error(y_test, y_pred)) print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Display the predictions and actual values predictions = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) print(predictions.head(10)) # Display the first 10 predictions for better readability

- Insight:
- The error metrics provide an understanding of how well the model performs.
Conclusion:
Through this Boston Housing Price Project Report with Source Code in Python, we successfully conducted EDA, handled missing data, visualized feature relationships, and built a regression model to predict housing prices.
This project demonstrates key data science skills like data cleaning, visualization, and model evaluation.