Predicting House Prices using Machine Learning

Author
Recent Posts

Data Scientist at LeadTech Group

Passionate about unlocking insights from data, I am a dedicated data scientist with a keen interest in AI and Machine Learning. As a tech enthusiast, I constantly explore new technologies and innovations. My journey is driven by a love for learning and a commitment to leveraging data to create meaningful impact.

Latest posts by KANGKAN KALITA (see all)

Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025

Buying or selling a house is one of the most significant financial decisions individuals make. Accurately predicting house prices helps real estate agents, developers, and buyers make data-driven decisions. With the rise of Machine Learning, we can now build predictive models that analyze features like square footage, number of bedrooms, bathrooms, and location to estimate property prices.

In this tutorial, we’ll build a House Price Prediction project using regression models in Python. We’ll use a real-world dataset from Kaggle:
King County House Sales Dataset
This dataset contains information on homes sold in King County, USA, including Seattle.

1. Dataset Overview

The dataset contains 21,613 records with 21 features, including:

price: Target variable (house price)
sqft_living: Square footage of the living space
bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
zipcode: Location identifier

2. Data Loading and Preprocessing

2.1 Import Libraries and Load Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('kc_house_data.csv')  # Replace with the correct path if using local
df.head()

2.2 Check for Missing Values

df.info()
df.isnull().sum()  # No missing values in this dataset

2.3 Drop Irrelevant Features

We’ll drop columns like id, date, and lat/long for simplicity.

df.drop(['id', 'date', 'lat', 'long'], axis=1, inplace=True)

3. Exploratory Data Analysis (EDA)

3.1 Price Distribution

sns.histplot(df['price'], kde=True)
plt.title("House Price Distribution")
plt.xlabel("Price")
plt.show()

3.2 Correlation Heatmap

plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()

3.3 Scatter Plot: Square Footage vs Price

sns.scatterplot(x='sqft_living', y='price', data=df)
plt.title("Price vs. Living Area")
plt.xlabel("Square Feet (Living)")
plt.ylabel("Price")
plt.show()

4. Feature Selection and Splitting the Data

4.1 Selecting Important Features

features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors', 'zipcode']
X = df[features]
y = df['price']

4.2 One-Hot Encode Zipcode

X = pd.get_dummies(X, columns=['zipcode'], drop_first=True)

4.3 Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Model Training and Evaluation

5.1 Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Linear Regression R2 Score:", r2_score(y_test, y_pred_lr))
print("MSE:", mean_squared_error(y_test, y_pred_lr))

Residual Plot

sns.residplot(x=y_test, y=y_pred_lr, lowess=True, color='red')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.title('Linear Regression Residual Plot')
plt.show()

5.2 Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest R2 Score:", r2_score(y_test, y_pred_rf))
print("MSE:", mean_squared_error(y_test, y_pred_rf))

Feature Importance Plot

importances = rf.feature_importances_
indices = np.argsort(importances)[-10:]  # Top 10 features
features_top = X.columns[indices]

plt.figure(figsize=(10,6))
plt.title("Feature Importance (Random Forest)")
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), features_top)
plt.xlabel("Relative Importance")
plt.show()

5.3 Model Comparison

models = ['Linear Regression', 'Random Forest']
r2_scores = [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_rf)]

sns.barplot(x=models, y=r2_scores)
plt.title("Model Accuracy Comparison (R2 Score)")
plt.ylabel("R2 Score")
plt.show()

6. Conclusion

We successfully built a machine learning model to predict house prices using features like square footage, bedrooms, and bathrooms. Here’s what we learned:

Linear Regression is simple but may underfit complex relationships.
Random Forest offers better accuracy and interpretable feature importance.
Zipcode and sqft_living were among the most influential features.

Future Directions

Want to take this project to the next level? Try these:

Deploy the Model:
- Use Flask or Streamlit to create a web app for real-time predictions.
Use Deep Learning:
- Implement a Neural Network Regressor using TensorFlow or PyTorch.
Add More Features:
- Include columns like view, condition, and grade for better accuracy.
Hyperparameter Tuning:
- Use GridSearchCV or RandomizedSearchCV to optimize models.

SEO-Optimized Summary

Learn how to build a House Price Prediction model using Python and Machine Learning with this hands-on tutorial. Use real-world data from Kaggle, perform EDA, apply Linear Regression and Random Forest, and visualize results with residual plots and feature importance charts. Ideal for beginners in real estate data science, this guide walks you through every step to accurately estimate home values.

Keywords: House Price Prediction, Machine Learning, Regression Models, House Sales, Real Estate Data

Latest Posts:

Post Views: 34

Predicting House Prices using Machine Learning

1. Dataset Overview

2. Data Loading and Preprocessing

2.1 Import Libraries and Load Data

2.2 Check for Missing Values

2.3 Drop Irrelevant Features

3. Exploratory Data Analysis (EDA)

3.1 Price Distribution

3.2 Correlation Heatmap

3.3 Scatter Plot: Square Footage vs Price

4. Feature Selection and Splitting the Data

4.1 Selecting Important Features

4.2 One-Hot Encode Zipcode

4.3 Train-Test Split

5. Model Training and Evaluation

5.1 Linear Regression

Residual Plot

5.2 Random Forest Regressor

Feature Importance Plot

5.3 Model Comparison

6. Conclusion

Future Directions

SEO-Optimized Summary

Latest Posts:

Natural Language Processing with Disaster Tweets End to End Project

Data Analytics Projects for Students

Movie Recommendation System Project with Source Code

Music Recommendation System using Python – Full Project

Road Accident Prediction Using Machine Learning PDF

Data Science Projects for Beginners

Leave a Reply Cancel reply

1. Dataset Overview

2. Data Loading and Preprocessing

2.1 Import Libraries and Load Data

2.2 Check for Missing Values

2.3 Drop Irrelevant Features

3. Exploratory Data Analysis (EDA)

3.1 Price Distribution

3.2 Correlation Heatmap

3.3 Scatter Plot: Square Footage vs Price

4. Feature Selection and Splitting the Data

4.1 Selecting Important Features

4.2 One-Hot Encode Zipcode

4.3 Train-Test Split

5. Model Training and Evaluation

5.1 Linear Regression

Residual Plot

5.2 Random Forest Regressor

Feature Importance Plot

5.3 Model Comparison

6. Conclusion

Future Directions

SEO-Optimized Summary

Latest Posts:

Similar Posts

Leave a Reply Cancel reply