Predicting House Prices using Machine Learning
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025

Buying or selling a house is one of the most significant financial decisions individuals make. Accurately predicting house prices helps real estate agents, developers, and buyers make data-driven decisions. With the rise of Machine Learning, we can now build predictive models that analyze features like square footage, number of bedrooms, bathrooms, and location to estimate property prices.
In this tutorial, we’ll build a House Price Prediction project using regression models in Python. We’ll use a real-world dataset from Kaggle:
King County House Sales Dataset
This dataset contains information on homes sold in King County, USA, including Seattle.
1. Dataset Overview
The dataset contains 21,613 records with 21 features, including:
price
: Target variable (house price)sqft_living
: Square footage of the living spacebedrooms
: Number of bedroomsbathrooms
: Number of bathroomszipcode
: Location identifier
2. Data Loading and Preprocessing
2.1 Import Libraries and Load Data
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load dataset df = pd.read_csv('kc_house_data.csv') # Replace with the correct path if using local df.head()
2.2 Check for Missing Values
df.info() df.isnull().sum() # No missing values in this dataset
2.3 Drop Irrelevant Features
We’ll drop columns like id
, date
, and lat/long
for simplicity.
df.drop(['id', 'date', 'lat', 'long'], axis=1, inplace=True)
3. Exploratory Data Analysis (EDA)
3.1 Price Distribution
sns.histplot(df['price'], kde=True) plt.title("House Price Distribution") plt.xlabel("Price") plt.show()
3.2 Correlation Heatmap
plt.figure(figsize=(14,10)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title("Feature Correlation Heatmap") plt.show()
3.3 Scatter Plot: Square Footage vs Price
sns.scatterplot(x='sqft_living', y='price', data=df) plt.title("Price vs. Living Area") plt.xlabel("Square Feet (Living)") plt.ylabel("Price") plt.show()
4. Feature Selection and Splitting the Data
4.1 Selecting Important Features
features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors', 'zipcode'] X = df[features] y = df['price']
4.2 One-Hot Encode Zipcode
X = pd.get_dummies(X, columns=['zipcode'], drop_first=True)
4.3 Train-Test Split
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Model Training and Evaluation
5.1 Linear Regression
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score lr = LinearRegression() lr.fit(X_train, y_train) y_pred_lr = lr.predict(X_test) print("Linear Regression R2 Score:", r2_score(y_test, y_pred_lr)) print("MSE:", mean_squared_error(y_test, y_pred_lr))
Residual Plot
sns.residplot(x=y_test, y=y_pred_lr, lowess=True, color='red') plt.xlabel('Actual Price') plt.ylabel('Residuals') plt.title('Linear Regression Residual Plot') plt.show()
5.2 Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=100, random_state=42) rf.fit(X_train, y_train) y_pred_rf = rf.predict(X_test) print("Random Forest R2 Score:", r2_score(y_test, y_pred_rf)) print("MSE:", mean_squared_error(y_test, y_pred_rf))
Feature Importance Plot
importances = rf.feature_importances_ indices = np.argsort(importances)[-10:] # Top 10 features features_top = X.columns[indices] plt.figure(figsize=(10,6)) plt.title("Feature Importance (Random Forest)") plt.barh(range(len(indices)), importances[indices], align='center') plt.yticks(range(len(indices)), features_top) plt.xlabel("Relative Importance") plt.show()
5.3 Model Comparison
models = ['Linear Regression', 'Random Forest'] r2_scores = [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_rf)] sns.barplot(x=models, y=r2_scores) plt.title("Model Accuracy Comparison (R2 Score)") plt.ylabel("R2 Score") plt.show()
6. Conclusion
We successfully built a machine learning model to predict house prices using features like square footage, bedrooms, and bathrooms. Here’s what we learned:
- Linear Regression is simple but may underfit complex relationships.
- Random Forest offers better accuracy and interpretable feature importance.
- Zipcode and sqft_living were among the most influential features.
Future Directions
Want to take this project to the next level? Try these:
- Deploy the Model:
- Use Flask or Streamlit to create a web app for real-time predictions.
- Use Deep Learning:
- Implement a Neural Network Regressor using TensorFlow or PyTorch.
- Add More Features:
- Include columns like
view
,condition
, andgrade
for better accuracy.
- Include columns like
- Hyperparameter Tuning:
- Use GridSearchCV or RandomizedSearchCV to optimize models.
SEO-Optimized Summary
Learn how to build a House Price Prediction model using Python and Machine Learning with this hands-on tutorial. Use real-world data from Kaggle, perform EDA, apply Linear Regression and Random Forest, and visualize results with residual plots and feature importance charts. Ideal for beginners in real estate data science, this guide walks you through every step to accurately estimate home values.
Keywords: House Price Prediction, Machine Learning, Regression Models, House Sales, Real Estate Data