Predicting House Prices using Machine Learning

KANGKAN KALITA

machine learning

Buying or selling a house is one of the most significant financial decisions individuals make. Accurately predicting house prices helps real estate agents, developers, and buyers make data-driven decisions. With the rise of Machine Learning, we can now build predictive models that analyze features like square footage, number of bedrooms, bathrooms, and location to estimate property prices.

In this tutorial, we’ll build a House Price Prediction project using regression models in Python. We’ll use a real-world dataset from Kaggle:
King County House Sales Dataset
This dataset contains information on homes sold in King County, USA, including Seattle.

1. Dataset Overview

The dataset contains 21,613 records with 21 features, including:

  • price: Target variable (house price)
  • sqft_living: Square footage of the living space
  • bedrooms: Number of bedrooms
  • bathrooms: Number of bathrooms
  • zipcode: Location identifier

2. Data Loading and Preprocessing

2.1 Import Libraries and Load Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('kc_house_data.csv')  # Replace with the correct path if using local
df.head()

2.2 Check for Missing Values

df.info()
df.isnull().sum()  # No missing values in this dataset

2.3 Drop Irrelevant Features

We’ll drop columns like id, date, and lat/long for simplicity.

df.drop(['id', 'date', 'lat', 'long'], axis=1, inplace=True)

3. Exploratory Data Analysis (EDA)

3.1 Price Distribution

sns.histplot(df['price'], kde=True)
plt.title("House Price Distribution")
plt.xlabel("Price")
plt.show()

3.2 Correlation Heatmap

plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()

3.3 Scatter Plot: Square Footage vs Price

sns.scatterplot(x='sqft_living', y='price', data=df)
plt.title("Price vs. Living Area")
plt.xlabel("Square Feet (Living)")
plt.ylabel("Price")
plt.show()

4. Feature Selection and Splitting the Data

4.1 Selecting Important Features

features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors', 'zipcode']
X = df[features]
y = df['price']

4.2 One-Hot Encode Zipcode

X = pd.get_dummies(X, columns=['zipcode'], drop_first=True)

4.3 Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Model Training and Evaluation

5.1 Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Linear Regression R2 Score:", r2_score(y_test, y_pred_lr))
print("MSE:", mean_squared_error(y_test, y_pred_lr))

Residual Plot

sns.residplot(x=y_test, y=y_pred_lr, lowess=True, color='red')
plt.xlabel('Actual Price')
plt.ylabel('Residuals')
plt.title('Linear Regression Residual Plot')
plt.show()

5.2 Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest R2 Score:", r2_score(y_test, y_pred_rf))
print("MSE:", mean_squared_error(y_test, y_pred_rf))

Feature Importance Plot

importances = rf.feature_importances_
indices = np.argsort(importances)[-10:]  # Top 10 features
features_top = X.columns[indices]

plt.figure(figsize=(10,6))
plt.title("Feature Importance (Random Forest)")
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), features_top)
plt.xlabel("Relative Importance")
plt.show()

5.3 Model Comparison

models = ['Linear Regression', 'Random Forest']
r2_scores = [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_rf)]

sns.barplot(x=models, y=r2_scores)
plt.title("Model Accuracy Comparison (R2 Score)")
plt.ylabel("R2 Score")
plt.show()

6. Conclusion

We successfully built a machine learning model to predict house prices using features like square footage, bedrooms, and bathrooms. Here’s what we learned:

  • Linear Regression is simple but may underfit complex relationships.
  • Random Forest offers better accuracy and interpretable feature importance.
  • Zipcode and sqft_living were among the most influential features.

Future Directions

Want to take this project to the next level? Try these:

  1. Deploy the Model:
    • Use Flask or Streamlit to create a web app for real-time predictions.
  2. Use Deep Learning:
    • Implement a Neural Network Regressor using TensorFlow or PyTorch.
  3. Add More Features:
    • Include columns like view, condition, and grade for better accuracy.
  4. Hyperparameter Tuning:
    • Use GridSearchCV or RandomizedSearchCV to optimize models.

SEO-Optimized Summary

Learn how to build a House Price Prediction model using Python and Machine Learning with this hands-on tutorial. Use real-world data from Kaggle, perform EDA, apply Linear Regression and Random Forest, and visualize results with residual plots and feature importance charts. Ideal for beginners in real estate data science, this guide walks you through every step to accurately estimate home values.

Keywords: House Price Prediction, Machine Learning, Regression Models, House Sales, Real Estate Data

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *