Heart Disease Prediction Project Using Machine Learning
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025
Heart Disease Prediction Project Using Machine Learning:
Introduction
Heart disease is a leading cause of death worldwide, and early detection can save lives. In this Heart Disease Prediction Project Using Machine Learning, we will build a predictive model to identify individuals at risk of heart disease. This end-to-end project includes data collection, preprocessing, exploratory data analysis, feature engineering, model building, evaluation, and deployment.

Objective
- Collect and preprocess heart disease dataset.
- Perform Exploratory Data Analysis (EDA).
- Engineer relevant features.
- Train multiple machine learning models to predict heart disease.
- Evaluate model accuracy using different metrics.
- Deploy the best-performing model for real-world use.
Tools & Libraries
We will use the following Python libraries:
pandas
andnumpy
for data manipulation.matplotlib
andseaborn
for visualization.scikit-learn
for machine learning models.Flask
for model deployment.
1. Data Collection
We will use the Heart Disease UCI dataset, available at: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
OR Click Here to Download it from Kaggle
Loading the Dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc # Load dataset df = pd.read_csv("heart.csv") # Ensure the file is in your working directory print(df.head()) print(df.shape)
2. Data Exploration
Checking for Missing Values
print(df.info()) print(df.isnull().sum())
Summary Statistics
print(df.describe())
Checking Unique Values in Categorical Columns
for col in df.columns: if df[col].dtype == 'object': print(f"{col}: {df[col].unique()}")
3. Data Cleaning
Handling Missing Values
df.fillna(df.mean(), inplace=True) # Replace missing values with column mean
Removing Duplicates
df.drop_duplicates(inplace=True) print(f"Data shape after removing duplicates: {df.shape}")
Checking for Outliers
for col in df.select_dtypes(include=['int64', 'float64']).columns: plt.figure(figsize=(6, 4)) sns.boxplot(x=df[col]) plt.title(f"Boxplot of {col}") plt.show()
4. Exploratory Data Analysis (EDA)
Visualizing Data Distributions
plt.figure(figsize=(10,6)) sns.histplot(df['Age'], bins=30, kde=True) plt.title('Age Distribution') plt.show()
Checking the Correlation Between Features
plt.figure(figsize=(12,8)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f') plt.title('Feature Correlation') plt.show()
Boxplot for Outlier Detection
plt.figure(figsize=(12,6)) sns.boxplot(data=df, orient='h', palette='Set2') plt.title('Boxplot of Features') plt.show()
Pairplot to Visualize Relationships Between Variables
sns.pairplot(df, hue='Heart Disease', diag_kind='kde') plt.show()
Countplot for Categorical Variables
categorical_columns = ['Sex', 'Chest pain type', 'FBS over 120', 'Exercise angina', 'Slope of ST', 'Thallium'] for col in categorical_columns: plt.figure(figsize=(6, 4)) sns.countplot(x=df[col], hue=df['Heart Disease'], palette='Set1') plt.title(f'Countplot of {col}') plt.show()
5. Feature Engineering
Encoding Categorical Variables
encoder = LabelEncoder() for col in ['Sex', 'Chest pain type', 'FBS over 120', 'EKG results', 'Exercise angina', 'Slope of ST', 'Number of vessels fluro', 'Thallium']: df[col] = encoder.fit_transform(df[col])
Creating Interaction Features
df['Cholesterol_Age_Ratio'] = df['Cholesterol'] / df['Age'] df['MaxHR_Age_Ratio'] = df['Max HR'] / df['Age']
Scaling Features
scaler = StandardScaler() scaled_features = scaler.fit_transform(df.drop('Heart Disease', axis=1)) df_scaled = pd.DataFrame(scaled_features, columns=df.columns[:-1]) df_scaled['Heart Disease'] = df['Heart Disease']
6. Model Building
Splitting Data
X = df_scaled.drop('Heart Disease', axis=1) y = df_scaled['Heart Disease'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training Multiple Models
Logistic Regression
log_model = LogisticRegression() log_model.fit(X_train, y_train) y_pred_log = log_model.predict(X_test)
Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) y_pred_rf = rf_model.predict(X_test)
Support Vector Machine (SVM)
svm_model = SVC(probability=True) svm_model.fit(X_train, y_train) y_pred_svm = svm_model.predict(X_test)
7. Model Evaluation
Evaluating Accuracy
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log)) print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues') plt.title("Random Forest Confusion Matrix") plt.show()
8. Model Deployment
We will deploy the model using Flask.
Creating app.py
from flask import Flask, request, jsonify import pickle import numpy as np # Load trained model model = pickle.load(open("heart_disease_model.pkl", "rb")) scaler = pickle.load(open("scaler.pkl", "rb")) app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): data = request.get_json() input_features = np.array(data['features']).reshape(1, -1) scaled_input = scaler.transform(input_features) prediction = model.predict(scaled_input) return jsonify({'prediction': int(prediction[0])}) if __name__ == '__main__': app.run(debug=True)
Conclusion
This Heart Disease Prediction Project Using Machine Learning covers data preprocessing, EDA, feature engineering, model training, evaluation, and deployment. You can extend it with deep learning models like CNNs or ANNs for better accuracy.
Click Here to explore more such End to End Data Science Projects
Want to extend this project? Let us know in the comments!
Latest Posts:
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast]
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide]
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025]
- Netflix Data Analysis with Python: Beginner-Friendly Project with Code & Insights
- 15 Best Machine Learning Projects for Your Resume That Will Impress Recruiters [2025 Guide]