Heart Disease Prediction Project Using Machine Learning

KANGKAN KALITA

Heart Disease Prediction Project Using Machine Learning:

Introduction

Heart disease is a leading cause of death worldwide, and early detection can save lives. In this Heart Disease Prediction Project Using Machine Learning, we will build a predictive model to identify individuals at risk of heart disease. This end-to-end project includes data collection, preprocessing, exploratory data analysis, feature engineering, model building, evaluation, and deployment.

Heart Disease Prediction Project Using Machine Learning

Objective

  • Collect and preprocess heart disease dataset.
  • Perform Exploratory Data Analysis (EDA).
  • Engineer relevant features.
  • Train multiple machine learning models to predict heart disease.
  • Evaluate model accuracy using different metrics.
  • Deploy the best-performing model for real-world use.

Tools & Libraries

We will use the following Python libraries:

  • pandas and numpy for data manipulation.
  • matplotlib and seaborn for visualization.
  • scikit-learn for machine learning models.
  • Flask for model deployment.

1. Data Collection

We will use the Heart Disease UCI dataset, available at: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
OR Click Here to Download it from Kaggle

Loading the Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc

# Load dataset
df = pd.read_csv("heart.csv")  # Ensure the file is in your working directory
print(df.head())
print(df.shape)

2. Data Exploration

Checking for Missing Values

print(df.info())
print(df.isnull().sum())

Summary Statistics

print(df.describe())

Checking Unique Values in Categorical Columns

for col in df.columns:
    if df[col].dtype == 'object':
        print(f"{col}: {df[col].unique()}")

3. Data Cleaning

Handling Missing Values

df.fillna(df.mean(), inplace=True)  # Replace missing values with column mean

Removing Duplicates

df.drop_duplicates(inplace=True)
print(f"Data shape after removing duplicates: {df.shape}")

Checking for Outliers

for col in df.select_dtypes(include=['int64', 'float64']).columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col}")
    plt.show()

4. Exploratory Data Analysis (EDA)

Visualizing Data Distributions

plt.figure(figsize=(10,6))
sns.histplot(df['Age'], bins=30, kde=True)
plt.title('Age Distribution')
plt.show()

Checking the Correlation Between Features

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation')
plt.show()

Boxplot for Outlier Detection

plt.figure(figsize=(12,6))
sns.boxplot(data=df, orient='h', palette='Set2')
plt.title('Boxplot of Features')
plt.show()

Pairplot to Visualize Relationships Between Variables

sns.pairplot(df, hue='Heart Disease', diag_kind='kde')
plt.show()

Countplot for Categorical Variables

categorical_columns = ['Sex', 'Chest pain type', 'FBS over 120', 'Exercise angina', 'Slope of ST', 'Thallium']
for col in categorical_columns:
    plt.figure(figsize=(6, 4))
    sns.countplot(x=df[col], hue=df['Heart Disease'], palette='Set1')
    plt.title(f'Countplot of {col}')
    plt.show()

5. Feature Engineering

Encoding Categorical Variables

encoder = LabelEncoder()
for col in ['Sex', 'Chest pain type', 'FBS over 120', 'EKG results', 'Exercise angina', 'Slope of ST', 'Number of vessels fluro', 'Thallium']:
    df[col] = encoder.fit_transform(df[col])

Creating Interaction Features

df['Cholesterol_Age_Ratio'] = df['Cholesterol'] / df['Age']
df['MaxHR_Age_Ratio'] = df['Max HR'] / df['Age']

Scaling Features

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('Heart Disease', axis=1))
df_scaled = pd.DataFrame(scaled_features, columns=df.columns[:-1])
df_scaled['Heart Disease'] = df['Heart Disease']

6. Model Building

Splitting Data

X = df_scaled.drop('Heart Disease', axis=1)
y = df_scaled['Heart Disease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training Multiple Models

Logistic Regression

log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

Random Forest Classifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

Support Vector Machine (SVM)

svm_model = SVC(probability=True)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

7. Model Evaluation

Evaluating Accuracy

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

Confusion Matrix

sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.show()

8. Model Deployment

We will deploy the model using Flask.

Creating app.py

from flask import Flask, request, jsonify
import pickle
import numpy as np

# Load trained model
model = pickle.load(open("heart_disease_model.pkl", "rb"))
scaler = pickle.load(open("scaler.pkl", "rb"))

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    input_features = np.array(data['features']).reshape(1, -1)
    scaled_input = scaler.transform(input_features)
    prediction = model.predict(scaled_input)
    return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(debug=True)

Conclusion

This Heart Disease Prediction Project Using Machine Learning covers data preprocessing, EDA, feature engineering, model training, evaluation, and deployment. You can extend it with deep learning models like CNNs or ANNs for better accuracy.

Click Here to explore more such End to End Data Science Projects


Want to extend this project? Let us know in the comments!

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *