Heart Disease Prediction Using Machine Learning: End-to-End Project

KANGKAN KALITA

Heart Disease Prediction Using Machine Learning:

Heart Disease Prediction Using Machine Learning

Heart disease is one of the leading causes of death worldwide. Early prediction of heart disease can save lives by enabling timely medical intervention. In this project, we develop a Heart Disease Prediction Using Machine Learning model to classify whether a patient is at risk of heart disease based on medical attributes. This end-to-end guide includes data collection, preprocessing, exploratory data analysis (EDA), visualization, feature engineering, model building, and evaluation.

Objective:

  • Predict the risk of heart disease using machine learning classification models.
  • Perform EDA to understand the data and identify patterns.
  • Implement feature engineering and preprocess the data for better model performance.
  • Evaluate multiple machine learning models and select the best-performing one.

Dataset:
The dataset can be downloaded from Kaggle. It contains medical information such as age, cholesterol levels, blood pressure, and other relevant features. The target variable indicates whether a patient has heart disease.

Tools & Libraries:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • XGBoost
  • Jupyter Notebook or Google Colab (Recommended)

1. Data Collection & Setup

Import Libraries and Load Dataset

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
data = pd.read_csv('/path/to/heart_disease.csv')  # Replace with the actual file path

# Preview the dataset
data.head()

Explanation:

  1. The dataset is loaded using pandas for easy data manipulation.
  2. Previewing the dataset helps understand its structure, including column names and data types.
  3. Understanding the dataset layout ensures we know the input features and target variable.
  4. Medical data, such as age, cholesterol, and blood pressure, are included as features.
  5. The head() function displays the first few rows to confirm the dataset is loaded correctly.

2. Exploratory Data Analysis (EDA)

Dataset Overview

# Display dataset information
data.info()
data.describe()
data.isnull().sum()

Target Variable Distribution

# Visualize the distribution of the target variable
target = 'HeartDisease'  # Replace with the actual target column
sns.countplot(x=target, data=data, palette='coolwarm')
plt.title('Heart Disease Distribution')
plt.xlabel('Heart Disease (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

Feature Correlation Analysis

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

Explanation:

  1. EDA identifies missing values, outliers, and distributions of features.
  2. info() and describe() provide an overview of data types, non-null values, and statistical summaries.
  3. The target variable distribution is visualized to check class imbalance.
  4. A correlation heatmap reveals relationships between features, helping us understand which attributes are highly correlated with the target.
  5. Insights gained from EDA guide data cleaning and feature selection.

3. Data Preprocessing & Feature Engineering

Handle Missing Values

# Fill missing values if any
columns_with_na = ['col1', 'col2']  # Replace with actual column names
data[columns_with_na] = data[columns_with_na].fillna(data[columns_with_na].median())

Encoding Categorical Variables

# Encode categorical variables
encoder = LabelEncoder()
categorical_cols = ['Sex', 'ChestPainType', 'RestingECG']  # Replace with actual column names
for col in categorical_cols:
    data[col] = encoder.fit_transform(data[col])

Feature Scaling

# Scale numerical features
scaler = StandardScaler()
numerical_cols = ['Age', 'Cholesterol', 'RestingBP']  # Replace with actual columns
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

Train-Test Split

# Split dataset into features and target
X = data.drop(columns=[target])
y = data[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  1. Missing values are filled using the median, ensuring no empty entries disrupt the analysis.
  2. Categorical variables are converted into numerical formats for machine learning models.
  3. Numerical features are scaled to standardize their ranges, improving model performance.
  4. The dataset is split into training and testing sets for model evaluation.
  5. Preprocessing ensures data quality and consistency for accurate predictions.

4. Data Visualization

Heart Disease vs. Age

sns.boxplot(x='HeartDisease', y='Age', data=data, palette='Set2')
plt.title('Heart Disease Distribution by Age')
plt.xlabel('Heart Disease (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.show()

Cholesterol Levels Distribution

sns.histplot(data['Cholesterol'], kde=True, color='green')
plt.title('Cholesterol Levels Distribution')
plt.xlabel('Cholesterol')
plt.ylabel('Density')
plt.show()

Explanation:

  1. Boxplots reveal how age varies between patients with and without heart disease.
  2. Cholesterol distribution visualizations identify high-risk thresholds.
  3. Visualizations provide insights into which features are most critical for prediction.
  4. Analyzing visual trends aids in feature importance ranking.
  5. Clear visuals simplify the communication of findings to stakeholders.

5. Model Building & Evaluation

Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

XGBoost Classifier

from xgboost import XGBClassifier

xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

Explanation:

  1. Logistic Regression serves as a baseline model for comparison.
  2. Random Forest is implemented to leverage ensemble learning for better accuracy.
  3. XGBoost, a boosting algorithm, is used to optimize predictions further.
  4. Evaluation metrics like accuracy and classification reports highlight model performance.
  5. Testing multiple models ensures robustness and reliability of the predictions.

6. Visualization of Model Performance

Compare Model Accuracies

model_accuracies = {
    "Logistic Regression": accuracy_score(y_test, y_pred_logreg),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
    "XGBoost": accuracy_score(y_test, y_pred_xgb),
}

plt.figure(figsize=(10, 6))
plt.bar(model_accuracies.keys(), model_accuracies.values(), color='skyblue')
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.show()

Explanation:

  1. A bar chart compares the accuracies of all implemented models.
  2. Visual performance comparisons simplify model selection.
  3. Charts help communicate results to non-technical audiences.
  4. Accurate models are prioritized for real-world applications.
  5. Continuous model improvement is guided by performance insights.

7. Conclusion

This Heart Disease Prediction Using Machine Learning project demonstrates how to preprocess data, explore features, and build effective classification models. Among the models tested, XGBoost Classifier achieved the highest accuracy. Businesses and healthcare providers can use this model to identify at-risk patients and take preventive actions.


Download the dataset and experiment with hyperparameter tuning or advanced deep learning models to further improve predictions. Deploy this project using Streamlit or Flask for real-world applications.

Explore More Projects From here

Keywords: Heart disease prediction using machine learning, Classification models, Data visualization, Medical data analysis, Python project.

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *