Heart Disease Prediction Using Machine Learning: End-to-End Project
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025
Heart Disease Prediction Using Machine Learning:

Heart disease is one of the leading causes of death worldwide. Early prediction of heart disease can save lives by enabling timely medical intervention. In this project, we develop a Heart Disease Prediction Using Machine Learning model to classify whether a patient is at risk of heart disease based on medical attributes. This end-to-end guide includes data collection, preprocessing, exploratory data analysis (EDA), visualization, feature engineering, model building, and evaluation.
Objective:
- Predict the risk of heart disease using machine learning classification models.
- Perform EDA to understand the data and identify patterns.
- Implement feature engineering and preprocess the data for better model performance.
- Evaluate multiple machine learning models and select the best-performing one.
Dataset:
The dataset can be downloaded from Kaggle. It contains medical information such as age, cholesterol levels, blood pressure, and other relevant features. The target variable indicates whether a patient has heart disease.
Tools & Libraries:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- XGBoost
- Jupyter Notebook or Google Colab (Recommended)
1. Data Collection & Setup
Import Libraries and Load Dataset
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.metrics import classification_report, accuracy_score # Load the dataset data = pd.read_csv('/path/to/heart_disease.csv') # Replace with the actual file path # Preview the dataset data.head()
Explanation:
- The dataset is loaded using pandas for easy data manipulation.
- Previewing the dataset helps understand its structure, including column names and data types.
- Understanding the dataset layout ensures we know the input features and target variable.
- Medical data, such as age, cholesterol, and blood pressure, are included as features.
- The
head()
function displays the first few rows to confirm the dataset is loaded correctly.
2. Exploratory Data Analysis (EDA)
Dataset Overview
# Display dataset information data.info() data.describe() data.isnull().sum()
Target Variable Distribution
# Visualize the distribution of the target variable target = 'HeartDisease' # Replace with the actual target column sns.countplot(x=target, data=data, palette='coolwarm') plt.title('Heart Disease Distribution') plt.xlabel('Heart Disease (0 = No, 1 = Yes)') plt.ylabel('Count') plt.show()
Feature Correlation Analysis
# Correlation heatmap plt.figure(figsize=(12, 8)) sns.heatmap(data.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show()
Explanation:
- EDA identifies missing values, outliers, and distributions of features.
info()
anddescribe()
provide an overview of data types, non-null values, and statistical summaries.- The target variable distribution is visualized to check class imbalance.
- A correlation heatmap reveals relationships between features, helping us understand which attributes are highly correlated with the target.
- Insights gained from EDA guide data cleaning and feature selection.
3. Data Preprocessing & Feature Engineering
Handle Missing Values
# Fill missing values if any columns_with_na = ['col1', 'col2'] # Replace with actual column names data[columns_with_na] = data[columns_with_na].fillna(data[columns_with_na].median())
Encoding Categorical Variables
# Encode categorical variables encoder = LabelEncoder() categorical_cols = ['Sex', 'ChestPainType', 'RestingECG'] # Replace with actual column names for col in categorical_cols: data[col] = encoder.fit_transform(data[col])
Feature Scaling
# Scale numerical features scaler = StandardScaler() numerical_cols = ['Age', 'Cholesterol', 'RestingBP'] # Replace with actual columns data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
Train-Test Split
# Split dataset into features and target X = data.drop(columns=[target]) y = data[target] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Missing values are filled using the median, ensuring no empty entries disrupt the analysis.
- Categorical variables are converted into numerical formats for machine learning models.
- Numerical features are scaled to standardize their ranges, improving model performance.
- The dataset is split into training and testing sets for model evaluation.
- Preprocessing ensures data quality and consistency for accurate predictions.
4. Data Visualization
Heart Disease vs. Age
sns.boxplot(x='HeartDisease', y='Age', data=data, palette='Set2') plt.title('Heart Disease Distribution by Age') plt.xlabel('Heart Disease (0 = No, 1 = Yes)') plt.ylabel('Age') plt.show()
Cholesterol Levels Distribution
sns.histplot(data['Cholesterol'], kde=True, color='green') plt.title('Cholesterol Levels Distribution') plt.xlabel('Cholesterol') plt.ylabel('Density') plt.show()
Explanation:
- Boxplots reveal how age varies between patients with and without heart disease.
- Cholesterol distribution visualizations identify high-risk thresholds.
- Visualizations provide insights into which features are most critical for prediction.
- Analyzing visual trends aids in feature importance ranking.
- Clear visuals simplify the communication of findings to stakeholders.
5. Model Building & Evaluation
Logistic Regression
from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred_logreg = logreg.predict(X_test) print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg)) print(classification_report(y_test, y_pred_logreg))
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier rf_model = RandomForestClassifier() rf_model.fit(X_train, y_train) y_pred_rf = rf_model.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print(classification_report(y_test, y_pred_rf))
XGBoost Classifier
from xgboost import XGBClassifier xgb_model = XGBClassifier() xgb_model.fit(X_train, y_train) y_pred_xgb = xgb_model.predict(X_test) print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb)) print(classification_report(y_test, y_pred_xgb))
Explanation:
- Logistic Regression serves as a baseline model for comparison.
- Random Forest is implemented to leverage ensemble learning for better accuracy.
- XGBoost, a boosting algorithm, is used to optimize predictions further.
- Evaluation metrics like accuracy and classification reports highlight model performance.
- Testing multiple models ensures robustness and reliability of the predictions.
6. Visualization of Model Performance
Compare Model Accuracies
model_accuracies = { "Logistic Regression": accuracy_score(y_test, y_pred_logreg), "Random Forest": accuracy_score(y_test, y_pred_rf), "XGBoost": accuracy_score(y_test, y_pred_xgb), } plt.figure(figsize=(10, 6)) plt.bar(model_accuracies.keys(), model_accuracies.values(), color='skyblue') plt.title('Model Accuracy Comparison') plt.ylabel('Accuracy') plt.xticks(rotation=45) plt.show()
Explanation:
- A bar chart compares the accuracies of all implemented models.
- Visual performance comparisons simplify model selection.
- Charts help communicate results to non-technical audiences.
- Accurate models are prioritized for real-world applications.
- Continuous model improvement is guided by performance insights.
7. Conclusion
This Heart Disease Prediction Using Machine Learning project demonstrates how to preprocess data, explore features, and build effective classification models. Among the models tested, XGBoost Classifier achieved the highest accuracy. Businesses and healthcare providers can use this model to identify at-risk patients and take preventive actions.
Download the dataset and experiment with hyperparameter tuning or advanced deep learning models to further improve predictions. Deploy this project using Streamlit or Flask for real-world applications.
Explore More Projects From here
Keywords: Heart disease prediction using machine learning, Classification models, Data visualization, Medical data analysis, Python project.