Data Scientist at LeadTech Group

Passionate about unlocking insights from data, I am a dedicated data scientist with a keen interest in AI and Machine Learning. As a tech enthusiast, I constantly explore new technologies and innovations. My journey is driven by a love for learning and a commitment to leveraging data to create meaningful impact.

Latest posts by KANGKAN KALITA (see all)

SQL for beginners : A Complete Guide - June 24, 2025
Predictive Analytics Techniques: A Beginner’s Guide to Turning Data into Future Insights - June 15, 2025
Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025

Introduction

Health Insurance Cost Prediction using Machine Learning is a crucial application in the healthcare and insurance industry. Insurance companies need accurate cost estimations to determine premium amounts, assess risks, and manage financial planning. Traditional cost prediction methods rely on generalized assumptions, which may not be accurate for individuals with unique health profiles.

Machine learning provides a data-driven approach to predicting insurance charges by analyzing multiple factors, including age, BMI, smoking habits, and region. By leveraging predictive models, we can gain deeper insights into how these factors influence medical costs and build a system that provides accurate cost estimations. Lets begin our project Health Insurance Cost Prediction Using Machine Learning.

Step 1: Problem Understanding & Dataset Overview

Objective of the Project

Health insurance companies determine insurance premiums based on various factors like age, BMI, smoking habits, and region. Predicting insurance costs accurately helps insurers set fair pricing and allows individuals to estimate their expenses.

In this project, we will:

✔ Analyze the dataset using exploratory data analysis (EDA)
✔ Perform data preprocessing (handling missing values, encoding categorical features, feature scaling)
✔ Train multiple machine learning models (Linear Regression, Decision Tree, Random Forest, and XGBoost)
✔ Compare model performance using evaluation metrics
✔ Deploy the model for real-world applications (optional)

This end-to-end Health Insurance Cost Prediction using Machine Learning project provides a practical example of how AI can enhance decision-making in the healthcare industry.

Step 2: Dataset Overview

Dataset Source

We will use the Medical Cost Personal Dataset from Kaggle, which contains health information for individuals and their corresponding insurance charges.

Dataset Features

Feature	Description
`age`	Age of the person (numeric)
`sex`	Gender (`male`, `female`)
`bmi`	Body Mass Index (numeric)
`children`	Number of children covered by insurance (integer)
`smoker`	Smoking status (`yes`, `no`)
`region`	Residential area (`northeast`, `northwest`, `southeast`, `southwest`)
`charges`	Insurance cost (target variable – numeric)

Understanding the Target Variable

The target variable, charges, represents the medical insurance cost for an individual. Our goal is to build a machine learning model that can accurately predict charges based on other attributes.

Step 3: Importing Libraries & Loading the Data

Before we start working with the data, let’s import the required Python libraries.

# Data Handling and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Explanation

pandas and numpy → For handling and processing data
matplotlib and seaborn → For visualizing trends in the dataset
sklearn.model_selection → To split data into training and test sets
LabelEncoder → For encoding categorical variables
StandardScaler → To standardize numerical features
LinearRegression and RandomForestRegressor → Machine learning models
mean_absolute_error, mean_squared_error, r2_score → Evaluation metrics

Loading the Dataset

# Load the dataset
df = pd.read_csv("insurance.csv")

# Display first 5 rows
df.head()

Explanation

pd.read_csv("insurance.csv") → Loads the dataset
df.head() → Displays the first five rows

Checking Basic Information

# Checking dataset structure
df.info()
df.describe()

Step 4: Data Cleaning

Data cleaning is essential to ensure our dataset is accurate and free from inconsistencies. In this step, we will:
✔ Check for missing values
✔ Handle duplicates
✔ Detect and remove outliers
✔ Standardize categorical variables

4.1 Checking for Missing Values

Why Check for Missing Values?

Missing values can negatively impact our model’s performance. Some machine learning models cannot handle missing values, so we need to address them appropriately.

Checking Missing Values

# Check for missing values in the dataset
missing_values = df.isnull().sum()

# Display columns with missing values (if any)
missing_values[missing_values > 0]

Explanation

df.isnull().sum() → Checks for missing values in each column
missing_values[missing_values > 0] → Filters and displays only the columns with missing values

4.2 Checking for Duplicate Rows

Why Remove Duplicates?

Duplicate rows can lead to biased model training and incorrect predictions.

Checking for Duplicates

# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Explanation

df.duplicated().sum() → Counts the number of duplicate rows in the dataset

If any duplicates exist, we remove them.

Code: Removing Duplicates

# Remove duplicate rows
df = df.drop_duplicates()

# Verify the change
print(f"Number of rows after removing duplicates: {df.shape[0]}")

4.3 Handling Outliers

What Are Outliers?

Outliers are data points that are significantly different from other observations. They can affect model performance by skewing results.

Detecting Outliers Using Boxplot

# Boxplot for numerical columns
plt.figure(figsize=(12,6))

# Creating subplots for multiple variables
plt.subplot(1, 2, 1)
sns.boxplot(y=df["bmi"])
plt.title("Boxplot of BMI")

plt.subplot(1, 2, 2)
sns.boxplot(y=df["charges"])
plt.title("Boxplot of Insurance Charges")

plt.show()

Explanation

sns.boxplot(y=df["bmi"]) → Displays BMI outliers
sns.boxplot(y=df["charges"]) → Displays insurance charge outliers

Handling Outliers Using the IQR Method

One common method to handle outliers is the Interquartile Range (IQR) Method.

Removing Outliers

# Function to remove outliers using IQR
def remove_outliers(column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Removing outliers from BMI and Charges
df = remove_outliers("bmi")
df = remove_outliers("charges")

# Display updated dataset size
print(f"Number of rows after removing outliers: {df.shape[0]}")

Explanation

Step 1: Compute the Interquartile Range (IQR)
Step 2: Calculate the upper and lower limits
Step 3: Remove rows where values fall outside this range

4.4 Standardizing Categorical Data

Why Convert Categorical Data?

Machine learning models require numerical inputs, so we need to convert categorical variables into numerical representations.

Encoding `sex`, `smoker`, and `region` Columns

# Label Encoding categorical variables
le = LabelEncoder()

df["sex"] = le.fit_transform(df["sex"])  # Male → 1, Female → 0
df["smoker"] = le.fit_transform(df["smoker"])  # Yes → 1, No → 0
df["region"] = le.fit_transform(df["region"])  # Encodes region values

# Display dataset after encoding
df.head()

Explanation

LabelEncoder().fit_transform() → Converts categorical values to numbers
Example:
- Sex: male → 1, female → 0
- Smoker: yes → 1, no → 0
- Region: Encodes northeast, northwest, southeast, southwest into numbers

Final Cleaned Data Summary

df.info()

Now, our dataset is: ✅ Free from missing values
✅ No duplicate rows
✅ Outliers removed
✅ Categorical data encoded

Step 5: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps us understand patterns, detect outliers, and identify relationships between variables. This step includes visualizations and statistical insights.

5.1: Distribution of the Target Variable (`charges`)

Let’s check how the target variable (insurance charges) is distributed.

import seaborn as sns
import matplotlib.pyplot as plt

# Histogram of charges
plt.figure(figsize=(8, 5))
sns.histplot(df["charges"], bins=30, kde=True)
plt.title("Distribution of Insurance Charges")
plt.xlabel("Charges")
plt.ylabel("Frequency")
plt.show()

Explanation:

We use sns.histplot() to plot the distribution of insurance charges.
The KDE (Kernel Density Estimation) helps visualize the probability density of the target variable.
A right-skewed distribution suggests that most people have lower insurance charges, but some individuals have very high charges.

5.2: Checking for Outliers in Charges

Outliers can impact the performance of machine learning models. Let’s detect them using a boxplot.

# Boxplot for outliers
plt.figure(figsize=(8, 5))
sns.boxplot(y=df["charges"])
plt.title("Boxplot of Insurance Charges")
plt.ylabel("Charges")
plt.show()

Explanation:

The boxplot helps visualize the spread of the target variable.
Outliers are typically seen as points beyond the “whiskers” of the boxplot.
If necessary, we can handle outliers by transformation or removal.

5.3: Relationship Between Numerical Features

A pairplot allows us to examine relationships between numerical variables.

# Pairplot for relationships
sns.pairplot(df, diag_kind="kde")
plt.show()

Explanation:

sns.pairplot() creates scatter plots between all numerical variables.
The diagonal plots represent the KDE distributions of individual features.
Helps in identifying correlations and patterns.

5.4: Correlation Analysis (Heatmap)

A correlation heatmap helps identify relationships between numerical features.

# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

Explanation:

df.corr() calculates correlation coefficients between numerical variables.
sns.heatmap() visualizes correlations with colors.
Strong correlations (closer to +1 or -1) indicate strong relationships between features.
Helps us decide which features are most important for predictions.

5.5: Impact of Categorical Variables on Charges

5.5.1: Smoker vs. Insurance Charges

Smoking status is a major factor affecting insurance charges. Let’s analyze it.

# Boxplot of smoker vs charges
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["smoker"], y=df["charges"])
plt.title("Impact of Smoking on Insurance Charges")
plt.xlabel("Smoker (0=No, 1=Yes)")
plt.ylabel("Charges")
plt.show()

Explanation:

sns.boxplot() shows the distribution of insurance charges for smokers vs. non-smokers.
Expectation: Smokers tend to have significantly higher medical expenses.

5.5.2: Region vs. Insurance Charges

Different regions might have different healthcare costs.

# Boxplot for region vs charges
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["region"], y=df["charges"])
plt.title("Impact of Region on Insurance Charges")
plt.xlabel("Region")
plt.ylabel("Charges")
plt.show()

Explanation:

sns.boxplot() helps compare insurance charges across different regions.
If there’s a significant difference, region could be an important factor in predicting costs.

5.6: Summary of EDA Findings

Insurance charges are right-skewed, meaning most people have lower medical costs, but some individuals have very high expenses.
Outliers exist in the charges column, mostly due to smokers and individuals with high BMI.
Smokers tend to have significantly higher medical expenses than non-smokers.
There is a correlation between BMI, age, and charges, meaning older individuals or those with higher BMI often have higher costs.
Region seems to have a minor impact, but we will confirm this with feature importance later.

Step 6: Data Preprocessing

Before building machine learning models, we need to preprocess the dataset by:
✅ Encoding categorical variables (converting text to numbers)
✅ Scaling numerical features
✅ Splitting data into training and testing sets

6.1: Encoding Categorical Variables

The dataset has three categorical variables:

Sex (male/female)
Smoker (yes/no)
Region (northeast, northwest, southeast, southwest)

We use LabelEncoder to convert categorical values into numerical representations.

from sklearn.preprocessing import LabelEncoder

# Creating a LabelEncoder instance
le = LabelEncoder()

# Encoding categorical columns
df["sex"] = le.fit_transform(df["sex"])  # Male=1, Female=0
df["smoker"] = le.fit_transform(df["smoker"])  # Smoker=1, Non-smoker=0
df["region"] = le.fit_transform(df["region"])  # Assigns 0,1,2,3 to different regions

# Checking the transformed dataset
df.head()

Explanation:

LabelEncoder() converts text values into numerical ones.
sex: male → 1, female → 0
smoker: yes → 1, no → 0
region: Categorical regions are assigned unique numerical values (0,1,2,3).
This step ensures our dataset is machine-learning compatible.

6.2: Feature Scaling (Standardization)

Some features (like age, bmi, children) have different numerical ranges, which can affect model performance.
We use StandardScaler to scale them.

from sklearn.preprocessing import StandardScaler

# Creating a StandardScaler instance
scaler = StandardScaler()

# Selecting numerical columns for scaling
num_features = ["age", "bmi", "children"]

# Applying scaling
df[num_features] = scaler.fit_transform(df[num_features])

# Checking transformed values
df.head()

Explanation:

StandardScaler() normalizes numerical values to have mean = 0 and standard deviation = 1.
Helps models like Linear Regression and Neural Networks perform better.
Scaling prevents large-valued features (e.g., age vs. BMI) from dominating others.

6.3: Splitting Data into Training and Testing Sets

To train and evaluate our machine learning models, we split the dataset:

Training Set (80%): Used to train models.
Testing Set (20%): Used to evaluate model performance.

from sklearn.model_selection import train_test_split

# Defining input features (X) and target variable (y)
X = df.drop("charges", axis=1)  # Features
y = df["charges"]  # Target variable

# Splitting into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Explanation:

X contains all features except the target variable (charges).
y contains only the target variable (charges).
train_test_split() randomly splits data into training (80%) and testing (20%) sets.
Random seed (random_state=42) ensures reproducibility.

Step 7: Building Machine Learning Models

We will start with Linear Regression, a simple but powerful model for predicting continuous values like insurance charges.
Later, we will compare it with more complex models like Random Forest and XGBoost.

7.1: Implementing Linear Regression

We train the model using X_train and y_train, then evaluate it using X_test and y_test.

from sklearn.linear_model import LinearRegression

# Initialize the model
lr_model = LinearRegression()

# Train (fit) the model on training data
lr_model.fit(X_train, y_train)

# Make predictions on test data
y_pred_lr = lr_model.predict(X_test)

Explanation:

LinearRegression() initializes the model.
.fit(X_train, y_train) trains the model by learning the best coefficients.
.predict(X_test) makes predictions on unseen data.

7.2: Evaluating Linear Regression Performance

We evaluate the model using three key metrics:

Mean Absolute Error (MAE): Measures average error in predictions.
Mean Squared Error (MSE): Penalizes larger errors more than smaller ones.
R² Score: Measures how well the model explains variance in the target variable.

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate metrics
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

# Print results
print("Linear Regression Performance:")
print(f"Mean Absolute Error (MAE): {mae_lr:.2f}")
print(f"Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"R² Score: {r2_lr:.4f}")

Interpretation of Metrics:

Lower MAE/MSE = better model
R² Score closer to 1 = better model fit

💡 Linear Regression may not always be the best choice if relationships in data are non-linear or involve complex interactions.

7.3: Checking Feature Importance (Linear Regression Coefficients)

We analyze which features contribute the most to predicting insurance costs.

# Extracting model coefficients
coefficients = pd.DataFrame(lr_model.coef_, X.columns, columns=["Coefficient"])

# Sorting by importance
coefficients = coefficients.sort_values(by="Coefficient", ascending=False)

# Displaying results
coefficients

Explanation:

The higher the absolute coefficient value, the more influence that feature has on predictions.
Positive coefficients increase charges, while negative coefficients reduce charges.

Step 8: Implementing Random Forest Regressor 🌲

Since Linear Regression assumes a simple relationship between features and target, we now introduce Random Forest Regressor, a more powerful model that can handle non-linear relationships and interactions between features.

8.1: Understanding Random Forest

Random Forest is an ensemble learning method that builds multiple decision trees and takes their average prediction to reduce overfitting and improve accuracy.

Advantages of Random Forest:
✅ Captures complex relationships in data.
✅ Reduces overfitting compared to individual decision trees.
✅ Works well with both categorical and numerical features.

8.2: Implementing Random Forest Regressor

We train a Random Forest model with 100 decision trees.

from sklearn.ensemble import RandomForestRegressor

# Initialize the model with 100 trees
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

Explanation:

RandomForestRegressor(n_estimators=100) creates a forest with 100 decision trees.
.fit(X_train, y_train) trains the model on training data.
.predict(X_test) generates predictions for test data.

8.3: Evaluating Random Forest Performance

We use the same evaluation metrics (MAE, MSE, R² Score) to compare with Linear Regression.

# Calculate metrics
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Print results
print("Random Forest Performance:")
print(f"Mean Absolute Error (MAE): {mae_rf:.2f}")
print(f"Mean Squared Error (MSE): {mse_rf:.2f}")
print(f"R² Score: {r2_rf:.4f}")

Comparison with Linear Regression:

Lower MAE & MSE = Better predictions.
Higher R² Score = Better fit to data.

8.4: Feature Importance in Random Forest

Unlike Linear Regression (which uses coefficients), Random Forest determines feature importance by measuring how much each feature improves decision trees.

# Extract feature importance
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_})

# Sort in descending order
feature_importance = feature_importance.sort_values(by="Importance", ascending=False)

# Plot feature importance
plt.figure(figsize=(10,5))
sns.barplot(x=feature_importance["Importance"], y=feature_importance["Feature"], palette="viridis")
plt.title("Feature Importance in Random Forest")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.show()

Explanation:

Higher importance = Feature contributes more to predictions.
Helps us understand which factors influence insurance cost predictions the most.

Step 9: Implementing XGBoost Regressor 🚀

Now that we have seen Random Forest, let’s explore XGBoost (Extreme Gradient Boosting)—a powerful algorithm that often outperforms traditional machine learning models.

9.1: Understanding XGBoost

XGBoost is an ensemble learning technique based on gradient boosting, which means:
✅ It builds multiple weak learners (decision trees) sequentially.
✅ Each tree focuses on correcting errors made by the previous tree.
✅ It uses a gradient descent optimization approach to minimize errors.

Advantages of XGBoost:
✔️ Faster & more efficient than Random Forest.
✔️ Handles missing values and feature importance automatically.
✔️ Less overfitting due to built-in regularization.

9.2: Installing & Importing XGBoost

First, make sure XGBoost is installed. If not, install it using:

pip install xgboost

Now, import the library and initialize the model:

from xgboost import XGBRegressor

# Initialize XGBoost Regressor
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

Explanation:

XGBRegressor(n_estimators=100, learning_rate=0.1) trains 100 gradient-boosted trees.
learning_rate=0.1 controls how much each tree contributes to the final prediction.
.fit(X_train, y_train) trains the model.
.predict(X_test) generates predictions for test data.

9.3: Evaluating XGBoost Performance

Let’s compare XGBoost with Random Forest and Linear Regression.

# Calculate metrics
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

# Print results
print("XGBoost Performance:")
print(f"Mean Absolute Error (MAE): {mae_xgb:.2f}")
print(f"Mean Squared Error (MSE): {mse_xgb:.2f}")
print(f"R² Score: {r2_xgb:.4f}")

✅ Lower MAE & MSE = More accurate predictions.
✅ Higher R² Score = Better fit to the data.

9.4: Feature Importance in XGBoost

XGBoost provides built-in feature importance, which helps us understand the most influential factors.

# Extract feature importance
feature_importance_xgb = pd.DataFrame({'Feature': X.columns, 'Importance': xgb_model.feature_importances_})

# Sort in descending order
feature_importance_xgb = feature_importance_xgb.sort_values(by="Importance", ascending=False)

# Plot feature importance
plt.figure(figsize=(10,5))
sns.barplot(x=feature_importance_xgb["Importance"], y=feature_importance_xgb["Feature"], palette="coolwarm")
plt.title("Feature Importance in XGBoost")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.show()

Key Insights:

The most important features will have the highest scores.
Helps in feature selection (we can remove low-importance features).

9.5: Comparing All Models

Let’s create a comparison table to summarize model performance.

# Creating a dataframe for model comparison
model_comparison = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "XGBoost"],
    "MAE": [mae_lr, mae_rf, mae_xgb],
    "MSE": [mse_lr, mse_rf, mse_xgb],
    "R² Score": [r2_lr, r2_rf, r2_xgb]
})

# Display the comparison table
print(model_comparison)

🚀 XGBoost often performs the best, but let’s analyze the table to decide!

10.2: Hyperparameter Tuning using Grid Search

We’ll use GridSearchCV to test multiple combinations and find the best parameters.

🔧 Step 1: Import Required Libraries

from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

🔧 Step 2: Define Parameter Grid

param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

✅ We define different values for each parameter. The GridSearchCV will test all possible combinations.

🔧 Step 3: Perform Grid Search

# Initialize XGBoost Regressor
xgb_model = XGBRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='r2',  # Optimizing for best R² score
    cv=5,          # 5-fold cross-validation
    n_jobs=-1,     # Use all CPU cores for faster execution
    verbose=2      # Display progress
)

# Fit GridSearchCV on training data
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters:", grid_search.best_params_)

✅ Cross-validation (cv=5) splits data into 5 parts and tests different hyperparameter combinations.
✅ n_jobs=-1 allows parallel processing, speeding up tuning.
✅ The best parameter set is stored in grid_search.best_params_.

10.3: Training XGBoost with Best Parameters

Now, we train our final XGBoost model using the best parameters.

# Extract best parameters
best_params = grid_search.best_params_

# Train the optimized model
xgb_optimized = XGBRegressor(**best_params, random_state=42)
xgb_optimized.fit(X_train, y_train)

# Make predictions
y_pred_xgb_opt = xgb_optimized.predict(X_test)

✅ The model now uses optimized hyperparameters for better performance.

10.4: Evaluating the Optimized XGBoost Model

Let’s check the performance improvement after tuning.

# Compute performance metrics
mae_xgb_opt = mean_absolute_error(y_test, y_pred_xgb_opt)
mse_xgb_opt = mean_squared_error(y_test, y_pred_xgb_opt)
r2_xgb_opt = r2_score(y_test, y_pred_xgb_opt)

# Print results
print("Optimized XGBoost Performance:")
print(f"MAE: {mae_xgb_opt:.2f}")
print(f"MSE: {mse_xgb_opt:.2f}")
print(f"R² Score: {r2_xgb_opt:.4f}")

✅ Lower MAE & MSE and higher R² indicate the model has improved! 🚀

10.5: Comparing All Models (Final Performance Table)

Now, let’s compare Linear Regression, Random Forest, XGBoost (default), and XGBoost (optimized).

# Create comparison table
final_comparison = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "XGBoost (default)", "XGBoost (Optimized)"],
    "MAE": [mae_lr, mae_rf, mae_xgb, mae_xgb_opt],
    "MSE": [mse_lr, mse_rf, mse_xgb, mse_xgb_opt],
    "R² Score": [r2_lr, r2_rf, r2_xgb, r2_xgb_opt]
})

# Display results
print(final_comparison)

10.6: Key Takeaways

📌 Did XGBoost improve after tuning? Check the table!
📌 Compare with other models to see the best choice.
📌 XGBoost is often the best, but it depends on dataset complexity.

Conclusion

In this Health Insurance Cost Prediction using Machine Learning project, we have built a complete end-to-end solution to predict medical insurance charges based on various factors such as age, BMI, smoking status, and region.

Key Takeaways:

Data Collection & Preprocessing:
- We started by loading the Medical Cost Personal Dataset and explored its structure.
- Cleaned the dataset by handling missing values and encoding categorical variables.
- Scaled numerical features for better model performance.
Exploratory Data Analysis (EDA):
- We performed descriptive statistical analysis to understand feature distributions.
- Visualized relationships between independent variables and insurance charges using histograms, boxplots, pair plots, and heatmaps.
- Identified the impact of smoking, BMI, and age on insurance costs.
Model Building & Evaluation:
- Implemented Linear Regression, Decision Tree, Random Forest, and XGBoost models to compare performance.
- Used MAE, MSE, RMSE, and R² scores to evaluate the models.
- XGBoost performed the best, indicating the need for non-linear relationships in predicting insurance charges.
Web Application Development:
- Created an interactive front-end using Streamlit, allowing users to input their details and get a predicted insurance cost.
- Integrated the trained machine learning model with a user-friendly UI.

Next Steps:

Feature Engineering: Introduce new features like health conditions, lifestyle habits, and family history for better accuracy.
Hyperparameter Tuning: Optimize model parameters using GridSearchCV or Bayesian Optimization.
Deep Learning Approach: Implement Neural Networks to improve predictions.
Deployment: Host the model using Flask, FastAPI, or Streamlit Cloud for real-world usage.

This project ( Health Insurance Cost Prediction using Machine Learning ) demonstrates the power of machine learning in healthcare analytics, showcasing how predictive modeling can assist insurance companies in risk assessment and pricing strategies.

EXPLORE MORE DATA SCIENCE PROJECTS

Latest Posts:

Post Views: 100

Introduction

Step 1: Problem Understanding & Dataset Overview

Objective of the Project

Step 2: Dataset Overview

Dataset Source

Dataset Features

Understanding the Target Variable

Step 3: Importing Libraries & Loading the Data

Explanation

Loading the Dataset

Explanation

Checking Basic Information

Step 4: Data Cleaning

4.1 Checking for Missing Values

Why Check for Missing Values?

Checking Missing Values

Explanation

4.2 Checking for Duplicate Rows

Why Remove Duplicates?

Checking for Duplicates

Explanation

Code: Removing Duplicates

4.3 Handling Outliers

What Are Outliers?

Detecting Outliers Using Boxplot

Explanation

Handling Outliers Using the IQR Method

Removing Outliers

Explanation

4.4 Standardizing Categorical Data

Why Convert Categorical Data?

Encoding sex, smoker, and region Columns

Explanation

Final Cleaned Data Summary

Step 5: Exploratory Data Analysis (EDA)

5.1: Distribution of the Target Variable (charges)

Explanation:

5.2: Checking for Outliers in Charges

Explanation:

5.3: Relationship Between Numerical Features

Explanation:

5.4: Correlation Analysis (Heatmap)

Explanation:

5.5: Impact of Categorical Variables on Charges

5.5.1: Smoker vs. Insurance Charges

Explanation:

5.5.2: Region vs. Insurance Charges

Explanation:

5.6: Summary of EDA Findings

Step 6: Data Preprocessing

6.1: Encoding Categorical Variables

Explanation:

6.2: Feature Scaling (Standardization)

Explanation:

6.3: Splitting Data into Training and Testing Sets

Explanation:

Step 7: Building Machine Learning Models

7.1: Implementing Linear Regression

Explanation:

7.2: Evaluating Linear Regression Performance

Interpretation of Metrics:

7.3: Checking Feature Importance (Linear Regression Coefficients)

Explanation:

Step 8: Implementing Random Forest Regressor 🌲

8.1: Understanding Random Forest

8.2: Implementing Random Forest Regressor

Explanation:

8.3: Evaluating Random Forest Performance

Comparison with Linear Regression:

8.4: Feature Importance in Random Forest

Explanation:

Step 9: Implementing XGBoost Regressor 🚀

9.1: Understanding XGBoost

9.2: Installing & Importing XGBoost

Explanation:

9.3: Evaluating XGBoost Performance

9.4: Feature Importance in XGBoost

Key Insights:

9.5: Comparing All Models

10.2: Hyperparameter Tuning using Grid Search

Encoding `sex`, `smoker`, and `region` Columns

5.1: Distribution of the Target Variable (`charges`)