Health Insurance Cost Prediction Using Machine Learning
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025
Introduction
Health Insurance Cost Prediction using Machine Learning is a crucial application in the healthcare and insurance industry. Insurance companies need accurate cost estimations to determine premium amounts, assess risks, and manage financial planning. Traditional cost prediction methods rely on generalized assumptions, which may not be accurate for individuals with unique health profiles.

Machine learning provides a data-driven approach to predicting insurance charges by analyzing multiple factors, including age, BMI, smoking habits, and region. By leveraging predictive models, we can gain deeper insights into how these factors influence medical costs and build a system that provides accurate cost estimations. Lets begin our project Health Insurance Cost Prediction Using Machine Learning.
Step 1: Problem Understanding & Dataset Overview
Objective of the Project
Health insurance companies determine insurance premiums based on various factors like age, BMI, smoking habits, and region. Predicting insurance costs accurately helps insurers set fair pricing and allows individuals to estimate their expenses.
In this project, we will:
✔ Analyze the dataset using exploratory data analysis (EDA)
✔ Perform data preprocessing (handling missing values, encoding categorical features, feature scaling)
✔ Train multiple machine learning models (Linear Regression, Decision Tree, Random Forest, and XGBoost)
✔ Compare model performance using evaluation metrics
✔ Deploy the model for real-world applications (optional)
This end-to-end Health Insurance Cost Prediction using Machine Learning project provides a practical example of how AI can enhance decision-making in the healthcare industry.
Step 2: Dataset Overview
Dataset Source
We will use the Medical Cost Personal Dataset from Kaggle, which contains health information for individuals and their corresponding insurance charges.
Dataset Features
Feature | Description |
---|---|
age | Age of the person (numeric) |
sex | Gender (male , female ) |
bmi | Body Mass Index (numeric) |
children | Number of children covered by insurance (integer) |
smoker | Smoking status (yes , no ) |
region | Residential area (northeast , northwest , southeast , southwest ) |
charges | Insurance cost (target variable – numeric) |
Understanding the Target Variable
The target variable, charges
, represents the medical insurance cost for an individual. Our goal is to build a machine learning model that can accurately predict charges
based on other attributes.
Step 3: Importing Libraries & Loading the Data
Before we start working with the data, let’s import the required Python libraries.
# Data Handling and Analysis import pandas as pd import numpy as np # Data Visualization import matplotlib.pyplot as plt import seaborn as sns # Machine Learning from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Ignore warnings import warnings warnings.filterwarnings("ignore")
Explanation
pandas
andnumpy
→ For handling and processing datamatplotlib
andseaborn
→ For visualizing trends in the datasetsklearn.model_selection
→ To split data into training and test setsLabelEncoder
→ For encoding categorical variablesStandardScaler
→ To standardize numerical featuresLinearRegression
andRandomForestRegressor
→ Machine learning modelsmean_absolute_error
,mean_squared_error
,r2_score
→ Evaluation metrics
Loading the Dataset
# Load the dataset df = pd.read_csv("insurance.csv") # Display first 5 rows df.head()
Explanation
pd.read_csv("insurance.csv")
→ Loads the datasetdf.head()
→ Displays the first five rows
Checking Basic Information
# Checking dataset structure df.info() df.describe()
Step 4: Data Cleaning
Data cleaning is essential to ensure our dataset is accurate and free from inconsistencies. In this step, we will:
✔ Check for missing values
✔ Handle duplicates
✔ Detect and remove outliers
✔ Standardize categorical variables
4.1 Checking for Missing Values
Why Check for Missing Values?
Missing values can negatively impact our model’s performance. Some machine learning models cannot handle missing values, so we need to address them appropriately.
Checking Missing Values
# Check for missing values in the dataset missing_values = df.isnull().sum() # Display columns with missing values (if any) missing_values[missing_values > 0]
Explanation
df.isnull().sum()
→ Checks for missing values in each columnmissing_values[missing_values > 0]
→ Filters and displays only the columns with missing values
4.2 Checking for Duplicate Rows
Why Remove Duplicates?
Duplicate rows can lead to biased model training and incorrect predictions.
Checking for Duplicates
# Check for duplicate rows duplicates = df.duplicated().sum() print(f"Number of duplicate rows: {duplicates}")
Explanation
df.duplicated().sum()
→ Counts the number of duplicate rows in the dataset
If any duplicates exist, we remove them.
Code: Removing Duplicates
# Remove duplicate rows df = df.drop_duplicates() # Verify the change print(f"Number of rows after removing duplicates: {df.shape[0]}")
4.3 Handling Outliers
What Are Outliers?
Outliers are data points that are significantly different from other observations. They can affect model performance by skewing results.
Detecting Outliers Using Boxplot
# Boxplot for numerical columns plt.figure(figsize=(12,6)) # Creating subplots for multiple variables plt.subplot(1, 2, 1) sns.boxplot(y=df["bmi"]) plt.title("Boxplot of BMI") plt.subplot(1, 2, 2) sns.boxplot(y=df["charges"]) plt.title("Boxplot of Insurance Charges") plt.show()
Explanation
sns.boxplot(y=df["bmi"])
→ Displays BMI outlierssns.boxplot(y=df["charges"])
→ Displays insurance charge outliers
Handling Outliers Using the IQR Method
One common method to handle outliers is the Interquartile Range (IQR) Method.
Removing Outliers
# Function to remove outliers using IQR def remove_outliers(column): Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)] # Removing outliers from BMI and Charges df = remove_outliers("bmi") df = remove_outliers("charges") # Display updated dataset size print(f"Number of rows after removing outliers: {df.shape[0]}")
Explanation
- Step 1: Compute the Interquartile Range (IQR)
- Step 2: Calculate the upper and lower limits
- Step 3: Remove rows where values fall outside this range
4.4 Standardizing Categorical Data
Why Convert Categorical Data?
Machine learning models require numerical inputs, so we need to convert categorical variables into numerical representations.
Encoding sex
, smoker
, and region
Columns
# Label Encoding categorical variables le = LabelEncoder() df["sex"] = le.fit_transform(df["sex"]) # Male → 1, Female → 0 df["smoker"] = le.fit_transform(df["smoker"]) # Yes → 1, No → 0 df["region"] = le.fit_transform(df["region"]) # Encodes region values # Display dataset after encoding df.head()
Explanation
LabelEncoder().fit_transform()
→ Converts categorical values to numbers- Example:
- Sex:
male → 1
,female → 0
- Smoker:
yes → 1
,no → 0
- Region: Encodes
northeast, northwest, southeast, southwest
into numbers
- Sex:
Final Cleaned Data Summary
df.info()
Now, our dataset is: ✅ Free from missing values
✅ No duplicate rows
✅ Outliers removed
✅ Categorical data encoded
Step 5: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) helps us understand patterns, detect outliers, and identify relationships between variables. This step includes visualizations and statistical insights.
5.1: Distribution of the Target Variable (charges
)
Let’s check how the target variable (insurance charges) is distributed.
import seaborn as sns import matplotlib.pyplot as plt # Histogram of charges plt.figure(figsize=(8, 5)) sns.histplot(df["charges"], bins=30, kde=True) plt.title("Distribution of Insurance Charges") plt.xlabel("Charges") plt.ylabel("Frequency") plt.show()
Explanation:
- We use
sns.histplot()
to plot the distribution of insurance charges. - The KDE (Kernel Density Estimation) helps visualize the probability density of the target variable.
- A right-skewed distribution suggests that most people have lower insurance charges, but some individuals have very high charges.
5.2: Checking for Outliers in Charges
Outliers can impact the performance of machine learning models. Let’s detect them using a boxplot.
# Boxplot for outliers plt.figure(figsize=(8, 5)) sns.boxplot(y=df["charges"]) plt.title("Boxplot of Insurance Charges") plt.ylabel("Charges") plt.show()
Explanation:
- The boxplot helps visualize the spread of the target variable.
- Outliers are typically seen as points beyond the “whiskers” of the boxplot.
- If necessary, we can handle outliers by transformation or removal.
5.3: Relationship Between Numerical Features
A pairplot allows us to examine relationships between numerical variables.
# Pairplot for relationships sns.pairplot(df, diag_kind="kde") plt.show()
Explanation:
sns.pairplot()
creates scatter plots between all numerical variables.- The diagonal plots represent the KDE distributions of individual features.
- Helps in identifying correlations and patterns.
5.4: Correlation Analysis (Heatmap)
A correlation heatmap helps identify relationships between numerical features.
# Correlation heatmap plt.figure(figsize=(10, 6)) sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5) plt.title("Feature Correlation Heatmap") plt.show()
Explanation:
df.corr()
calculates correlation coefficients between numerical variables.sns.heatmap()
visualizes correlations with colors.- Strong correlations (closer to +1 or -1) indicate strong relationships between features.
- Helps us decide which features are most important for predictions.
5.5: Impact of Categorical Variables on Charges
5.5.1: Smoker vs. Insurance Charges
Smoking status is a major factor affecting insurance charges. Let’s analyze it.
# Boxplot of smoker vs charges plt.figure(figsize=(8, 5)) sns.boxplot(x=df["smoker"], y=df["charges"]) plt.title("Impact of Smoking on Insurance Charges") plt.xlabel("Smoker (0=No, 1=Yes)") plt.ylabel("Charges") plt.show()
Explanation:
sns.boxplot()
shows the distribution of insurance charges for smokers vs. non-smokers.- Expectation: Smokers tend to have significantly higher medical expenses.
5.5.2: Region vs. Insurance Charges
Different regions might have different healthcare costs.
# Boxplot for region vs charges plt.figure(figsize=(8, 5)) sns.boxplot(x=df["region"], y=df["charges"]) plt.title("Impact of Region on Insurance Charges") plt.xlabel("Region") plt.ylabel("Charges") plt.show()
Explanation:
sns.boxplot()
helps compare insurance charges across different regions.- If there’s a significant difference, region could be an important factor in predicting costs.
5.6: Summary of EDA Findings
- Insurance charges are right-skewed, meaning most people have lower medical costs, but some individuals have very high expenses.
- Outliers exist in the charges column, mostly due to smokers and individuals with high BMI.
- Smokers tend to have significantly higher medical expenses than non-smokers.
- There is a correlation between BMI, age, and charges, meaning older individuals or those with higher BMI often have higher costs.
- Region seems to have a minor impact, but we will confirm this with feature importance later.
Step 6: Data Preprocessing
Before building machine learning models, we need to preprocess the dataset by:
✅ Encoding categorical variables (converting text to numbers)
✅ Scaling numerical features
✅ Splitting data into training and testing sets
6.1: Encoding Categorical Variables
The dataset has three categorical variables:
- Sex (
male
/female
) - Smoker (
yes
/no
) - Region (
northeast
,northwest
,southeast
,southwest
)
We use LabelEncoder
to convert categorical values into numerical representations.
from sklearn.preprocessing import LabelEncoder # Creating a LabelEncoder instance le = LabelEncoder() # Encoding categorical columns df["sex"] = le.fit_transform(df["sex"]) # Male=1, Female=0 df["smoker"] = le.fit_transform(df["smoker"]) # Smoker=1, Non-smoker=0 df["region"] = le.fit_transform(df["region"]) # Assigns 0,1,2,3 to different regions # Checking the transformed dataset df.head()
Explanation:
LabelEncoder()
converts text values into numerical ones.sex
:male
→ 1,female
→ 0smoker
:yes
→ 1,no
→ 0region
: Categorical regions are assigned unique numerical values (0,1,2,3).- This step ensures our dataset is machine-learning compatible.
6.2: Feature Scaling (Standardization)
Some features (like age
, bmi
, children
) have different numerical ranges, which can affect model performance.
We use StandardScaler to scale them.
from sklearn.preprocessing import StandardScaler # Creating a StandardScaler instance scaler = StandardScaler() # Selecting numerical columns for scaling num_features = ["age", "bmi", "children"] # Applying scaling df[num_features] = scaler.fit_transform(df[num_features]) # Checking transformed values df.head()
Explanation:
StandardScaler()
normalizes numerical values to have mean = 0 and standard deviation = 1.- Helps models like Linear Regression and Neural Networks perform better.
- Scaling prevents large-valued features (e.g., age vs. BMI) from dominating others.
6.3: Splitting Data into Training and Testing Sets
To train and evaluate our machine learning models, we split the dataset:
- Training Set (80%): Used to train models.
- Testing Set (20%): Used to evaluate model performance.
from sklearn.model_selection import train_test_split # Defining input features (X) and target variable (y) X = df.drop("charges", axis=1) # Features y = df["charges"] # Target variable # Splitting into training and testing sets (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Checking shapes X_train.shape, X_test.shape, y_train.shape, y_test.shape
Explanation:
X
contains all features except the target variable (charges
).y
contains only the target variable (charges
).train_test_split()
randomly splits data into training (80%) and testing (20%) sets.- Random seed (
random_state=42
) ensures reproducibility.
Step 7: Building Machine Learning Models
We will start with Linear Regression, a simple but powerful model for predicting continuous values like insurance charges.
Later, we will compare it with more complex models like Random Forest and XGBoost.
7.1: Implementing Linear Regression
We train the model using X_train
and y_train
, then evaluate it using X_test
and y_test
.
from sklearn.linear_model import LinearRegression # Initialize the model lr_model = LinearRegression() # Train (fit) the model on training data lr_model.fit(X_train, y_train) # Make predictions on test data y_pred_lr = lr_model.predict(X_test)
Explanation:
LinearRegression()
initializes the model..fit(X_train, y_train)
trains the model by learning the best coefficients..predict(X_test)
makes predictions on unseen data.
7.2: Evaluating Linear Regression Performance
We evaluate the model using three key metrics:
- Mean Absolute Error (MAE): Measures average error in predictions.
- Mean Squared Error (MSE): Penalizes larger errors more than smaller ones.
- R² Score: Measures how well the model explains variance in the target variable.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Calculate metrics mae_lr = mean_absolute_error(y_test, y_pred_lr) mse_lr = mean_squared_error(y_test, y_pred_lr) r2_lr = r2_score(y_test, y_pred_lr) # Print results print("Linear Regression Performance:") print(f"Mean Absolute Error (MAE): {mae_lr:.2f}") print(f"Mean Squared Error (MSE): {mse_lr:.2f}") print(f"R² Score: {r2_lr:.4f}")
Interpretation of Metrics:
- Lower MAE/MSE = better model
- R² Score closer to 1 = better model fit
💡 Linear Regression may not always be the best choice if relationships in data are non-linear or involve complex interactions.
7.3: Checking Feature Importance (Linear Regression Coefficients)
We analyze which features contribute the most to predicting insurance costs.
# Extracting model coefficients coefficients = pd.DataFrame(lr_model.coef_, X.columns, columns=["Coefficient"]) # Sorting by importance coefficients = coefficients.sort_values(by="Coefficient", ascending=False) # Displaying results coefficients
Explanation:
- The higher the absolute coefficient value, the more influence that feature has on predictions.
- Positive coefficients increase charges, while negative coefficients reduce charges.
Step 8: Implementing Random Forest Regressor 🌲
Since Linear Regression assumes a simple relationship between features and target, we now introduce Random Forest Regressor, a more powerful model that can handle non-linear relationships and interactions between features.
8.1: Understanding Random Forest
Random Forest is an ensemble learning method that builds multiple decision trees and takes their average prediction to reduce overfitting and improve accuracy.
Advantages of Random Forest:
✅ Captures complex relationships in data.
✅ Reduces overfitting compared to individual decision trees.
✅ Works well with both categorical and numerical features.
8.2: Implementing Random Forest Regressor
We train a Random Forest model with 100
decision trees.
from sklearn.ensemble import RandomForestRegressor # Initialize the model with 100 trees rf_model = RandomForestRegressor(n_estimators=100, random_state=42) # Train the model rf_model.fit(X_train, y_train) # Make predictions y_pred_rf = rf_model.predict(X_test)
Explanation:
RandomForestRegressor(n_estimators=100)
creates a forest with 100 decision trees..fit(X_train, y_train)
trains the model on training data..predict(X_test)
generates predictions for test data.
8.3: Evaluating Random Forest Performance
We use the same evaluation metrics (MAE, MSE, R² Score) to compare with Linear Regression.
# Calculate metrics mae_rf = mean_absolute_error(y_test, y_pred_rf) mse_rf = mean_squared_error(y_test, y_pred_rf) r2_rf = r2_score(y_test, y_pred_rf) # Print results print("Random Forest Performance:") print(f"Mean Absolute Error (MAE): {mae_rf:.2f}") print(f"Mean Squared Error (MSE): {mse_rf:.2f}") print(f"R² Score: {r2_rf:.4f}")
Comparison with Linear Regression:
- Lower MAE & MSE = Better predictions.
- Higher R² Score = Better fit to data.
8.4: Feature Importance in Random Forest
Unlike Linear Regression (which uses coefficients), Random Forest determines feature importance by measuring how much each feature improves decision trees.
# Extract feature importance feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_}) # Sort in descending order feature_importance = feature_importance.sort_values(by="Importance", ascending=False) # Plot feature importance plt.figure(figsize=(10,5)) sns.barplot(x=feature_importance["Importance"], y=feature_importance["Feature"], palette="viridis") plt.title("Feature Importance in Random Forest") plt.xlabel("Importance Score") plt.ylabel("Feature") plt.show()
Explanation:
- Higher importance = Feature contributes more to predictions.
- Helps us understand which factors influence insurance cost predictions the most.
Step 9: Implementing XGBoost Regressor 🚀
Now that we have seen Random Forest, let’s explore XGBoost (Extreme Gradient Boosting)—a powerful algorithm that often outperforms traditional machine learning models.
9.1: Understanding XGBoost
XGBoost is an ensemble learning technique based on gradient boosting, which means:
✅ It builds multiple weak learners (decision trees) sequentially.
✅ Each tree focuses on correcting errors made by the previous tree.
✅ It uses a gradient descent optimization approach to minimize errors.
Advantages of XGBoost:
✔️ Faster & more efficient than Random Forest.
✔️ Handles missing values and feature importance automatically.
✔️ Less overfitting due to built-in regularization.
9.2: Installing & Importing XGBoost
First, make sure XGBoost is installed. If not, install it using:
pip install xgboost
Now, import the library and initialize the model:
from xgboost import XGBRegressor # Initialize XGBoost Regressor xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42) # Train the model xgb_model.fit(X_train, y_train) # Make predictions y_pred_xgb = xgb_model.predict(X_test)
Explanation:
XGBRegressor(n_estimators=100, learning_rate=0.1)
trains 100 gradient-boosted trees.learning_rate=0.1
controls how much each tree contributes to the final prediction..fit(X_train, y_train)
trains the model..predict(X_test)
generates predictions for test data.
9.3: Evaluating XGBoost Performance
Let’s compare XGBoost with Random Forest and Linear Regression.
# Calculate metrics mae_xgb = mean_absolute_error(y_test, y_pred_xgb) mse_xgb = mean_squared_error(y_test, y_pred_xgb) r2_xgb = r2_score(y_test, y_pred_xgb) # Print results print("XGBoost Performance:") print(f"Mean Absolute Error (MAE): {mae_xgb:.2f}") print(f"Mean Squared Error (MSE): {mse_xgb:.2f}") print(f"R² Score: {r2_xgb:.4f}")
✅ Lower MAE & MSE = More accurate predictions.
✅ Higher R² Score = Better fit to the data.
9.4: Feature Importance in XGBoost
XGBoost provides built-in feature importance, which helps us understand the most influential factors.
# Extract feature importance feature_importance_xgb = pd.DataFrame({'Feature': X.columns, 'Importance': xgb_model.feature_importances_}) # Sort in descending order feature_importance_xgb = feature_importance_xgb.sort_values(by="Importance", ascending=False) # Plot feature importance plt.figure(figsize=(10,5)) sns.barplot(x=feature_importance_xgb["Importance"], y=feature_importance_xgb["Feature"], palette="coolwarm") plt.title("Feature Importance in XGBoost") plt.xlabel("Importance Score") plt.ylabel("Feature") plt.show()
Key Insights:
- The most important features will have the highest scores.
- Helps in feature selection (we can remove low-importance features).
9.5: Comparing All Models
Let’s create a comparison table to summarize model performance.
# Creating a dataframe for model comparison model_comparison = pd.DataFrame({ "Model": ["Linear Regression", "Random Forest", "XGBoost"], "MAE": [mae_lr, mae_rf, mae_xgb], "MSE": [mse_lr, mse_rf, mse_xgb], "R² Score": [r2_lr, r2_rf, r2_xgb] }) # Display the comparison table print(model_comparison)
🚀 XGBoost often performs the best, but let’s analyze the table to decide!
10.2: Hyperparameter Tuning using Grid Search
We’ll use GridSearchCV to test multiple combinations and find the best parameters.
🔧 Step 1: Import Required Libraries
from sklearn.model_selection import GridSearchCV from xgboost import XGBRegressor
🔧 Step 2: Define Parameter Grid
param_grid = { 'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7], 'subsample': [0.7, 0.8, 1.0], 'colsample_bytree': [0.7, 0.8, 1.0] }
✅ We define different values for each parameter. The GridSearchCV will test all possible combinations.
🔧 Step 3: Perform Grid Search
# Initialize XGBoost Regressor xgb_model = XGBRegressor(random_state=42) # Initialize GridSearchCV grid_search = GridSearchCV( estimator=xgb_model, param_grid=param_grid, scoring='r2', # Optimizing for best R² score cv=5, # 5-fold cross-validation n_jobs=-1, # Use all CPU cores for faster execution verbose=2 # Display progress ) # Fit GridSearchCV on training data grid_search.fit(X_train, y_train) # Print the best parameters print("Best Parameters:", grid_search.best_params_)
✅ Cross-validation (cv=5
) splits data into 5 parts and tests different hyperparameter combinations.
✅ n_jobs=-1
allows parallel processing, speeding up tuning.
✅ The best parameter set is stored in grid_search.best_params_
.
10.3: Training XGBoost with Best Parameters
Now, we train our final XGBoost model using the best parameters.
# Extract best parameters best_params = grid_search.best_params_ # Train the optimized model xgb_optimized = XGBRegressor(**best_params, random_state=42) xgb_optimized.fit(X_train, y_train) # Make predictions y_pred_xgb_opt = xgb_optimized.predict(X_test)
✅ The model now uses optimized hyperparameters for better performance.
10.4: Evaluating the Optimized XGBoost Model
Let’s check the performance improvement after tuning.
# Compute performance metrics mae_xgb_opt = mean_absolute_error(y_test, y_pred_xgb_opt) mse_xgb_opt = mean_squared_error(y_test, y_pred_xgb_opt) r2_xgb_opt = r2_score(y_test, y_pred_xgb_opt) # Print results print("Optimized XGBoost Performance:") print(f"MAE: {mae_xgb_opt:.2f}") print(f"MSE: {mse_xgb_opt:.2f}") print(f"R² Score: {r2_xgb_opt:.4f}")
✅ Lower MAE & MSE and higher R² indicate the model has improved! 🚀
10.5: Comparing All Models (Final Performance Table)
Now, let’s compare Linear Regression, Random Forest, XGBoost (default), and XGBoost (optimized).
# Create comparison table final_comparison = pd.DataFrame({ "Model": ["Linear Regression", "Random Forest", "XGBoost (default)", "XGBoost (Optimized)"], "MAE": [mae_lr, mae_rf, mae_xgb, mae_xgb_opt], "MSE": [mse_lr, mse_rf, mse_xgb, mse_xgb_opt], "R² Score": [r2_lr, r2_rf, r2_xgb, r2_xgb_opt] }) # Display results print(final_comparison)
10.6: Key Takeaways
📌 Did XGBoost improve after tuning? Check the table!
📌 Compare with other models to see the best choice.
📌 XGBoost is often the best, but it depends on dataset complexity.
Conclusion
In this Health Insurance Cost Prediction using Machine Learning project, we have built a complete end-to-end solution to predict medical insurance charges based on various factors such as age, BMI, smoking status, and region.
Key Takeaways:
- Data Collection & Preprocessing:
- We started by loading the Medical Cost Personal Dataset and explored its structure.
- Cleaned the dataset by handling missing values and encoding categorical variables.
- Scaled numerical features for better model performance.
- Exploratory Data Analysis (EDA):
- We performed descriptive statistical analysis to understand feature distributions.
- Visualized relationships between independent variables and insurance charges using histograms, boxplots, pair plots, and heatmaps.
- Identified the impact of smoking, BMI, and age on insurance costs.
- Model Building & Evaluation:
- Implemented Linear Regression, Decision Tree, Random Forest, and XGBoost models to compare performance.
- Used MAE, MSE, RMSE, and R² scores to evaluate the models.
- XGBoost performed the best, indicating the need for non-linear relationships in predicting insurance charges.
- Web Application Development:
- Created an interactive front-end using Streamlit, allowing users to input their details and get a predicted insurance cost.
- Integrated the trained machine learning model with a user-friendly UI.
Next Steps:
- Feature Engineering: Introduce new features like health conditions, lifestyle habits, and family history for better accuracy.
- Hyperparameter Tuning: Optimize model parameters using GridSearchCV or Bayesian Optimization.
- Deep Learning Approach: Implement Neural Networks to improve predictions.
- Deployment: Host the model using Flask, FastAPI, or Streamlit Cloud for real-world usage.
This project ( Health Insurance Cost Prediction using Machine Learning ) demonstrates the power of machine learning in healthcare analytics, showcasing how predictive modeling can assist insurance companies in risk assessment and pricing strategies.
Latest Posts:
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast]
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide]
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025]
- Netflix Data Analysis with Python: Beginner-Friendly Project with Code & Insights
- 15 Best Machine Learning Projects for Your Resume That Will Impress Recruiters [2025 Guide]