Customer Churn Prediction Using Machine Learning

KANGKAN KALITA

Customer Churn Prediction Using Machine Learning:


Customer churn prediction is a vital aspect of business analytics. Churn refers to customers leaving a service or subscription. By predicting churn, businesses can take proactive measures to retain customers and improve profitability. This project focuses on Customer Churn Prediction Using Machine Learning, where we’ll go through data collection, preprocessing, visualization, feature engineering, and model building. Each step includes detailed explanations and code snippets to help you understand the process.

Customer Churn Prediction Using Machine Learning

Objective:

  • Predict customer churn using machine learning models.
  • Perform data preprocessing and feature engineering.
  • Visualize customer trends and behavior.
  • Build and evaluate classification models to predict churn.
  • Explore more such projects from us.

Dataset:
The dataset can be downloaded from Kaggle. It contains customer information such as demographic details, subscription plans, and churn status.

Tools & Libraries:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • XGBoost
  • Jupyter Notebook or Google Colab (Recommended)

1. Data Collection & Setup

Import Libraries and Load Dataset

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
data = pd.read_csv('/path/to/customer_churn.csv')  # Replace with actual file path

# Preview the dataset
data.head()

Explanation:

  • We use pandas to load and preview the dataset. This allows us to understand the structure and contents of the data. The dataset contains customer details and whether they churned or not.

2. Exploratory Data Analysis (EDA)

Dataset Overview

# Display dataset information
data.info()
data.describe()
data.isnull().sum()

Target Variable Distribution

# Visualize churn distribution
sns.countplot(x='Churn', data=data, palette='viridis')
plt.title('Churn Distribution')
plt.xlabel('Churn (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

Numerical Feature Correlation

# Correlation heatmap
#plt.figure(figsize=(12, 8))
#sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
#plt.title('Feature Correlation Heatmap')
#plt.show()


# Select only the numeric columns
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns
numeric_data = data[numeric_columns]

# Create the correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

Explanation:

  • This step provides an overview of data types, missing values, and the distribution of the target variable. Correlation analysis helps identify relationships between features.

3. Data Preprocessing & Feature Engineering

Handle Missing Values

# Fill missing values
columns_with_na = ['column1', 'column2']  # Replace with actual columns
data[columns_with_na] = data[columns_with_na].fillna(data[columns_with_na].median())

#OR

# Fill missing values
columns_with_na = [
    'Offer', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 
    'Internet Type', 'Avg Monthly GB Download', 'Online Security', 
    'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 
    'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 
    'Churn Category', 'Churn Reason'
]

# Fill missing values with the median for numeric columns and mode for categorical columns
for column in columns_with_na:
    if data[column].dtype in ['float64', 'int64']:
        data[column].fillna(data[column].median(), inplace=True)
    else:
        data[column].fillna(data[column].mode()[0], inplace=True)

Encoding Categorical Variables

# Encode categorical variables
encoder = LabelEncoder()
categorical_cols = ['Gender', 'Internet Service', 'Contract']  # Replace with actual columns
for col in categorical_cols:
    data[col] = encoder.fit_transform(data[col])

Feature Scaling

# Scale numerical features
scaler = StandardScaler()
numerical_cols = ['Tenure in Months', 'Monthly Charges', 'Total Charges']  # Replace with actual columns
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

Train-Test Split

# Split data into features and target
X = data.drop(columns=['Customer Status'])  # Replace 'Churn' with the actual target column
y = data['Customer Status']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  • Missing values are filled with appropriate measures (e.g., median). Categorical variables are encoded into numerical formats. Features are scaled to standardize data for better model performance.

4. Data Visualization

Churn by Contract Type

# Map the numeric values back to their original contract types
contract_mapping = {0: 'Month-to-Month', 1: 'One Year', 2: 'Two Year'}
data['Contract'] = data['Contract'].map(contract_mapping)

# Bar plot for churn by contract type
sns.countplot(x='Contract', hue='Churn', data=data, palette='coolwarm')
plt.title('Churn by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Count')
plt.legend(title='Churn')
plt.show()

Monthly Charges Distribution

# Distribution plot for monthly charges
sns.histplot(data['MonthlyCharges'], kde=True, color='green')
plt.title('Distribution of Monthly Charges')
plt.xlabel('Monthly Charges')
plt.ylabel('Density')
plt.show()

Explanation:

  • Visualizations provide insights into the relationships between churn and other features like contract type and monthly charges. This helps identify patterns in customer behavior.

5. Model Building & Evaluation

Logistic Regression

from sklearn.linear_model import LogisticRegression

# Train Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict on test data
y_pred_logreg = logreg.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Predict on test data
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

XGBoost Classifier

from xgboost import XGBClassifier

# Train XGBoost model
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

# Predict on test data
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

Explanation:

  • Logistic Regression serves as a baseline model. Random Forest and XGBoost are used to improve accuracy and performance.

6. Visualization of Model Performance

Compare Model Accuracies

# Model accuracy comparison
model_accuracies = {
    "Logistic Regression": accuracy_score(y_test, y_pred_logreg),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
    "XGBoost": accuracy_score(y_test, y_pred_xgb),
}

plt.figure(figsize=(10, 6))
plt.bar(model_accuracies.keys(), model_accuracies.values(), color='skyblue')
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.show()

Explanation:

  • A bar plot compares model accuracies, helping to identify the best-performing model for churn prediction.

7. Conclusion

This Customer Churn Prediction Using Machine Learning project demonstrates how to preprocess data, visualize trends, and build effective classification models. Among the models tested, [insert best-performing model here] achieved the highest accuracy. Businesses can use this analysis to retain customers and enhance decision-making processes.


Download the dataset and try experimenting with hyperparameter tuning or other advanced machine learning models to further improve predictions. Extend this project by deploying it using Flask or Streamlit!

Keywords: Customer churn prediction using machine learning, Churn analysis project, Data visualization, Classification models, Python project.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *