Customer Churn Prediction Using Machine Learning
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025
Customer Churn Prediction Using Machine Learning:
Customer churn prediction is a vital aspect of business analytics. Churn refers to customers leaving a service or subscription. By predicting churn, businesses can take proactive measures to retain customers and improve profitability. This project focuses on Customer Churn Prediction Using Machine Learning, where we’ll go through data collection, preprocessing, visualization, feature engineering, and model building. Each step includes detailed explanations and code snippets to help you understand the process.

Objective:
- Predict customer churn using machine learning models.
- Perform data preprocessing and feature engineering.
- Visualize customer trends and behavior.
- Build and evaluate classification models to predict churn.
- Explore more such projects from us.
Dataset:
The dataset can be downloaded from Kaggle. It contains customer information such as demographic details, subscription plans, and churn status.
Tools & Libraries:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- XGBoost
- Jupyter Notebook or Google Colab (Recommended)
1. Data Collection & Setup
Import Libraries and Load Dataset
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.metrics import classification_report, accuracy_score # Load the dataset data = pd.read_csv('/path/to/customer_churn.csv') # Replace with actual file path # Preview the dataset data.head()
Explanation:
- We use pandas to load and preview the dataset. This allows us to understand the structure and contents of the data. The dataset contains customer details and whether they churned or not.
2. Exploratory Data Analysis (EDA)
Dataset Overview
# Display dataset information data.info() data.describe() data.isnull().sum()
Target Variable Distribution
# Visualize churn distribution sns.countplot(x='Churn', data=data, palette='viridis') plt.title('Churn Distribution') plt.xlabel('Churn (0 = No, 1 = Yes)') plt.ylabel('Count') plt.show()
Numerical Feature Correlation
# Correlation heatmap #plt.figure(figsize=(12, 8)) #sns.heatmap(data.corr(), annot=True, cmap='coolwarm') #plt.title('Feature Correlation Heatmap') #plt.show() # Select only the numeric columns numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns numeric_data = data[numeric_columns] # Create the correlation heatmap plt.figure(figsize=(12, 8)) sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show()
Explanation:
- This step provides an overview of data types, missing values, and the distribution of the target variable. Correlation analysis helps identify relationships between features.
3. Data Preprocessing & Feature Engineering
Handle Missing Values
# Fill missing values columns_with_na = ['column1', 'column2'] # Replace with actual columns data[columns_with_na] = data[columns_with_na].fillna(data[columns_with_na].median()) #OR # Fill missing values columns_with_na = [ 'Offer', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Type', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Churn Category', 'Churn Reason' ] # Fill missing values with the median for numeric columns and mode for categorical columns for column in columns_with_na: if data[column].dtype in ['float64', 'int64']: data[column].fillna(data[column].median(), inplace=True) else: data[column].fillna(data[column].mode()[0], inplace=True)
Encoding Categorical Variables
# Encode categorical variables encoder = LabelEncoder() categorical_cols = ['Gender', 'Internet Service', 'Contract'] # Replace with actual columns for col in categorical_cols: data[col] = encoder.fit_transform(data[col])
Feature Scaling
# Scale numerical features scaler = StandardScaler() numerical_cols = ['Tenure in Months', 'Monthly Charges', 'Total Charges'] # Replace with actual columns data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
Train-Test Split
# Split data into features and target X = data.drop(columns=['Customer Status']) # Replace 'Churn' with the actual target column y = data['Customer Status'] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Missing values are filled with appropriate measures (e.g., median). Categorical variables are encoded into numerical formats. Features are scaled to standardize data for better model performance.
4. Data Visualization
Churn by Contract Type
# Map the numeric values back to their original contract types contract_mapping = {0: 'Month-to-Month', 1: 'One Year', 2: 'Two Year'} data['Contract'] = data['Contract'].map(contract_mapping) # Bar plot for churn by contract type sns.countplot(x='Contract', hue='Churn', data=data, palette='coolwarm') plt.title('Churn by Contract Type') plt.xlabel('Contract Type') plt.ylabel('Count') plt.legend(title='Churn') plt.show()
Monthly Charges Distribution
# Distribution plot for monthly charges sns.histplot(data['MonthlyCharges'], kde=True, color='green') plt.title('Distribution of Monthly Charges') plt.xlabel('Monthly Charges') plt.ylabel('Density') plt.show()
Explanation:
- Visualizations provide insights into the relationships between churn and other features like contract type and monthly charges. This helps identify patterns in customer behavior.
5. Model Building & Evaluation
Logistic Regression
from sklearn.linear_model import LogisticRegression # Train Logistic Regression model logreg = LogisticRegression() logreg.fit(X_train, y_train) # Predict on test data y_pred_logreg = logreg.predict(X_test) # Evaluate the model print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg)) print(classification_report(y_test, y_pred_logreg))
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier # Train Random Forest model rf_model = RandomForestClassifier() rf_model.fit(X_train, y_train) # Predict on test data y_pred_rf = rf_model.predict(X_test) # Evaluate the model print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print(classification_report(y_test, y_pred_rf))
XGBoost Classifier
from xgboost import XGBClassifier # Train XGBoost model xgb_model = XGBClassifier() xgb_model.fit(X_train, y_train) # Predict on test data y_pred_xgb = xgb_model.predict(X_test) # Evaluate the model print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb)) print(classification_report(y_test, y_pred_xgb))
Explanation:
- Logistic Regression serves as a baseline model. Random Forest and XGBoost are used to improve accuracy and performance.
6. Visualization of Model Performance
Compare Model Accuracies
# Model accuracy comparison model_accuracies = { "Logistic Regression": accuracy_score(y_test, y_pred_logreg), "Random Forest": accuracy_score(y_test, y_pred_rf), "XGBoost": accuracy_score(y_test, y_pred_xgb), } plt.figure(figsize=(10, 6)) plt.bar(model_accuracies.keys(), model_accuracies.values(), color='skyblue') plt.title('Model Accuracy Comparison') plt.ylabel('Accuracy') plt.xticks(rotation=45) plt.show()
Explanation:
- A bar plot compares model accuracies, helping to identify the best-performing model for churn prediction.
7. Conclusion
This Customer Churn Prediction Using Machine Learning project demonstrates how to preprocess data, visualize trends, and build effective classification models. Among the models tested, [insert best-performing model here] achieved the highest accuracy. Businesses can use this analysis to retain customers and enhance decision-making processes.
Download the dataset and try experimenting with hyperparameter tuning or other advanced machine learning models to further improve predictions. Extend this project by deploying it using Flask or Streamlit!
Keywords: Customer churn prediction using machine learning, Churn analysis project, Data visualization, Classification models, Python project.