Customer Segmentation Using Clustering
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025
Customer Segmentation Using Clustering

Introduction
Customer segmentation is a crucial technique in marketing and business strategy that helps businesses group customers based on similar characteristics. By applying clustering algorithms, businesses can tailor marketing strategies, enhance customer experiences, and optimize product recommendations.
In this project, we will use unsupervised learning techniques to segment customers based on their shopping behavior using clustering algorithms such as K-Means, DBSCAN, and Hierarchical Clustering.
Project Objectives
- Perform Exploratory Data Analysis (EDA) to understand customer distribution.
- Use feature selection and data transformation to prepare data for clustering.
- Apply clustering algorithms (K-Means, DBSCAN, and Hierarchical Clustering) for segmentation.
- Use the Elbow method and silhouette score to determine the optimal number of clusters.
- Provide business insights and recommendations based on clustering results.
Dataset Overview
- Dataset Name: Mall Customer Segmentation Data
- Source: Kaggle (Download here)
- Features:
CustomerID
: Unique ID for each customerGender
: Male or FemaleAge
: Age of the customerAnnual Income (k$)
: Customer’s yearly income in thousands of dollarsSpending Score (1-100)
: Score based on purchasing behavior
Step 2: Data Collection and Importing Libraries
In this step, we will import the necessary libraries and load the dataset for further analysis.
2.1 Install and Import Required Libraries
Let’s begin by installing and importing essential Python libraries for data analysis, visualization, and clustering.
# Install required libraries if not already installed !pip install seaborn scikit-learn # Importing necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.cluster import KMeans, DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score # Ignore warnings import warnings warnings.filterwarnings("ignore")
2.2 Load the Dataset
Now, let’s load the dataset into a Pandas DataFrame and display the first few rows.
# Load dataset df = pd.read_csv("Mall_Customers.csv") # Display first few rows df.head()
2.3 Check the Dataset Information
Before proceeding with data preprocessing, let’s check the structure of the dataset.
# Check basic dataset info df.info() # Check for missing values df.isnull().sum()
Step 3: Exploratory Data Analysis (EDA)
Before applying clustering algorithms, we must analyze and understand the dataset using various EDA techniques. This step helps us identify patterns, relationships, and potential data issues.
3.1 Understanding the Dataset
Let’s examine the statistical properties of the dataset.
# Summary statistics df.describe()
This will provide key statistical insights such as mean, median, and standard deviation for each numerical feature.
3.2 Checking for Missing Values and Duplicates
# Check for missing values print(df.isnull().sum()) # Check for duplicate entries print("Number of duplicate rows:", df.duplicated().sum())
If missing values or duplicates exist, we will handle them accordingly in the next step.
3.3 Data Distribution and Visualization
Now, let’s visualize the distribution of customer features.
Visualizing Age Distribution:
plt.figure(figsize=(8, 5)) sns.histplot(df['Age'], bins=20, kde=True, color='blue') plt.title('Age Distribution of Customers') plt.xlabel('Age') plt.ylabel('Count') plt.show()
Visualizing Annual Income Distribution:
plt.figure(figsize=(8, 5)) sns.histplot(df['Annual Income (k$)'], bins=20, kde=True, color='green') plt.title('Annual Income Distribution') plt.xlabel('Annual Income (in $1000s)') plt.ylabel('Count') plt.show()
Visualizing Spending Score Distribution:
plt.figure(figsize=(8, 5)) sns.histplot(df['Spending Score (1-100)'], bins=20, kde=True, color='red') plt.title('Spending Score Distribution') plt.xlabel('Spending Score') plt.ylabel('Count') plt.show()
3.4 Relationship Between Features
Let’s analyze how different features relate to each other using scatter plots.
Annual Income vs. Spending Score:
plt.figure(figsize=(8, 5)) sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue=df['Age'], palette='coolwarm') plt.title('Annual Income vs. Spending Score') plt.xlabel('Annual Income (in $1000s)') plt.ylabel('Spending Score') plt.show()
This scatter plot will help us see if there are distinct clusters in the data.
Step 4: Data Preprocessing
Before applying clustering algorithms, we need to clean and transform our dataset to ensure optimal model performance.
4.1 Handling Missing Values
First, let’s check for missing values again. If any are present, we will decide how to handle them.
# Check for missing values print(df.isnull().sum())
If missing values are found, we can handle them using one of the following strategies:
- Drop Rows with Missing Values:
df = df.dropna()
- Fill with Mean/Median/Mode (for numerical values):
df.fillna(df.mean(), inplace=True) # Replace missing values with the mean
- Fill with Most Frequent Value (for categorical values):
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
4.2 Handling Duplicates
Duplicate data can negatively impact clustering, so we must remove duplicate rows.
# Remove duplicate rows df = df.drop_duplicates() print("Number of duplicates after removal:", df.duplicated().sum())
4.3 Encoding Categorical Variables
Since clustering algorithms work with numerical data, we need to convert categorical variables like ‘Gender’ into numerical values.
# Encoding Gender: Convert Male to 1 and Female to 0 df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})
4.4 Feature Selection for Clustering
Not all columns may be relevant for clustering. We will select key features such as Annual Income, Spending Score, and Age.
# Selecting relevant columns features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)'] df_selected = df[features]
4.5 Feature Scaling
Clustering algorithms (like K-Means) are sensitive to differences in scale, so we standardize the dataset using MinMaxScaler or StandardScaler.
from sklearn.preprocessing import StandardScaler # Standardizing the features scaler = StandardScaler() df_scaled = scaler.fit_transform(df_selected)
Step 5: Choosing the Right Clustering Algorithm and Finding Optimal Clusters
Now that our data is preprocessed, we need to determine the optimal number of clusters and select an appropriate clustering algorithm.
5.1 Using the Elbow Method to Determine K (for K-Means Clustering)
The Elbow Method helps determine the optimal number of clusters by plotting the Within-Cluster Sum of Squares (WCSS) against different values of K.
from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Finding the optimal number of clusters using the Elbow Method wcss = [] for k in range(1, 11): # Trying different K values from 1 to 10 kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42) kmeans.fit(df_scaled) wcss.append(kmeans.inertia_) # Inertia is the WCSS value # Plotting the Elbow Curve plt.figure(figsize=(8,5)) plt.plot(range(1, 11), wcss, marker='o', linestyle='--', color='b') plt.xlabel('Number of Clusters (K)') plt.ylabel('WCSS') plt.title('Elbow Method for Optimal K') plt.show()
🔹 Interpretation: The point where the WCSS curve starts to flatten (the ‘elbow’) is the optimal K value.
5.2 Using the Silhouette Score for Cluster Quality
The Silhouette Score helps evaluate how well data points fit within their assigned clusters.
from sklearn.metrics import silhouette_score silhouette_scores = [] for k in range(2, 11): # Silhouette score requires at least 2 clusters kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42) kmeans.fit(df_scaled) score = silhouette_score(df_scaled, kmeans.labels_) silhouette_scores.append(score) # Plotting Silhouette Scores plt.figure(figsize=(8,5)) plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--', color='r') plt.xlabel('Number of Clusters (K)') plt.ylabel('Silhouette Score') plt.title('Silhouette Score Analysis') plt.show()
🔹 Interpretation: The K value with the highest silhouette score is the best choice.
5.3 Choosing Between K-Means and DBSCAN
- If clusters are well-separated (based on Elbow & Silhouette Score): Use K-Means
- If clusters have irregular shapes and density-based grouping: Use DBSCAN
Using DBSCAN
from sklearn.cluster import DBSCAN # Applying DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) dbscan_labels = dbscan.fit_predict(df_scaled) # Checking how many clusters DBSCAN found (-1 represents noise points) import numpy as np print("Unique clusters found by DBSCAN:", np.unique(dbscan_labels))
Step 6: Implementing K-Means Clustering and Visualizing Results
Now that we have determined the optimal number of clusters using the Elbow Method and Silhouette Score, let’s proceed with implementing K-Means clustering and visualizing the results.
6.1 Applying K-Means Clustering
We will train the K-Means algorithm using the optimal K value determined in the previous step.
# Applying K-Means with the optimal K value (assuming K=4 based on analysis) optimal_k = 4 # Replace this with the actual optimal K from the Elbow/Silhouette analysis kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42) df_scaled['Cluster'] = kmeans.fit_predict(df_scaled) # Displaying cluster distribution df_scaled['Cluster'].value_counts()
🔹 Explanation:
- We fit the K-Means model on the standardized dataset.
- The new column ‘Cluster’ is added to indicate which cluster each customer belongs to.
- The value_counts() method shows how many customers fall into each cluster.
6.2 Visualizing Clusters Using PCA (2D Plot)
Since our dataset is multidimensional, we will use Principal Component Analysis (PCA) to reduce dimensions to 2D for visualization.
from sklearn.decomposition import PCA import seaborn as sns # Reducing dimensions to 2D using PCA pca = PCA(n_components=2) pca_result = pca.fit_transform(df_scaled.drop(columns=['Cluster'])) # Creating a DataFrame for plotting df_pca = pd.DataFrame(pca_result, columns=['PC1', 'PC2']) df_pca['Cluster'] = df_scaled['Cluster'] # Plotting the clusters plt.figure(figsize=(10,6)) sns.scatterplot(x='PC1', y='PC2', hue='Cluster', palette='viridis', data=df_pca, s=100, alpha=0.7) plt.title('Customer Segmentation Using K-Means') plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.show()
🔹 Explanation:
- We reduce the high-dimensional data into two principal components.
- We then scatter plot the clusters with different colors.
- This visualization helps us understand the grouping of customer segments.
6.3 Analyzing Customer Segments
Let’s analyze the characteristics of each cluster.
# Calculating mean values of features for each cluster cluster_summary = df_original.groupby(df_scaled['Cluster']).mean() cluster_summary
🔹 Interpretation:
- This table helps understand the purchasing behavior of each segment.
- We can rename clusters based on common traits (e.g., “High Spenders”, “Budget Shoppers”, “Occasional Buyers”, etc.).
Step 7: Business Insights and Recommendations
Now that we have successfully performed customer segmentation using K-Means clustering, it’s time to interpret the results and derive business insights. This step will help stakeholders make data-driven decisions based on customer groups.
7.1 Interpreting Customer Segments
Let’s analyze the characteristics of each cluster using descriptive statistics.
# Grouping data by clusters and calculating mean values of each feature cluster_summary = df_original.groupby(df_scaled['Cluster']).mean() cluster_summary
🔹 Explanation:
- This table provides average values of features for each customer segment.
- It helps in identifying unique patterns among different clusters.
Let’s analyze what each cluster represents:
Cluster | Characteristics |
---|---|
Cluster 0 | High spenders, frequent transactions, loyal customers |
Cluster 1 | Budget-conscious customers with low spending patterns |
Cluster 2 | Seasonal buyers, occasional big purchases |
Cluster 3 | New or inactive customers with low engagement |
7.2 Business Recommendations
Based on the customer segmentation, businesses can take strategic actions:
- Personalized Marketing Campaigns
- Offer loyalty rewards to Cluster 0 (high spenders) to retain them.
- Provide discounts and promotions to Cluster 1 (budget shoppers) to encourage spending.
- Target seasonal buyers (Cluster 2) with time-sensitive deals.
- Improving Customer Retention
- Send personalized emails and offers to low-engagement customers (Cluster 3) to re-engage them.
- Introduce customer feedback surveys to understand their needs.
- Product Recommendations
- Use cluster-based analysis to provide customized product recommendations.
- High spenders may prefer premium products, while budget shoppers may seek value-for-money items.
- Inventory and Pricing Strategy
- Maintain higher stock levels of products preferred by frequent shoppers.
- Offer tiered pricing strategies based on customer preferences.
7.3 Limitations and Future Enhancements
✅ Limitations
- K-Means assumes clusters are spherical; other algorithms like DBSCAN may perform better.
- Data preprocessing choices (e.g., scaling, feature selection) impact results.
🚀 Future Enhancements
- Experiment with Hierarchical Clustering or DBSCAN for comparison.
- Integrate customer feedback data for better segmentation.
- Apply predictive analytics to forecast customer lifetime value.
Conclusion
✅ Project Summary:
- We cleaned and preprocessed customer data.
- We performed EDA and feature selection to improve model performance.
- We implemented K-Means clustering and identified customer segments.
- We provided business recommendations for targeted marketing, pricing, and retention strategies.
📌 Final Thought:
Customer segmentation is a powerful tool for businesses to enhance marketing strategies and optimize operations. Unsupervised learning helps discover hidden patterns, ultimately leading to better customer satisfaction and increased revenue.
Next Steps for Implementation
- Deploy the model in a dashboard for real-time customer insights.
- Integrate segmentation results into a CRM system.
- Use A/B testing to measure the effectiveness of targeted marketing campaigns.
🚀 This concludes our Customer Segmentation project! Would you like to implement another technique or explore a different dataset next? 😊 Please let me know in the comment section.