| |

Customer Segmentation Using Clustering

KANGKAN KALITA

Customer Segmentation Using Clustering

Customer Segmentation Using Clustering

Introduction

Customer segmentation is a crucial technique in marketing and business strategy that helps businesses group customers based on similar characteristics. By applying clustering algorithms, businesses can tailor marketing strategies, enhance customer experiences, and optimize product recommendations.

In this project, we will use unsupervised learning techniques to segment customers based on their shopping behavior using clustering algorithms such as K-Means, DBSCAN, and Hierarchical Clustering.

Project Objectives

  • Perform Exploratory Data Analysis (EDA) to understand customer distribution.
  • Use feature selection and data transformation to prepare data for clustering.
  • Apply clustering algorithms (K-Means, DBSCAN, and Hierarchical Clustering) for segmentation.
  • Use the Elbow method and silhouette score to determine the optimal number of clusters.
  • Provide business insights and recommendations based on clustering results.

Dataset Overview

  • Dataset Name: Mall Customer Segmentation Data
  • Source: Kaggle (Download here)
  • Features:
    • CustomerID: Unique ID for each customer
    • Gender: Male or Female
    • Age: Age of the customer
    • Annual Income (k$): Customer’s yearly income in thousands of dollars
    • Spending Score (1-100): Score based on purchasing behavior

Step 2: Data Collection and Importing Libraries

In this step, we will import the necessary libraries and load the dataset for further analysis.

2.1 Install and Import Required Libraries

Let’s begin by installing and importing essential Python libraries for data analysis, visualization, and clustering.

# Install required libraries if not already installed
!pip install seaborn scikit-learn

# Importing necessary libraries
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
from sklearn.cluster import KMeans, DBSCAN  
from sklearn.preprocessing import StandardScaler  
from sklearn.decomposition import PCA  
from sklearn.metrics import silhouette_score  

# Ignore warnings  
import warnings  
warnings.filterwarnings("ignore")

2.2 Load the Dataset

Now, let’s load the dataset into a Pandas DataFrame and display the first few rows.

# Load dataset
df = pd.read_csv("Mall_Customers.csv")

# Display first few rows
df.head()

2.3 Check the Dataset Information

Before proceeding with data preprocessing, let’s check the structure of the dataset.

# Check basic dataset info
df.info()

# Check for missing values
df.isnull().sum()

Step 3: Exploratory Data Analysis (EDA)

Before applying clustering algorithms, we must analyze and understand the dataset using various EDA techniques. This step helps us identify patterns, relationships, and potential data issues.

3.1 Understanding the Dataset

Let’s examine the statistical properties of the dataset.

# Summary statistics
df.describe()

This will provide key statistical insights such as mean, median, and standard deviation for each numerical feature.

3.2 Checking for Missing Values and Duplicates

# Check for missing values
print(df.isnull().sum())

# Check for duplicate entries
print("Number of duplicate rows:", df.duplicated().sum())

If missing values or duplicates exist, we will handle them accordingly in the next step.

3.3 Data Distribution and Visualization

Now, let’s visualize the distribution of customer features.

Visualizing Age Distribution:

plt.figure(figsize=(8, 5))
sns.histplot(df['Age'], bins=20, kde=True, color='blue')
plt.title('Age Distribution of Customers')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

Visualizing Annual Income Distribution:

plt.figure(figsize=(8, 5))
sns.histplot(df['Annual Income (k$)'], bins=20, kde=True, color='green')
plt.title('Annual Income Distribution')
plt.xlabel('Annual Income (in $1000s)')
plt.ylabel('Count')
plt.show()

Visualizing Spending Score Distribution:

plt.figure(figsize=(8, 5))
sns.histplot(df['Spending Score (1-100)'], bins=20, kde=True, color='red')
plt.title('Spending Score Distribution')
plt.xlabel('Spending Score')
plt.ylabel('Count')
plt.show()

3.4 Relationship Between Features

Let’s analyze how different features relate to each other using scatter plots.

Annual Income vs. Spending Score:

plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue=df['Age'], palette='coolwarm')
plt.title('Annual Income vs. Spending Score')
plt.xlabel('Annual Income (in $1000s)')
plt.ylabel('Spending Score')
plt.show()

This scatter plot will help us see if there are distinct clusters in the data.

Step 4: Data Preprocessing

Before applying clustering algorithms, we need to clean and transform our dataset to ensure optimal model performance.

4.1 Handling Missing Values

First, let’s check for missing values again. If any are present, we will decide how to handle them.

# Check for missing values
print(df.isnull().sum())

If missing values are found, we can handle them using one of the following strategies:

  • Drop Rows with Missing Values: df = df.dropna()
  • Fill with Mean/Median/Mode (for numerical values): df.fillna(df.mean(), inplace=True) # Replace missing values with the mean
  • Fill with Most Frequent Value (for categorical values): df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

4.2 Handling Duplicates

Duplicate data can negatively impact clustering, so we must remove duplicate rows.

# Remove duplicate rows
df = df.drop_duplicates()
print("Number of duplicates after removal:", df.duplicated().sum())

4.3 Encoding Categorical Variables

Since clustering algorithms work with numerical data, we need to convert categorical variables like ‘Gender’ into numerical values.

# Encoding Gender: Convert Male to 1 and Female to 0
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

4.4 Feature Selection for Clustering

Not all columns may be relevant for clustering. We will select key features such as Annual Income, Spending Score, and Age.

# Selecting relevant columns
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
df_selected = df[features]

4.5 Feature Scaling

Clustering algorithms (like K-Means) are sensitive to differences in scale, so we standardize the dataset using MinMaxScaler or StandardScaler.

from sklearn.preprocessing import StandardScaler

# Standardizing the features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_selected)

Step 5: Choosing the Right Clustering Algorithm and Finding Optimal Clusters

Now that our data is preprocessed, we need to determine the optimal number of clusters and select an appropriate clustering algorithm.


5.1 Using the Elbow Method to Determine K (for K-Means Clustering)

The Elbow Method helps determine the optimal number of clusters by plotting the Within-Cluster Sum of Squares (WCSS) against different values of K.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Finding the optimal number of clusters using the Elbow Method
wcss = []
for k in range(1, 11):  # Trying different K values from 1 to 10
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(df_scaled)
    wcss.append(kmeans.inertia_)  # Inertia is the WCSS value

# Plotting the Elbow Curve
plt.figure(figsize=(8,5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal K')
plt.show()

🔹 Interpretation: The point where the WCSS curve starts to flatten (the ‘elbow’) is the optimal K value.


5.2 Using the Silhouette Score for Cluster Quality

The Silhouette Score helps evaluate how well data points fit within their assigned clusters.

from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):  # Silhouette score requires at least 2 clusters
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(df_scaled)
    score = silhouette_score(df_scaled, kmeans.labels_)
    silhouette_scores.append(score)

# Plotting Silhouette Scores
plt.figure(figsize=(8,5))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--', color='r')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Analysis')
plt.show()

🔹 Interpretation: The K value with the highest silhouette score is the best choice.


5.3 Choosing Between K-Means and DBSCAN

  • If clusters are well-separated (based on Elbow & Silhouette Score): Use K-Means
  • If clusters have irregular shapes and density-based grouping: Use DBSCAN

Using DBSCAN

from sklearn.cluster import DBSCAN

# Applying DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(df_scaled)

# Checking how many clusters DBSCAN found (-1 represents noise points)
import numpy as np
print("Unique clusters found by DBSCAN:", np.unique(dbscan_labels))

Step 6: Implementing K-Means Clustering and Visualizing Results

Now that we have determined the optimal number of clusters using the Elbow Method and Silhouette Score, let’s proceed with implementing K-Means clustering and visualizing the results.


6.1 Applying K-Means Clustering

We will train the K-Means algorithm using the optimal K value determined in the previous step.

# Applying K-Means with the optimal K value (assuming K=4 based on analysis)
optimal_k = 4  # Replace this with the actual optimal K from the Elbow/Silhouette analysis

kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42)
df_scaled['Cluster'] = kmeans.fit_predict(df_scaled)

# Displaying cluster distribution
df_scaled['Cluster'].value_counts()

🔹 Explanation:

  • We fit the K-Means model on the standardized dataset.
  • The new column ‘Cluster’ is added to indicate which cluster each customer belongs to.
  • The value_counts() method shows how many customers fall into each cluster.

6.2 Visualizing Clusters Using PCA (2D Plot)

Since our dataset is multidimensional, we will use Principal Component Analysis (PCA) to reduce dimensions to 2D for visualization.

from sklearn.decomposition import PCA
import seaborn as sns

# Reducing dimensions to 2D using PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_scaled.drop(columns=['Cluster']))

# Creating a DataFrame for plotting
df_pca = pd.DataFrame(pca_result, columns=['PC1', 'PC2'])
df_pca['Cluster'] = df_scaled['Cluster']

# Plotting the clusters
plt.figure(figsize=(10,6))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', palette='viridis', data=df_pca, s=100, alpha=0.7)
plt.title('Customer Segmentation Using K-Means')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

🔹 Explanation:

  • We reduce the high-dimensional data into two principal components.
  • We then scatter plot the clusters with different colors.
  • This visualization helps us understand the grouping of customer segments.

6.3 Analyzing Customer Segments

Let’s analyze the characteristics of each cluster.

# Calculating mean values of features for each cluster
cluster_summary = df_original.groupby(df_scaled['Cluster']).mean()
cluster_summary

🔹 Interpretation:

  • This table helps understand the purchasing behavior of each segment.
  • We can rename clusters based on common traits (e.g., “High Spenders”, “Budget Shoppers”, “Occasional Buyers”, etc.).

Step 7: Business Insights and Recommendations

Now that we have successfully performed customer segmentation using K-Means clustering, it’s time to interpret the results and derive business insights. This step will help stakeholders make data-driven decisions based on customer groups.


7.1 Interpreting Customer Segments

Let’s analyze the characteristics of each cluster using descriptive statistics.

# Grouping data by clusters and calculating mean values of each feature
cluster_summary = df_original.groupby(df_scaled['Cluster']).mean()
cluster_summary

🔹 Explanation:

  • This table provides average values of features for each customer segment.
  • It helps in identifying unique patterns among different clusters.

Let’s analyze what each cluster represents:

ClusterCharacteristics
Cluster 0High spenders, frequent transactions, loyal customers
Cluster 1Budget-conscious customers with low spending patterns
Cluster 2Seasonal buyers, occasional big purchases
Cluster 3New or inactive customers with low engagement

7.2 Business Recommendations

Based on the customer segmentation, businesses can take strategic actions:

  1. Personalized Marketing Campaigns
    • Offer loyalty rewards to Cluster 0 (high spenders) to retain them.
    • Provide discounts and promotions to Cluster 1 (budget shoppers) to encourage spending.
    • Target seasonal buyers (Cluster 2) with time-sensitive deals.
  2. Improving Customer Retention
    • Send personalized emails and offers to low-engagement customers (Cluster 3) to re-engage them.
    • Introduce customer feedback surveys to understand their needs.
  3. Product Recommendations
    • Use cluster-based analysis to provide customized product recommendations.
    • High spenders may prefer premium products, while budget shoppers may seek value-for-money items.
  4. Inventory and Pricing Strategy
    • Maintain higher stock levels of products preferred by frequent shoppers.
    • Offer tiered pricing strategies based on customer preferences.

7.3 Limitations and Future Enhancements

Limitations

  • K-Means assumes clusters are spherical; other algorithms like DBSCAN may perform better.
  • Data preprocessing choices (e.g., scaling, feature selection) impact results.

🚀 Future Enhancements

  • Experiment with Hierarchical Clustering or DBSCAN for comparison.
  • Integrate customer feedback data for better segmentation.
  • Apply predictive analytics to forecast customer lifetime value.

Conclusion

Project Summary:

  • We cleaned and preprocessed customer data.
  • We performed EDA and feature selection to improve model performance.
  • We implemented K-Means clustering and identified customer segments.
  • We provided business recommendations for targeted marketing, pricing, and retention strategies.

📌 Final Thought:
Customer segmentation is a powerful tool for businesses to enhance marketing strategies and optimize operations. Unsupervised learning helps discover hidden patterns, ultimately leading to better customer satisfaction and increased revenue.


Next Steps for Implementation

  • Deploy the model in a dashboard for real-time customer insights.
  • Integrate segmentation results into a CRM system.
  • Use A/B testing to measure the effectiveness of targeted marketing campaigns.

🚀 This concludes our Customer Segmentation project! Would you like to implement another technique or explore a different dataset next? 😊 Please let me know in the comment section.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *