Chocolate Sales Analysis and Prediction Using ML

KANGKAN KALITA

Chocolate Sales Analysis and Prediction

Understanding chocolate sales trends is crucial for businesses to maximize profits and optimize marketing strategies. In this chocolate sales analysis and prediction project, we leverage Exploratory Data Analysis (EDA) and Machine Learning (ML) to uncover key patterns and forecast future sales. By analyzing factors like product type, price, and regional demand, we provide data-driven insights to enhance decision-making. Using Python, Pandas, Scikit-learn, and visualization tools, we develop predictive models to improve sales strategies and business performance.

chocolate sales analysis and prediction
chocolate sales analysis and prediction

Analyzing chocolate sales is essential for understanding market trends, customer behavior, and seasonal demand. This project focuses on using Exploratory Data Analysis (EDA) and Machine Learning to uncover insights and predict future sales trends. By analyzing sales data, we aim to identify patterns, key drivers of sales, and improve business decisions.

Objective

  • Perform EDA to understand sales trends and patterns.
  • Create visualizations to explore relationships between different factors like product type, price, and region.
  • Develop machine learning models to predict future sales.

Dataset Overview

We will use the Chocolate Sales Dataset from Kaggle, which includes information on:

  • Date: Date of the sale.
  • Product: Type of chocolate product sold.
  • Price: Sale price of the product.
  • Quantity: Quantity of the product sold.
  • Region: Sales region.

Tools and Libraries

  • Python – Programming language.
  • Pandas – Data manipulation and analysis.
  • NumPy – Numerical computing.
  • Matplotlib and Seaborn – Data visualization.
  • Scikit-learn – Machine learning.

Step 1: Importing Libraries and Loading the Dataset

First, we need to import the necessary Python libraries and load the dataset.

# Import necessary libraries  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression  
from sklearn.ensemble import RandomForestRegressor  
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  

# Load the dataset  
file_path = "/path/to/chocolate_sales.csv"  # Replace with the actual file path  
df = pd.read_csv(file_path)  

# Display the first few rows  
df.head()

Explanation:

  • Pandas and NumPy – For data handling and numerical computations.
  • Matplotlib and Seaborn – For data visualization.
  • Scikit-learn – For splitting the dataset and building machine learning models.
  • pd.read_csv() – Loads the dataset into a DataFrame.
  • df.head() – Displays the first five rows of the dataset to check its structure and content.

Step 2: Data Cleaning and Exploration

1. Check for Missing Values

# Check for missing values
data.isnull().sum()

Explanation:

  • isnull().sum() – Returns the number of missing values in each column.
  • This helps identify if any data imputation or removal is necessary.

2. Handle Missing Values

# Fill missing values with the median (for numerical columns)
data.fillna(data.median(), inplace=True)

# Fill missing values with the mode (for categorical columns)
for col in data.select_dtypes(include='object').columns:
    data[col].fillna(data[col].mode()[0], inplace=True)

Explanation:

  • fillna(data.median()) – Fills missing numerical values with the median.
  • mode()[0] – Fills missing categorical values with the most frequent value.

3. Check for Duplicates

# Drop duplicate rows if any
data.drop_duplicates(inplace=True)

Explanation:

  • drop_duplicates() – Removes duplicate rows to avoid skewed results.

4. Check Data Types

# Check data types
data.dtypes

Explanation:

  • Ensures that numerical columns are in the correct format for analysis and model building.

5. Convert Data Types (if needed)

# Example: Converting date column to datetime
data['date'] = pd.to_datetime(data['date'])

Explanation:

  • pd.to_datetime() – Converts date columns into a datetime format for time-based analysis.

Step 3: Exploratory Data Analysis (EDA)

1. Overview of the Dataset

# Display basic information
data.info()

Explanation:

  • info() – Displays the number of non-null values, data types, and memory usage.
  • Helps understand the structure of the dataset.

2. Statistical Summary

# Display summary statistics
data.describe()

Explanation:

  • describe() – Provides summary statistics like mean, median, standard deviation, min, and max for numerical columns.
  • Useful for identifying data distribution and outliers.

3. Distribution of Sales

# Distribution of sales
sns.histplot(data['sales'], kde=True, color='blue', bins=30)
plt.title('Sales Distribution')
plt.show()

Explanation:

  • histplot() – Plots a histogram of sales values to check for normality or skewness.
  • kde=True – Adds a Kernel Density Estimate (smooth curve) over the histogram.

4. Correlation Heatmap

# Correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Explanation:

  • corr() – Calculates correlation between numerical variables.
  • heatmap() – Visualizes the correlation, helping identify multicollinearity.

5. Sales Trend Over Time

# Plot sales over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='sales', data=data)
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

Explanation:

  • lineplot() – Displays changes in sales over time, helping spot seasonal trends or patterns.

6. Sales by Region

# Sales by region
plt.figure(figsize=(10, 5))
sns.barplot(x='region', y='sales', data=data, palette='viridis')
plt.title('Sales by Region')
plt.xticks(rotation=45)
plt.show()

Explanation:

  • barplot() – Displays mean sales values by region, helping identify which regions have higher sales.

7. Product Category Contribution

# Sales by product category
plt.figure(figsize=(10, 5))
sns.boxplot(x='product_category', y='sales', data=data)
plt.title('Sales by Product Category')
plt.xticks(rotation=45)
plt.show()

Explanation:

  • boxplot() – Shows the distribution of sales for different product categories.
  • Helps detect outliers and category-based differences.

Step 4: Feature Engineering

1. Handling Missing Values

# Check for missing values
data.isnull().sum()

Explanation:

  • isnull().sum() – Identifies columns with missing values.
  • Handling options:
    • Drop rows – If missing values are few and data loss is minimal.
    • Impute values – Use mean, median, or mode to fill gaps.

Example:

# Imputing missing values with mean
data['sales'].fillna(data['sales'].mean(), inplace=True)

2. Handling Categorical Variables
Convert categorical features into numerical format using One-Hot Encoding or Label Encoding.

Example:

# One-Hot Encoding for 'region'
data = pd.get_dummies(data, columns=['region'], drop_first=True)

Explanation:

  • get_dummies() – Converts categorical values into binary columns.
  • drop_first=True – Prevents multicollinearity by removing the first binary column.

3. Creating New Features
Example: Create new features based on date components:

# Extracting date features
data['month'] = pd.to_datetime(data['date']).dt.month
data['day_of_week'] = pd.to_datetime(data['date']).dt.dayofweek
data['year'] = pd.to_datetime(data['date']).dt.year

Explanation:

  • dt.month, dt.dayofweek, and dt.year – Create new features from date for better pattern recognition.

4. Scaling Numerical Features
Scale numerical features to bring them to a similar scale using StandardScaler or MinMaxScaler.

Example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['sales']] = scaler.fit_transform(data[['sales']])

Explanation:

  • StandardScaler() – Transforms data to have zero mean and unit variance.
  • Helps models like SVM and KNN work more efficiently.

Step 5: Model Building

1. Splitting the Data
We split the dataset into training and testing sets to evaluate model performance.

from sklearn.model_selection import train_test_split

# Define independent variables (X) and target variable (y)
X = data.drop('sales', axis=1)
y = data['sales']

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation:

  • train_test_split() – Splits the data into training and testing sets.
  • test_size=0.2 – Allocates 20% of data for testing.
  • random_state=42 – Ensures reproducibility.

2. Building a Linear Regression Model
Linear Regression is a simple yet effective baseline model for sales prediction.

from sklearn.linear_model import LinearRegression

# Initialize the model
lr = LinearRegression()

# Train the model
lr.fit(X_train, y_train)

# Predict on the test set
y_pred_lr = lr.predict(X_test)

Explanation:

  • LinearRegression() – Fits a linear function to the data.
  • fit() – Trains the model on training data.
  • predict() – Predicts on unseen data.

3. Evaluating the Linear Regression Model
Evaluate the model using common regression metrics.

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Linear Regression Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_lr))
print("MSE:", mean_squared_error(y_test, y_pred_lr))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
print("R2 Score:", r2_score(y_test, y_pred_lr))

Explanation:

  • mean_absolute_error – Average absolute error.
  • mean_squared_error – Average squared error.
  • np.sqrt() – Square root of MSE gives RMSE (Root Mean Squared Error).
  • r2_score – Measures how well the model fits the data (value close to 1 is ideal).

4. Building a Decision Tree Regressor
Next, let’s try a Decision Tree model for better flexibility with complex patterns.

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

# Predict on the test set
y_pred_dt = dt.predict(X_test)

Explanation:

  • DecisionTreeRegressor() – Builds a tree structure for regression.
  • fit() – Trains the tree model.
  • predict() – Predicts on the test data.

5. Evaluating the Decision Tree Model
Evaluate Decision Tree performance using the same metrics.

print("Decision Tree Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_dt))
print("MSE:", mean_squared_error(y_test, y_pred_dt))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_dt)))
print("R2 Score:", r2_score(y_test, y_pred_dt))

6. Building a Random Forest Regressor
Random Forest is an ensemble model that often improves accuracy by reducing variance.

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf.predict(X_test)

Explanation:

  • RandomForestRegressor() – Builds multiple decision trees and averages predictions.
  • n_estimators=100 – Builds 100 trees for improved accuracy.
  • fit() – Trains the model.
  • predict() – Predicts on new data.

7. Evaluating the Random Forest Model

print("Random Forest Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_rf))
print("MSE:", mean_squared_error(y_test, y_pred_rf))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("R2 Score:", r2_score(y_test, y_pred_rf))

Step 6: Model Comparison and Hyperparameter Tuning

1. Comparing Model Performance
Let’s compare the performance of Linear Regression, Decision Tree, and Random Forest using a performance table:

comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'MAE': [mean_absolute_error(y_test, y_pred_lr), 
            mean_absolute_error(y_test, y_pred_dt), 
            mean_absolute_error(y_test, y_pred_rf)],
    'MSE': [mean_squared_error(y_test, y_pred_lr), 
            mean_squared_error(y_test, y_pred_dt), 
            mean_squared_error(y_test, y_pred_rf)],
    'RMSE': [np.sqrt(mean_squared_error(y_test, y_pred_lr)),
             np.sqrt(mean_squared_error(y_test, y_pred_dt)),
             np.sqrt(mean_squared_error(y_test, y_pred_rf)],
    'R2 Score': [r2_score(y_test, y_pred_lr), 
                 r2_score(y_test, y_pred_dt), 
                 r2_score(y_test, y_pred_rf)]
})
comparison

Explanation:

  • pd.DataFrame() – Creates a structured table with model performance metrics.
  • This table helps visualize which model performs best across different metrics.

2. Hyperparameter Tuning for Random Forest
Let’s improve the Random Forest model using Grid Search to find the best combination of hyperparameters:

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
                           param_grid=param_grid,
                           cv=5,
                           scoring='neg_mean_squared_error',
                           n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

Explanation:

  • GridSearchCV() – Performs exhaustive search over specified hyperparameter values.
  • cv=5 – 5-fold cross-validation ensures stability of results.
  • scoring='neg_mean_squared_error' – Maximizes negative MSE to minimize error.
  • n_jobs=-1 – Uses all available CPU cores for faster computation.

3. Training Random Forest with Best Parameters
After finding the best parameters, let’s retrain the Random Forest model:

best_rf = RandomForestRegressor(n_estimators=grid_search.best_params_['n_estimators'],
                                max_depth=grid_search.best_params_['max_depth'],
                                min_samples_split=grid_search.best_params_['min_samples_split'],
                                min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                random_state=42)

best_rf.fit(X_train, y_train)
y_pred_best_rf = best_rf.predict(X_test)

Explanation:

  • best_params_[] – Uses the best hyperparameter values from Grid Search.
  • fit() – Trains the optimized model.
  • predict() – Makes predictions using the tuned model.

4. Evaluating the Optimized Model
Evaluate the performance of the optimized model:

print("Optimized Random Forest Performance:")
print("MAE:", mean_absolute_error(y_test, y_pred_best_rf))
print("MSE:", mean_squared_error(y_test, y_pred_best_rf))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_best_rf)))
print("R2 Score:", r2_score(y_test, y_pred_best_rf))

Next Steps

Now that we have built and optimized the model, here are some potential next steps to improve the project further:

  • Try more advanced models – Explore Gradient Boosting, XGBoost, or LightGBM for potentially higher accuracy.
  • Feature Engineering – Create new meaningful features based on domain knowledge and insights from data.
  • Model Stacking – Combine multiple models to create an ensemble for better generalization.
  • Deploy the Model – Create a Flask or Streamlit web app to allow users to interact with the model and make predictions in real-time.

Summary

This project demonstrated how to perform end-to-end chocolate sales prediction using machine learning. We started with data collection, explored and visualized the dataset, performed feature engineering, and built and optimized multiple machine learning models. By comparing models and tuning hyperparameters, we identified the best-performing model. This project provides a solid foundation for handling sales prediction problems using Python and machine learning. Keep experimenting with new models and features to enhance predictive performance! 🚀

This blog explores chocolate sales analysis and prediction using EDA and machine learning. We analyze sales data, identify patterns, and develop predictive models to forecast future sales trends. By leveraging Python, Pandas, and Scikit-learn, we uncover insights into factors influencing chocolate sales, such as pricing, seasonality, and regional demand. Whether you’re a business owner or a data science enthusiast, this project helps in making data-driven decisions to optimize sales performance.

If you’re ready to start another project or explore deeper insights, let me know! 😎

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *