Chocolate Sales Analysis and Prediction Using ML
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025
Chocolate Sales Analysis and Prediction
Understanding chocolate sales trends is crucial for businesses to maximize profits and optimize marketing strategies. In this chocolate sales analysis and prediction project, we leverage Exploratory Data Analysis (EDA) and Machine Learning (ML) to uncover key patterns and forecast future sales. By analyzing factors like product type, price, and regional demand, we provide data-driven insights to enhance decision-making. Using Python, Pandas, Scikit-learn, and visualization tools, we develop predictive models to improve sales strategies and business performance.

Analyzing chocolate sales is essential for understanding market trends, customer behavior, and seasonal demand. This project focuses on using Exploratory Data Analysis (EDA) and Machine Learning to uncover insights and predict future sales trends. By analyzing sales data, we aim to identify patterns, key drivers of sales, and improve business decisions.
Objective
- Perform EDA to understand sales trends and patterns.
- Create visualizations to explore relationships between different factors like product type, price, and region.
- Develop machine learning models to predict future sales.
Dataset Overview
We will use the Chocolate Sales Dataset from Kaggle, which includes information on:
- Date: Date of the sale.
- Product: Type of chocolate product sold.
- Price: Sale price of the product.
- Quantity: Quantity of the product sold.
- Region: Sales region.
Tools and Libraries
- Python – Programming language.
- Pandas – Data manipulation and analysis.
- NumPy – Numerical computing.
- Matplotlib and Seaborn – Data visualization.
- Scikit-learn – Machine learning.
Step 1: Importing Libraries and Loading the Dataset
First, we need to import the necessary Python libraries and load the dataset.
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Load the dataset file_path = "/path/to/chocolate_sales.csv" # Replace with the actual file path df = pd.read_csv(file_path) # Display the first few rows df.head()
Explanation:
- Pandas and NumPy – For data handling and numerical computations.
- Matplotlib and Seaborn – For data visualization.
- Scikit-learn – For splitting the dataset and building machine learning models.
pd.read_csv()
– Loads the dataset into a DataFrame.df.head()
– Displays the first five rows of the dataset to check its structure and content.
Step 2: Data Cleaning and Exploration
1. Check for Missing Values
# Check for missing values data.isnull().sum()
Explanation:
isnull().sum()
– Returns the number of missing values in each column.- This helps identify if any data imputation or removal is necessary.
2. Handle Missing Values
# Fill missing values with the median (for numerical columns) data.fillna(data.median(), inplace=True) # Fill missing values with the mode (for categorical columns) for col in data.select_dtypes(include='object').columns: data[col].fillna(data[col].mode()[0], inplace=True)
Explanation:
fillna(data.median())
– Fills missing numerical values with the median.mode()[0]
– Fills missing categorical values with the most frequent value.
3. Check for Duplicates
# Drop duplicate rows if any data.drop_duplicates(inplace=True)
Explanation:
drop_duplicates()
– Removes duplicate rows to avoid skewed results.
4. Check Data Types
# Check data types data.dtypes
Explanation:
- Ensures that numerical columns are in the correct format for analysis and model building.
5. Convert Data Types (if needed)
# Example: Converting date column to datetime data['date'] = pd.to_datetime(data['date'])
Explanation:
pd.to_datetime()
– Converts date columns into a datetime format for time-based analysis.
Step 3: Exploratory Data Analysis (EDA)
1. Overview of the Dataset
# Display basic information data.info()
Explanation:
info()
– Displays the number of non-null values, data types, and memory usage.- Helps understand the structure of the dataset.
2. Statistical Summary
# Display summary statistics data.describe()
Explanation:
describe()
– Provides summary statistics like mean, median, standard deviation, min, and max for numerical columns.- Useful for identifying data distribution and outliers.
3. Distribution of Sales
# Distribution of sales sns.histplot(data['sales'], kde=True, color='blue', bins=30) plt.title('Sales Distribution') plt.show()
Explanation:
histplot()
– Plots a histogram of sales values to check for normality or skewness.kde=True
– Adds a Kernel Density Estimate (smooth curve) over the histogram.
4. Correlation Heatmap
# Correlation matrix plt.figure(figsize=(8, 6)) sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5) plt.title('Correlation Heatmap') plt.show()
Explanation:
corr()
– Calculates correlation between numerical variables.heatmap()
– Visualizes the correlation, helping identify multicollinearity.
5. Sales Trend Over Time
# Plot sales over time plt.figure(figsize=(12, 6)) sns.lineplot(x='date', y='sales', data=data) plt.title('Sales Trend Over Time') plt.xlabel('Date') plt.ylabel('Sales') plt.show()
Explanation:
lineplot()
– Displays changes in sales over time, helping spot seasonal trends or patterns.
6. Sales by Region
# Sales by region plt.figure(figsize=(10, 5)) sns.barplot(x='region', y='sales', data=data, palette='viridis') plt.title('Sales by Region') plt.xticks(rotation=45) plt.show()
Explanation:
barplot()
– Displays mean sales values by region, helping identify which regions have higher sales.
7. Product Category Contribution
# Sales by product category plt.figure(figsize=(10, 5)) sns.boxplot(x='product_category', y='sales', data=data) plt.title('Sales by Product Category') plt.xticks(rotation=45) plt.show()
Explanation:
boxplot()
– Shows the distribution of sales for different product categories.- Helps detect outliers and category-based differences.
Step 4: Feature Engineering
1. Handling Missing Values
# Check for missing values data.isnull().sum()
Explanation:
isnull().sum()
– Identifies columns with missing values.- Handling options:
- Drop rows – If missing values are few and data loss is minimal.
- Impute values – Use mean, median, or mode to fill gaps.
Example:
# Imputing missing values with mean data['sales'].fillna(data['sales'].mean(), inplace=True)
2. Handling Categorical Variables
Convert categorical features into numerical format using One-Hot Encoding or Label Encoding.
Example:
# One-Hot Encoding for 'region' data = pd.get_dummies(data, columns=['region'], drop_first=True)
Explanation:
get_dummies()
– Converts categorical values into binary columns.drop_first=True
– Prevents multicollinearity by removing the first binary column.
3. Creating New Features
Example: Create new features based on date components:
# Extracting date features data['month'] = pd.to_datetime(data['date']).dt.month data['day_of_week'] = pd.to_datetime(data['date']).dt.dayofweek data['year'] = pd.to_datetime(data['date']).dt.year
Explanation:
dt.month
,dt.dayofweek
, anddt.year
– Create new features from date for better pattern recognition.
4. Scaling Numerical Features
Scale numerical features to bring them to a similar scale using StandardScaler or MinMaxScaler.
Example:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data[['sales']] = scaler.fit_transform(data[['sales']])
Explanation:
StandardScaler()
– Transforms data to have zero mean and unit variance.- Helps models like SVM and KNN work more efficiently.
Step 5: Model Building
1. Splitting the Data
We split the dataset into training and testing sets to evaluate model performance.
from sklearn.model_selection import train_test_split # Define independent variables (X) and target variable (y) X = data.drop('sales', axis=1) y = data['sales'] # Split into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Explanation:
train_test_split()
– Splits the data into training and testing sets.test_size=0.2
– Allocates 20% of data for testing.random_state=42
– Ensures reproducibility.
2. Building a Linear Regression Model
Linear Regression is a simple yet effective baseline model for sales prediction.
from sklearn.linear_model import LinearRegression # Initialize the model lr = LinearRegression() # Train the model lr.fit(X_train, y_train) # Predict on the test set y_pred_lr = lr.predict(X_test)
Explanation:
LinearRegression()
– Fits a linear function to the data.fit()
– Trains the model on training data.predict()
– Predicts on unseen data.
3. Evaluating the Linear Regression Model
Evaluate the model using common regression metrics.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score print("Linear Regression Performance:") print("MAE:", mean_absolute_error(y_test, y_pred_lr)) print("MSE:", mean_squared_error(y_test, y_pred_lr)) print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr))) print("R2 Score:", r2_score(y_test, y_pred_lr))
Explanation:
mean_absolute_error
– Average absolute error.mean_squared_error
– Average squared error.np.sqrt()
– Square root of MSE gives RMSE (Root Mean Squared Error).r2_score
– Measures how well the model fits the data (value close to 1 is ideal).
4. Building a Decision Tree Regressor
Next, let’s try a Decision Tree model for better flexibility with complex patterns.
from sklearn.tree import DecisionTreeRegressor # Initialize and train the model dt = DecisionTreeRegressor(random_state=42) dt.fit(X_train, y_train) # Predict on the test set y_pred_dt = dt.predict(X_test)
Explanation:
DecisionTreeRegressor()
– Builds a tree structure for regression.fit()
– Trains the tree model.predict()
– Predicts on the test data.
5. Evaluating the Decision Tree Model
Evaluate Decision Tree performance using the same metrics.
print("Decision Tree Performance:") print("MAE:", mean_absolute_error(y_test, y_pred_dt)) print("MSE:", mean_squared_error(y_test, y_pred_dt)) print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_dt))) print("R2 Score:", r2_score(y_test, y_pred_dt))
6. Building a Random Forest Regressor
Random Forest is an ensemble model that often improves accuracy by reducing variance.
from sklearn.ensemble import RandomForestRegressor # Initialize and train the model rf = RandomForestRegressor(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # Predict on the test set y_pred_rf = rf.predict(X_test)
Explanation:
RandomForestRegressor()
– Builds multiple decision trees and averages predictions.n_estimators=100
– Builds 100 trees for improved accuracy.fit()
– Trains the model.predict()
– Predicts on new data.
7. Evaluating the Random Forest Model
print("Random Forest Performance:") print("MAE:", mean_absolute_error(y_test, y_pred_rf)) print("MSE:", mean_squared_error(y_test, y_pred_rf)) print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf))) print("R2 Score:", r2_score(y_test, y_pred_rf))
Step 6: Model Comparison and Hyperparameter Tuning
1. Comparing Model Performance
Let’s compare the performance of Linear Regression, Decision Tree, and Random Forest using a performance table:
comparison = pd.DataFrame({ 'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'], 'MAE': [mean_absolute_error(y_test, y_pred_lr), mean_absolute_error(y_test, y_pred_dt), mean_absolute_error(y_test, y_pred_rf)], 'MSE': [mean_squared_error(y_test, y_pred_lr), mean_squared_error(y_test, y_pred_dt), mean_squared_error(y_test, y_pred_rf)], 'RMSE': [np.sqrt(mean_squared_error(y_test, y_pred_lr)), np.sqrt(mean_squared_error(y_test, y_pred_dt)), np.sqrt(mean_squared_error(y_test, y_pred_rf)], 'R2 Score': [r2_score(y_test, y_pred_lr), r2_score(y_test, y_pred_dt), r2_score(y_test, y_pred_rf)] }) comparison
Explanation:
pd.DataFrame()
– Creates a structured table with model performance metrics.- This table helps visualize which model performs best across different metrics.
2. Hyperparameter Tuning for Random Forest
Let’s improve the Random Forest model using Grid Search to find the best combination of hyperparameters:
from sklearn.model_selection import GridSearchCV # Define the parameter grid param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2] } # Initialize GridSearchCV grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42), param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) # Fit the model grid_search.fit(X_train, y_train) # Best parameters print("Best Parameters:", grid_search.best_params_)
Explanation:
GridSearchCV()
– Performs exhaustive search over specified hyperparameter values.cv=5
– 5-fold cross-validation ensures stability of results.scoring='neg_mean_squared_error'
– Maximizes negative MSE to minimize error.n_jobs=-1
– Uses all available CPU cores for faster computation.
3. Training Random Forest with Best Parameters
After finding the best parameters, let’s retrain the Random Forest model:
best_rf = RandomForestRegressor(n_estimators=grid_search.best_params_['n_estimators'], max_depth=grid_search.best_params_['max_depth'], min_samples_split=grid_search.best_params_['min_samples_split'], min_samples_leaf=grid_search.best_params_['min_samples_leaf'], random_state=42) best_rf.fit(X_train, y_train) y_pred_best_rf = best_rf.predict(X_test)
Explanation:
best_params_[]
– Uses the best hyperparameter values from Grid Search.fit()
– Trains the optimized model.predict()
– Makes predictions using the tuned model.
4. Evaluating the Optimized Model
Evaluate the performance of the optimized model:
print("Optimized Random Forest Performance:") print("MAE:", mean_absolute_error(y_test, y_pred_best_rf)) print("MSE:", mean_squared_error(y_test, y_pred_best_rf)) print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_best_rf))) print("R2 Score:", r2_score(y_test, y_pred_best_rf))
Next Steps
Now that we have built and optimized the model, here are some potential next steps to improve the project further:
- Try more advanced models – Explore Gradient Boosting, XGBoost, or LightGBM for potentially higher accuracy.
- Feature Engineering – Create new meaningful features based on domain knowledge and insights from data.
- Model Stacking – Combine multiple models to create an ensemble for better generalization.
- Deploy the Model – Create a Flask or Streamlit web app to allow users to interact with the model and make predictions in real-time.
Summary
This project demonstrated how to perform end-to-end chocolate sales prediction using machine learning. We started with data collection, explored and visualized the dataset, performed feature engineering, and built and optimized multiple machine learning models. By comparing models and tuning hyperparameters, we identified the best-performing model. This project provides a solid foundation for handling sales prediction problems using Python and machine learning. Keep experimenting with new models and features to enhance predictive performance! 🚀
This blog explores chocolate sales analysis and prediction using EDA and machine learning. We analyze sales data, identify patterns, and develop predictive models to forecast future sales trends. By leveraging Python, Pandas, and Scikit-learn, we uncover insights into factors influencing chocolate sales, such as pricing, seasonality, and regional demand. Whether you’re a business owner or a data science enthusiast, this project helps in making data-driven decisions to optimize sales performance.
If you’re ready to start another project or explore deeper insights, let me know! 😎
Latest Posts:
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast]
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide]
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025]
- Netflix Data Analysis with Python: Beginner-Friendly Project with Code & Insights
- 15 Best Machine Learning Projects for Your Resume That Will Impress Recruiters [2025 Guide]