Easy Data Science Project on Recommendation Systems: Santander Product Recommendation System
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast] - May 30, 2025
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide] - May 26, 2025
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025] - May 24, 2025
Easy Data Science Project on Recommendation Systems: Santander Product Recommendation System

Recommendation systems are at the core of modern data-driven services, helping businesses enhance customer experiences by predicting their preferences. In this project, we develop a Santander Product Recommendation System to forecast which products Santander Bank customers are likely to use in the upcoming month based on their past behavior and that of similar customers. By leveraging various machine learning models, we aim to create an efficient and scalable recommendation system.
Objective:
- Perform EDA and feature engineering on the Santander dataset.
- Visualize insights using Python libraries like Seaborn.
- Build and evaluate multiple classification models, including Logistic Regression, XGBoost, Gradient Boosting, Random Forest, Extra Tree Classifier, and Neural Networks.
Dataset:
The Santander dataset can be downloaded from Kaggle. It contains customer details, product usage history, and other relevant information required for this project.
Tools & Libraries:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- XGBoost
- TensorFlow/Keras
- Jupyter Notebook or Google Colab (Recommended)
- click here to check our other projects and their tutorials
Instructions:
- Use Jupyter Notebook or Google Colab for step-by-step implementation.
- The dataset contains customer-product interactions. Clean and preprocess it before applying machine learning models.
Lets Start our Easy Data Science Project on Recommendation Systems.
1. Data Collection & Setup
Import Libraries and Load Dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.metrics import classification_report, accuracy_score # Load dataset data = pd.read_csv('/path/to/santander_dataset.csv') # Replace with actual file path data.head()
Explanation:
- Essential libraries for data manipulation, visualization, and modeling are imported.
- The dataset is loaded and previewed to understand its structure.
2. Exploratory Data Analysis (EDA)
Dataset Overview:
# Dataset info and statistics data.info() data.describe() data.isnull().sum()
Target Variable Distribution:
# Visualize the target variable target = 'product_column' # Replace with actual target column sns.countplot(x=target, data=data, palette='viridis') plt.title('Target Variable Distribution') plt.show()
Correlations:
# Correlation heatmap plt.figure(figsize=(12, 8)) sns.heatmap(data.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show()
Explanation:
- Missing values, data types, and distributions are analyzed.
- Heatmaps identify correlations between features and the target variable.
3. Data Preprocessing & Feature Engineering
Handle Missing Values:
# Fill missing values with median or mode data.fillna(data.median(), inplace=True)
Feature Transformation:
# Encode categorical variables encoder = LabelEncoder() categorical_cols = ['category1', 'category2'] # Replace with actual categorical columns for col in categorical_cols: data[col] = encoder.fit_transform(data[col]) # Scale numerical features scaler = StandardScaler() numerical_cols = ['num_col1', 'num_col2'] # Replace with actual numerical columns data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
Train-Test Split:
# Split dataset into features and target X = data.drop(columns=[target]) y = data[target] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Missing values are handled, categorical features are encoded, and numerical features are scaled for consistency.
- The dataset is split into training and testing sets.
4. Model Building & Evaluation
Logistic Regression
from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred_logreg = logreg.predict(X_test) print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg)) print(classification_report(y_test, y_pred_logreg))
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier rf_model = RandomForestClassifier() rf_model.fit(X_train, y_train) y_pred_rf = rf_model.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print(classification_report(y_test, y_pred_rf))
Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier gb_model = GradientBoostingClassifier() gb_model.fit(X_train, y_train) y_pred_gb = gb_model.predict(X_test) print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb)) print(classification_report(y_test, y_pred_gb))
XGBoost Classifier
from xgboost import XGBClassifier xgb_model = XGBClassifier() xgb_model.fit(X_train, y_train) y_pred_xgb = xgb_model.predict(X_test) print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb)) print(classification_report(y_test, y_pred_xgb))
Neural Network (MLPClassifier)
from sklearn.neural_network import MLPClassifier mlp_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300) mlp_model.fit(X_train, y_train) y_pred_mlp = mlp_model.predict(X_test) print("Neural Network Accuracy:", accuracy_score(y_test, y_pred_mlp)) print(classification_report(y_test, y_pred_mlp))
Explanation:
- Multiple models are trained and evaluated using accuracy and classification reports.
- Performance comparison helps select the best model.
5. Visualization of Model Performance
Compare Model Accuracies:
model_accuracies = { "Logistic Regression": accuracy_score(y_test, y_pred_logreg), "Random Forest": accuracy_score(y_test, y_pred_rf), "Gradient Boosting": accuracy_score(y_test, y_pred_gb), "XGBoost": accuracy_score(y_test, y_pred_xgb), "Neural Network": accuracy_score(y_test, y_pred_mlp), } plt.figure(figsize=(10, 6)) plt.bar(model_accuracies.keys(), model_accuracies.values(), color='skyblue') plt.title('Model Accuracy Comparison') plt.ylabel('Accuracy') plt.xticks(rotation=45) plt.show()
Explanation:
- A bar plot compares the accuracy of different models, making it easier to select the best-performing one.
6. Conclusion
This Santander Product Recommendation System Project demonstrates how to preprocess data, perform feature engineering, and build robust machine learning models. Among the models tested, [insert best-performing model here] performed the best, achieving an accuracy of [insert accuracy]. This project highlights the importance of EDA, feature transformation, and model selection in developing recommendation systems.
Download the Santander dataset and experiment with other machine learning models or hyperparameter tuning to further improve performance. Extend this project by deploying the model using Flask or FastAPI!
Easy Data Science Project on Recommendation Systems, Santander Product Recommendation System, EDA, Feature Engineering, Machine Learning Models, Python.