Amazon Product Review Sentiment Analysis Using Machine Learning
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025

Amazon Product Review Sentiment Analysis Using Machine Learning:
Sentiment analysis is a valuable application of Natural Language Processing (NLP) that helps businesses understand customer opinions and feedback. This project focuses on Amazon Product Review Sentiment Analysis Using Machine Learning to classify reviews as positive, negative, or neutral. We will cover the entire process, including data collection, preprocessing, visualization, feature engineering, model building, and evaluation.
Objective:
- Perform sentiment analysis on Amazon product reviews using machine learning models.
- Preprocess and clean the text data for effective analysis.
- Visualize trends in customer reviews and derive actionable insights.
- Build and evaluate classification models for sentiment prediction.
Dataset:
The dataset can be downloaded from Kaggle. It contains Amazon product reviews, including text reviews, ratings, and other metadata.
Tools & Libraries:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- NLTK
- Scikit-learn
- TensorFlow/Keras (optional for deep learning models)
- Jupyter Notebook or Google Colab (Recommended)
1. Data Collection & Setup
Import Libraries and Load Dataset
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import classification_report, accuracy_score, confusion_matrix import itertools # Load the dataset data = pd.read_csv('/path/to/amazon_reviews.csv') # Replace with actual file path # Preview the dataset data.head()
Explanation:
- The dataset is loaded using pandas and previewed to understand its structure. The columns of interest typically include
Review
,Rating
, and other metadata.
2. Exploratory Data Analysis (EDA)
Dataset Overview
# Display dataset information data.info() data.describe() data.isnull().sum()
Distribution of Ratings
# Visualize the distribution of ratings sns.countplot(x='Rating', data=data, palette='viridis') plt.title('Distribution of Ratings') plt.xlabel('Rating') plt.ylabel('Count') plt.show()
Review Length Analysis
# Add a column for review length data['Review_Length'] = data['Review'].apply(lambda x: len(x.split())) # Plot review length distribution sns.histplot(data['Review_Length'], kde=True, color='blue') plt.title('Distribution of Review Lengths') plt.xlabel('Number of Words') plt.ylabel('Frequency') plt.show()
Explanation:
- Missing values, data types, and target distribution are analyzed to prepare the data for further processing. Review length analysis provides insights into how detailed customer reviews are.
3. Data Preprocessing & Feature Engineering
Handle Missing Values
# Fill missing reviews with a placeholder data['Review'] = data['Review'].fillna('No review')
Sentiment Labels
# Create a sentiment label based on ratings def assign_sentiment(rating): if rating >= 4: return 'Positive' elif rating == 3: return 'Neutral' else: return 'Negative' data['Sentiment'] = data['Rating'].apply(assign_sentiment)
Text Cleaning
import re stop_words = set(stopwords.words('english')) # Function to clean text def clean_text(text): text = re.sub(r'[^a-zA-Z]', ' ', text) # Remove non-alphabet characters text = text.lower() # Convert to lowercase text = word_tokenize(text) # Tokenize text = [word for word in text if word not in stop_words] # Remove stopwords return ' '.join(text) data['Cleaned_Review'] = data['Review'].apply(clean_text) data.head()
Feature Extraction with TF-IDF
# Vectorize text using TF-IDF tfidf = TfidfVectorizer(max_features=5000) X = tfidf.fit_transform(data['Cleaned_Review']).toarray() # Encode sentiment labels y = data['Sentiment'].factorize()[0]
Train-Test Split
# Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Explanation:
- Reviews are cleaned by removing special characters and stopwords, and converting text to lowercase. TF-IDF converts text data into numerical features for model training. Sentiment labels are derived from ratings.
4. Model Building & Evaluation
Logistic Regression
from sklearn.linear_model import LogisticRegression # Train logistic regression model logreg = LogisticRegression() logreg.fit(X_train, y_train) # Make predictions y_pred_logreg = logreg.predict(X_test) # Evaluate the model print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg)) print(classification_report(y_test, y_pred_logreg)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred_logreg) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive']) plt.title('Confusion Matrix - Logistic Regression') plt.xlabel('Predicted Labels') plt.ylabel('True Labels') plt.show()
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier # Train Random Forest model rf_model = RandomForestClassifier() rf_model.fit(X_train, y_train) # Make predictions y_pred_rf = rf_model.predict(X_test) # Evaluate the model print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print(classification_report(y_test, y_pred_rf)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred_rf) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive']) plt.title('Confusion Matrix - Random Forest') plt.xlabel('Predicted Labels') plt.ylabel('True Labels') plt.show()
XGBoost Classifier
from xgboost import XGBClassifier # Train XGBoost model xgb_model = XGBClassifier() xgb_model.fit(X_train, y_train) # Make predictions y_pred_xgb = xgb_model.predict(X_test) # Evaluate the model print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb)) print(classification_report(y_test, y_pred_xgb)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred_xgb) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive']) plt.title('Confusion Matrix - XGBoost') plt.xlabel('Predicted Labels') plt.ylabel('True Labels') plt.show()
Explanation:
- Multiple classification models are trained and evaluated to find the best-performing one. Logistic Regression provides a baseline, while Random Forest and XGBoost aim to improve performance. Confusion matrices are visualized for better insight into model performance.
5. Data Visualization
Sentiment Distribution
# Plot sentiment distribution sns.countplot(x='Sentiment', data=data, palette='coolwarm') plt.title('Sentiment Distribution') plt.xlabel('Sentiment') plt.ylabel('Count') plt.show()
Most Common Words in Positive Reviews
from wordcloud import WordCloud # Generate word cloud for positive reviews positive_reviews = ' '.join(data[data['Sentiment'] == 'Positive']['Cleaned_Review']) wordcloud = WordCloud(width=800, height=400, background_color='white').generate(positive_reviews) plt.figure(figsize=(10, 6)) plt.imshow(wordcloud, interpolation='bilinear') plt.title('Most Common Words in Positive Reviews') plt.axis('off') plt.show()
Explanation:
- Sentiment distribution provides an overview of customer sentiment. Word clouds visualize common words in positive reviews, giving insights into customer satisfaction.
6. Conclusion
This Amazon Product Review Sentiment Analysis Using Machine Learning project demonstrates the end-to-end process of performing sentiment analysis. From preprocessing reviews to building classification models, we covered key steps to extract valuable insights. Among the models tested, [insert best-performing model here] achieved the highest accuracy. Businesses can use this analysis to better understand customer feedback and improve services.
Download the dataset and try experimenting with deep learning models like LSTMs or transformers to further improve sentiment classification. Share your findings and insights!
Keywords: Amazon product review sentiment analysis, Sentiment analysis using machine learning, Customer feedback analysis, Text classification, Python project.