Amazon Product Review Sentiment Analysis Using Machine Learning

KANGKAN KALITA
Amazon Product Review Sentiment Analysis Using Machine Learning

Amazon Product Review Sentiment Analysis Using Machine Learning:
Sentiment analysis is a valuable application of Natural Language Processing (NLP) that helps businesses understand customer opinions and feedback. This project focuses on Amazon Product Review Sentiment Analysis Using Machine Learning to classify reviews as positive, negative, or neutral. We will cover the entire process, including data collection, preprocessing, visualization, feature engineering, model building, and evaluation.

Objective:

  • Perform sentiment analysis on Amazon product reviews using machine learning models.
  • Preprocess and clean the text data for effective analysis.
  • Visualize trends in customer reviews and derive actionable insights.
  • Build and evaluate classification models for sentiment prediction.

Dataset:
The dataset can be downloaded from Kaggle. It contains Amazon product reviews, including text reviews, ratings, and other metadata.

Tools & Libraries:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • NLTK
  • Scikit-learn
  • TensorFlow/Keras (optional for deep learning models)
  • Jupyter Notebook or Google Colab (Recommended)

1. Data Collection & Setup

Import Libraries and Load Dataset

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import itertools

# Load the dataset
data = pd.read_csv('/path/to/amazon_reviews.csv')  # Replace with actual file path

# Preview the dataset
data.head()

Explanation:

  • The dataset is loaded using pandas and previewed to understand its structure. The columns of interest typically include Review, Rating, and other metadata.

2. Exploratory Data Analysis (EDA)

Dataset Overview

# Display dataset information
data.info()
data.describe()
data.isnull().sum()

Distribution of Ratings

# Visualize the distribution of ratings
sns.countplot(x='Rating', data=data, palette='viridis')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

Review Length Analysis

# Add a column for review length
data['Review_Length'] = data['Review'].apply(lambda x: len(x.split()))

# Plot review length distribution
sns.histplot(data['Review_Length'], kde=True, color='blue')
plt.title('Distribution of Review Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

Explanation:

  • Missing values, data types, and target distribution are analyzed to prepare the data for further processing. Review length analysis provides insights into how detailed customer reviews are.

3. Data Preprocessing & Feature Engineering

Handle Missing Values

# Fill missing reviews with a placeholder
data['Review'] = data['Review'].fillna('No review')

Sentiment Labels

# Create a sentiment label based on ratings
def assign_sentiment(rating):
    if rating >= 4:
        return 'Positive'
    elif rating == 3:
        return 'Neutral'
    else:
        return 'Negative'

data['Sentiment'] = data['Rating'].apply(assign_sentiment)

Text Cleaning

import re
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
    text = text.lower()  # Convert to lowercase
    text = word_tokenize(text)  # Tokenize
    text = [word for word in text if word not in stop_words]  # Remove stopwords
    return ' '.join(text)

data['Cleaned_Review'] = data['Review'].apply(clean_text)
data.head()

Feature Extraction with TF-IDF

# Vectorize text using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(data['Cleaned_Review']).toarray()

# Encode sentiment labels
y = data['Sentiment'].factorize()[0]

Train-Test Split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  • Reviews are cleaned by removing special characters and stopwords, and converting text to lowercase. TF-IDF converts text data into numerical features for model training. Sentiment labels are derived from ratings.

4. Model Building & Evaluation

Logistic Regression

from sklearn.linear_model import LogisticRegression

# Train logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions
y_pred_logreg = logreg.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_logreg)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

XGBoost Classifier

from xgboost import XGBClassifier

# Train XGBoost model
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])
plt.title('Confusion Matrix - XGBoost')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

Explanation:

  • Multiple classification models are trained and evaluated to find the best-performing one. Logistic Regression provides a baseline, while Random Forest and XGBoost aim to improve performance. Confusion matrices are visualized for better insight into model performance.

5. Data Visualization

Sentiment Distribution

# Plot sentiment distribution
sns.countplot(x='Sentiment', data=data, palette='coolwarm')
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

Most Common Words in Positive Reviews

from wordcloud import WordCloud

# Generate word cloud for positive reviews
positive_reviews = ' '.join(data[data['Sentiment'] == 'Positive']['Cleaned_Review'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(positive_reviews)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Most Common Words in Positive Reviews')
plt.axis('off')
plt.show()

Explanation:

  • Sentiment distribution provides an overview of customer sentiment. Word clouds visualize common words in positive reviews, giving insights into customer satisfaction.

6. Conclusion

This Amazon Product Review Sentiment Analysis Using Machine Learning project demonstrates the end-to-end process of performing sentiment analysis. From preprocessing reviews to building classification models, we covered key steps to extract valuable insights. Among the models tested, [insert best-performing model here] achieved the highest accuracy. Businesses can use this analysis to better understand customer feedback and improve services.


Download the dataset and try experimenting with deep learning models like LSTMs or transformers to further improve sentiment classification. Share your findings and insights!

Keywords: Amazon product review sentiment analysis, Sentiment analysis using machine learning, Customer feedback analysis, Text classification, Python project.

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *