Spam Email Detection Using Machine Learning

KANGKAN KALITA
Spam Email Detection Using Machine Learning

In today’s digital world, spam emails pose a significant challenge, flooding inboxes with unwanted and potentially harmful messages. Traditional spam filters rely on manually defined rules, but they often fail to adapt to new spam patterns. This is where Spam Email Detection Using Machine Learning comes into play.

By leveraging machine learning techniques, we can develop an intelligent spam classification model that automatically identifies and filters out spam emails with high accuracy. In this project, we perform an end-to-end spam detection system, starting with data preprocessing, exploratory data analysis (EDA), feature extraction, model training, evaluation, and deployment. We explore multiple machine learning algorithms like Logistic Regression, Naïve Bayes, SVM, Random Forest, and XGBoost to determine the most effective model for spam classification.

Through this project, we aim to enhance email security and user experience by building an efficient spam detection system that minimizes false positives and ensures only genuine emails reach users’ inboxes.

Spam emails are unwanted and often fraudulent messages that clutter inboxes and pose security threats. Detecting spam manually is inefficient, so machine learning algorithms help automate spam detection by analyzing email content and identifying spam patterns. In this project, we will build a spam detection model using machine learning techniques, covering data preprocessing, exploratory data analysis, feature engineering, model training, evaluation, and a front-end for user interaction.

Step 1: Dataset Overview

We will use the Spam SMS Dataset, which consists of labeled messages categorized as:

  • ham (legitimate emails)
  • spam (unwanted messages)

The dataset contains two main columns:

  • text: The content of the email/SMS.
  • label: Classification as ‘spam’ or ‘ham’.

Let’s proceed with importing the necessary libraries and loading the dataset.

Step 2: Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Explanation

  • pandas, numpy: For handling and processing data.
  • seaborn, matplotlib: For data visualization.
  • re: Regular expressions for text cleaning.
  • nltk: NLP library for text preprocessing.
  • TfidfVectorizer: Converts text data into numerical format.
  • MultinomialNB: A Naïve Bayes classifier for text classification.
  • train_test_split: Splitting data into training and testing sets.
  • accuracy_score, classification_report, confusion_matrix: Evaluation metrics.

Step 3: Loading the Dataset

Let’s load the dataset and take a look at its structure.

# Load dataset
df = pd.read_csv("spam.csv", encoding="latin-1")

# Display the first five rows
df.head()

Explanation:

  • pd.read_csv("spam.csv", encoding="latin-1"): Loads the dataset into a Pandas DataFrame.
  • df.head(): Displays the first five rows to understand the dataset structure.

Step 3.1: Checking Dataset Information

Before proceeding with cleaning, let’s check the dataset’s information and see if it contains unnecessary columns.

# Check dataset information
df.info()

# Check for missing values
df.isnull().sum()

Explanation:

  • df.info(): Provides details about the dataset, including data types and non-null values.
  • df.isnull().sum(): Checks for missing values in the dataset.

Step 3.2: Dropping Unnecessary Columns

Some spam email datasets may contain extra columns that are not needed. Let’s drop any unnecessary ones.

# Keeping only the relevant columns
df = df[['v1', 'v2']]

# Renaming the columns for better understanding
df.columns = ['label', 'text']

# Checking the first few rows after modification
df.head()

Explanation:

  • We keep only the two relevant columns:
    • label: Spam or ham classification.
    • text: The email/SMS message content.
  • Renaming columns for clarity.

Step 3.3: Checking Class Distribution

Since this is a classification problem, it’s important to check the balance between spam and ham emails.

# Count the number of spam and ham emails
sns.countplot(x=df['label'], palette="coolwarm")
plt.title("Distribution of Spam and Ham Emails")
plt.show()

Explanation:

  • sns.countplot(x=df['label']): Creates a count plot to visualize the distribution of spam vs. ham emails.
  • This helps us understand if the dataset is balanced or imbalanced.

Step 4: Data Cleaning and Preprocessing

Before we train our machine learning models, we need to clean and preprocess the dataset. This includes:

  • Handling categorical labels (converting text labels to numerical values).
  • Checking for duplicates and removing them.
  • Text preprocessing (removing special characters, stopwords, and applying stemming).

Step 4.1: Encoding the Labels

Since the labels (ham and spam) are categorical, we need to convert them into numerical values.

# Encoding the target variable: ham → 0, spam → 1
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Checking the first few rows after encoding
df.head()

Explanation:

  • We replace "ham" with 0 and "spam" with 1 so that the machine learning model can process the labels.
  • df.head() is used to verify that the encoding is done correctly.

Step 4.2: Checking for Duplicates

Duplicate messages may exist in the dataset, which can negatively impact model training.

# Checking for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Removing duplicate rows if any
df = df.drop_duplicates()

# Checking the shape after removing duplicates
df.shape

Explanation:

  • df.duplicated().sum(): Counts the number of duplicate rows.
  • df.drop_duplicates(): Removes duplicate entries.
  • df.shape: Displays the dataset dimensions after removing duplicates.

Step 4.3: Text Preprocessing

Text data needs to be cleaned before using it in machine learning models. This includes:

  • Converting text to lowercase
  • Removing special characters, punctuation, and numbers
  • Removing stopwords (common words like “the”, “and”, “is”)
  • Stemming (reducing words to their root form)

Install NLTK Library if Not Installed

If you haven’t installed NLTK before, you need to install it using:

pip install nltk

Preprocessing the Text

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Download stopwords dataset
nltk.download('stopwords')

# Initialize the stemmer
stemmer = PorterStemmer()

# Function to clean the text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    words = text.split()  # Tokenization
    words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')]  # Remove stopwords and apply stemming
    return " ".join(words)

# Apply text cleaning function to the dataset
df["clean_text"] = df["text"].apply(clean_text)

# Display the first few cleaned messages
df.head()

Explanation:

  • Converts text to lowercase.
  • Removes punctuation using str.translate().
  • Splits text into words (tokenization).
  • Removes stopwords (words that don’t add much meaning).
  • Applies stemming (e.g., “running” → “run”, “playing” → “play”).
  • Stores the cleaned text in a new column clean_text.

Step 5: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps us understand the dataset by uncovering patterns, relationships, and potential anomalies. In this step, we will:

  • Check the distribution of spam vs. ham messages.
  • Analyze message lengths.
  • Identify frequently occurring words in spam and ham messages using word clouds.
  • Examine correlations between word frequency and spam likelihood.

Step 5.1: Checking Label Distribution

Let’s first examine the number of spam and ham messages in our dataset.

import matplotlib.pyplot as plt
import seaborn as sns

# Count of spam and ham messages
plt.figure(figsize=(6, 4))
sns.countplot(x=df["label"], palette="viridis")
plt.xticks(ticks=[0, 1], labels=["Ham", "Spam"])
plt.title("Distribution of Ham and Spam Messages")
plt.xlabel("Message Type")
plt.ylabel("Count")
plt.show()

Explanation:

  • Uses Seaborn’s countplot to visualize the count of ham (non-spam) and spam messages.
  • Helps us determine if the dataset is imbalanced (i.e., significantly more ham than spam messages).

Step 5.2: Checking Message Lengths

Spam messages are often longer due to promotional content. Let’s analyze the average message length.

# Create a new column with message lengths
df["message_length"] = df["message"].apply(len)

# Plot the distribution of message lengths
plt.figure(figsize=(8, 5))
sns.histplot(df[df["label"] == 0]["message_length"], bins=30, kde=True, color="blue", label="Ham")
sns.histplot(df[df["label"] == 1]["message_length"], bins=30, kde=True, color="red", label="Spam")
plt.legend()
plt.title("Distribution of Message Lengths")
plt.xlabel("Message Length")
plt.ylabel("Frequency")
plt.show()

Explanation:

  • Creates a new column (message_length) that stores the length of each message.
  • Uses histograms to compare message lengths between spam and ham messages.
  • Helps us identify if spam messages tend to be longer than ham messages.

Step 5.3: Word Frequency Analysis Using Word Clouds

A word cloud helps visualize the most frequently occurring words in spam and ham messages.

from wordcloud import WordCloud

# Generate word cloud for ham messages
ham_words = " ".join(df[df["label"] == 0]["message"])
spam_words = " ".join(df[df["label"] == 1]["message"])

plt.figure(figsize=(12, 6))

# Ham WordCloud
plt.subplot(1, 2, 1)
wc_ham = WordCloud(width=400, height=300, background_color="white").generate(ham_words)
plt.imshow(wc_ham, interpolation="bilinear")
plt.axis("off")
plt.title("Most Common Words in Ham Messages")

# Spam WordCloud
plt.subplot(1, 2, 2)
wc_spam = WordCloud(width=400, height=300, background_color="black", colormap="Reds").generate(spam_words)
plt.imshow(wc_spam, interpolation="bilinear")
plt.axis("off")
plt.title("Most Common Words in Spam Messages")

plt.show()

Explanation:

  • Combines all words in ham and spam messages separately.
  • Uses the WordCloud library to visualize frequent words.
  • Helps us identify spam-related words (e.g., “win”, “free”, “urgent”).

Step 5.4: Checking Correlations Between Words and Spam Likelihood

We will identify the top words that strongly indicate whether a message is spam.

from sklearn.feature_extraction.text import CountVectorizer

# Convert text into word count vectors
vectorizer = CountVectorizer(stop_words="english", max_features=20)
X_counts = vectorizer.fit_transform(df["message"])
word_freq_df = pd.DataFrame(X_counts.toarray(), columns=vectorizer.get_feature_names_out())

# Compute correlation with the spam label
word_freq_df["label"] = df["label"]
correlations = word_freq_df.corr()["label"].drop("label").sort_values(ascending=False)

# Plot top correlated words
plt.figure(figsize=(10, 5))
sns.barplot(x=correlations.index, y=correlations.values, palette="coolwarm")
plt.xticks(rotation=45)
plt.title("Top Words Correlated with Spam Messages")
plt.xlabel("Word")
plt.ylabel("Correlation with Spam")
plt.show()

Explanation:

  • Converts messages into a word count matrix using CountVectorizer.
  • Computes the correlation between word frequency and spam likelihood.
  • Plots the most correlated words, helping us identify key spam indicators.

Summary of EDA

  • Spam messages tend to be longer than ham messages.
  • Common spam words include “win”, “free”, “urgent”, while ham messages have general conversational words.
  • Certain words have a strong correlation with spam messages, which can help in feature selection.

Step 6: Data Preprocessing

Before training a machine learning model, we need to clean, tokenize, and vectorize the text data. This step includes:

  • Removing unnecessary characters, stopwords, and punctuations.
  • Converting text into lowercase.
  • Tokenizing and stemming words.
  • Converting text data into numerical format using TF-IDF Vectorization.

Step 6.1: Text Cleaning (Removing Punctuation, Stopwords, and Lowercasing)

We will first clean the text data by removing unnecessary characters and converting everything into lowercase.

import re
import string
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return " ".join(words)

# Apply cleaning function
df["cleaned_message"] = df["message"].apply(clean_text)

# Display first 5 cleaned messages
df[["message", "cleaned_message"]].head()

Explanation:

  • Lowercasing ensures consistency in text processing.
  • Punctuation and numbers are removed as they do not contribute to spam detection.
  • Stopwords (e.g., “the”, “is”, “and”) are removed to reduce noise.

Step 6.2: Tokenization and Stemming

Tokenization breaks sentences into words, and stemming reduces words to their root form.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    words = text.split()
    words = [stemmer.stem(word) for word in words]
    return " ".join(words)

df["stemmed_message"] = df["cleaned_message"].apply(stem_words)

# Display first 5 stemmed messages
df[["cleaned_message", "stemmed_message"]].head()

Explanation:

  • Tokenization breaks text into individual words.
  • Stemming reduces words to their base form (e.g., “winning” → “win”).

Step 6.3: Converting Text Data to Numerical Format Using TF-IDF

To train a machine learning model, we need to convert text data into numerical vectors. We use TF-IDF (Term Frequency – Inverse Document Frequency) to weigh words based on importance.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=3000)

# Transform the messages
X = tfidf.fit_transform(df["stemmed_message"]).toarray()

# Store feature names
features = tfidf.get_feature_names_out()

# Convert to DataFrame for better visualization
tfidf_df = pd.DataFrame(X, columns=features)

# Display first 5 rows
tfidf_df.head()

Explanation:

  • TF-IDF assigns importance to words based on their frequency across messages.
  • Converts text into numerical representation to be used by machine learning models.
  • Limits features to 3000 words to avoid memory issues.

Summary of Data Preprocessing

Text is cleaned (stopwords, punctuation, and numbers removed).
Words are tokenized and stemmed to their root forms.
TF-IDF transforms text into numerical format for model training.

Step 7: Model Selection and Training

Now that our text data is preprocessed and transformed into numerical format using TF-IDF, we can proceed with training machine learning models for spam detection.

We will use multiple machine learning models and compare their performance:

  1. Logistic Regression
  2. Naïve Bayes (MultinomialNB)
  3. Support Vector Machine (SVM – Linear Kernel)
  4. Random Forest Classifier

We will train, test, and evaluate each model to find the best-performing one.

Step 7.1: Splitting Data into Training and Testing Sets

We divide the dataset into 80% training and 20% testing to evaluate model performance.

from sklearn.model_selection import train_test_split

# Target variable (Spam: 1, Ham: 0)
y = df["label"].map({"ham": 0, "spam": 1})  

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training Set Size: {X_train.shape[0]} samples")
print(f"Testing Set Size: {X_test.shape[0]} samples")

Explanation:

  • train_test_split() splits the dataset into 80% training and 20% testing.
  • stratify=y ensures class balance in training and testing sets.

Step 7.2: Training Logistic Regression Model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)

# Model evaluation
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_lr))

Logistic Regression is fast and interpretable.
✔ Suitable for binary classification tasks like spam detection.

Step 7.3: Training Naïve Bayes Model (MultinomialNB)

from sklearn.naive_bayes import MultinomialNB

# Initialize and train model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predictions
y_pred_nb = nb_model.predict(X_test)

# Model evaluation
print("Naïve Bayes Performance:")
print(classification_report(y_test, y_pred_nb))

Naïve Bayes is effective for text classification.
✔ Works well with TF-IDF-transformed data.

Step 7.4: Training Support Vector Machine (SVM) Model

from sklearn.svm import SVC

# Initialize and train model
svm_model = SVC(kernel="linear")
svm_model.fit(X_train, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test)

# Model evaluation
print("SVM Performance:")
print(classification_report(y_test, y_pred_svm))

SVM (Linear Kernel) works well with high-dimensional data like text.
✔ Often provides better accuracy than Naïve Bayes.

Step 7.5: Training Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# Initialize and train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Model evaluation
print("Random Forest Performance:")
print(classification_report(y_test, y_pred_rf))

Random Forest is a strong ensemble method.
✔ It captures complex relationships in data.

Step 7.6: Comparing Model Performance

from sklearn.metrics import accuracy_score

# Accuracy comparison
models = {
    "Logistic Regression": accuracy_score(y_test, y_pred_lr),
    "Naïve Bayes": accuracy_score(y_test, y_pred_nb),
    "SVM": accuracy_score(y_test, y_pred_svm),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
}

# Display accuracy scores
for model, accuracy in models.items():
    print(f"{model}: {accuracy:.4f}")

Explanation:

  • We compare the accuracy of all models to find the best one.
  • Higher accuracy means better classification performance.

Summary of Model Training:

Logistic Regression: Fast and interpretable.
Naïve Bayes: Simple and effective for text classification.
SVM: Often provides higher accuracy.
Random Forest: Strong ensemble method for complex data.

Step 8: Data Preprocessing

Before training a machine learning model, we need to preprocess the data by converting text into numerical representations. This involves:

  1. Removing Stopwords & Punctuation
  2. Tokenization & Lemmatization
  3. Vectorization using TF-IDF

8.1 Removing Stopwords & Punctuation

Spam emails often contain unnecessary words. We will remove them along with punctuations to clean the text.

Code Implementation

import nltk
from nltk.corpus import stopwords
import string

# Download stopwords dataset
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """ Function to remove punctuation and stopwords from text """
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = text.lower().split()  # Convert to lowercase and split
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return ' '.join(words)

# Apply text cleaning function
df['clean_text'] = df['text'].apply(clean_text)

# Display cleaned text
df[['text', 'clean_text']].head()

Explanation

  • stopwords.words('english') loads common words like “the”, “is”, “and” which do not add much meaning.
  • string.punctuation removes punctuation marks like “.”, “!”, “?”.
  • lower() and split() convert the text to lowercase and split it into words.
  • List comprehension is used to remove stopwords efficiently.
  • Finally, we apply this function to all email texts.

8.2 Tokenization & Lemmatization

Tokenization splits text into words, and lemmatization converts words to their base form (e.g., “running” → “run”).

Code Implementation

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    """ Function to lemmatize words in text """
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply lemmatization
df['lemmatized_text'] = df['clean_text'].apply(lemmatize_text)

# Display lemmatized text
df[['clean_text', 'lemmatized_text']].head()

Explanation

  • WordNet Lemmatizer converts words to their root form.
  • Each word in the cleaned text is lemmatized and joined back into a sentence.

8.3 Converting Text to Vectors using TF-IDF

Since machine learning models work with numbers, we will convert text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).

Code Implementation

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000)  # Limit vocabulary size
X_tfidf = vectorizer.fit_transform(df['lemmatized_text'])

# Convert sparse matrix to DataFrame
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

# Target variable
y = df['label']

# Display shape of processed data
print("Shape of feature matrix:", X_tfidf_df.shape)
print("Shape of target variable:", y.shape)

Explanation

  • TF-IDF converts words into a weighted numerical representation.
  • max_features=5000 limits vocabulary size to 5000 most important words.
  • fit_transform() applies vectorization to our dataset.
  • The sparse matrix is converted into a DataFrame for better readability.
  • The target variable (y) remains unchanged.

Step 9: Splitting the Dataset for Training and Testing

Before training our machine learning models, we need to split the dataset into two parts:

  • Training Set (80%) – Used to train the model
  • Testing Set (20%) – Used to evaluate model performance

9.1 Splitting the Data

We’ll use train_test_split from scikit-learn to divide our dataset.

Code Implementation

from sklearn.model_selection import train_test_split

# Splitting data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y)

# Print dataset sizes
print("Training Set Size:", X_train.shape)
print("Testing Set Size:", X_test.shape)

Explanation

  • train_test_split randomly splits the dataset into training (80%) and testing (20%).
  • random_state=42 ensures reproducibility.
  • stratify=y maintains the proportion of spam and non-spam emails in both sets.

Step 9: Model Training and Evaluation

We will train multiple machine learning models and compare their performances. The models we will use are:

  • Logistic Regression
  • Naïve Bayes (MultinomialNB)
  • Support Vector Machine (SVM)
  • Random Forest Classifier
  • XGBoost Classifier

9.1 Training Logistic Regression

Code Implementation

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test)

# Model Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print("Classification Report:\n", classification_report(y_test, y_pred_log))

9.2 Training Naïve Bayes Classifier

Code Implementation

from sklearn.naive_bayes import MultinomialNB

# Initialize and train the model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Predictions
y_pred_nb = nb.predict(X_test)

# Model Evaluation
print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))

8.3 Training Support Vector Machine (SVM)

Code Implementation

from sklearn.svm import SVC

# Initialize and train the model
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

# Predictions
y_pred_svm = svm.predict(X_test)

# Model Evaluation
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))

9.4 Training Random Forest Classifier

Code Implementation

from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)

# Model Evaluation
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

9.5 Training XGBoost Classifier

Code Implementation

from xgboost import XGBClassifier

# Initialize and train the model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)

# Predictions
y_pred_xgb = xgb.predict(X_test)

# Model Evaluation
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

Step 10: Comparing Model Performance

Now, let’s compare the accuracy of all models.

Code Implementation

# Store accuracy scores
model_scores = {
    "Logistic Regression": accuracy_score(y_test, y_pred_log),
    "Naïve Bayes": accuracy_score(y_test, y_pred_nb),
    "SVM": accuracy_score(y_test, y_pred_svm),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
    "XGBoost": accuracy_score(y_test, y_pred_xgb)
}

# Print model comparison
for model, score in model_scores.items():
    print(f"{model}: {score:.4f}")

Next Steps

Now that we have trained multiple models and compared their performances, the next step is to deploy the best-performing model using a Flask-based web app for user interaction.

The Spam Email Detection Using Machine Learning project demonstrates the power of AI in automating email classification and enhancing cybersecurity. We started by preprocessing the dataset, applying TF-IDF vectorization, and training multiple machine learning models. Our evaluation showed that models like Naïve Bayes and SVM perform exceptionally well in identifying spam emails with high accuracy.

To make this project practical, we also built a user-friendly front-end using Flask, allowing users to input email text and get instant spam classification results. This real-time spam detection system can be integrated into email platforms to prevent phishing, scams, and malware threats.

By implementing machine learning-based spam filters, organizations can significantly reduce email-based threats, improve email security, and provide users with a spam-free communication experience. This project highlights the importance of AI-driven automation in today’s cybersecurity landscape, making email communication more efficient and secure.

CLICK HERE TO EXPLORE MORE SUCH PROJECTS

Latest Posts:

Next Steps:

Step 11: Building the Front-End with Flask for Spam Email Detection

Now that we have trained and evaluated multiple machine learning models, it’s time to build a Flask-based web application that allows users to input an email and get real-time spam detection results.

Flask is a lightweight web framework for Python that enables us to create a user-friendly interface for our spam detection model.

11.1 Installing Dependencies

Before we start coding, install Flask and other required libraries:

pip install flask flask-wtf wtforms pandas joblib

11.2 Creating the Flask App Structure

Create a project folder and organize it as follows:

spam_email_detection/
│── static/                 # CSS, JS, images
│── templates/              # HTML files
│── model/                  # Saved ML model
│── app.py                  # Flask main application
│── spam_model.pkl          # Serialized trained model
│── vectorizer.pkl          # TF-IDF vectorizer
│── requirements.txt        # List of dependencies

11.3 Saving the Best Model & Vectorizer

Before deploying, we save the best-performing model and vectorizer using joblib.

import joblib

# Save the best model
joblib.dump(best_model, "spam_model.pkl")

# Save the TF-IDF vectorizer
joblib.dump(vectorizer, "vectorizer.pkl")

11.4 Creating the Flask App (app.py)

Now, let’s create the main Flask app (app.py), which will load the saved model, process user inputs, and return predictions.

from flask import Flask, render_template, request
import joblib

# Initialize Flask app
app = Flask(__name__)

# Load the trained model and vectorizer
model = joblib.load("spam_model.pkl")
vectorizer = joblib.load("vectorizer.pkl")

@app.route("/", methods=["GET", "POST"])
def index():
    prediction = None
    
    if request.method == "POST":
        email_text = request.form["email_text"]
        
        # Transform the input email text
        email_vector = vectorizer.transform([email_text])
        
        # Predict spam or ham
        prediction = model.predict(email_vector)[0]
        
        if prediction == 1:
            prediction = "Spam"
        else:
            prediction = "Not Spam"
    
    return render_template("index.html", prediction=prediction)

if __name__ == "__main__":
    app.run(debug=True)

11.5 Creating the Front-End (index.html)

Inside the templates/ folder, create an index.html file for the web interface.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Spam Email Detector</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            text-align: center;
            margin-top: 50px;
        }
        form {
            margin-top: 20px;
        }
        textarea {
            width: 50%;
            height: 100px;
        }
        .result {
            font-size: 20px;
            font-weight: bold;
            margin-top: 20px;
        }
    </style>
</head>
<body>
    <h1>Spam Email Detection Using Machine Learning</h1>
    
    <form method="POST">
        <label for="email_text">Enter Email Content:</label><br>
        <textarea name="email_text" required></textarea><br><br>
        <button type="submit">Check</button>
    </form>

    {% if prediction %}
        <div class="result">
            <p>Prediction: {{ prediction }}</p>
        </div>
    {% endif %}
</body>
</html>

11.6 Running the Flask App

To start the web app, run the following command:

python app.py

The application will start at http://127.0.0.1:5000/. Open it in your browser and enter an email text to check if it’s spam or not.


Conclusion

We have successfully built and deployed a Spam Email Detection web application using Flask. This app enables users to enter email text and receive real-time spam classification results. By integrating machine learning and Flask, we have created a practical solution for identifying spam emails.

Next, we can deploy this app to platforms like Heroku, Render, or AWS for public access. 🚀

Let me know in comments if you want to proceed with Step 12: Deploying the Flask App to Heroku!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *