Spam Email Detection Using Machine Learning
- Predicting House Prices using Machine Learning - April 10, 2025
- 10 Data Visualization Project Ideas with Source Code - April 9, 2025
- Music Recommendation System using Python – Full Project - April 7, 2025

In today’s digital world, spam emails pose a significant challenge, flooding inboxes with unwanted and potentially harmful messages. Traditional spam filters rely on manually defined rules, but they often fail to adapt to new spam patterns. This is where Spam Email Detection Using Machine Learning comes into play.
By leveraging machine learning techniques, we can develop an intelligent spam classification model that automatically identifies and filters out spam emails with high accuracy. In this project, we perform an end-to-end spam detection system, starting with data preprocessing, exploratory data analysis (EDA), feature extraction, model training, evaluation, and deployment. We explore multiple machine learning algorithms like Logistic Regression, Naïve Bayes, SVM, Random Forest, and XGBoost to determine the most effective model for spam classification.
Through this project, we aim to enhance email security and user experience by building an efficient spam detection system that minimizes false positives and ensures only genuine emails reach users’ inboxes.
Spam emails are unwanted and often fraudulent messages that clutter inboxes and pose security threats. Detecting spam manually is inefficient, so machine learning algorithms help automate spam detection by analyzing email content and identifying spam patterns. In this project, we will build a spam detection model using machine learning techniques, covering data preprocessing, exploratory data analysis, feature engineering, model training, evaluation, and a front-end for user interaction.
Step 1: Dataset Overview
We will use the Spam SMS Dataset, which consists of labeled messages categorized as:
- ham (legitimate emails)
- spam (unwanted messages)
The dataset contains two main columns:
- text: The content of the email/SMS.
- label: Classification as ‘spam’ or ‘ham’.
Let’s proceed with importing the necessary libraries and loading the dataset.
Step 2: Importing Libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import re import nltk from nltk.corpus import stopwords from nltk.stem import PorterStemmer from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Explanation
pandas
,numpy
: For handling and processing data.seaborn
,matplotlib
: For data visualization.re
: Regular expressions for text cleaning.nltk
: NLP library for text preprocessing.TfidfVectorizer
: Converts text data into numerical format.MultinomialNB
: A Naïve Bayes classifier for text classification.train_test_split
: Splitting data into training and testing sets.accuracy_score
,classification_report
,confusion_matrix
: Evaluation metrics.
Step 3: Loading the Dataset
Let’s load the dataset and take a look at its structure.
# Load dataset df = pd.read_csv("spam.csv", encoding="latin-1") # Display the first five rows df.head()
Explanation:
pd.read_csv("spam.csv", encoding="latin-1")
: Loads the dataset into a Pandas DataFrame.df.head()
: Displays the first five rows to understand the dataset structure.
Step 3.1: Checking Dataset Information
Before proceeding with cleaning, let’s check the dataset’s information and see if it contains unnecessary columns.
# Check dataset information df.info() # Check for missing values df.isnull().sum()
Explanation:
df.info()
: Provides details about the dataset, including data types and non-null values.df.isnull().sum()
: Checks for missing values in the dataset.
Step 3.2: Dropping Unnecessary Columns
Some spam email datasets may contain extra columns that are not needed. Let’s drop any unnecessary ones.
# Keeping only the relevant columns df = df[['v1', 'v2']] # Renaming the columns for better understanding df.columns = ['label', 'text'] # Checking the first few rows after modification df.head()
Explanation:
- We keep only the two relevant columns:
label
: Spam or ham classification.text
: The email/SMS message content.
- Renaming columns for clarity.
Step 3.3: Checking Class Distribution
Since this is a classification problem, it’s important to check the balance between spam and ham emails.
# Count the number of spam and ham emails sns.countplot(x=df['label'], palette="coolwarm") plt.title("Distribution of Spam and Ham Emails") plt.show()
Explanation:
sns.countplot(x=df['label'])
: Creates a count plot to visualize the distribution of spam vs. ham emails.- This helps us understand if the dataset is balanced or imbalanced.
Step 4: Data Cleaning and Preprocessing
Before we train our machine learning models, we need to clean and preprocess the dataset. This includes:
- Handling categorical labels (converting text labels to numerical values).
- Checking for duplicates and removing them.
- Text preprocessing (removing special characters, stopwords, and applying stemming).
Step 4.1: Encoding the Labels
Since the labels (ham
and spam
) are categorical, we need to convert them into numerical values.
# Encoding the target variable: ham → 0, spam → 1 df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Checking the first few rows after encoding df.head()
Explanation:
- We replace
"ham"
with0
and"spam"
with1
so that the machine learning model can process the labels. df.head()
is used to verify that the encoding is done correctly.
Step 4.2: Checking for Duplicates
Duplicate messages may exist in the dataset, which can negatively impact model training.
# Checking for duplicate rows duplicates = df.duplicated().sum() print(f"Number of duplicate rows: {duplicates}") # Removing duplicate rows if any df = df.drop_duplicates() # Checking the shape after removing duplicates df.shape
Explanation:
df.duplicated().sum()
: Counts the number of duplicate rows.df.drop_duplicates()
: Removes duplicate entries.df.shape
: Displays the dataset dimensions after removing duplicates.
Step 4.3: Text Preprocessing
Text data needs to be cleaned before using it in machine learning models. This includes:
- Converting text to lowercase
- Removing special characters, punctuation, and numbers
- Removing stopwords (common words like “the”, “and”, “is”)
- Stemming (reducing words to their root form)
Install NLTK Library if Not Installed
If you haven’t installed NLTK before, you need to install it using:
pip install nltk
Preprocessing the Text
import nltk from nltk.corpus import stopwords from nltk.stem import PorterStemmer import string # Download stopwords dataset nltk.download('stopwords') # Initialize the stemmer stemmer = PorterStemmer() # Function to clean the text def clean_text(text): text = text.lower() # Convert to lowercase text = text.translate(str.maketrans("", "", string.punctuation)) # Remove punctuation words = text.split() # Tokenization words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')] # Remove stopwords and apply stemming return " ".join(words) # Apply text cleaning function to the dataset df["clean_text"] = df["text"].apply(clean_text) # Display the first few cleaned messages df.head()
Explanation:
- Converts text to lowercase.
- Removes punctuation using
str.translate()
. - Splits text into words (tokenization).
- Removes stopwords (words that don’t add much meaning).
- Applies stemming (e.g., “running” → “run”, “playing” → “play”).
- Stores the cleaned text in a new column
clean_text
.
Step 5: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) helps us understand the dataset by uncovering patterns, relationships, and potential anomalies. In this step, we will:
- Check the distribution of spam vs. ham messages.
- Analyze message lengths.
- Identify frequently occurring words in spam and ham messages using word clouds.
- Examine correlations between word frequency and spam likelihood.
Step 5.1: Checking Label Distribution
Let’s first examine the number of spam and ham messages in our dataset.
import matplotlib.pyplot as plt import seaborn as sns # Count of spam and ham messages plt.figure(figsize=(6, 4)) sns.countplot(x=df["label"], palette="viridis") plt.xticks(ticks=[0, 1], labels=["Ham", "Spam"]) plt.title("Distribution of Ham and Spam Messages") plt.xlabel("Message Type") plt.ylabel("Count") plt.show()
Explanation:
- Uses Seaborn’s
countplot
to visualize the count of ham (non-spam) and spam messages. - Helps us determine if the dataset is imbalanced (i.e., significantly more ham than spam messages).
Step 5.2: Checking Message Lengths
Spam messages are often longer due to promotional content. Let’s analyze the average message length.
# Create a new column with message lengths df["message_length"] = df["message"].apply(len) # Plot the distribution of message lengths plt.figure(figsize=(8, 5)) sns.histplot(df[df["label"] == 0]["message_length"], bins=30, kde=True, color="blue", label="Ham") sns.histplot(df[df["label"] == 1]["message_length"], bins=30, kde=True, color="red", label="Spam") plt.legend() plt.title("Distribution of Message Lengths") plt.xlabel("Message Length") plt.ylabel("Frequency") plt.show()
Explanation:
- Creates a new column (
message_length
) that stores the length of each message. - Uses histograms to compare message lengths between spam and ham messages.
- Helps us identify if spam messages tend to be longer than ham messages.
Step 5.3: Word Frequency Analysis Using Word Clouds
A word cloud helps visualize the most frequently occurring words in spam and ham messages.
from wordcloud import WordCloud # Generate word cloud for ham messages ham_words = " ".join(df[df["label"] == 0]["message"]) spam_words = " ".join(df[df["label"] == 1]["message"]) plt.figure(figsize=(12, 6)) # Ham WordCloud plt.subplot(1, 2, 1) wc_ham = WordCloud(width=400, height=300, background_color="white").generate(ham_words) plt.imshow(wc_ham, interpolation="bilinear") plt.axis("off") plt.title("Most Common Words in Ham Messages") # Spam WordCloud plt.subplot(1, 2, 2) wc_spam = WordCloud(width=400, height=300, background_color="black", colormap="Reds").generate(spam_words) plt.imshow(wc_spam, interpolation="bilinear") plt.axis("off") plt.title("Most Common Words in Spam Messages") plt.show()
Explanation:
- Combines all words in ham and spam messages separately.
- Uses the WordCloud library to visualize frequent words.
- Helps us identify spam-related words (e.g., “win”, “free”, “urgent”).
Step 5.4: Checking Correlations Between Words and Spam Likelihood
We will identify the top words that strongly indicate whether a message is spam.
from sklearn.feature_extraction.text import CountVectorizer # Convert text into word count vectors vectorizer = CountVectorizer(stop_words="english", max_features=20) X_counts = vectorizer.fit_transform(df["message"]) word_freq_df = pd.DataFrame(X_counts.toarray(), columns=vectorizer.get_feature_names_out()) # Compute correlation with the spam label word_freq_df["label"] = df["label"] correlations = word_freq_df.corr()["label"].drop("label").sort_values(ascending=False) # Plot top correlated words plt.figure(figsize=(10, 5)) sns.barplot(x=correlations.index, y=correlations.values, palette="coolwarm") plt.xticks(rotation=45) plt.title("Top Words Correlated with Spam Messages") plt.xlabel("Word") plt.ylabel("Correlation with Spam") plt.show()
Explanation:
- Converts messages into a word count matrix using
CountVectorizer
. - Computes the correlation between word frequency and spam likelihood.
- Plots the most correlated words, helping us identify key spam indicators.
Summary of EDA
- Spam messages tend to be longer than ham messages.
- Common spam words include “win”, “free”, “urgent”, while ham messages have general conversational words.
- Certain words have a strong correlation with spam messages, which can help in feature selection.
Step 6: Data Preprocessing
Before training a machine learning model, we need to clean, tokenize, and vectorize the text data. This step includes:
- Removing unnecessary characters, stopwords, and punctuations.
- Converting text into lowercase.
- Tokenizing and stemming words.
- Converting text data into numerical format using TF-IDF Vectorization.
Step 6.1: Text Cleaning (Removing Punctuation, Stopwords, and Lowercasing)
We will first clean the text data by removing unnecessary characters and converting everything into lowercase.
import re import string import nltk from nltk.corpus import stopwords nltk.download("stopwords") stop_words = set(stopwords.words("english")) def clean_text(text): # Convert text to lowercase text = text.lower() # Remove punctuation text = text.translate(str.maketrans("", "", string.punctuation)) # Remove numbers text = re.sub(r"\d+", "", text) # Remove stopwords words = text.split() words = [word for word in words if word not in stop_words] return " ".join(words) # Apply cleaning function df["cleaned_message"] = df["message"].apply(clean_text) # Display first 5 cleaned messages df[["message", "cleaned_message"]].head()
Explanation:
- Lowercasing ensures consistency in text processing.
- Punctuation and numbers are removed as they do not contribute to spam detection.
- Stopwords (e.g., “the”, “is”, “and”) are removed to reduce noise.
Step 6.2: Tokenization and Stemming
Tokenization breaks sentences into words, and stemming reduces words to their root form.
from nltk.stem import PorterStemmer stemmer = PorterStemmer() def stem_words(text): words = text.split() words = [stemmer.stem(word) for word in words] return " ".join(words) df["stemmed_message"] = df["cleaned_message"].apply(stem_words) # Display first 5 stemmed messages df[["cleaned_message", "stemmed_message"]].head()
Explanation:
- Tokenization breaks text into individual words.
- Stemming reduces words to their base form (e.g., “winning” → “win”).
Step 6.3: Converting Text Data to Numerical Format Using TF-IDF
To train a machine learning model, we need to convert text data into numerical vectors. We use TF-IDF (Term Frequency – Inverse Document Frequency) to weigh words based on importance.
from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TF-IDF Vectorizer tfidf = TfidfVectorizer(max_features=3000) # Transform the messages X = tfidf.fit_transform(df["stemmed_message"]).toarray() # Store feature names features = tfidf.get_feature_names_out() # Convert to DataFrame for better visualization tfidf_df = pd.DataFrame(X, columns=features) # Display first 5 rows tfidf_df.head()
Explanation:
- TF-IDF assigns importance to words based on their frequency across messages.
- Converts text into numerical representation to be used by machine learning models.
- Limits features to 3000 words to avoid memory issues.
Summary of Data Preprocessing
✔ Text is cleaned (stopwords, punctuation, and numbers removed).
✔ Words are tokenized and stemmed to their root forms.
✔ TF-IDF transforms text into numerical format for model training.
Step 7: Model Selection and Training
Now that our text data is preprocessed and transformed into numerical format using TF-IDF, we can proceed with training machine learning models for spam detection.
We will use multiple machine learning models and compare their performance:
- Logistic Regression
- Naïve Bayes (MultinomialNB)
- Support Vector Machine (SVM – Linear Kernel)
- Random Forest Classifier
We will train, test, and evaluate each model to find the best-performing one.
Step 7.1: Splitting Data into Training and Testing Sets
We divide the dataset into 80% training and 20% testing to evaluate model performance.
from sklearn.model_selection import train_test_split # Target variable (Spam: 1, Ham: 0) y = df["label"].map({"ham": 0, "spam": 1}) # Splitting dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) print(f"Training Set Size: {X_train.shape[0]} samples") print(f"Testing Set Size: {X_test.shape[0]} samples")
Explanation:
train_test_split()
splits the dataset into 80% training and 20% testing.stratify=y
ensures class balance in training and testing sets.
Step 7.2: Training Logistic Regression Model
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Initialize and train model lr_model = LogisticRegression() lr_model.fit(X_train, y_train) # Predictions y_pred_lr = lr_model.predict(X_test) # Model evaluation print("Logistic Regression Performance:") print(classification_report(y_test, y_pred_lr))
✔ Logistic Regression is fast and interpretable.
✔ Suitable for binary classification tasks like spam detection.
Step 7.3: Training Naïve Bayes Model (MultinomialNB)
from sklearn.naive_bayes import MultinomialNB # Initialize and train model nb_model = MultinomialNB() nb_model.fit(X_train, y_train) # Predictions y_pred_nb = nb_model.predict(X_test) # Model evaluation print("Naïve Bayes Performance:") print(classification_report(y_test, y_pred_nb))
✔ Naïve Bayes is effective for text classification.
✔ Works well with TF-IDF-transformed data.
Step 7.4: Training Support Vector Machine (SVM) Model
from sklearn.svm import SVC # Initialize and train model svm_model = SVC(kernel="linear") svm_model.fit(X_train, y_train) # Predictions y_pred_svm = svm_model.predict(X_test) # Model evaluation print("SVM Performance:") print(classification_report(y_test, y_pred_svm))
✔ SVM (Linear Kernel) works well with high-dimensional data like text.
✔ Often provides better accuracy than Naïve Bayes.
Step 7.5: Training Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier # Initialize and train model rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # Predictions y_pred_rf = rf_model.predict(X_test) # Model evaluation print("Random Forest Performance:") print(classification_report(y_test, y_pred_rf))
✔ Random Forest is a strong ensemble method.
✔ It captures complex relationships in data.
Step 7.6: Comparing Model Performance
from sklearn.metrics import accuracy_score # Accuracy comparison models = { "Logistic Regression": accuracy_score(y_test, y_pred_lr), "Naïve Bayes": accuracy_score(y_test, y_pred_nb), "SVM": accuracy_score(y_test, y_pred_svm), "Random Forest": accuracy_score(y_test, y_pred_rf), } # Display accuracy scores for model, accuracy in models.items(): print(f"{model}: {accuracy:.4f}")
Explanation:
- We compare the accuracy of all models to find the best one.
- Higher accuracy means better classification performance.
Summary of Model Training:
✔ Logistic Regression: Fast and interpretable.
✔ Naïve Bayes: Simple and effective for text classification.
✔ SVM: Often provides higher accuracy.
✔ Random Forest: Strong ensemble method for complex data.
Step 8: Data Preprocessing
Before training a machine learning model, we need to preprocess the data by converting text into numerical representations. This involves:
- Removing Stopwords & Punctuation
- Tokenization & Lemmatization
- Vectorization using TF-IDF
8.1 Removing Stopwords & Punctuation
Spam emails often contain unnecessary words. We will remove them along with punctuations to clean the text.
Code Implementation
import nltk from nltk.corpus import stopwords import string # Download stopwords dataset nltk.download('stopwords') stop_words = set(stopwords.words('english')) def clean_text(text): """ Function to remove punctuation and stopwords from text """ text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation words = text.lower().split() # Convert to lowercase and split words = [word for word in words if word not in stop_words] # Remove stopwords return ' '.join(words) # Apply text cleaning function df['clean_text'] = df['text'].apply(clean_text) # Display cleaned text df[['text', 'clean_text']].head()
Explanation
stopwords.words('english')
loads common words like “the”, “is”, “and” which do not add much meaning.string.punctuation
removes punctuation marks like “.”, “!”, “?”.lower()
andsplit()
convert the text to lowercase and split it into words.- List comprehension is used to remove stopwords efficiently.
- Finally, we apply this function to all email texts.
8.2 Tokenization & Lemmatization
Tokenization splits text into words, and lemmatization converts words to their base form (e.g., “running” → “run”).
Code Implementation
from nltk.stem import WordNetLemmatizer nltk.download('wordnet') lemmatizer = WordNetLemmatizer() def lemmatize_text(text): """ Function to lemmatize words in text """ words = text.split() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Apply lemmatization df['lemmatized_text'] = df['clean_text'].apply(lemmatize_text) # Display lemmatized text df[['clean_text', 'lemmatized_text']].head()
Explanation
- WordNet Lemmatizer converts words to their root form.
- Each word in the cleaned text is lemmatized and joined back into a sentence.
8.3 Converting Text to Vectors using TF-IDF
Since machine learning models work with numbers, we will convert text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).
Code Implementation
from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TF-IDF vectorizer vectorizer = TfidfVectorizer(max_features=5000) # Limit vocabulary size X_tfidf = vectorizer.fit_transform(df['lemmatized_text']) # Convert sparse matrix to DataFrame X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out()) # Target variable y = df['label'] # Display shape of processed data print("Shape of feature matrix:", X_tfidf_df.shape) print("Shape of target variable:", y.shape)
Explanation
- TF-IDF converts words into a weighted numerical representation.
max_features=5000
limits vocabulary size to 5000 most important words.fit_transform()
applies vectorization to our dataset.- The sparse matrix is converted into a DataFrame for better readability.
- The target variable (
y
) remains unchanged.
Step 9: Splitting the Dataset for Training and Testing
Before training our machine learning models, we need to split the dataset into two parts:
- Training Set (80%) – Used to train the model
- Testing Set (20%) – Used to evaluate model performance
9.1 Splitting the Data
We’ll use train_test_split
from scikit-learn to divide our dataset.
Code Implementation
from sklearn.model_selection import train_test_split # Splitting data into 80% training and 20% testing X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y) # Print dataset sizes print("Training Set Size:", X_train.shape) print("Testing Set Size:", X_test.shape)
Explanation
train_test_split
randomly splits the dataset into training (80%) and testing (20%).random_state=42
ensures reproducibility.stratify=y
maintains the proportion of spam and non-spam emails in both sets.
Step 9: Model Training and Evaluation
We will train multiple machine learning models and compare their performances. The models we will use are:
- Logistic Regression
- Naïve Bayes (MultinomialNB)
- Support Vector Machine (SVM)
- Random Forest Classifier
- XGBoost Classifier
9.1 Training Logistic Regression
Code Implementation
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Initialize and train the model log_reg = LogisticRegression() log_reg.fit(X_train, y_train) # Predictions y_pred_log = log_reg.predict(X_test) # Model Evaluation print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log)) print("Classification Report:\n", classification_report(y_test, y_pred_log))
9.2 Training Naïve Bayes Classifier
Code Implementation
from sklearn.naive_bayes import MultinomialNB # Initialize and train the model nb = MultinomialNB() nb.fit(X_train, y_train) # Predictions y_pred_nb = nb.predict(X_test) # Model Evaluation print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred_nb)) print("Classification Report:\n", classification_report(y_test, y_pred_nb))
8.3 Training Support Vector Machine (SVM)
Code Implementation
from sklearn.svm import SVC # Initialize and train the model svm = SVC(kernel='linear') svm.fit(X_train, y_train) # Predictions y_pred_svm = svm.predict(X_test) # Model Evaluation print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm)) print("Classification Report:\n", classification_report(y_test, y_pred_svm))
9.4 Training Random Forest Classifier
Code Implementation
from sklearn.ensemble import RandomForestClassifier # Initialize and train the model rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # Predictions y_pred_rf = rf.predict(X_test) # Model Evaluation print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print("Classification Report:\n", classification_report(y_test, y_pred_rf))
9.5 Training XGBoost Classifier
Code Implementation
from xgboost import XGBClassifier # Initialize and train the model xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss') xgb.fit(X_train, y_train) # Predictions y_pred_xgb = xgb.predict(X_test) # Model Evaluation print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb)) print("Classification Report:\n", classification_report(y_test, y_pred_xgb))
Step 10: Comparing Model Performance
Now, let’s compare the accuracy of all models.
Code Implementation
# Store accuracy scores model_scores = { "Logistic Regression": accuracy_score(y_test, y_pred_log), "Naïve Bayes": accuracy_score(y_test, y_pred_nb), "SVM": accuracy_score(y_test, y_pred_svm), "Random Forest": accuracy_score(y_test, y_pred_rf), "XGBoost": accuracy_score(y_test, y_pred_xgb) } # Print model comparison for model, score in model_scores.items(): print(f"{model}: {score:.4f}")
Next Steps
Now that we have trained multiple models and compared their performances, the next step is to deploy the best-performing model using a Flask-based web app for user interaction.
The Spam Email Detection Using Machine Learning project demonstrates the power of AI in automating email classification and enhancing cybersecurity. We started by preprocessing the dataset, applying TF-IDF vectorization, and training multiple machine learning models. Our evaluation showed that models like Naïve Bayes and SVM perform exceptionally well in identifying spam emails with high accuracy.
To make this project practical, we also built a user-friendly front-end using Flask, allowing users to input email text and get instant spam classification results. This real-time spam detection system can be integrated into email platforms to prevent phishing, scams, and malware threats.
By implementing machine learning-based spam filters, organizations can significantly reduce email-based threats, improve email security, and provide users with a spam-free communication experience. This project highlights the importance of AI-driven automation in today’s cybersecurity landscape, making email communication more efficient and secure.
CLICK HERE TO EXPLORE MORE SUCH PROJECTS
Latest Posts:
- Predicting House Prices using Machine Learning
- 10 Data Visualization Project Ideas with Source Code
- Music Recommendation System using Python – Full Project
- AI ML Projects for Beginners: A Simple Guide to Get Started
- Data Visualization Techniques: A Comprehensive Guide
Next Steps:
Step 11: Building the Front-End with Flask for Spam Email Detection
Now that we have trained and evaluated multiple machine learning models, it’s time to build a Flask-based web application that allows users to input an email and get real-time spam detection results.
Flask is a lightweight web framework for Python that enables us to create a user-friendly interface for our spam detection model.
11.1 Installing Dependencies
Before we start coding, install Flask and other required libraries:
pip install flask flask-wtf wtforms pandas joblib
11.2 Creating the Flask App Structure
Create a project folder and organize it as follows:
spam_email_detection/ │── static/ # CSS, JS, images │── templates/ # HTML files │── model/ # Saved ML model │── app.py # Flask main application │── spam_model.pkl # Serialized trained model │── vectorizer.pkl # TF-IDF vectorizer │── requirements.txt # List of dependencies
11.3 Saving the Best Model & Vectorizer
Before deploying, we save the best-performing model and vectorizer using joblib
.
import joblib # Save the best model joblib.dump(best_model, "spam_model.pkl") # Save the TF-IDF vectorizer joblib.dump(vectorizer, "vectorizer.pkl")
11.4 Creating the Flask App (app.py)
Now, let’s create the main Flask app (app.py
), which will load the saved model, process user inputs, and return predictions.
from flask import Flask, render_template, request import joblib # Initialize Flask app app = Flask(__name__) # Load the trained model and vectorizer model = joblib.load("spam_model.pkl") vectorizer = joblib.load("vectorizer.pkl") @app.route("/", methods=["GET", "POST"]) def index(): prediction = None if request.method == "POST": email_text = request.form["email_text"] # Transform the input email text email_vector = vectorizer.transform([email_text]) # Predict spam or ham prediction = model.predict(email_vector)[0] if prediction == 1: prediction = "Spam" else: prediction = "Not Spam" return render_template("index.html", prediction=prediction) if __name__ == "__main__": app.run(debug=True)
11.5 Creating the Front-End (index.html)
Inside the templates/
folder, create an index.html
file for the web interface.
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Spam Email Detector</title> <style> body { font-family: Arial, sans-serif; text-align: center; margin-top: 50px; } form { margin-top: 20px; } textarea { width: 50%; height: 100px; } .result { font-size: 20px; font-weight: bold; margin-top: 20px; } </style> </head> <body> <h1>Spam Email Detection Using Machine Learning</h1> <form method="POST"> <label for="email_text">Enter Email Content:</label><br> <textarea name="email_text" required></textarea><br><br> <button type="submit">Check</button> </form> {% if prediction %} <div class="result"> <p>Prediction: {{ prediction }}</p> </div> {% endif %} </body> </html>
11.6 Running the Flask App
To start the web app, run the following command:
python app.py
The application will start at http://127.0.0.1:5000/. Open it in your browser and enter an email text to check if it’s spam or not.
Conclusion
We have successfully built and deployed a Spam Email Detection web application using Flask. This app enables users to enter email text and receive real-time spam classification results. By integrating machine learning and Flask, we have created a practical solution for identifying spam emails.
Next, we can deploy this app to platforms like Heroku, Render, or AWS for public access. 🚀
Let me know in comments if you want to proceed with Step 12: Deploying the Flask App to Heroku!