|

Fake News Detection Using Machine Learning

KANGKAN KALITA

Fake News Detection Using Machine Learning: In today’s digital age, fake news spreads rapidly, influencing public opinion and shaping societal narratives. With the rise of social media and online platforms, distinguishing between real and misleading information has become increasingly challenging. Machine learning, however, offers a powerful solution to combat fake news by analyzing patterns, detecting anomalies, and identifying unreliable sources. In this blog post, we explore how machine learning techniques can enhance fake news detection, the key algorithms involved, and the impact of AI-driven fact-checking on digital literacy.

Fake News Detection Using Machine Learning

Objective of the Project

The goal of this project is to build a machine learning model that can detect fake news based on textual and source-based features. Using the provided dataset, we will analyze patterns in real and fake news, preprocess text data, extract meaningful features, and train multiple machine learning models to classify news articles as real or fake. Finally, we will deploy the model using a Flask-based web application for user interaction. Lets start our project i.e. Fake News Detection Using Machine Learning.

Project Details

This project will be completed in the following structured steps:

  1. Data Collection & Understanding – Overview of the dataset and its features.
  2. Exploratory Data Analysis (EDA) – Analyzing data distribution, patterns, and visualizations.
  3. Data Preprocessing – Cleaning, handling missing values, and transforming data.
  4. Feature Engineering – Converting text data into numerical features (TF-IDF, Count Vectorizer, etc.).
  5. Model Selection & Training – Implementing and comparing multiple ML models.
  6. Model Evaluation – Assessing performance using metrics like accuracy, precision, recall, and F1-score.
  7. Hyperparameter Tuning – Optimizing the best model for improved accuracy.
  8. Deploying the Model – Creating a Flask-based web application for real-time fake news detection.
  9. Conclusion & Future Enhancements – Summarizing the project and discussing possible improvements.

Dataset Overview

The dataset consists of the following columns:

  • title – The headline of the news article.
  • news_url – The URL of the news article.
  • source_domain – The domain from which the news is sourced.
  • tweet_num – The number of times the article was tweeted.
  • real – The target column (1 → Real News, 0 → Fake News).

Since the dataset contains textual and numerical features, we will apply NLP techniques and feature engineering methods to extract meaningful insights.

Dataset Link: Fake News detection Dataset or Click Here

Tools & Libraries Used

We will use the following libraries for different steps in the project:

  • Data Handling & Analysis: Pandas, NumPy
  • Exploratory Data Analysis (EDA): Matplotlib, Seaborn
  • Text Preprocessing: NLTK, re (Regex), Scikit-learn
  • Feature Extraction: CountVectorizer, TF-IDF Vectorizer
  • Machine Learning Models: Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machine (SVM)
  • Model Evaluation: Accuracy Score, Precision, Recall, F1-score, Confusion Matrix
  • Web Deployment: Flask, HTML, CSS

Step 1: Importing the Required Libraries

Let’s start by importing the necessary libraries for data handling and basic exploration.

# Importing Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Download NLTK stopwords
nltk.download('stopwords')

Got it! We will proceed step by step, making it an end-to-end beginner-friendly tutorial. Before we dive into coding, let’s start with the Objective of the Project, Project Details, Dataset Overview, and Tools & Libraries Used to set the foundation.

Objective of the Project

The goal of this project is to build a machine learning model that can detect fake news based on textual and source-based features. Using the provided dataset, we will analyze patterns in real and fake news, preprocess text data, extract meaningful features, and train multiple machine learning models to classify news articles as real or fake. Finally, we will deploy the model using a Flask-based web application for user interaction.

Project Details

This project will be completed in the following structured steps:

  1. Data Collection & Understanding – Overview of the dataset and its features.
  2. Exploratory Data Analysis (EDA) – Analyzing data distribution, patterns, and visualizations.
  3. Data Preprocessing – Cleaning, handling missing values, and transforming data.
  4. Feature Engineering – Converting text data into numerical features (TF-IDF, Count Vectorizer, etc.).
  5. Model Selection & Training – Implementing and comparing multiple ML models.
  6. Model Evaluation – Assessing performance using metrics like accuracy, precision, recall, and F1-score.
  7. Hyperparameter Tuning – Optimizing the best model for improved accuracy.
  8. Deploying the Model – Creating a Flask-based web application for real-time fake news detection.
  9. Conclusion & Future Enhancements – Summarizing the project and discussing possible improvements.

Dataset Overview

The dataset consists of the following columns:

  • title – The headline of the news article.
  • news_url – The URL of the news article.
  • source_domain – The domain from which the news is sourced.
  • tweet_num – The number of times the article was tweeted.
  • real – The target column (1 → Real News, 0 → Fake News).

Since the dataset contains textual and numerical features, we will apply NLP techniques and feature engineering methods to extract meaningful insights.

Tools & Libraries Used

We will use the following libraries for different steps in the project:

  • Data Handling & Analysis: Pandas, NumPy
  • Exploratory Data Analysis (EDA): Matplotlib, Seaborn
  • Text Preprocessing: NLTK, re (Regex), Scikit-learn
  • Feature Extraction: CountVectorizer, TF-IDF Vectorizer
  • Machine Learning Models: Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machine (SVM)
  • Model Evaluation: Accuracy Score, Precision, Recall, F1-score, Confusion Matrix
  • Web Deployment: Flask, HTML, CSS

Step 1: Importing the Required Libraries

Let’s start by importing the necessary libraries for data handling and basic exploration.

# Importing Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Download NLTK stopwords
nltk.download('stopwords')

Step 2: Loading and Exploring the Dataset

Now that we have imported the required libraries, let’s load the dataset, check its structure, and explore the first few rows.

2.1 Loading the Dataset

Since you provided the column names (title, news_url, source_domain, tweet_num, real), we will assume the dataset is in CSV format. If you have the dataset ready, you can upload it, or we can proceed with a sample dataset.

# Load the dataset
df = pd.read_csv("fake_news_dataset.csv")  # Update with the correct file path

# Display the first five rows
df.head()

2.2 Checking the Dataset Structure

Let’s examine the dataset’s structure, including column names, data types, and any missing values.

# Check basic information about the dataset
df.info()

This will output details like:

  • Number of non-null entries
  • Data types of each column
  • Memory usage

2.3 Checking for Missing Values

Missing values can affect our model performance, so we need to check if any columns contain null values.

# Check for missing values
df.isnull().sum()

If there are missing values, we will decide whether to drop or impute them in later steps.

2.4 Exploring Target Variable (Real vs. Fake News Distribution)

We need to check the distribution of real and fake news to understand class imbalance.

# Plot the distribution of real vs. fake news
plt.figure(figsize=(6,4))
sns.countplot(x=df['real'], palette='viridis')
plt.xlabel("News Type (1 = Real, 0 = Fake)")
plt.ylabel("Count")
plt.title("Distribution of Real vs. Fake News")
plt.show()

This will help us see whether our dataset is balanced or biased toward one category.

2.5 Checking for Duplicate Entries

Duplicate rows can lead to biased results, so let’s identify and remove them if needed.

# Count duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

# Remove duplicate rows if necessary
df = df.drop_duplicates()

Next Step: Exploratory Data Analysis (EDA)

We have now loaded and explored the dataset structure. Next, we will perform Exploratory Data Analysis (EDA) to gain deeper insights into patterns and relationships within the data.

Step 3: Exploratory Data Analysis (EDA)

Now that we have loaded and explored the dataset structure, we will perform Exploratory Data Analysis (EDA) to gain deeper insights. EDA helps in understanding patterns, relationships, and potential data issues before applying machine learning models.

3.1 Statistical Summary of the Dataset

We start by checking the statistical properties of numerical columns.

# Get summary statistics of numerical columns
df.describe()

🔹 What this tells us:

  • tweet_num: We can check the distribution of tweet counts associated with each news article.
  • Other numerical insights that might help in feature selection.

3.2 Visualizing Fake vs. Real News Distribution

We previously checked the count of fake vs. real news. Let’s visualize it using a pie chart.

# Plot pie chart of fake vs real news distribution
plt.figure(figsize=(6, 6))
df['real'].value_counts().plot.pie(autopct='%1.1f%%', labels=["Fake News", "Real News"], colors=['red', 'green'], startangle=90, explode=[0.05, 0.05])
plt.title("Fake vs. Real News Distribution")
plt.ylabel("")
plt.show()

🔹 Insights: If the dataset is highly imbalanced (e.g., 80% real news and 20% fake news), we may need to apply data balancing techniques such as oversampling or undersampling.

3.3 Word Cloud for Fake vs. Real News Titles

A word cloud is a great way to visualize frequently appearing words in news titles.

from wordcloud import WordCloud

# Generate word clouds for Fake and Real news
fake_titles = ' '.join(df[df['real'] == 0]['title'].dropna())
real_titles = ' '.join(df[df['real'] == 1]['title'].dropna())

plt.figure(figsize=(14, 6))

# Fake News Word Cloud
plt.subplot(1, 2, 1)
wordcloud_fake = WordCloud(width=500, height=300, background_color='black').generate(fake_titles)
plt.imshow(wordcloud_fake, interpolation="bilinear")
plt.axis('off')
plt.title("Fake News Word Cloud")

# Real News Word Cloud
plt.subplot(1, 2, 2)
wordcloud_real = WordCloud(width=500, height=300, background_color='black').generate(real_titles)
plt.imshow(wordcloud_real, interpolation="bilinear")
plt.axis('off')
plt.title("Real News Word Cloud")

plt.show()

🔹 Insights:

  • Common words in fake news vs. real news can be identified.
  • If fake news contains more sensational words, this might help feature engineering.

3.4 Analyzing Most Common Words in Fake and Real News

Instead of just visualizing, let’s extract the most common words.

from collections import Counter
import string

# Function to preprocess text: remove punctuation and lowercase
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

df['clean_title'] = df['title'].dropna().apply(preprocess_text)

# Tokenize words for Fake and Real news
fake_words = ' '.join(df[df['real'] == 0]['clean_title']).split()
real_words = ' '.join(df[df['real'] == 1]['clean_title']).split()

# Get most common words
fake_common = Counter(fake_words).most_common(20)
real_common = Counter(real_words).most_common(20)

# Convert to DataFrame
fake_df = pd.DataFrame(fake_common, columns=['Word', 'Frequency'])
real_df = pd.DataFrame(real_common, columns=['Word', 'Frequency'])

# Plot
plt.figure(figsize=(14, 6))

# Fake News Common Words
plt.subplot(1, 2, 1)
sns.barplot(y=fake_df['Word'], x=fake_df['Frequency'], palette='Reds_r')
plt.title("Most Common Words in Fake News")
plt.xlabel("Frequency")
plt.ylabel("Word")

# Real News Common Words
plt.subplot(1, 2, 2)
sns.barplot(y=real_df['Word'], x=real_df['Frequency'], palette='Greens_r')
plt.title("Most Common Words in Real News")
plt.xlabel("Frequency")
plt.ylabel("Word")

plt.show()

🔹 Insights:

  • This helps identify patterns in fake vs. real news.
  • Sensational words (e.g., “shocking”, “clickbait”) might be more common in fake news.

3.5 Analyzing Tweet Count Distribution

The tweet_num column represents how often a news article is shared on Twitter. Let’s analyze its distribution.

# Distribution of tweet counts
plt.figure(figsize=(8, 5))
sns.histplot(df['tweet_num'], bins=50, kde=True, color="purple")
plt.xlabel("Number of Tweets")
plt.ylabel("Frequency")
plt.title("Distribution of Tweet Counts")
plt.show()

🔹 Insights:

  • If fake news articles are shared more frequently, this may help in classification.
  • We can apply feature scaling if needed.

Next Steps

EDA has helped us understand text-based and numerical patterns in the dataset. The next step is Text Preprocessing and Feature Engineering, where we will:
✅ Convert text into a numerical format using TF-IDF or Word Embeddings
✅ Handle missing values and clean data
✅ Apply vectorization techniques

Step 4: Text Preprocessing and Feature Engineering

Now that we have performed Exploratory Data Analysis (EDA), we will preprocess the text data and engineer useful features for machine learning. Since our dataset contains news titles, we must convert this textual data into a numerical format before training our models.

4.1 Handling Missing Values

Before proceeding, let’s check if there are any missing values in our dataset.

# Check for missing values
df.isnull().sum()

🔹 Why is this important?

  • If we find missing values in title, we can either drop them or replace them with placeholder text.

Let’s handle missing values now.

# Fill missing titles with a placeholder
df['title'].fillna("unknown", inplace=True)

4.2 Converting Text to Lowercase

Text data should be converted to lowercase to maintain consistency.

# Convert text to lowercase
df['title'] = df['title'].str.lower()

4.3 Removing Punctuation and Special Characters

Punctuation marks and special characters do not contribute much to meaning, so we remove them.

import string

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['title'] = df['title'].apply(remove_punctuation)

4.4 Removing Stopwords

Stopwords (like “the”, “is”, “in”) are commonly occurring words that do not add much value.

from nltk.corpus import stopwords
import nltk

# Download stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

df['title'] = df['title'].apply(remove_stopwords)

4.5 Lemmatization

Lemmatization reduces words to their base form, improving generalization.

from nltk.stem import WordNetLemmatizer

# Download WordNet
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Function to lemmatize text
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df['title'] = df['title'].apply(lemmatize_text)

4.6 Converting Text into Numerical Features (TF-IDF Vectorization)

Since machine learning models work with numbers, we must convert text data into numerical representations. TF-IDF (Term Frequency – Inverse Document Frequency) is a popular method.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit to 5000 words for efficiency

# Transform text into numerical format
X = tfidf_vectorizer.fit_transform(df['title']).toarray()

# Convert to DataFrame
feature_names = tfidf_vectorizer.get_feature_names_out()
X_df = pd.DataFrame(X, columns=feature_names)

# Show sample transformed data
X_df.head()

🔹 What this does:

  • Converts the title column into a numerical matrix.
  • Each column represents a unique word, and the value shows its importance.
  • This is now ready for model training! 🎯

4.7 Preparing Final Feature Set

Now that our text is processed, let’s prepare the final feature set.

# Our features (X) are the transformed text data
X = X_df

# Our target variable (y) is the "real" column
y = df['real']

Next Steps

We have successfully prepared our dataset for machine learning! The next step is to split the data and train multiple machine learning models for classification.

Step 5: Training Machine Learning Models

Now that we have preprocessed the dataset and converted the text into numerical features using TF-IDF vectorization, we can proceed with training multiple machine learning models to classify news as real or fake.

5.1 Splitting the Dataset into Training and Testing Sets

Before training our models, we must split the dataset into training and testing sets. The training set is used to train the models, while the testing set evaluates their performance.

from sklearn.model_selection import train_test_split

# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the shapes of the datasets
print("Training Set Shape:", X_train.shape)
print("Testing Set Shape:", X_test.shape)

🔹 Why 80-20 Split?

  • 80% of the data is used to train the model so that it learns patterns effectively.
  • 20% is reserved for testing to evaluate performance.

5.2 Training Multiple Machine Learning Models

We will train multiple models and compare their performances to select the best one.

5.2.1 Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions
y_pred_logreg = logreg.predict(X_test)

# Evaluate model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("Classification Report:\n", classification_report(y_test, y_pred_logreg))

🔹 Why Logistic Regression?

  • Simple and efficient for binary classification problems.
  • Performs well on linearly separable data.

5.2.2 Support Vector Machine (SVM)

from sklearn.svm import SVC

# Initialize and train SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Evaluate model
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))

🔹 Why SVM?

  • Works well for text classification tasks.
  • Good at handling high-dimensional data.

5.2.3 Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# Initialize and train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

🔹 Why Random Forest?

  • Handles non-linearity better than Logistic Regression.
  • Works well with large feature sets.

5.2.4 Naïve Bayes Classifier

from sklearn.naive_bayes import MultinomialNB

# Initialize and train Naïve Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Make predictions
y_pred_nb = nb_model.predict(X_test)

# Evaluate model
print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))

🔹 Why Naïve Bayes?

  • Works exceptionally well for text classification problems.
  • Assumes word independence, which simplifies computations.

5.3 Comparing Model Performance

To determine the best model, let’s compare their accuracy scores.

# Print accuracy scores for all models
model_accuracies = {
    "Logistic Regression": accuracy_score(y_test, y_pred_logreg),
    "SVM": accuracy_score(y_test, y_pred_svm),
    "Random Forest": accuracy_score(y_test, y_pred_rf),
    "Naïve Bayes": accuracy_score(y_test, y_pred_nb)
}

# Display model performance
for model, accuracy in model_accuracies.items():
    print(f"{model}: {accuracy:.4f}")

Next Steps

Now that we have trained multiple models, the next step is to select the best-performing model and fine-tune it using hyperparameter optimization.

Step 6: Hyperparameter Tuning for Best Model

Now that we have trained multiple models and evaluated their performance, we will fine-tune the best-performing model using hyperparameter optimization.

6.1 Selecting the Best Model

Based on the accuracy scores obtained in Step 5, let’s assume Logistic Regression and Random Forest performed well. We will fine-tune these models using Grid Search CV and Randomized Search CV.

6.2 Hyperparameter Tuning for Logistic Regression

We will tune parameters like:

  • C (Regularization strength): Controls the amount of regularization applied.
  • Solver: Determines the optimization algorithm used.
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Logistic Regression
param_grid_logreg = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# Initialize Logistic Regression
logreg = LogisticRegression()

# Perform Grid Search CV
grid_search_logreg = GridSearchCV(logreg, param_grid_logreg, cv=5, scoring='accuracy')
grid_search_logreg.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters for Logistic Regression:", grid_search_logreg.best_params_)
print("Best Accuracy:", grid_search_logreg.best_score_)

6.3 Hyperparameter Tuning for Random Forest

For Random Forest, we will tune parameters like:

  • n_estimators: Number of decision trees.
  • max_depth: Maximum depth of each tree.
  • min_samples_split: Minimum samples required to split a node.
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Perform Randomized Search CV
random_search_rf = RandomizedSearchCV(rf_model, param_distributions=param_grid_rf, n_iter=10, cv=5, scoring='accuracy', random_state=42, n_jobs=-1)
random_search_rf.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters for Random Forest:", random_search_rf.best_params_)
print("Best Accuracy:", random_search_rf.best_score_)

6.4 Retraining the Best Model

After obtaining the best hyperparameters, we retrain the best model.

# Retrain Logistic Regression with best parameters
best_logreg = LogisticRegression(C=grid_search_logreg.best_params_['C'], solver=grid_search_logreg.best_params_['solver'])
best_logreg.fit(X_train, y_train)

# Evaluate the optimized model
y_pred_best_logreg = best_logreg.predict(X_test)
print("Optimized Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_best_logreg))
print("Classification Report:\n", classification_report(y_test, y_pred_best_logreg))

OR (if Random Forest performs better)

# Retrain Random Forest with best parameters
best_rf = RandomForestClassifier(**random_search_rf.best_params_, random_state=42)
best_rf.fit(X_train, y_train)

# Evaluate the optimized model
y_pred_best_rf = best_rf.predict(X_test)
print("Optimized Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_best_rf))

Next Steps

Now that we have fine-tuned our best model, the next step is to evaluate it further using cross-validation and confusion matrix before proceeding to deployment.

Step 7: Model Evaluation

Now that we have fine-tuned our best model, we will evaluate it further using cross-validation, confusion matrix, classification report, and ROC-AUC curve to measure its effectiveness in detecting fake news.

7.1 Cross-Validation for Model Reliability

Cross-validation helps us test the model’s stability by dividing the dataset into multiple subsets and training the model on different portions.

from sklearn.model_selection import cross_val_score

# Perform cross-validation on the best model
cv_scores = cross_val_score(best_rf, X_train, y_train, cv=5, scoring='accuracy')

# Print cross-validation results
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", np.mean(cv_scores))

🔹 Why Cross-Validation?

  • Ensures that the model is not overfitting.
  • Evaluates performance on multiple data splits instead of just one.

7.2 Confusion Matrix and Classification Report

The confusion matrix helps visualize true positives, false positives, true negatives, and false negatives. The classification report provides detailed metrics such as precision, recall, and F1-score.

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Get predictions on test set
y_pred_best_rf = best_rf.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred_best_rf)

# Visualize the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Real', 'Fake'], yticklabels=['Real', 'Fake'])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred_best_rf))

🔹 Key Metrics Explained:

  • Precision: Out of all predicted fake news, how many were actually fake?
  • Recall: Out of all actual fake news, how many did we correctly identify?
  • F1-score: A balance between precision and recall (high values indicate a good model).

7.3 ROC-AUC Curve for Model Performance

The Receiver Operating Characteristic (ROC) curve shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different thresholds.

from sklearn.metrics import roc_curve, auc

# Get probability scores for ROC curve
y_probs = best_rf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, color='blue', label='AUC = %0.2f' % roc_auc)
plt.plot([0,1], [0,1], linestyle='--', color='grey')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC-AUC Curve")
plt.legend()
plt.show()

print("AUC Score:", roc_auc)

🔹 Interpreting AUC Score:

  • Closer to 1.0Excellent Model
  • Around 0.5 → Random guessing
  • Below 0.5 → Poor model

Next Steps

Now that we have thoroughly evaluated our model, the next step is Step 8: Saving and Deploying the Model with Flask.

Step 8: Saving and Deploying the Model with Flask

Now that we have built and evaluated our Fake News Detection model, the next step is to save the trained model and deploy it using Flask, so users can interact with it through a web interface.

8.1 Saving the Model

To make the model accessible for deployment, we save it using joblib or pickle.

import joblib

# Save the trained model
joblib.dump(best_rf, 'fake_news_model.pkl')

print("Model saved successfully!")

🔹 Why Save the Model?

  • Saves time by avoiding re-training.
  • Enables real-world deployment in applications.

8.2 Creating a Flask Web App for Deployment

Flask will help us create a simple web-based interface where users can input a news title, and the model will predict whether it’s real or fake.

📌 Steps to Deploy Using Flask:

  1. Create a Flask application.
  2. Load the trained model.
  3. Build a front-end form for user input.
  4. Make predictions based on user input.
  5. Display the results.

Let’s start by creating a app.py file with the following Flask code:

8.3 Writing Flask Backend (app.py)

from flask import Flask, request, render_template
import joblib
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the saved model
model = joblib.load("fake_news_model.pkl")

# Initialize Flask app
app = Flask(__name__)

# Load the TF-IDF vectorizer used for training
vectorizer = joblib.load("tfidf_vectorizer.pkl")

@app.route('/')
def home():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    if request.method == 'POST':
        # Get user input
        news_title = request.form['news_title']

        # Transform the input text using the TF-IDF vectorizer
        news_vector = vectorizer.transform([news_title])

        # Make prediction
        prediction = model.predict(news_vector)[0]

        # Display result
        result = "Fake News" if prediction == 1 else "Real News"

        return render_template('index.html', prediction_text=f"The news article is: {result}")

if __name__ == '__main__':
    app.run(debug=True)

8.4 Designing a Simple HTML Front-End (templates/index.html)

This HTML form allows users to input a news title and get a prediction.

<!DOCTYPE html>
<html>
<head>
    <title>Fake News Detection</title>
    <style>
        body { font-family: Arial, sans-serif; text-align: center; }
        form { margin-top: 20px; }
        input { width: 60%; padding: 10px; }
        button { padding: 10px; background: blue; color: white; border: none; }
    </style>
</head>
<body>
    <h1>Fake News Detection</h1>
    <form action="/predict" method="post">
        <input type="text" name="news_title" placeholder="Enter News Title" required>
        <button type="submit">Check</button>
    </form>

    {% if prediction_text %}
        <h2>{{ prediction_text }}</h2>
    {% endif %}
</body>
</html>

8.5 Running the Flask App

Once the files are created, run the following command in your terminal:

python app.py

Then, open http://127.0.0.1:5000/ in your browser to test the fake news detection model.

Next Steps

✅ Now that we have successfully built and deployed the Fake News Detection Web App, the final step is to write a project summary and conclusion.

Step 9: Conclusion

Project Summary

In this project, we built an end-to-end Fake News Detection system using Machine Learning, following a structured approach from data preprocessing to model deployment. We used TF-IDF vectorization for feature extraction and trained multiple classification models, selecting Random Forest as the best-performing model. The trained model was then deployed using Flask, allowing users to input a news title and receive real-time predictions on whether the news is real or fake.

Key Takeaways

Data Preprocessing: Cleaned and processed text data, removed stopwords, and converted text to numerical format using TF-IDF.

Model Training & Evaluation: Trained and compared multiple models, choosing Random Forest for deployment due to its superior accuracy.

Web Deployment: Built a Flask web app with a simple HTML front-end for user interaction.

Real-world Application: This project can be extended to analyze full articles, incorporate more features, and improve accuracy with deep learning models (LSTMs, BERT, etc.).

Future Improvements

🔹 Expand Feature Set: Utilize metadata such as publication date, author credibility, and external sources.
🔹 Improve Model Performance: Experiment with deep learning models like LSTMs, Transformers, or BERT for better accuracy.
🔹 Deploy Online: Host the model on a cloud platform (AWS, Google Cloud, or Heroku) for public access.
🔹 Build a Fake News Database: Collect and store flagged fake news for further analysis and model improvement.

This marks the successful completion of our Fake News Detection Project! 🎯 🚀

The battle against fake news is ongoing, but machine learning has proven to be a game-changer in detecting and mitigating misinformation. By leveraging advanced algorithms, NLP techniques, and AI-driven fact-checking, we can enhance the accuracy of news verification and reduce the spread of false information. While challenges such as bias in training data and evolving deception tactics remain, continuous improvements in machine learning models and collaboration with fact-checking organizations can strengthen the fight against disinformation. As technology evolves, the integration of AI in fake news detection using machine learning will play a crucial role in promoting digital literacy and ensuring a more informed society.

Read More From us:

Spam Email Detection Using Machine Learning

Chatbot Using Python for Beginners

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *