Fake News Detection Using Machine Learning
Fake News Detection Using Machine Learning: In today’s digital age, fake news spreads rapidly, influencing public opinion and shaping societal narratives. With the rise of social media and online platforms, distinguishing between real and misleading information has become increasingly challenging. Machine learning, however, offers a powerful solution to combat fake news by analyzing patterns, detecting anomalies, and identifying unreliable sources. In this blog post, we explore how machine learning techniques can enhance fake news detection, the key algorithms involved, and the impact of AI-driven fact-checking on digital literacy.

Objective of the Project
The goal of this project is to build a machine learning model that can detect fake news based on textual and source-based features. Using the provided dataset, we will analyze patterns in real and fake news, preprocess text data, extract meaningful features, and train multiple machine learning models to classify news articles as real or fake. Finally, we will deploy the model using a Flask-based web application for user interaction. Lets start our project i.e. Fake News Detection Using Machine Learning.
Project Details
This project will be completed in the following structured steps:
- Data Collection & Understanding – Overview of the dataset and its features.
- Exploratory Data Analysis (EDA) – Analyzing data distribution, patterns, and visualizations.
- Data Preprocessing – Cleaning, handling missing values, and transforming data.
- Feature Engineering – Converting text data into numerical features (TF-IDF, Count Vectorizer, etc.).
- Model Selection & Training – Implementing and comparing multiple ML models.
- Model Evaluation – Assessing performance using metrics like accuracy, precision, recall, and F1-score.
- Hyperparameter Tuning – Optimizing the best model for improved accuracy.
- Deploying the Model – Creating a Flask-based web application for real-time fake news detection.
- Conclusion & Future Enhancements – Summarizing the project and discussing possible improvements.
Dataset Overview
The dataset consists of the following columns:
- title – The headline of the news article.
- news_url – The URL of the news article.
- source_domain – The domain from which the news is sourced.
- tweet_num – The number of times the article was tweeted.
- real – The target column (1 → Real News, 0 → Fake News).
Since the dataset contains textual and numerical features, we will apply NLP techniques and feature engineering methods to extract meaningful insights.
Dataset Link: Fake News detection Dataset or Click Here
Tools & Libraries Used
We will use the following libraries for different steps in the project:
- Data Handling & Analysis: Pandas, NumPy
- Exploratory Data Analysis (EDA): Matplotlib, Seaborn
- Text Preprocessing: NLTK, re (Regex), Scikit-learn
- Feature Extraction: CountVectorizer, TF-IDF Vectorizer
- Machine Learning Models: Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machine (SVM)
- Model Evaluation: Accuracy Score, Precision, Recall, F1-score, Confusion Matrix
- Web Deployment: Flask, HTML, CSS
Step 1: Importing the Required Libraries
Let’s start by importing the necessary libraries for data handling and basic exploration.
# Importing Required Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re import nltk from nltk.corpus import stopwords from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Download NLTK stopwords nltk.download('stopwords')
Got it! We will proceed step by step, making it an end-to-end beginner-friendly tutorial. Before we dive into coding, let’s start with the Objective of the Project, Project Details, Dataset Overview, and Tools & Libraries Used to set the foundation.
Objective of the Project
The goal of this project is to build a machine learning model that can detect fake news based on textual and source-based features. Using the provided dataset, we will analyze patterns in real and fake news, preprocess text data, extract meaningful features, and train multiple machine learning models to classify news articles as real or fake. Finally, we will deploy the model using a Flask-based web application for user interaction.
Project Details
This project will be completed in the following structured steps:
- Data Collection & Understanding – Overview of the dataset and its features.
- Exploratory Data Analysis (EDA) – Analyzing data distribution, patterns, and visualizations.
- Data Preprocessing – Cleaning, handling missing values, and transforming data.
- Feature Engineering – Converting text data into numerical features (TF-IDF, Count Vectorizer, etc.).
- Model Selection & Training – Implementing and comparing multiple ML models.
- Model Evaluation – Assessing performance using metrics like accuracy, precision, recall, and F1-score.
- Hyperparameter Tuning – Optimizing the best model for improved accuracy.
- Deploying the Model – Creating a Flask-based web application for real-time fake news detection.
- Conclusion & Future Enhancements – Summarizing the project and discussing possible improvements.
Dataset Overview
The dataset consists of the following columns:
- title – The headline of the news article.
- news_url – The URL of the news article.
- source_domain – The domain from which the news is sourced.
- tweet_num – The number of times the article was tweeted.
- real – The target column (1 → Real News, 0 → Fake News).
Since the dataset contains textual and numerical features, we will apply NLP techniques and feature engineering methods to extract meaningful insights.
Tools & Libraries Used
We will use the following libraries for different steps in the project:
- Data Handling & Analysis: Pandas, NumPy
- Exploratory Data Analysis (EDA): Matplotlib, Seaborn
- Text Preprocessing: NLTK, re (Regex), Scikit-learn
- Feature Extraction: CountVectorizer, TF-IDF Vectorizer
- Machine Learning Models: Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machine (SVM)
- Model Evaluation: Accuracy Score, Precision, Recall, F1-score, Confusion Matrix
- Web Deployment: Flask, HTML, CSS
Step 1: Importing the Required Libraries
Let’s start by importing the necessary libraries for data handling and basic exploration.
# Importing Required Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re import nltk from nltk.corpus import stopwords from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Download NLTK stopwords nltk.download('stopwords')
Step 2: Loading and Exploring the Dataset
Now that we have imported the required libraries, let’s load the dataset, check its structure, and explore the first few rows.
2.1 Loading the Dataset
Since you provided the column names (title, news_url, source_domain, tweet_num, real
), we will assume the dataset is in CSV format. If you have the dataset ready, you can upload it, or we can proceed with a sample dataset.
# Load the dataset df = pd.read_csv("fake_news_dataset.csv") # Update with the correct file path # Display the first five rows df.head()
2.2 Checking the Dataset Structure
Let’s examine the dataset’s structure, including column names, data types, and any missing values.
# Check basic information about the dataset df.info()
This will output details like:
- Number of non-null entries
- Data types of each column
- Memory usage
2.3 Checking for Missing Values
Missing values can affect our model performance, so we need to check if any columns contain null values.
# Check for missing values df.isnull().sum()
If there are missing values, we will decide whether to drop or impute them in later steps.
2.4 Exploring Target Variable (Real vs. Fake News Distribution)
We need to check the distribution of real and fake news to understand class imbalance.
# Plot the distribution of real vs. fake news plt.figure(figsize=(6,4)) sns.countplot(x=df['real'], palette='viridis') plt.xlabel("News Type (1 = Real, 0 = Fake)") plt.ylabel("Count") plt.title("Distribution of Real vs. Fake News") plt.show()
This will help us see whether our dataset is balanced or biased toward one category.
2.5 Checking for Duplicate Entries
Duplicate rows can lead to biased results, so let’s identify and remove them if needed.
# Count duplicate rows duplicate_rows = df.duplicated().sum() print(f"Number of duplicate rows: {duplicate_rows}") # Remove duplicate rows if necessary df = df.drop_duplicates()
Next Step: Exploratory Data Analysis (EDA)
We have now loaded and explored the dataset structure. Next, we will perform Exploratory Data Analysis (EDA) to gain deeper insights into patterns and relationships within the data.
Step 3: Exploratory Data Analysis (EDA)
Now that we have loaded and explored the dataset structure, we will perform Exploratory Data Analysis (EDA) to gain deeper insights. EDA helps in understanding patterns, relationships, and potential data issues before applying machine learning models.
3.1 Statistical Summary of the Dataset
We start by checking the statistical properties of numerical columns.
# Get summary statistics of numerical columns df.describe()
🔹 What this tells us:
tweet_num
: We can check the distribution of tweet counts associated with each news article.- Other numerical insights that might help in feature selection.
3.2 Visualizing Fake vs. Real News Distribution
We previously checked the count of fake vs. real news. Let’s visualize it using a pie chart.
# Plot pie chart of fake vs real news distribution plt.figure(figsize=(6, 6)) df['real'].value_counts().plot.pie(autopct='%1.1f%%', labels=["Fake News", "Real News"], colors=['red', 'green'], startangle=90, explode=[0.05, 0.05]) plt.title("Fake vs. Real News Distribution") plt.ylabel("") plt.show()
🔹 Insights: If the dataset is highly imbalanced (e.g., 80% real news and 20% fake news), we may need to apply data balancing techniques such as oversampling or undersampling.
3.3 Word Cloud for Fake vs. Real News Titles
A word cloud is a great way to visualize frequently appearing words in news titles.
from wordcloud import WordCloud # Generate word clouds for Fake and Real news fake_titles = ' '.join(df[df['real'] == 0]['title'].dropna()) real_titles = ' '.join(df[df['real'] == 1]['title'].dropna()) plt.figure(figsize=(14, 6)) # Fake News Word Cloud plt.subplot(1, 2, 1) wordcloud_fake = WordCloud(width=500, height=300, background_color='black').generate(fake_titles) plt.imshow(wordcloud_fake, interpolation="bilinear") plt.axis('off') plt.title("Fake News Word Cloud") # Real News Word Cloud plt.subplot(1, 2, 2) wordcloud_real = WordCloud(width=500, height=300, background_color='black').generate(real_titles) plt.imshow(wordcloud_real, interpolation="bilinear") plt.axis('off') plt.title("Real News Word Cloud") plt.show()
🔹 Insights:
- Common words in fake news vs. real news can be identified.
- If fake news contains more sensational words, this might help feature engineering.
3.4 Analyzing Most Common Words in Fake and Real News
Instead of just visualizing, let’s extract the most common words.
from collections import Counter import string # Function to preprocess text: remove punctuation and lowercase def preprocess_text(text): text = text.lower() text = text.translate(str.maketrans('', '', string.punctuation)) return text df['clean_title'] = df['title'].dropna().apply(preprocess_text) # Tokenize words for Fake and Real news fake_words = ' '.join(df[df['real'] == 0]['clean_title']).split() real_words = ' '.join(df[df['real'] == 1]['clean_title']).split() # Get most common words fake_common = Counter(fake_words).most_common(20) real_common = Counter(real_words).most_common(20) # Convert to DataFrame fake_df = pd.DataFrame(fake_common, columns=['Word', 'Frequency']) real_df = pd.DataFrame(real_common, columns=['Word', 'Frequency']) # Plot plt.figure(figsize=(14, 6)) # Fake News Common Words plt.subplot(1, 2, 1) sns.barplot(y=fake_df['Word'], x=fake_df['Frequency'], palette='Reds_r') plt.title("Most Common Words in Fake News") plt.xlabel("Frequency") plt.ylabel("Word") # Real News Common Words plt.subplot(1, 2, 2) sns.barplot(y=real_df['Word'], x=real_df['Frequency'], palette='Greens_r') plt.title("Most Common Words in Real News") plt.xlabel("Frequency") plt.ylabel("Word") plt.show()
🔹 Insights:
- This helps identify patterns in fake vs. real news.
- Sensational words (e.g., “shocking”, “clickbait”) might be more common in fake news.
3.5 Analyzing Tweet Count Distribution
The tweet_num
column represents how often a news article is shared on Twitter. Let’s analyze its distribution.
# Distribution of tweet counts plt.figure(figsize=(8, 5)) sns.histplot(df['tweet_num'], bins=50, kde=True, color="purple") plt.xlabel("Number of Tweets") plt.ylabel("Frequency") plt.title("Distribution of Tweet Counts") plt.show()
🔹 Insights:
- If fake news articles are shared more frequently, this may help in classification.
- We can apply feature scaling if needed.
Next Steps
EDA has helped us understand text-based and numerical patterns in the dataset. The next step is Text Preprocessing and Feature Engineering, where we will:
✅ Convert text into a numerical format using TF-IDF or Word Embeddings
✅ Handle missing values and clean data
✅ Apply vectorization techniques
Step 4: Text Preprocessing and Feature Engineering
Now that we have performed Exploratory Data Analysis (EDA), we will preprocess the text data and engineer useful features for machine learning. Since our dataset contains news titles, we must convert this textual data into a numerical format before training our models.
4.1 Handling Missing Values
Before proceeding, let’s check if there are any missing values in our dataset.
# Check for missing values df.isnull().sum()
🔹 Why is this important?
- If we find missing values in
title
, we can either drop them or replace them with placeholder text.
Let’s handle missing values now.
# Fill missing titles with a placeholder df['title'].fillna("unknown", inplace=True)
4.2 Converting Text to Lowercase
Text data should be converted to lowercase to maintain consistency.
# Convert text to lowercase df['title'] = df['title'].str.lower()
4.3 Removing Punctuation and Special Characters
Punctuation marks and special characters do not contribute much to meaning, so we remove them.
import string # Function to remove punctuation def remove_punctuation(text): return text.translate(str.maketrans('', '', string.punctuation)) df['title'] = df['title'].apply(remove_punctuation)
4.4 Removing Stopwords
Stopwords (like “the”, “is”, “in”) are commonly occurring words that do not add much value.
from nltk.corpus import stopwords import nltk # Download stopwords nltk.download('stopwords') stop_words = set(stopwords.words('english')) # Function to remove stopwords def remove_stopwords(text): return " ".join([word for word in text.split() if word not in stop_words]) df['title'] = df['title'].apply(remove_stopwords)
4.5 Lemmatization
Lemmatization reduces words to their base form, improving generalization.
from nltk.stem import WordNetLemmatizer # Download WordNet nltk.download('wordnet') lemmatizer = WordNetLemmatizer() # Function to lemmatize text def lemmatize_text(text): return " ".join([lemmatizer.lemmatize(word) for word in text.split()]) df['title'] = df['title'].apply(lemmatize_text)
4.6 Converting Text into Numerical Features (TF-IDF Vectorization)
Since machine learning models work with numbers, we must convert text data into numerical representations. TF-IDF (Term Frequency – Inverse Document Frequency) is a popular method.
from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TF-IDF Vectorizer tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limit to 5000 words for efficiency # Transform text into numerical format X = tfidf_vectorizer.fit_transform(df['title']).toarray() # Convert to DataFrame feature_names = tfidf_vectorizer.get_feature_names_out() X_df = pd.DataFrame(X, columns=feature_names) # Show sample transformed data X_df.head()
🔹 What this does:
- Converts the
title
column into a numerical matrix. - Each column represents a unique word, and the value shows its importance.
- This is now ready for model training! 🎯
4.7 Preparing Final Feature Set
Now that our text is processed, let’s prepare the final feature set.
# Our features (X) are the transformed text data X = X_df # Our target variable (y) is the "real" column y = df['real']
Next Steps
We have successfully prepared our dataset for machine learning! The next step is to split the data and train multiple machine learning models for classification.
Step 5: Training Machine Learning Models
Now that we have preprocessed the dataset and converted the text into numerical features using TF-IDF vectorization, we can proceed with training multiple machine learning models to classify news as real or fake.
5.1 Splitting the Dataset into Training and Testing Sets
Before training our models, we must split the dataset into training and testing sets. The training set is used to train the models, while the testing set evaluates their performance.
from sklearn.model_selection import train_test_split # Split data into training (80%) and testing (20%) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Display the shapes of the datasets print("Training Set Shape:", X_train.shape) print("Testing Set Shape:", X_test.shape)
🔹 Why 80-20 Split?
- 80% of the data is used to train the model so that it learns patterns effectively.
- 20% is reserved for testing to evaluate performance.
5.2 Training Multiple Machine Learning Models
We will train multiple models and compare their performances to select the best one.
5.2.1 Logistic Regression
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Initialize and train Logistic Regression model logreg = LogisticRegression() logreg.fit(X_train, y_train) # Make predictions y_pred_logreg = logreg.predict(X_test) # Evaluate model print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg)) print("Classification Report:\n", classification_report(y_test, y_pred_logreg))
🔹 Why Logistic Regression?
- Simple and efficient for binary classification problems.
- Performs well on linearly separable data.
5.2.2 Support Vector Machine (SVM)
from sklearn.svm import SVC # Initialize and train SVM model svm_model = SVC(kernel='linear') svm_model.fit(X_train, y_train) # Make predictions y_pred_svm = svm_model.predict(X_test) # Evaluate model print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm)) print("Classification Report:\n", classification_report(y_test, y_pred_svm))
🔹 Why SVM?
- Works well for text classification tasks.
- Good at handling high-dimensional data.
5.2.3 Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier # Initialize and train Random Forest model rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # Make predictions y_pred_rf = rf_model.predict(X_test) # Evaluate model print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf)) print("Classification Report:\n", classification_report(y_test, y_pred_rf))
🔹 Why Random Forest?
- Handles non-linearity better than Logistic Regression.
- Works well with large feature sets.
5.2.4 Naïve Bayes Classifier
from sklearn.naive_bayes import MultinomialNB # Initialize and train Naïve Bayes model nb_model = MultinomialNB() nb_model.fit(X_train, y_train) # Make predictions y_pred_nb = nb_model.predict(X_test) # Evaluate model print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred_nb)) print("Classification Report:\n", classification_report(y_test, y_pred_nb))
🔹 Why Naïve Bayes?
- Works exceptionally well for text classification problems.
- Assumes word independence, which simplifies computations.
5.3 Comparing Model Performance
To determine the best model, let’s compare their accuracy scores.
# Print accuracy scores for all models model_accuracies = { "Logistic Regression": accuracy_score(y_test, y_pred_logreg), "SVM": accuracy_score(y_test, y_pred_svm), "Random Forest": accuracy_score(y_test, y_pred_rf), "Naïve Bayes": accuracy_score(y_test, y_pred_nb) } # Display model performance for model, accuracy in model_accuracies.items(): print(f"{model}: {accuracy:.4f}")
Next Steps
Now that we have trained multiple models, the next step is to select the best-performing model and fine-tune it using hyperparameter optimization.
Step 6: Hyperparameter Tuning for Best Model
Now that we have trained multiple models and evaluated their performance, we will fine-tune the best-performing model using hyperparameter optimization.
6.1 Selecting the Best Model
Based on the accuracy scores obtained in Step 5, let’s assume Logistic Regression and Random Forest performed well. We will fine-tune these models using Grid Search CV and Randomized Search CV.
6.2 Hyperparameter Tuning for Logistic Regression
We will tune parameters like:
- C (Regularization strength): Controls the amount of regularization applied.
- Solver: Determines the optimization algorithm used.
from sklearn.model_selection import GridSearchCV # Define parameter grid for Logistic Regression param_grid_logreg = { 'C': [0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'lbfgs'] } # Initialize Logistic Regression logreg = LogisticRegression() # Perform Grid Search CV grid_search_logreg = GridSearchCV(logreg, param_grid_logreg, cv=5, scoring='accuracy') grid_search_logreg.fit(X_train, y_train) # Print best parameters and accuracy print("Best Parameters for Logistic Regression:", grid_search_logreg.best_params_) print("Best Accuracy:", grid_search_logreg.best_score_)
6.3 Hyperparameter Tuning for Random Forest
For Random Forest, we will tune parameters like:
- n_estimators: Number of decision trees.
- max_depth: Maximum depth of each tree.
- min_samples_split: Minimum samples required to split a node.
from sklearn.model_selection import RandomizedSearchCV import numpy as np # Define parameter grid for Random Forest param_grid_rf = { 'n_estimators': [50, 100, 200, 300], 'max_depth': [None, 10, 20, 30, 40], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } # Initialize Random Forest rf_model = RandomForestClassifier(random_state=42) # Perform Randomized Search CV random_search_rf = RandomizedSearchCV(rf_model, param_distributions=param_grid_rf, n_iter=10, cv=5, scoring='accuracy', random_state=42, n_jobs=-1) random_search_rf.fit(X_train, y_train) # Print best parameters and accuracy print("Best Parameters for Random Forest:", random_search_rf.best_params_) print("Best Accuracy:", random_search_rf.best_score_)
6.4 Retraining the Best Model
After obtaining the best hyperparameters, we retrain the best model.
# Retrain Logistic Regression with best parameters best_logreg = LogisticRegression(C=grid_search_logreg.best_params_['C'], solver=grid_search_logreg.best_params_['solver']) best_logreg.fit(X_train, y_train) # Evaluate the optimized model y_pred_best_logreg = best_logreg.predict(X_test) print("Optimized Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_best_logreg)) print("Classification Report:\n", classification_report(y_test, y_pred_best_logreg))
OR (if Random Forest performs better)
# Retrain Random Forest with best parameters best_rf = RandomForestClassifier(**random_search_rf.best_params_, random_state=42) best_rf.fit(X_train, y_train) # Evaluate the optimized model y_pred_best_rf = best_rf.predict(X_test) print("Optimized Random Forest Accuracy:", accuracy_score(y_test, y_pred_best_rf)) print("Classification Report:\n", classification_report(y_test, y_pred_best_rf))
Next Steps
Now that we have fine-tuned our best model, the next step is to evaluate it further using cross-validation and confusion matrix before proceeding to deployment.
Step 7: Model Evaluation
Now that we have fine-tuned our best model, we will evaluate it further using cross-validation, confusion matrix, classification report, and ROC-AUC curve to measure its effectiveness in detecting fake news.
7.1 Cross-Validation for Model Reliability
Cross-validation helps us test the model’s stability by dividing the dataset into multiple subsets and training the model on different portions.
from sklearn.model_selection import cross_val_score # Perform cross-validation on the best model cv_scores = cross_val_score(best_rf, X_train, y_train, cv=5, scoring='accuracy') # Print cross-validation results print("Cross-Validation Scores:", cv_scores) print("Mean CV Accuracy:", np.mean(cv_scores))
🔹 Why Cross-Validation?
- Ensures that the model is not overfitting.
- Evaluates performance on multiple data splits instead of just one.
7.2 Confusion Matrix and Classification Report
The confusion matrix helps visualize true positives, false positives, true negatives, and false negatives. The classification report provides detailed metrics such as precision, recall, and F1-score.
from sklearn.metrics import confusion_matrix, classification_report import seaborn as sns import matplotlib.pyplot as plt # Get predictions on test set y_pred_best_rf = best_rf.predict(X_test) # Compute confusion matrix cm = confusion_matrix(y_test, y_pred_best_rf) # Visualize the confusion matrix plt.figure(figsize=(6,4)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Real', 'Fake'], yticklabels=['Real', 'Fake']) plt.xlabel("Predicted") plt.ylabel("Actual") plt.title("Confusion Matrix") plt.show() # Print classification report print("Classification Report:\n", classification_report(y_test, y_pred_best_rf))
🔹 Key Metrics Explained:
- Precision: Out of all predicted fake news, how many were actually fake?
- Recall: Out of all actual fake news, how many did we correctly identify?
- F1-score: A balance between precision and recall (high values indicate a good model).
7.3 ROC-AUC Curve for Model Performance
The Receiver Operating Characteristic (ROC) curve shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different thresholds.
from sklearn.metrics import roc_curve, auc # Get probability scores for ROC curve y_probs = best_rf.predict_proba(X_test)[:, 1] # Compute ROC curve fpr, tpr, _ = roc_curve(y_test, y_probs) roc_auc = auc(fpr, tpr) # Plot the ROC curve plt.figure(figsize=(6,4)) plt.plot(fpr, tpr, color='blue', label='AUC = %0.2f' % roc_auc) plt.plot([0,1], [0,1], linestyle='--', color='grey') plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("ROC-AUC Curve") plt.legend() plt.show() print("AUC Score:", roc_auc)
🔹 Interpreting AUC Score:
- Closer to 1.0 → Excellent Model
- Around 0.5 → Random guessing
- Below 0.5 → Poor model
Next Steps
Now that we have thoroughly evaluated our model, the next step is Step 8: Saving and Deploying the Model with Flask.
Step 8: Saving and Deploying the Model with Flask
Now that we have built and evaluated our Fake News Detection model, the next step is to save the trained model and deploy it using Flask, so users can interact with it through a web interface.
8.1 Saving the Model
To make the model accessible for deployment, we save it using joblib or pickle.
import joblib # Save the trained model joblib.dump(best_rf, 'fake_news_model.pkl') print("Model saved successfully!")
🔹 Why Save the Model?
- Saves time by avoiding re-training.
- Enables real-world deployment in applications.
8.2 Creating a Flask Web App for Deployment
Flask will help us create a simple web-based interface where users can input a news title, and the model will predict whether it’s real or fake.
📌 Steps to Deploy Using Flask:
- Create a Flask application.
- Load the trained model.
- Build a front-end form for user input.
- Make predictions based on user input.
- Display the results.
Let’s start by creating a app.py
file with the following Flask code:
8.3 Writing Flask Backend (app.py
)
from flask import Flask, request, render_template import joblib import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer # Load the saved model model = joblib.load("fake_news_model.pkl") # Initialize Flask app app = Flask(__name__) # Load the TF-IDF vectorizer used for training vectorizer = joblib.load("tfidf_vectorizer.pkl") @app.route('/') def home(): return render_template('index.html') @app.route('/predict', methods=['POST']) def predict(): if request.method == 'POST': # Get user input news_title = request.form['news_title'] # Transform the input text using the TF-IDF vectorizer news_vector = vectorizer.transform([news_title]) # Make prediction prediction = model.predict(news_vector)[0] # Display result result = "Fake News" if prediction == 1 else "Real News" return render_template('index.html', prediction_text=f"The news article is: {result}") if __name__ == '__main__': app.run(debug=True)
8.4 Designing a Simple HTML Front-End (templates/index.html
)
This HTML form allows users to input a news title and get a prediction.
<!DOCTYPE html> <html> <head> <title>Fake News Detection</title> <style> body { font-family: Arial, sans-serif; text-align: center; } form { margin-top: 20px; } input { width: 60%; padding: 10px; } button { padding: 10px; background: blue; color: white; border: none; } </style> </head> <body> <h1>Fake News Detection</h1> <form action="/predict" method="post"> <input type="text" name="news_title" placeholder="Enter News Title" required> <button type="submit">Check</button> </form> {% if prediction_text %} <h2>{{ prediction_text }}</h2> {% endif %} </body> </html>
8.5 Running the Flask App
Once the files are created, run the following command in your terminal:
python app.py
Then, open http://127.0.0.1:5000/ in your browser to test the fake news detection model.
Next Steps
✅ Now that we have successfully built and deployed the Fake News Detection Web App, the final step is to write a project summary and conclusion.
Step 9: Conclusion
Project Summary
In this project, we built an end-to-end Fake News Detection system using Machine Learning, following a structured approach from data preprocessing to model deployment. We used TF-IDF vectorization for feature extraction and trained multiple classification models, selecting Random Forest as the best-performing model. The trained model was then deployed using Flask, allowing users to input a news title and receive real-time predictions on whether the news is real or fake.
Key Takeaways
✔ Data Preprocessing: Cleaned and processed text data, removed stopwords, and converted text to numerical format using TF-IDF.
✔ Model Training & Evaluation: Trained and compared multiple models, choosing Random Forest for deployment due to its superior accuracy.
✔ Web Deployment: Built a Flask web app with a simple HTML front-end for user interaction.
✔ Real-world Application: This project can be extended to analyze full articles, incorporate more features, and improve accuracy with deep learning models (LSTMs, BERT, etc.).
Future Improvements
🔹 Expand Feature Set: Utilize metadata such as publication date, author credibility, and external sources.
🔹 Improve Model Performance: Experiment with deep learning models like LSTMs, Transformers, or BERT for better accuracy.
🔹 Deploy Online: Host the model on a cloud platform (AWS, Google Cloud, or Heroku) for public access.
🔹 Build a Fake News Database: Collect and store flagged fake news for further analysis and model improvement.
This marks the successful completion of our Fake News Detection Project! 🎯 🚀
The battle against fake news is ongoing, but machine learning has proven to be a game-changer in detecting and mitigating misinformation. By leveraging advanced algorithms, NLP techniques, and AI-driven fact-checking, we can enhance the accuracy of news verification and reduce the spread of false information. While challenges such as bias in training data and evolving deception tactics remain, continuous improvements in machine learning models and collaboration with fact-checking organizations can strengthen the fight against disinformation. As technology evolves, the integration of AI in fake news detection using machine learning will play a crucial role in promoting digital literacy and ensuring a more informed society.
Read More From us:
Spam Email Detection Using Machine Learning
Chatbot Using Python for Beginners
Latest Posts:
- SQL for beginners : A Complete Guide
- Predictive Analytics Techniques: A Beginner’s Guide to Turning Data into Future Insights
- Top 10 Data Analysis Techniques for Beginners [2025 Guide to Get Started Fast]
- How to Build a Powerful Data Scientist Portfolio as a Beginner [Step-by-Step 2025 Guide]
- Hypothesis Testing in Machine Learning Using Python: A Complete Beginner’s Guide [2025]