Natural Language Processing with Disaster Tweets End to End Project

KANGKAN KALITA
Natural Language Processing with Disaster Tweets

Natural Language Processing with Disaster Tweets :


Analyzing tweets to identify if they are about real disasters is a critical application of Natural Language Processing (NLP). This project focuses on building a classification model to predict whether a tweet refers to a real disaster or not. Using NLP techniques, we preprocess the data, extract meaningful features, and train machine learning models for prediction.

Objective:

  • Perform NLP on tweets to classify them as disaster-related or not.
  • Preprocess, clean, and analyze the dataset.
  • Build and evaluate machine learning models for classification.

Dataset Description:
We will use the Natural Language Processing with Disaster Tweets dataset, which contains:

  • train.csv: Training data with text, keyword, location, and target columns.
  • test.csv: Test data with text, keyword, and location columns.
  • sample_submission.csv: A sample submission file in the correct format.

Columns:

  • id: Unique identifier for each tweet.
  • text: The tweet’s text.
  • location: The location the tweet was sent from (may be blank).
  • keyword: A keyword from the tweet (may be blank).
  • target: 1 if the tweet is about a real disaster, 0 otherwise (only in train.csv).

Tools & Libraries:

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • NLTK
  • TensorFlow/Keras (optional)

Explore More Such Data Science Projects from Here: https://thesmartcoder.com/blog/


1. Data Collection & Setup

Import Libraries and Load Dataset

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Load the datasets
train_data = pd.read_csv('/path/to/train.csv')  # Replace with actual file path
test_data = pd.read_csv('/path/to/test.csv')

# Preview the datasets
train_data.head()

Explanation:

  • The training data contains tweets and their target labels (target), while the test data contains tweets without target labels. These datasets will be used for model training and evaluation.

2. Exploratory Data Analysis (EDA)

Dataset Overview

# Display dataset information
train_data.info()
train_data.describe()
train_data.isnull().sum()

Target Distribution

# Plot the distribution of the target variable
sns.countplot(x='target', data=train_data, palette='viridis')
plt.title('Distribution of Target Variable')
plt.xlabel('Target (0 = Not Disaster, 1 = Disaster)')
plt.ylabel('Count')
plt.show()

Keyword Analysis

# Most frequent keywords
top_keywords = train_data['keyword'].value_counts().head(10)
top_keywords.plot(kind='bar', color='blue')
plt.title('Top 10 Keywords')
plt.xlabel('Keyword')
plt.ylabel('Count')
plt.show()

Tweet Length Analysis

# Length of tweets
train_data['tweet_length'] = train_data['text'].apply(len)
sns.histplot(train_data['tweet_length'], kde=True, color='green')
plt.title('Tweet Length Distribution')
plt.xlabel('Tweet Length')
plt.ylabel('Frequency')
plt.show()

Explanation:

  • EDA helps us understand the dataset structure, missing values, and important features like keyword, location, and text.

3. Data Cleaning

Handle Missing Values

# Fill missing keywords and locations
train_data['keyword'] = train_data['keyword'].fillna('no_keyword')
train_data['location'] = train_data['location'].fillna('unknown')

# Drop rows with missing text
data = train_data.dropna(subset=['text'])

Text Cleaning

# Import NLTK packages
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Text preprocessing function
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatize
    return ' '.join(tokens)

# Apply text cleaning
train_data['clean_text'] = train_data['text'].apply(clean_text)

Explanation:

  • Missing values in keyword and location are handled. Text preprocessing includes removing URLs, special characters, stopwords, and lemmatizing tokens for cleaner text.

4. Feature Engineering

TF-IDF Vectorization

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000)

# Transform text data
X = tfidf.fit_transform(train_data['clean_text']).toarray()
y = train_data['target']

Train-Test Split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation:

  • TF-IDF converts text into numerical features by measuring the importance of words in the dataset. The dataset is split for training and testing.

5. Model Building & Evaluation

Logistic Regression

# Train logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions
y_pred_logreg = logreg.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Random Forest Classifier

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

Explanation:

  • Both Logistic Regression and Random Forest models are trained and evaluated. The classification report provides precision, recall, and F1 scores.

6. Submission

Generate Predictions for Test Data

# Preprocess test data
test_data['clean_text'] = test_data['text'].apply(clean_text)
X_test_final = tfidf.transform(test_data['clean_text']).toarray()

# Predict using the best model
test_data['target'] = rf_model.predict(X_test_final)

# Save the submission file
test_data[['id', 'target']].to_csv('submission.csv', index=False)

Explanation:

  • The test dataset is preprocessed and predictions are generated using the best-performing model. A submission file is created in the required format.

7. Conclusion

This Natural Language Processing with Disaster Tweets project demonstrates how to preprocess text, engineer features, and build machine learning models to classify disaster-related tweets. Among the models tested, Random Forest Classifier achieved the highest accuracy.


Try experimenting with deep learning models like LSTM or BERT for better predictions. Share your insights and improve disaster response strategies!

Keywords: Natural Language Processing with Disaster Tweets, Tweet classification, Machine learning for disaster response, NLP projects, Natural Language Processing with Disaster Tweets pdf, Natural Language Processing with Disaster Tweets with source code.

Latest Posts:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *