Understanding and Implementing Naïve Bayes Algorithm for Email Spam Detection

Hussnain Akbar
4 min readMay 16, 2024

--

Concept of Naïve Bayes Theorem

The Naïve Bayes is based on the classification of supervised machine learning. It calculates the probability of an event based on its various related features. The term Naïve in this model explained the interdependency of feature variables in the data set. In other words, the feature variables have no relationship with each other and are independent.

In the following article, I will explain the implementation of this theorem through Python in a simple manner. In general, there are five types of Naïve Bayes models under the Scikit-Learn library of Python, and these will be explained further in other articles. These include Gaussian, Multinomial, Bernoulli, Complement, and Categorical. By now, I will consider a simple model for understanding, and for that, I will generate a data set (i.e., sample data) to have a strong grip on this model.

#Essential libraries required for this model
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Create a random dataset
np.random.seed(42) # For reproducibility

# Generate random words for features (words in emails)
word_list = ['discount', 'offer', 'sale', 'free', 'click', 'buy', 'win', 'money', 'gift', 'limited']

# Generate random emails
num_emails = 1000
emails = []
labels = []
for _ in range(num_emails):
email = ' '.join(np.random.choice(word_list, size=np.random.randint(5, 15)))
emails.append(email)
# Assign labels (spam or not spam)
labels.append(np.random.choice(['spam', 'not spam'], p=[0.3, 0.7]))

# Create a DataFrame
data = pd.DataFrame({'email': emails, 'label': labels})

# Display the first few rows of the dataset
print(data.head())

This code will create a data frame with random emails and their corresponding labels (spam or not spam). Each email will consist of a random selection of words from the word_list. However, the above code will have the following output.

Let’s proceed with building and training a Naive Bayes classifier using the dataset we created. For that purpose, four major steps must be considered. These include preprocessing and splitting the data, training the model, and evaluating it.

  1. Preprocessing the Data: Convert text data into numerical features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
  2. Splitting the Data: Divide the dataset into training and testing sets.
  3. Training the Naive Bayes Model: Fit the Naive Bayes classifier on the training data.
  4. Evaluating the Model: Test the model on the testing data and evaluate its performance.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Preprocessing: Convert text data to numerical features
tfidf_vectorizer = TfidfVectorizer(max_features=1000) # Limit features to 1000 for simplicity
X = tfidf_vectorizer.fit_transform(data['email'])
y = data['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Naive Bayes classifier
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = naive_bayes.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report for detailed evaluation
print(classification_report(y_test, y_pred))

The output of the can be analysed below. The accuracy of the Naive Bayes classifier on the test data is 66%, which means it correctly classified about two-thirds of the emails. However, looking at the classification report, we can see that the precision, recall, and F1 scorefor the “spam” class are very low. This indicates that the model is not performing well in correctly identifying spam emails. This perspective will not be discussed here, as this article is meant to help understand the basic concept. I will try to consider real data to cater to all these aspects.

The last aspect is to predict the new data based on this theorem. For that purpose, the following code is run to predict whether a new email is spam or not using the trained Naïve Bayes model. Following to that, output can be analysed.

# Example of a new email to be predicted
new_email = "Limited time offer! Click here to win a free gift."

# Preprocess the new email using the TF-IDF vectorizer from the training
new_email_features = tfidf_vectorizer.transform([new_email])

# Make prediction using the trained Naive Bayes classifier
predicted_label = naive_bayes.predict(new_email_features)

# Print the predicted label
print(f"Predicted Label: {predicted_label[0]}")Predicted Label: not spam
Output of prediction

Feel free to discuss.

Good Luck!

--

--

Hussnain Akbar

Quantitative Analyst with experience of Python, Machine Learning and R.