Building a Spam Detection Model using Scikit-Learn (2022)

Spam is a large number of unsolicited messages that are sent to a large number of people. The messages may be for advertising, fraudulent purposes, or malware spreading. Spams can be informed of comments left on personal sites or emails sent in bulk.

Spam detection helps in detecting these spam messages and comments. Spam detection models filter out unwanted messages and comments. This ensures an individual receives messages or notifications that are crucial to them.When building the spam detection model, we will provide the model with a dataset that consists of spam and non-spam comments.The model will learn from this dataset and find relevant patterns that will help it to distinguish between spam and non-spam comments.

This tutorial will demonstrate how to build a machine learning model that will detect Youtube comments as spam or non-spam. We will use a dataset that contains a list of comments from popular Youtube channels to train our model. Finally, we will implement the model using the Naive Bayes algorithm.

Table of contents

  • Prerequisites
  • Dataset preparation
  • Extracting important columns
  • Feature extraction from text
  • Model building
  • Accuracy score of our model
  • Model evaluation
  • Making a single prediction
  • Making another prediction
  • Conclusion
  • References

Prerequisites

A reader should know the following to understand this tutorial clearly:

  • Be well equipped with Python programming skills.
  • Understand the concepts of machine learning.
  • Have some knowledge about natural language processing.
  • Know how to work with some of the Scikit-learn algorithms.
  • Know how to build a machine learning model using Google Colab notebook.

Dataset preparation

The dataset used contains a list of comments from popular Youtube channels. We will use a dataset collected from five Youtube channels. We need to prepare this dataset to be ready for use. Data preparation involves correctly formatting our dataset to make it easy for use by the model during training.

First, we need to load these datasets into our machine. Let’s import the packages that will load our dataset.

import pandas as pdimport numpy as np

We will use Pandas to read the datasets and Numpy to perform mathematical operations on these datasets. We will have five datasets since we have collected the dataset from five Youtube channels.

To download the five datasets in a ZIP file, click here. After downloading the ZIP file, extract the individual datasets, which we will load onto our machine.

To load the five datasets, use the following code:

df1 = pd.read_csv("Youtube01-Channel1.csv")df2 = pd.read_csv("Youtube02-Channel2.csv")df3 = pd.read_csv("Youtube03-Channel3.csv")df4 = pd.read_csv("Youtube04-Channel4.csv")df5 = pd.read_csv("Youtube05-Channel5.csv")

Now we have five datasets, we need to concatenate or merge them. We will join the five datasets together to have a single data frame.

Datasets concatenation

We create a single data frame for the datasets and then apply the concat method to join them together.

frames = [df1,df2,df3,df4,df5]df_merged = pd.concat(frames)

To view our merged datasets, use the following code:

df_merged

The output is shown below:

Building a Spam Detection Model using Scikit-Learn (1)

From the image above, our dataset has five columns: COMMENT_ID, AUTHOR, DATE, CONTENT, and CLASS. The columns that we are most interested in are CONTENT and CLASS columns.

(Video) Email Spam Classification Scikit-Learn | Machine learning Projects for Beginners

CONTENT column represents the actual Youtube comments. The CLASS column is labeled either 0 or 1. 0 represents non-spam comments, and 1 represents spam comments.

The merged dataset contains five datasets. We need to assign keys to our merged dataset to distinguish each dataset.

Assigning keys

Assigning keys enables the model to know the Youtube channel that a dataset belongs to. We will have five keys to represent the five datasets as shown below.

keys = ["Channel1","Channel2","Channel3","Channel4","Channel5"]

After initializing the five keys, we need to concatenate these keys into our dataset using the following code:

df_with_keys = pd.concat(frames,keys=keys)

The code above will add the keys to the dataset. It will also group the dataset according to the Youtube channels. This makes it easy for the model to understand and manipulate the dataset.

The model will easily identify useful insights and patterns from the dataset during training. To view this dataset with the added keys, use this code:

df_with_keys

The output is shown below:

Building a Spam Detection Model using Scikit-Learn (2)

We can save the dataset into a new variable, df.

df = df_with_keys

To check the size of the dataset, run the following code:

df.size

The output is shown below:

9780

After combining the five datasets, we have 9780 Youtube comments.

Let’s check for any missing values in our dataset.

Checking for missing values

To check for missing values, use the following code:

df.isnull().isnull().sum()
(Video) Spam Detection for YouTube Comments using Python and scikit-learn | Machine Learning

The output is shown below:

COMMENT_ID 0AUTHOR 0DATE 0CONTENT 0CLASS 0dtype: int64

From the output above, there are no missing values. Therefore, our dataset is ready for use.

Extracting important columns

We need to extract the important columns from our dataset. As mentioned earlier, we are interested in only two columns, CONTENT and CLASS.

The CONTENT column contains the actual Youtube comments. This column will be used as an input for the model. The CLASS column contains 0 and 1 labels. This column will be used as an output or target for the model.

To extract these two columns, use this code:

df_data = df[["CONTENT","CLASS"]]

We now need to specify which column will be used as an input and which one will be used as an output. This is done using the following code:

df_x = df_data['CONTENT']df_y = df_data['CLASS']

From the code above, df_x is the input variable and df_y is the output or target variable. After specifying our input and output variables, let’s perform feature extraction.

Feature extraction is the process of getting important characteristics from the raw text. Machine learning models do not understand the text and can not use text directly. That’s why we have to perform feature extraction. The extracted features will now be used as inputs for the model.

We have to convert the raw text into a vector of numeric values during feature extraction. The vectors of numeric values represent the original raw text. Machine learning models easily understand numeric values and can use them directly.

This process of converting raw text to vectors of numeric values will be done using the CountVectorizer Python package. CountVectorizer is a powerful tool from Scikit-learn library that speeds up this feature extraction process from text.

Let’s import CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

We will then use CountVectorizer to perform feature extraction on our input variable, df_x.

corpus = df_xcv = CountVectorizer()X = cv.fit_transform(corpus)

In the code above, we save the input variable into a new variable, corpus. The fit_transform ensures that the CountVectorizer completely fits our input dataset and no data point is left out. Therefore, all the raw text will be converted into vectors of numeric values.

To view these vectors of numeric values, use this code. The code will convert the numeric values into an array of numbers.

X.toarray()

The output is shown below:

(Video) Django and HTMX #18 - Automated Spam Filtering / Machine Learning with Scikit-Learn

Building a Spam Detection Model using Scikit-Learn (3)

We can now use this vector of numbers to build the model.

Model building

To build our machine learning model, we need to import the packages that will be useful during this process.

from sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_split

MultinomialNB

This is the classification method imported from the Naive Bayes algorithm. Naive Bayes algorithm has other methods such as GaussianNB, but MultinomialNB is best suited because we are working with text.

We will use the MultinomialNB method to build our spam detection model.

For a detailed understanding of the different Naive Bayes algorithm methods, click here.

train_test_split

We will use this package to split our dataset into two sets. The model will use the first set for training, and the second set for testing.

We will start by splitting the dataset.

Dataset splitting

To split the dataset, use the following code:

X_train, X_test, y_train, y_test = train_test_split(X, df_y, test_size=0.30, random_state=42)

From the code above, we have a test_size=0.30. This means the algorithm uses 70% of data for training the model, and 30% will be used to test the model.

Let’s now build the model using the MultinomialNB method. First, we initialize the MultinomialNB method as follows:

clf = MultinomialNB()

After initializing this method, we fit our model into our dataset. This enables the model to learn by identifying useful insights and patterns from the dataset.

Model fitting

clf.fit(X_train,y_train)

Accuracy score of our model

To calculate the accuracy score of this trained model, use this code:

print("Accuracy of Model",clf.score(X_test,y_test)*100,"%")

The accuracy score is shown below:

(Video) Machine Learning for Security Analysts - Part 2: Building a Spam Filter

Accuracy of Model 91.95046439628483 %

This is a very high accuracy score, and the model has a high chance of making accurate predictions. We can now evaluate this model using the testing dataset.

Model evaluation

We will use this model to classify the Youtube comments in the testing dataset as either spam or non-spam.

clf.predict(X_test)

We use the predict method to classify all the Youtube comments in the testing dataset. The output is shown below:

Building a Spam Detection Model using Scikit-Learn (4)

From the image above, we can see the model assigned labels to our testing dataset. The labels are either 0 or 1.

We can use this model to make a single prediction.

Making a single prediction

We will use input text to predict, as shown below.

comment = ["Check this out I will be giving 50% offer on your first purchase"]vect = cv.transform(comment).toarray()

The input text is “Check this out”. We will use the model to classify the text into either spam(1) or non-spam(0). We also need to convert the input text into vectors of numeric values using cv.transform method. Finally, the numeric values will be converted into an array of numbers using the toarray() method.

To make this prediction, run this code:

clf.predict(vect)

The prediction result is shown below:

array([1], dtype=int64)

The prediction result is 1, which shows that the Youtube comment above is spam. We can use this mode to make another prediction.

Making another prediction

We will follow the same steps as above to make a second prediction.

comment1 = ["Great song Friend, it has really touched my heart"]vect = cv.transform(comment1).toarray()clf.predict(vect)

The prediction result is shown below:

array([0], dtype=int64)

The prediction result is 0, which shows that the comment is non-spam. Using these two predictions, we can see that our model can distinguish between spam and non-spam comments.

(Video) How to Build a Spam Detector using Python ~ Hands On tutorial | Text Classification | NLP

Conclusion

In this tutorial, we have learned how to build a spam detection model. We started by preparing our dataset to format our dataset correctly. We had five datasets that were collected from popular Youtube channels. After preparing the dataset, we used it to build our spam detection model. The model was able to distinguish between spam and non-spam comments. This was the tutorial’s goal, and we have successfully built a spam detection model.

To get this spam detection model in Google Colab, click here.

References

Peer Review Contributions by: Willies Ogola

FAQs

How do I make a spam classifier? ›

  1. Load and simplify the dataset. ...
  2. Explore the dataset: Bar Chart. ...
  3. Explore the dataset: Word Clouds. ...
  4. Handle imbalanced datasets. ...
  5. Split the dataset. ...
  6. Apply Tf-IDF Vectorizer for feature extraction. ...
  7. Train our Naive Bayes Model. ...
  8. Check out the accuracy, and f-measure.
10 Aug 2020

Which algorithm is used for spam detection? ›

The methodology is used for the process of e-mail spam filtering based on Naıve Bayes algorithm. 3.1. Naıve Bayes classifier The Naıve Bayes algorithm is a simple probabilistic classifier that calculates a set of probabilities by counting the frequency and combination of values in a given dataset [4].

What is spam detection in machine learning? ›

Spam detection is a supervised machine learning problem. This means you must provide your machine learning model with a set of examples of spam and ham messages and let it find the relevant patterns that separate the two different categories.

How do I make a spam filter? ›

First, click on the Settings icon that looks like a gear. Then, navigate to “Filters and Blocked Addresses.” Choose “Create New Filter.” Click in the “From” section, and type in the email address from the sender that you want to keep out of your spam folder.

Why is naive Bayes good for spam? ›

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

What is spam classifier? ›

A spam message classification is a step towards building a tool for scam message identification and early scam detection. Photo by Markus Winkler on Unsplash. Dataset. The dataset is from Kaggle, a collection of spam SMS messages, with 5572 messages, all classified as either 'ham' or 'spam' .

How spam is detected? ›

Anti-spam software and filters scan emails for red flags. These red flags are based on common attributes of spam messages. In the time an email is sent, to the time it lands in your inbox, filters will examine an email and decide whether it gets delivered to your inbox or into the spam folder.

Is spam detection classification or regression? ›

Logistic regression is one of the most likely and appropriate algorithm used for classification of datasets. In case of classifying a dataset named as spam base the logistic regression is the most versatile decision based approach for detecting spam emails out of a dataset.

How is AI used in spam filter? ›

AI spam filters scan each incoming message and label any objectionable content. Its intelligent learning capabilities label warning signs of malware. If a message containing this malicious software is found in your inbox, it's immediately flagged and you're alerted not to touch it.

Why is spam detection important? ›

Implementing spam filtering is extremely important for any organization. Not only does spam filtering help keep garbage out of email inboxes, it helps with the quality of life of business emails because they run smoothly and are only used for their desired purpose.

Which type of learning analytics can spam emails be detected? ›

Spam filter uses Machine Learning techniques to filter an email. It looks for several features in an email, based on which it decides whether an email is a spam or a ham (term used for legit emails). 1.

How do you filter spam in Python? ›

Step 1: We'll load a dataset. Step 2: We'll pre-process the content of each SMS with nltk & string. Step 3: We'll determine which words are associated with spam or ham messages and count their occurrences. Step 4: We'll build a predict function returning a ham or spam label.

What is spam in Python? ›

Spam is a piece of Python software built upon NumPy and SciPy for the analysis and manipulation of 3D and 2D data sets in material science, be they from x-ray tomography, random fields or any other source.

How do I automatically send email to spam? ›

Add any email address or domain to your blocked senders list to send these emails directly to your Junk Email folder.
  1. At the top of the page, select Settings. ...
  2. Under Block or allow, select Automatically filter junk email.
  3. Under Blocked Senders, type the email address or domain that you want to block and select Add .

How can we use Bayes rule in detection of spam mail? ›

Spam Filtering

With Bayes' Rule, we want to find the probability an email is spam, given it contains certain words. We do this by finding the probability that each word in the email is spam, and then multiply these probabilities together to get the overall email spam metric to be used in classification.

How do spam algorithms work? ›

When you mark a message as spam, it goes into a hopper with millions of messages that others have flagged. Algorithms churn through these messages to find similar characteristics, such as word proximity or misspellings, that show up frequently in spam.

Why linear regression is not suitable for spam filtering? ›

Reason 1: The hypothesis's range should be {0, 1} This brings us to the first folly of using linear regression to build a spam filter. The hypothesis should be a value 1 or 0 (spam or not spam, respectively). Yet linear regression allows it be any real number.

How do you create a spam classifier in Python? ›

In this article, we'll discuss:
  1. Import the required packages.
  2. Loading the Dataset.
  3. Remove the unwanted data columns.
  4. Preprocessing and Exploring the Dataset.
  5. Build word cloud to see which message is spam and which is not.
  6. Remove the stop words and punctuations.
  7. Convert the text data into vectors.
13 Sept 2021

Which of the following spam filtering methods are typically used? ›

The most commonly used technique to block spam is by filtering some common words used in spam emails. Some of the most common spam words include additional income, cash bonus and some claims you are a winner. You can filter such words or emails of such topics.

What is naive text classification? ›

Introduction. Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Is spam detection supervised or unsupervised? ›

Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explic- itly manually labeled as spam or ham. Such supervised learn- ing is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough.

Is junk mail same as spam? ›

At the core, both spam and junk mail represent messages that clutter the user's inbox. While junk mail often comes from opt-in services, such as from businesses, spam refers to messages that the user did not opt to receive.

What does spam stand for? ›

SPAM is an acronym: Special Processed American Meat.

Is spam detection a classification problem? ›

This is called Spam Detection, and it is a binary classification problem. The reason to do this is simple: by detecting unsolicited and unwanted emails, we can prevent spam messages from creeping into the user's inbox, thereby improving user experience.

Can logistic regression be used for spam detection? ›

Logistic regression is a statistical method that can be utilized for spam filtering. It is sensible that spam emails typically share a certain type of characteristics. Words that recurrently show up in spam emails can be used as predictor variables in the logistic regression model.

Which algorithm does Facebook use for face verification and how does it work? ›

The answer is DeepFace Algorithm. Also as we know Facebook has a database, so with the DeepFace Algorithm, they will determine a 3D numerical model of each face while by using the database they can recognize faces. DeepFace is a deep learning facial recognition system created by a research group at Facebook.

Is spam filter a firewall? ›

Despite these differences, all spam filtering features function as a sort of firewall for emails. No, spam filters are not security hardware devices, but they use rules similar to a firewall's Access Control List (ACL).

Why is it called spamming? ›

It is named after Spam, a luncheon meat, by way of a Monty Python sketch about a restaurant that has Spam in almost every dish in which Vikings annoyingly sing "Spam" repeatedly.

How do I make an email spam in Python? ›

How to Send Emails using Python - Email Spammer - YouTube

What is spam and eggs in Python? ›

Spam, eggs and hams are therefore used in python as a placeholder variable to remeber them as their references. When you are not able to define variable's name, then these are often used as placeholder variables whereas foo and bar are used in other languages (Including Python also).

How does Gmail detect spam? ›

Gmail employs a number of AI-driven filters that determine what gets marked as spam. These filters look at a variety of signals, including characteristics of the IP address, domains/subdomains, whether bulk senders are authenticated, and user input.

What is a third party spam filter? ›

Third party spam filters are those that will pre-analyze your mail before it's delivered to your server and filter out the junk before you receive it in your local server and then mailbox.

How do I automatically move emails to a folder? ›

Here they are:
  1. Open Outlook and enter the email from the sender whose emails you want to move.
  2. Click on the Home button.
  3. Choose Rules and then Always Move Messages From [Sender]
  4. Select the destination folder.
  5. Save changes with OK.
2 Nov 2020

Is spam detection supervised or unsupervised? ›

Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explic- itly manually labeled as spam or ham. Such supervised learn- ing is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough.

How do spam algorithms work? ›

When you mark a message as spam, it goes into a hopper with millions of messages that others have flagged. Algorithms churn through these messages to find similar characteristics, such as word proximity or misspellings, that show up frequently in spam.

Which type of learning analytics can spam emails be detected? ›

Spam filter uses Machine Learning techniques to filter an email. It looks for several features in an email, based on which it decides whether an email is a spam or a ham (term used for legit emails). 1.

Which domain of AI is used in email filters? ›

1 Answer. Natural Language Processing is the correct answer.

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem? ›

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem? Spam detection is a supervised learning problem because the labels are known (spam or no spam).

How can AI be used to detect and filter out such spam messages? ›

AI to power spam-prevention technology

Here are ways that AI-based tools will detect and filter spam: Keyword and content-based filtering: Machine learning approaches such as Neural Networks, Naïve Bayesian classification, k-nearest neighbor(kNN), and others are used.

Why is spam detection important? ›

Implementing spam filtering is extremely important for any organization. Not only does spam filtering help keep garbage out of email inboxes, it helps with the quality of life of business emails because they run smoothly and are only used for their desired purpose.

Why we need a spam filter? ›

A spam filter is a program used to detect unsolicited, unwanted and virus-infected emails and prevent those messages from getting to a user's inbox.

Is spam detection a classification problem? ›

This is called Spam Detection, and it is a binary classification problem. The reason to do this is simple: by detecting unsolicited and unwanted emails, we can prevent spam messages from creeping into the user's inbox, thereby improving user experience.

Which is the most advanced form of artificial intelligence? ›

The most advanced AI technology to date is deep learning, a technique where scientists train machines by feeding them different kinds of data. Over time, the machine makes decisions, solves problems, and performs other kinds of tasks on their own based on the data set given to them.

What are the three domains of artificial intelligence? ›

The domain of AI is classified into Formal tasks, Mundane tasks, and Expert tasks.

Videos

1. SMS Spam Detection Analysis (NLP) | Machine Learning | Python
(Hackers Realm)
2. Python Coding - Spam Detection using Machine Learning
(Vinsloev Academy)
3. Comment spam detection: Choosing a pre-made tf.js model, quantization, and word vectors explained
(TensorFlow)
4. Email Spam Detection Using Python & Machine Learning
(Computer Science)
5. Code With Me : Building a Spam Filter !
(ritvikmath)
6. Machine Learning - Python Naïve Bayes Spam Classifier Model | Classify Emails with ML Model
(BI Insights Inc)

Top Articles

You might also like

Latest Posts

Article information

Author: Jonah Leffler

Last Updated: 11/09/2022

Views: 5675

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.