Text mining in Python (complete guide)

In this step-by-step guide, we'll cover all the fundamentals of text mining in Python. Text mining is all about extracting useful information from unstructured text data using NLP and AI techniques. Whether you're analyzing customer feedback, research papers, or social media posts, these techniques will help you turn text into useful data you can work with.

Prerequisites for text mining in Python

Basic Python knowledge
Python 3.8 or higher installed

Setting up the virtual environment

Virtual environments ensure that we have separate library installations for separate projects, avoiding the conflicting versions required for different projects. There are two ways of setting up a virtual environment in Python.

Conda

If you have Anaconda installed, you can either set up the virtual environment using the Anaconda Navigator’s GUI or from the command line as:

conda create --name NLP

If you'd like to specify the Python version explicitly, please add it to the command.

conda create --name NLP python=3.12

Once created, we can activate it and then install the respective libraries.

conda activate NLP
conda install nltk pandas matplotlib re spacy vader wordcloud

Default Python

If you don’t want to use Conda, we can use the venv module from Python for virtual environment setup as well. The same steps would be replicated as:

python -m venv NLP_Default
source NLP_Default/bin/activate
pip install nltk pandas matplotlib re spacy vader wordcloud

Now our virtual environment is set up, we are ready to begin text mining.

1. Data collection

Before we can start mining text, we need data to analyze. For this tutorial, we'll use Twitter (X.com) data since tweets provide an excellent example of text that requires various preprocessing steps. We'll focus on collecting X posts (tweets) that contain scientific announcements and discoveries.

To collect a set of tweets, you can use Twitter APIs, either by calling them directly or through Python wrapper libraries like tweepy. However, Twitter API has significant rate limit challenges. Therefore, in this post, we’ll use the Twitter Profile Scraper Actor to collect tweets.

Setting up Twitter API access

First we need to install the apify-client library and set up the APIFY_API_TOKEN:

!pip install apify_client

To collect tweets from a particular profile, we’ll use Twitter Profile Scraper:

import pandas as pd
from apify_client import ApifyClient

client = ApifyClient(token="YOUR-APIFY-API-TOKEN")
run_input = {
    "maxTweetsPerUser": 100,
    "proxy": {"useApifyProxy": True},
    "startUrls": ["https://x.com/NaturePhysics"]
}

Collecting scientific tweets

Let's call the Actor to collect tweets about scientific discoveries that typically contain the elements we'll need to preprocess:

print("Running actor ... it might take a while")
run = client.actor("epctex/twitter-profile-scraper").call(run_input=run_input)
dataset_id = run["defaultDatasetId"]

dataset = client.dataset(dataset_id).list_items()
all_tweets = pd.DataFrame(dataset.items)

# Remove duplicates and reset index
all_tweets = all_tweets.drop_duplicates(subset=["full_text"]).reset_index(drop=True)

Understanding our dataset

Let's examine what we've collected:

# Basic dataset information
print(f"Total tweets collected: {len(all_tweets)}")
print("\nSample of tweets that need preprocessing:")
for i, tweet in all_tweets.head(5).iterrows():
    print(f"\nTweet {i + 1}, {tweet['full_text']}")

Preparing for preprocessing

Before we move to data preprocessing shown in the next section, let’s look at the data statistics:

print("\Tweet statistics:")
print(f"Total favorite_count: {all_tweets['favorite_count'].sum()}")
print(f"Total quote_count: {all_tweets['quote_count'].sum()}")
print(f"Total reply_count: {all_tweets['reply_count'].sum()}")
print(f"Total retweet_count: {all_tweets['retweet_count'].sum()}")

In the next section, we'll learn how to clean and preprocess this text data, starting with removing URLs and handling special characters using regular expressions.

2. Text preprocessing

Raw text data often needs cleaning and standardization before analysis. Here are the key preprocessing steps that prepare text for mining.

Cleaning text data

Consider a tweet:

BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. https://t.co/2lMvheiDcW https://t.co/Njoa0Y8mBe

Now, it's obvious that these URLs are unnecessary here. We can remove them (and any other unwanted content like punctuation marks) using regular expressions.

Regular expressions

While regular expressions need some dedicated posts, we will go through them briefly. Regular expressions are quite useful for string matching. For example, to check if an email address is valid, we have the regular expression:

import re
valid_email_regex = r'^[a-z0-9._]{3,}@[a-z0-9-]+\.[a-z.]{2,}$'

In Python, we have the re module for regular expressions (regex).

Now, an email consists of email+@+domain.+TLD .

We specify the email part as a string that would have at least 3 characters {3,} and can have any English letter, a digit, a period, or an underscore.
It will be followed by a @ sign and a domain name (which can allow English letters, numbers, and an underscore. The + sign shows that we should have at least one character in the domain name (like x in x.com).
\' means literal. So.will allow a period after the domain name, followed by a TLD of at least two characters. The period is added here, as we can have composite domain names likeco.uk`.

We can verify this regex using re.match(). We will pass some valid email addresses (and one counter-example):

sampleEmailsList = ['forever2024@yahoo.com','312122112@qq.com','myname@xyzaaa.co.uk','www.mywebsite.com']
for email in sampleEmailsList:
    print(bool(re.match(valid_email_regex, email)))

"""Output

True
True
True
False"

Now, let’s remove the URLs from the tweets. The URLs are always in the form https://t.co/XXXXXXXXXX, so we should use a regex pattern like this:

r'(https://t.co/)+[0-9a-zA-Z]{10}'

A little observation revealed that these URL parts have a fixed length of 10 and can have any alphanumeric character. We will use re.sub() here (this function takes a regex and replace it with the given character - none in our case) to weed out the URLs.

import re
text = "BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. \n  https://t.co/2lMvheiDcW https://t.co/Njoa0Y8mBe"

punctuation_free_text = re.sub(r'(https://t.co/)+[0-9a-zA-Z]{10}', '', text)

"""
'BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. \n   '
"""

It is working well. But we need to remove the unnecessary line break.

punctuation_free_text = re.sub(r'[\n]+', '', punctuation_free_text)
punctuation_free_text.strip()

#'BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time.'

This introduction covers regex basics. For a deeper understanding, you can check out Python's regex documentation. Let's move on to tokenization.

Tokenization

Before processing the text, it needs to be converted into smaller units. This conversion is known as tokenization. While tokenization can refer to sentence, word, or character tokenization, it usually refers to word tokenization. In word tokenization, we take a text and split it into words. NLTK has a built-in function, word_tokenize() for this purpose.

import nltk
nltk.download('punkt_tab')
from nltk import word_tokenize, LancasterStemmer
tokens = word_tokenize("The quick brown fox jumps over the lazy dog.")

"""
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
"""

As you can see, punctuation marks (a period in this case) are also part of the tokens. Whether you want to keep the punctuation marks or not is your call. In some cases, punctuation marks are just trivial, but in other cases, they may contain quite helpful contextual information (like commas, colons, etc.) If you want to remove them, you can use the string.punctuation list to filter out.

import string
tokens_without_punctuation = [word for word in tokens if word not in string.punctuation]

"""
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
"""

Removing stop words

Stop words refer to the redundant words carrying lesser contextual information. They include articles, prepositions, etc. To remove the stop words (or use them for any purpose), we download them from NLTK and set them to English stop words.

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Similar to filtering the punctuation, we can fetch tokens without stop words too.

tokens_without_stopwords = [word for word in tokens if word not in stop_words]
# ['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.']

It’s an open debate whether these stopwords are necessary for contextual information. So it's up to the user again whether to use them or not. The reason I am so cautious about removing the stop words is based on the fact that stop words contain valuable contextual information. This example will further explain it.

tokens = word_tokenize("To be or not to be")
tokens_without_stopwords = [word for word in tokens if word not in stop_words]
# ['To']

Stop words carry grammatical meaning but often add noise to text analysis. While removing them can lose some context, it typically improves results for most text mining tasks.

Stemming and lemmatization

Stemming is a linguistic technique that refers to going down the stem/root of a word. In stemming, we use simple (heuristic) rules to cut a word to its origin. Playing is a verb, so just drop its “ing,” and we will get “play.”

We will use PorterStemmer here. Named after its author, it is one of the earliest stemming algorithms.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stem = stemmer.stem("playing")
# 'play'

But it comes with its limitations. For example, trying “happiness” will result in “happi” which is not a valid word.

stem = stemmer.stem("happiness")
# 'happi'

NLTK features some other stemmers, so let's try an alternative one.

from nltk.stem import LancasterStemmer
stem = LancasterStemmer().stem("happiness") 
#'happy'

Lemmatization

Lemma also means the root/basic form of a word, though it should be part of a dictionary too (i.e. realized as a standalone word). Here’s how the Cambridge English Dictionary defines a lemma:

Lemma is a form of a word that appears as an entry in a dictionary and is used to represent all the other possible forms. For example, the lemma “build” represents “builds”, “building”, “built”, etc.
--Cambridge English Dictionary

For lemmatization, we look up the word in the dictionaries. Every NLP library, including NLTK, includes some dictionaries for lemmatization. As a result, Lemmatization is more accurate as in the happiness example.

For lemmatization, we require the wordnet package and respective lemmatizer.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("happiness", pos=wordnet.NOUN)
#'happiness'

Difference

Some people may need clarification about the difference between the two terms. But they aren’t. Both serve the same purpose: stemming uses fixed heuristic rules and is hence faster, while lemmatization does a dictionary lookup and is consequently slower but more accurate.

💡

3. Text representation

Text features can be represented in several ways. We will quickly review some of the common text representation models.

Bag of Words model

The Bag of Words (BoW) model treats text as a collection of words and creates representations of each word proportional to its frequency. To use BoW features, we can use CountVectorizer from Scikit-learn.

!pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

After instantiating the CountVectorizer, we can use the fit_transform() method. This method takes the vocabulary of the given text samples and converts their respective frequency into the respective (BoW) features.

import pandas as pd
text_samples = [
    "Manila is the capital of the Philippines.",
    "Capital investment model, despite its flaws, is highly successful.",
    "23 September is equinox."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(text_samples)
X = X.toarray()

df= pd.DataFrame(X, columns=vectorizer.get_feature_names_out())

TF-IDF model

	23	capital	despite	equinox	flaws	highly	investment	is	its	manila	model	of	phillipines	september	successful	the
0	0	1	0	0	0	0	0	1	0	1	0	1	1	0	0	1
1	0	1	1	0	1	1	1	1	1	0	1	0	0	0	1	0
2	1	0	0	1	0	0	0	1	0	0	0	0	0	1	0	0

The BoW approach has a serious limitation as it doesn’t take into account contextual information or grammar. A better approach, Term Frequency-Inverse Document Frequency (TF-IDF), takes into account the relative frequency of a term but also the inverse document frequency—penalizing words that appear across all documents.

Scikit-learn provides a vectorizer for TF-IDF..

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(text_samples)
X = X.toarray()

df= pd.DataFrame(X, columns=vectorizer.get_feature_names_out())

TF-IDF vector values

Word	Document 0	Document 1	Document 2
23	0.000000	0.000000	0.546454
capital	0.342620	0.270118	0.000000
despite	0.000000	0.355173	0.000000
equinox	0.000000	0.000000	0.546454
flaws	0.000000	0.355173	0.000000
highly	0.000000	0.355173	0.000000
investment	0.000000	0.355173	0.000000
is	0.266075	0.209771	0.322745
its	0.000000	0.355173	0.000000
manila	0.450504	0.000000	0.000000
model	0.000000	0.355173	0.000000
of	0.450504	0.000000	0.000000
phillipines	0.450504	0.000000	0.000000
september	0.000000	0.000000	0.546454
successful	0.000000	0.355173	0.000000
the	0.450504	0.000000	0.000000

Word embeddings

TF-IDF is a bit better than BoWs in giving relative weights to words, but it also lacks information about how similar or dissimilar two words are. Word embedding models project words into the (continuous) vector space. As a result, semantically similar words have closer embeddings too.

While BoW or TF-IDF representations work well only on smaller datasets, word embeddings are scalable to massive datasets. Models like the Generative Pre-Trained (GPT) family or Bi-directional Encoder Representations from Transformers (BERT) or any transformer model usually give better embeddings (we can train smaller Recurrent Neural Network (RNN)-based models, too). These models differ mainly in the way they calculate these embeddings and the output embedding size.

Given the growing interest in vector databases, these embeddings become even more important. Here, we will use static word embeddings from Gensim using Word2Vec.

!pip install --upgrade gensim #We need to install gensim first

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

tokenized_samples = [word_tokenize(text) for text in text_samples]
model = Word2Vec(sentences=tokenized_samples, vector_size=100, window=5, min_count=1, workers=4)

model.wv.most_similar('capital', topn=5)

# [('successful', 0.16373926401138306), (',', 0.14594906568527222), ('Manila', 0.07480262219905853), ('model', 0.05047113448381424), ('equinox', 0.04157735034823418)]

Although it wasn’t given much data, it has made an interesting similarity inference by rating Manila high in correlation with capital. Having more data will result in much more interesting results.

4. Text mining techniques

Several algorithms can be used for text mining. Some require a labeled dataset (supervised learning), while others can take unlabelled data.

Text classification

Text classification is a loose term that can refer to any sort of (supervised) classification for text samples. For example:

Sentiment analysis: We can classify the documents to see their sentiment. This can be useful when analyzing customer reviews, social media trends from comments, etc. Usually, sentiment analysis results in three classes: positive, negative, and neutral.
Grammatical correctness: Any grammar-checking software may determine if the given text is (grammatically) correct or not.

Text classification can be done using different classification algorithms like Support Vector Machines (SVM), neural networks, or advanced architectures like transformers. Here is a little example using Logistic regression (we will continue using the same text_samples we used earlier; please feel free to change it with your own dataset).

from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_samples)

model = LogisticRegression().fit(X,[1,0,1])
model.predict(vectorizer.transform(["22 Dec is Winter Solstice."]))

Obviously, it makes little sense to train a classifier on just three text samples. This model will give better results with more data, but the code and the rest of the procedure will be the same.

Text clustering

Clustering means dividing data into different categories. It’s different from classification as we don’t provide any labeling. Hence, clusters are learned in an unsupervised way.

One of the most famous and easy-to-understand clustering algorithms is K-means. It asks the user to provide the number of clusters and returns them, usually based on Euclidean distance. Hence, the text samples having closer embeddings are in the same cluster.

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters=2).fit(X)

"""
Cluster 1:     Manila is the capital of the Philippines.
Cluster 1:     Capital investment model, despite its flaws, is highly successful.
Cluster 0:     23 September is equinox.
"""

Other clustering algorithms include SOM (Self-Organizing Maps) or GMM (Gaussian Mixture Models). A related application is topic modeling.

Topic modeling

Topic modeling also uses unsupervised learning to produce a set of terms to define the text/documents. It checks the most frequently occurring terms in a document and also how these terms co-relate with each other.

One of the famous topic modeling algorithms is Latent Semantic Analysis (LSA). LSA works somewhat similar to TF-IDF by checking a term’s occurrence in a document and the common terms among documents.

While there is some Math involved with Eigenvectors and SVD behind the scenes, the bottom line is that LSA performs dimensionality reduction (using SVD) and then uses Cosine similarity to check how closely matched two documents are.

💡

Simple distance metrics like Cosine similarity or Euclidean distance are commonly used in vector databases for searches.

LSA can be performed simply in a couple of lines: apply SVD and then fit it on the data (documents).

from sklearn.decomposition import TruncatedSVD

LSA = TruncatedSVD(n_components=2) 
X_lsa = lsa.fit_transform(X)

5. Named entity recognition (NER)

Named Entity Recognition (NER) is an excellent application of text mining. NER allows us to find names of people, places, etc. (i.e., pronouns) and also some specific information in the text. For NER, we will use the Spacy library (it has better results than NLTK for NER).

import spacy
english_model = spacy.load("en_core_web_sm")

text = "Mrs. Linton saw Isabella tear herself free, and run into the garden; and a minute after, Heathcliff opened the door."
processed_text = english_model(text)

💡

Before using the English model, it is necessary to download it first: python -m spacy download en_core_web_sm

Passing text through the Spacy model processes it to fetch the NEs in the ents attribute. We can easily iterate over it to fetch these entities.

for entity in processed_text.ents:
    print(entity.text, entity.label_)   

"""
Linton PERSON
Isabella PERSON
a minute TIME
Heathcliff ORG
"""

At first, I thought it might be an occasional mistake. But when I tried some other examples, this ORG issue kept recurring. Let's use an advanced transformer model to see if it fixes it.

!pip install transformers

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

text = "Mrs. Linton saw Isabella tear herself free, and run into the garden; and a minute after, Heathcliff opened the door."

ner_results = ner_pipeline(text)

"""
[{'entity': 'B-PER',
  'score': 0.5869316,
  'index': 3,
  'word': 'Lin',
  'start': 5,
  'end': 8},
 {'entity': 'B-PER',
  'score': 0.99575436,
  'index': 6,
  'word': 'Isabella',
  'start': 16,
  'end': 24},
 {'entity': 'B-PER',
  'score': 0.9910059,
  'index': 22,
  'word': 'Heath',
  'start': 89,
  'end': 94},
 {'entity': 'B-PER',
  'score': 0.5074362,
  'index': 23,
  'word': '##cliff',
  'start': 94,
  'end': 99}]
"""

Now, it fixed the Healthcliff being recognized as an organization issue, but led to another one where it disregarded Linton as a person’s name and recognised Lin instead. These models have their pros and cons, and I admit that these examples are insufficient to judge them.

6. Sentiment analysis

Sentiment analysis is another useful application of text processing. We often want to see the tone of a message. Sentiment classification is usually based on three categories: positive, negative, and neutral.

With a large amount of (unlabeled) data, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a useful tool for labeling them accordingly. VADER is free and can work in a few lines of code, but it comes with its limitations, as its accuracy is around 80%.

Since we have already covered sentiment analysis in detail here, I will round this off with just a little example of Hugging Face.

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", tokenizer=AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")) #Please feel free to use Model of your own choice

df = pd.read_csv("consolidatedTweets.csv")
df['label'] = df['full_text'].apply(lambda x: sentiment_pipeline(x)[0]['label'])

Here, I am using the same tweets file that I used in the sentiment analysis article. To improve the labeling quality (over VADER), we are using transformer models (feel free to specify the model of your choice) to label it.

7. Text visualization

As the saying goes, a picture is worth a thousand words. We love to visualize things, though this luxury is unavailable with text data compared to images. However, we can use a couple of tools to visualize the word distribution in the text.

Word clouds

Word clouds provide a way of visualizing text datasets by showing the most commonly occurring terms (frequency is proportional to their size, as we will see).

To use word clouds, we need to install the library first.

pip install wordcloud

I will use the AG News dataset here, a diverse collection of over a million news articles.

from datasets import load_dataset
from wordcloud import WordCloud

dataset = load_dataset("ag_news") #Getting dataset from Hugging Face

dataset_plain_text = " ".join(dataset['train']['text']) #COnverting it into plain text for word clouds
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords='noisy_words').generate(dataset_plain_text)

The dataset, like other Hugging Face datasets, is already split into test and train sets. I will use the training chunk (being the bigger one). Now this word cloud is generated, all we have to do is to show it using Matplotlib

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

As we can see, some common terms are garbage/stop words, which can be weeded out. To uproot them, we will make a list of these garbage terms and specify them as stop words.

noisy_words = ["AP", "quot", "U", "S", "gt", "lt", "b"]
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=noisy_words).generate(dataset_plain_text)

As you can see, now we have a better picture of the most frequent terms. I am surprised to see that “on Thursday” or any other weekday news has occurred more than the weekend ones. In fact, we can further refine it by removing these trivial terms (like prepositions and articles) too. The complete code example is copied here for the reference.

from datasets import load_dataset
from wordcloud import WordCloud
import matplotlib.pyplot as plt

dataset = load_dataset("ag_news")
dataset_plain_text = " ".join(dataset['train']['text'])
noisy_words = ["AP", "quot", "U", "S", "gt", "lt", "b"] #Feel free to change or mute them
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=noisy_words).generate(dataset_plain_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Bar plots and histograms

While these word clouds are good, they don’t provide the statistical precision and insights that charts do. If we create a chart to see which terms are most frequent, we can use the Counter class.

from collections import Counter

words = dataset_plain_text.lower().split()  
word_counts = Counter(words)

df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency']).sort_values(by='Frequency', ascending=False).head(20)

plt.figure(figsize=(12, 6))
plt.bar(df['Word'],df['Frequency'], color=['green', 'yellow'])
plt.show()

If you try to make this chart without any filtering, it will show the correct bars for each term, but too many terms will end up mingling into each other. Hence, it's a better choice to check only the top 10 or 20 terms.

As we can see, it's dominated by trivial terms like prepositions and articles (and no linguist would disagree there). To make it more meaningful, we can remove these trivial terms as we removed them for word clouds.

Get started with text analysis

Text analysis offers an accessible path into data science, requiring minimal computing power compared to image or video processing. With Python's rich ecosystem of libraries, you can quickly move from basic word counting to sophisticated natural language processing. The abundance of text data from social media, books, and websites lets you practice with real-world examples, starting with basic pattern recognition before exploring advanced techniques and larger datasets as your skills grow.