In this step-by-step guide, we'll cover all the fundamentals of text mining in Python. Text mining is all about extracting useful information from unstructured text data using NLP and AI techniques. Whether you're analyzing customer feedback, research papers, or social media posts, these techniques will help you turn text into useful data you can work with.
Prerequisites for text mining in Python
- Basic Python knowledge
- Python 3.8 or higher installed
Setting up the virtual environment
Virtual environments ensure that we have separate library installations for separate projects, avoiding the conflicting versions required for different projects. There are two ways of setting up a virtual environment in Python.
Conda
If you have Anaconda installed, you can either set up the virtual environment using the Anaconda Navigator’s GUI or from the command line as:
conda create --name NLP
If you'd like to specify the Python version explicitly, please add it to the command.
conda create --name NLP python=3.12
Once created, we can activate it and then install the respective libraries.
conda activate NLP
conda install nltk pandas matplotlib re spacy vader wordcloud
Default Python
If you don’t want to use Conda, we can use the venv module from Python for virtual environment setup as well. The same steps would be replicated as:
python -m venv NLP_Default
source NLP_Default/bin/activate
pip install nltk pandas matplotlib re spacy vader wordcloud
Now our virtual environment is set up, we are ready to begin text mining.
1. Data collection
Before we can start mining text, we need data to analyze. For this tutorial, we'll use Twitter (X.com) data since tweets provide an excellent example of text that requires various preprocessing steps. We'll focus on collecting X posts (tweets) that contain scientific announcements and discoveries.
To collect a set of tweets, you can use Twitter APIs, either by calling them directly or through Python wrapper libraries like tweepy
. However, Twitter API has significant rate limit challenges. Therefore, in this post, we’ll use the Twitter Profile Scraper Actor to collect tweets.
Setting up Twitter API access
First we need to install the apify-client
library and set up the APIFY_API_TOKEN:
!pip install apify_client
To collect tweets from a particular profile, we’ll use Twitter Profile Scraper:
import pandas as pd
from apify_client import ApifyClient
client = ApifyClient(token="YOUR-APIFY-API-TOKEN")
run_input = {
"maxTweetsPerUser": 100,
"proxy": {"useApifyProxy": True},
"startUrls": ["https://x.com/NaturePhysics"]
}
Collecting scientific tweets
Let's call the Actor to collect tweets about scientific discoveries that typically contain the elements we'll need to preprocess:
print("Running actor ... it might take a while")
run = client.actor("epctex/twitter-profile-scraper").call(run_input=run_input)
dataset_id = run["defaultDatasetId"]
dataset = client.dataset(dataset_id).list_items()
all_tweets = pd.DataFrame(dataset.items)
# Remove duplicates and reset index
all_tweets = all_tweets.drop_duplicates(subset=["full_text"]).reset_index(drop=True)
Understanding our dataset
Let's examine what we've collected:
# Basic dataset information
print(f"Total tweets collected: {len(all_tweets)}")
print("\nSample of tweets that need preprocessing:")
for i, tweet in all_tweets.head(5).iterrows():
print(f"\nTweet {i + 1}, {tweet['full_text']}")
Preparing for preprocessing
Before we move to data preprocessing shown in the next section, let’s look at the data statistics:
print("\Tweet statistics:")
print(f"Total favorite_count: {all_tweets['favorite_count'].sum()}")
print(f"Total quote_count: {all_tweets['quote_count'].sum()}")
print(f"Total reply_count: {all_tweets['reply_count'].sum()}")
print(f"Total retweet_count: {all_tweets['retweet_count'].sum()}")
In the next section, we'll learn how to clean and preprocess this text data, starting with removing URLs and handling special characters using regular expressions.
2. Text preprocessing
Raw text data often needs cleaning and standardization before analysis. Here are the key preprocessing steps that prepare text for mining.
Cleaning text data
Consider a tweet:
BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. https://t.co/2lMvheiDcW https://t.co/Njoa0Y8mBe
Now, it's obvious that these URLs are unnecessary here. We can remove them (and any other unwanted content like punctuation marks) using regular expressions.
Regular expressions
While regular expressions need some dedicated posts, we will go through them briefly. Regular expressions are quite useful for string matching. For example, to check if an email address is valid, we have the regular expression:
import re
valid_email_regex = r'^[a-z0-9._]{3,}@[a-z0-9-]+\.[a-z.]{2,}$'
In Python, we have the re
module for regular expressions (regex).
Now, an email consists of email+@+domain.+TLD
.
- We specify the email part as a string that would have at least 3 characters
{3,}
and can have any English letter, a digit, a period, or an underscore. - It will be followed by a
@
sign and a domain name (which can allow English letters, numbers, and an underscore. The+
sign shows that we should have at least one character in the domain name (like x in x.com). \' means literal. So
.will allow a period after the domain name, followed by a TLD of at least two characters. The period is added here, as we can have composite domain names like
co.uk`.
We can verify this regex using re.match()
. We will pass some valid email addresses (and one counter-example):
sampleEmailsList = ['forever2024@yahoo.com','312122112@qq.com','myname@xyzaaa.co.uk','www.mywebsite.com']
for email in sampleEmailsList:
print(bool(re.match(valid_email_regex, email)))
"""Output
True
True
True
False"
Now, let’s remove the URLs from the tweets. The URLs are always in the form https://t.co/XXXXXXXXXX, so we should use a regex pattern like this:
r'(https://t.co/)+[0-9a-zA-Z]{10}'
A little observation revealed that these URL parts have a fixed length of 10 and can have any alphanumeric character. We will use re.sub()
here (this function takes a regex and replace it with the given character - none in our case) to weed out the URLs.
import re
text = "BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. \n https://t.co/2lMvheiDcW https://t.co/Njoa0Y8mBe"
punctuation_free_text = re.sub(r'(https://t.co/)+[0-9a-zA-Z]{10}', '', text)
"""
'BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time. \n '
"""
It is working well. But we need to remove the unnecessary line break.
punctuation_free_text = re.sub(r'[\n]+', '', punctuation_free_text)
punctuation_free_text.strip()
#'BREAKING: #LIGO confirms #gravitationalwaves detected for 1st time.'
This introduction covers regex basics. For a deeper understanding, you can check out Python's regex documentation. Let's move on to tokenization.
Tokenization
Before processing the text, it needs to be converted into smaller units. This conversion is known as tokenization. While tokenization can refer to sentence, word, or character tokenization, it usually refers to word tokenization. In word tokenization, we take a text and split it into words. NLTK has a built-in function, word_tokenize()
for this purpose.
import nltk
nltk.download('punkt_tab')
from nltk import word_tokenize, LancasterStemmer
tokens = word_tokenize("The quick brown fox jumps over the lazy dog.")
"""
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
"""
As you can see, punctuation marks (a period in this case) are also part of the tokens. Whether you want to keep the punctuation marks or not is your call. In some cases, punctuation marks are just trivial, but in other cases, they may contain quite helpful contextual information (like commas, colons, etc.) If you want to remove them, you can use the string.punctuation
list to filter out.
import string
tokens_without_punctuation = [word for word in tokens if word not in string.punctuation]
"""
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
"""
Removing stop words
Stop words refer to the redundant words carrying lesser contextual information. They include articles, prepositions, etc. To remove the stop words (or use them for any purpose), we download them from NLTK and set them to English stop words.
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
Similar to filtering the punctuation, we can fetch tokens without stop words too.
tokens_without_stopwords = [word for word in tokens if word not in stop_words]
# ['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.']
It’s an open debate whether these stopwords are necessary for contextual information. So it's up to the user again whether to use them or not. The reason I am so cautious about removing the stop words is based on the fact that stop words contain valuable contextual information. This example will further explain it.
tokens = word_tokenize("To be or not to be")
tokens_without_stopwords = [word for word in tokens if word not in stop_words]
# ['To']
Stop words carry grammatical meaning but often add noise to text analysis. While removing them can lose some context, it typically improves results for most text mining tasks.
Stemming and lemmatization
Stemming is a linguistic technique that refers to going down the stem/root of a word. In stemming, we use simple (heuristic) rules to cut a word to its origin. Playing is a verb, so just drop its “ing,” and we will get “play.”
We will use PorterStemmer
here. Named after its author, it is one of the earliest stemming algorithms.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stem = stemmer.stem("playing")
# 'play'
But it comes with its limitations. For example, trying “happiness” will result in “happi” which is not a valid word.
stem = stemmer.stem("happiness")
# 'happi'
NLTK features some other stemmers, so let's try an alternative one.
from nltk.stem import LancasterStemmer
stem = LancasterStemmer().stem("happiness")
#'happy'
Lemmatization
Lemma also means the root/basic form of a word, though it should be part of a dictionary too (i.e. realized as a standalone word). Here’s how the Cambridge English Dictionary defines a lemma:
Lemma is a form of a word that appears as an entry in a dictionary and is used to represent all the other possible forms. For example, the lemma “build” represents “builds”, “building”, “built”, etc.
--Cambridge English Dictionary
For lemmatization, we look up the word in the dictionaries. Every NLP library, including NLTK, includes some dictionaries for lemmatization. As a result, Lemmatization is more accurate as in the happiness example.
For lemmatization, we require the wordnet
package and respective lemmatizer.
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("happiness", pos=wordnet.NOUN)
#'happiness'
Difference
Some people may need clarification about the difference between the two terms. But they aren’t. Both serve the same purpose: stemming uses fixed heuristic rules and is hence faster, while lemmatization does a dictionary lookup and is consequently slower but more accurate.
3. Text representation
Text features can be represented in several ways. We will quickly review some of the common text representation models.
Bag of Words model
The Bag of Words (BoW) model treats text as a collection of words and creates representations of each word proportional to its frequency. To use BoW features, we can use CountVectorizer
from Scikit-learn.
!pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
After instantiating the CountVectorizer
, we can use the fit_transform()
method. This method takes the vocabulary of the given text samples and converts their respective frequency into the respective (BoW) features.
import pandas as pd
text_samples = [
"Manila is the capital of the Philippines.",
"Capital investment model, despite its flaws, is highly successful.",
"23 September is equinox."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text_samples)
X = X.toarray()
df= pd.DataFrame(X, columns=vectorizer.get_feature_names_out())
TF-IDF model
23 | capital | despite | equinox | flaws | highly | investment | is | its | manila | model | of | phillipines | september | successful | the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
The BoW approach has a serious limitation as it doesn’t take into account contextual information or grammar. A better approach, Term Frequency-Inverse Document Frequency (TF-IDF), takes into account the relative frequency of a term but also the inverse document frequency—penalizing words that appear across all documents.
Scikit-learn provides a vectorizer for TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_samples)
X = X.toarray()
df= pd.DataFrame(X, columns=vectorizer.get_feature_names_out())
TF-IDF vector values
Word | Document 0 | Document 1 | Document 2 |
---|---|---|---|
23 | 0.000000 | 0.000000 | 0.546454 |
capital | 0.342620 | 0.270118 | 0.000000 |
despite | 0.000000 | 0.355173 | 0.000000 |
equinox | 0.000000 | 0.000000 | 0.546454 |
flaws | 0.000000 | 0.355173 | 0.000000 |
highly | 0.000000 | 0.355173 | 0.000000 |
investment | 0.000000 | 0.355173 | 0.000000 |
is | 0.266075 | 0.209771 | 0.322745 |
its | 0.000000 | 0.355173 | 0.000000 |
manila | 0.450504 | 0.000000 | 0.000000 |
model | 0.000000 | 0.355173 | 0.000000 |
of | 0.450504 | 0.000000 | 0.000000 |
phillipines | 0.450504 | 0.000000 | 0.000000 |
september | 0.000000 | 0.000000 | 0.546454 |
successful | 0.000000 | 0.355173 | 0.000000 |
the | 0.450504 | 0.000000 | 0.000000 |
Word embeddings
TF-IDF is a bit better than BoWs in giving relative weights to words, but it also lacks information about how similar or dissimilar two words are. Word embedding models project words into the (continuous) vector space. As a result, semantically similar words have closer embeddings too.
While BoW or TF-IDF representations work well only on smaller datasets, word embeddings are scalable to massive datasets. Models like the Generative Pre-Trained (GPT) family or Bi-directional Encoder Representations from Transformers (BERT) or any transformer model usually give better embeddings (we can train smaller Recurrent Neural Network (RNN)-based models, too). These models differ mainly in the way they calculate these embeddings and the output embedding size.
Given the growing interest in vector databases, these embeddings become even more important. Here, we will use static word embeddings from Gensim using Word2Vec.
!pip install --upgrade gensim #We need to install gensim first
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
tokenized_samples = [word_tokenize(text) for text in text_samples]
model = Word2Vec(sentences=tokenized_samples, vector_size=100, window=5, min_count=1, workers=4)
model.wv.most_similar('capital', topn=5)
# [('successful', 0.16373926401138306), (',', 0.14594906568527222), ('Manila', 0.07480262219905853), ('model', 0.05047113448381424), ('equinox', 0.04157735034823418)]
Although it wasn’t given much data, it has made an interesting similarity inference by rating Manila high in correlation with capital. Having more data will result in much more interesting results.
4. Text mining techniques
Several algorithms can be used for text mining. Some require a labeled dataset (supervised learning), while others can take unlabelled data.
Text classification
Text classification is a loose term that can refer to any sort of (supervised) classification for text samples. For example:
- Sentiment analysis: We can classify the documents to see their sentiment. This can be useful when analyzing customer reviews, social media trends from comments, etc. Usually, sentiment analysis results in three classes: positive, negative, and neutral.
- Grammatical correctness: Any grammar-checking software may determine if the given text is (grammatically) correct or not.
Text classification can be done using different classification algorithms like Support Vector Machines (SVM), neural networks, or advanced architectures like transformers. Here is a little example using Logistic regression (we will continue using the same text_samples
we used earlier; please feel free to change it with your own dataset).
from sklearn.linear_model import LogisticRegression
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_samples)
model = LogisticRegression().fit(X,[1,0,1])
model.predict(vectorizer.transform(["22 Dec is Winter Solstice."]))
Obviously, it makes little sense to train a classifier on just three text samples. This model will give better results with more data, but the code and the rest of the procedure will be the same.
Text clustering
Clustering means dividing data into different categories. It’s different from classification as we don’t provide any labeling. Hence, clusters are learned in an unsupervised way.
One of the most famous and easy-to-understand clustering algorithms is K-means. It asks the user to provide the number of clusters and returns them, usually based on Euclidean distance. Hence, the text samples having closer embeddings are in the same cluster.
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=2).fit(X)
"""
Cluster 1: Manila is the capital of the Philippines.
Cluster 1: Capital investment model, despite its flaws, is highly successful.
Cluster 0: 23 September is equinox.
"""
Other clustering algorithms include SOM (Self-Organizing Maps) or GMM (Gaussian Mixture Models). A related application is topic modeling.
Topic modeling
Topic modeling also uses unsupervised learning to produce a set of terms to define the text/documents. It checks the most frequently occurring terms in a document and also how these terms co-relate with each other.
One of the famous topic modeling algorithms is Latent Semantic Analysis (LSA). LSA works somewhat similar to TF-IDF by checking a term’s occurrence in a document and the common terms among documents.
While there is some Math involved with Eigenvectors and SVD behind the scenes, the bottom line is that LSA performs dimensionality reduction (using SVD) and then uses Cosine similarity to check how closely matched two documents are.
LSA can be performed simply in a couple of lines: apply SVD and then fit it on the data (documents).
from sklearn.decomposition import TruncatedSVD
LSA = TruncatedSVD(n_components=2)
X_lsa = lsa.fit_transform(X)
5. Named entity recognition (NER)
Named Entity Recognition (NER) is an excellent application of text mining. NER allows us to find names of people, places, etc. (i.e., pronouns) and also some specific information in the text. For NER, we will use the Spacy library (it has better results than NLTK for NER).
import spacy
english_model = spacy.load("en_core_web_sm")
text = "Mrs. Linton saw Isabella tear herself free, and run into the garden; and a minute after, Heathcliff opened the door."
processed_text = english_model(text)
python -m spacy download en_core_web_sm
Passing text through the Spacy model processes it to fetch the NEs in the ents
attribute. We can easily iterate over it to fetch these entities.
for entity in processed_text.ents:
print(entity.text, entity.label_)
"""
Linton PERSON
Isabella PERSON
a minute TIME
Heathcliff ORG
"""
At first, I thought it might be an occasional mistake. But when I tried some other examples, this ORG issue kept recurring. Let's use an advanced transformer model to see if it fixes it.
!pip install transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
text = "Mrs. Linton saw Isabella tear herself free, and run into the garden; and a minute after, Heathcliff opened the door."
ner_results = ner_pipeline(text)
"""
[{'entity': 'B-PER',
'score': 0.5869316,
'index': 3,
'word': 'Lin',
'start': 5,
'end': 8},
{'entity': 'B-PER',
'score': 0.99575436,
'index': 6,
'word': 'Isabella',
'start': 16,
'end': 24},
{'entity': 'B-PER',
'score': 0.9910059,
'index': 22,
'word': 'Heath',
'start': 89,
'end': 94},
{'entity': 'B-PER',
'score': 0.5074362,
'index': 23,
'word': '##cliff',
'start': 94,
'end': 99}]
"""
Now, it fixed the Healthcliff being recognized as an organization issue, but led to another one where it disregarded Linton as a person’s name and recognised Lin instead. These models have their pros and cons, and I admit that these examples are insufficient to judge them.
6. Sentiment analysis
Sentiment analysis is another useful application of text processing. We often want to see the tone of a message. Sentiment classification is usually based on three categories: positive, negative, and neutral.
With a large amount of (unlabeled) data, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a useful tool for labeling them accordingly. VADER is free and can work in a few lines of code, but it comes with its limitations, as its accuracy is around 80%.
Since we have already covered sentiment analysis in detail here, I will round this off with just a little example of Hugging Face.
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", tokenizer=AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")) #Please feel free to use Model of your own choice
df = pd.read_csv("consolidatedTweets.csv")
df['label'] = df['full_text'].apply(lambda x: sentiment_pipeline(x)[0]['label'])
Here, I am using the same tweets file that I used in the sentiment analysis article. To improve the labeling quality (over VADER), we are using transformer models (feel free to specify the model of your choice) to label it.
7. Text visualization
As the saying goes, a picture is worth a thousand words. We love to visualize things, though this luxury is unavailable with text data compared to images. However, we can use a couple of tools to visualize the word distribution in the text.
Word clouds
Word clouds provide a way of visualizing text datasets by showing the most commonly occurring terms (frequency is proportional to their size, as we will see).
To use word clouds, we need to install the library first.
pip install wordcloud
I will use the AG News dataset here, a diverse collection of over a million news articles.
from datasets import load_dataset
from wordcloud import WordCloud
dataset = load_dataset("ag_news") #Getting dataset from Hugging Face
dataset_plain_text = " ".join(dataset['train']['text']) #COnverting it into plain text for word clouds
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords='noisy_words').generate(dataset_plain_text)
The dataset, like other Hugging Face datasets, is already split into test and train sets. I will use the training chunk (being the bigger one). Now this word cloud is generated, all we have to do is to show it using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
As we can see, some common terms are garbage/stop words, which can be weeded out. To uproot them, we will make a list of these garbage terms and specify them as stop words.
noisy_words = ["AP", "quot", "U", "S", "gt", "lt", "b"]
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=noisy_words).generate(dataset_plain_text)
As you can see, now we have a better picture of the most frequent terms. I am surprised to see that “on Thursday” or any other weekday news has occurred more than the weekend ones. In fact, we can further refine it by removing these trivial terms (like prepositions and articles) too. The complete code example is copied here for the reference.
from datasets import load_dataset
from wordcloud import WordCloud
import matplotlib.pyplot as plt
dataset = load_dataset("ag_news")
dataset_plain_text = " ".join(dataset['train']['text'])
noisy_words = ["AP", "quot", "U", "S", "gt", "lt", "b"] #Feel free to change or mute them
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=noisy_words).generate(dataset_plain_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Bar plots and histograms
While these word clouds are good, they don’t provide the statistical precision and insights that charts do. If we create a chart to see which terms are most frequent, we can use the Counter
class.
from collections import Counter
words = dataset_plain_text.lower().split()
word_counts = Counter(words)
df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency']).sort_values(by='Frequency', ascending=False).head(20)
plt.figure(figsize=(12, 6))
plt.bar(df['Word'],df['Frequency'], color=['green', 'yellow'])
plt.show()
If you try to make this chart without any filtering, it will show the correct bars for each term, but too many terms will end up mingling into each other. Hence, it's a better choice to check only the top 10 or 20 terms.
As we can see, it's dominated by trivial terms like prepositions and articles (and no linguist would disagree there). To make it more meaningful, we can remove these trivial terms as we removed them for word clouds.
Get started with text analysis
Text analysis offers an accessible path into data science, requiring minimal computing power compared to image or video processing. With Python's rich ecosystem of libraries, you can quickly move from basic word counting to sophisticated natural language processing. The abundance of text data from social media, books, and websites lets you practice with real-world examples, starting with basic pattern recognition before exploring advanced techniques and larger datasets as your skills grow.