Text and token classification in NLP

How to use Hugging Face for natural language processing.


Following our introduction to this Hugging Face series, we'll focus on Natural Language Processing (NLP) tasks in this blog post.

Before going further, let’s import the respective libraries. We'll need transformers for every task and sentencepiece is a library we need specifically for NLP tasks.

!pip install transformers
!pip install sentencepiece

Text classification

A text can be classified according to a number of criteria. For example, a reader may be evaluating whether this passage sounds positive or negative or whether it's grammatically correct.

In order to use text classifiers, we need to import the respective pipeline (text-classification).

from transformers import pipeline

textClassifier = pipeline("text-classification")

By default, it uses a DistilBERT model. It's a general model but can be used for sentiment classification, i.e. whether the text input is positive, negative, or neutral.

textClassifier("He was over the moon to hear the good news.")


[{'label': 'POSITIVE', 'score': 0.9995835423469543}]

It's pretty straightforward, so no prizes for guessing it's a positive text. But the confidence score (>99%) is quite impressive here. Let’s try some other examples.

textClassifier("As you are aware 2023 has been a challenging year.  The continuous stream of unprecedented global challenges, surging energy prices and volatile market conditions have had a significant impact on the logistics industry. The situation is particularly more concerning for Pakistan whereby economic uncertainty combined with rising inflation, elevated operational cost, and fluctuations in currency exchange rates have affected businesses across various sectors, including DHL Express.")


[{'label': 'POSITIVE', 'score': 0.9881656169891357}]

Wow! That’s pretty optimistic of the model to declare it as positive, and with such high confidence.

textClassifier("‘Don’t you think you would attract attention?’ said the Medical Man. ‘Our ancestors had no great tolerance for anachronisms.’")
[{'label': 'NEGATIVE', 'score': 0.9981589913368225}]

Now, I'll try a bit of a neutral sort of sentence and see how well this model does.

textClassifier("There were others coming, and presently a little group of perhaps eight or ten of these exquisite creatures were about me. One of them addressed me")


[{'label': 'POSITIVE', 'score': 0.9976465106010437}]

It sounds like a bipolar sort of model. Now, we'll try another model for checking if a sentence/text is grammatically correct.

Importing the “sentiment-analysis” pipeline also yields the same model as above.

Grammatical correctness

For grammatical correctness, we have a model trained on CoLA (I'll talk about it in a while). Let’s test this.

It returns label_0 for unacceptable and label_1 for acceptable.

grammaticalClassifier = pipeline("text-classification", model="textattack/distilbert-base-uncased-CoLA")

Let’s try it out a bit.

Test 1:

grammaticalClassifier("It surprises me a lot when I sees the images of airlines flying the empty flights.")

# Output: [{'label': 'LABEL_1', 'score': 0.9570088982582092}]

Test 2:

grammaticalClassifier("you doesn't deserve this after what have you went through")

# Output: [{'label': 'LABEL_0', 'score': 0.5380246639251709}]

Test 3:

grammaticalClassifier("I doesn't understand why schools is need to be closed in the summers.")

# Output: [{'label': 'LABEL_0', 'score': 0.8870238065719604}]

Coming back to the Corpus of Linguistic Acceptability (CoLA), this is a standard dataset used to train grammatical (correctness) models.

While I'm not completely sure, my intuition says that this corpus would have been used by grammatical checkers (like Grammarly), though it would have been fine-tuned by the heaps of data they acquire from users.

Apify makes it easy to get data from the web for your LLMs and generative AI models.

Natural Language Inference (NLI)

In NLI, we'll check if a couple of statements confirm or contradict each other.

By the way, before importing a new model, it would be nice to delete the unused models. For NLI, the de facto model is RoBERTa.

del textClassifier
del grammaticalClassifier
nliClassifier = pipeline("text-classification", model="roberta-large-mnli")

Test 1:

nliClassifier("Yemen has five sites on the list of World Heritage Sites. The first site from Yemen on the list, the Old Walled City of Shibam was designated in 1982.")

# Output: [{'label': 'NEUTRAL', 'score': 0.9883605241775513}]

Test 2:

nliClassifier("South Africa is one of the best cricket teams in the world. South Africa hasn't won any world cup yet.")

# Output: [{'label': 'CONTRADICTION', 'score': 0.912165105342865}]

Token classification

There can be situations where we need to pluck the information specifically for the words in a text. Determining the parts of speech within a sentence is a task requiring fine-grained classification of the specific words rather than the sentence as a whole. This is where token classification comes in handy.

Hugging Face provides a number of existing tokenizers to choose from. We can use any of them for the task.

PoS tagging

Token classification can be pretty useful for the parts of speech tagging.

tokenClassifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
tokenClassifier("A cat is sitting on the table.")


[{'entity': 'DET',
  'score': 0.9995196,
  'index': 1,
  'word': 'a',
  'start': 0,
  'end': 1},
 {'entity': 'NOUN',
  'score': 0.99896586,
  'index': 2,
  'word': 'cat',
  'start': 2,
  'end': 5},
 {'entity': 'AUX',
  'score': 0.9972844,
  'index': 3,
  'word': 'is',
  'start': 6,
  'end': 8},
 {'entity': 'VERB',
  'score': 0.99938405,
  'index': 4,
  'word': 'sitting',
  'start': 9,
  'end': 16},
 {'entity': 'ADP',
  'score': 0.99917114,
  'index': 5,
  'word': 'on',
  'start': 17,
  'end': 19},
 {'entity': 'DET',
  'score': 0.9995147,
  'index': 6,
  'word': 'the',
  'start': 20,
  'end': 23},
 {'entity': 'NOUN',
  'score': 0.9988354,
  'index': 7,
  'word': 'table',
  'start': 24,
  'end': 29},
 {'entity': 'PUNCT',
  'score': 0.9996613,
  'index': 8,
  'word': '.',
  'start': 29,
  'end': 30}]

Named Entity Recognition (NER)

We can also use token classification for the named entity recognition, where we can identify the pronouns (names of persons, cities, countries, etc.). It can be either used with the token-classification pipeline or individually with its own pipeline (ner) as well.

namedEntityRecognizer = pipeline("ner")
namedEntityRecognizer("Mount Everest lies on the border of Nepal and China. It perplexes me to know that it is closer to equator than Lahore.")


[{'entity': 'I-LOC',
  'score': 0.6424599,
  'index': 1,
  'word': 'Mount',
  'start': 0,
  'end': 5},
 {'entity': 'I-LOC',
  'score': 0.8563523,
  'index': 2,
  'word': 'Everest',
  'start': 6,
  'end': 13},
 {'entity': 'I-LOC',
  'score': 0.9997261,
  'index': 8,
  'word': 'Nepal',
  'start': 36,
  'end': 41},
 {'entity': 'I-LOC',
  'score': 0.9998282,
  'index': 10,
  'word': 'China',
  'start': 46,
  'end': 51},
 {'entity': 'I-LOC',
  'score': 0.99936837,
  'index': 28,
  'word': 'Lahore',
  'start': 111,
  'end': 117}]

As you can see, there are a number of variants in both text and token classification. Each of them is useful in its own way. That's it from me for now. We'll catch up soon in the next installment of this series.

Get started now

Step up your web scraping and automation