An introduction to transformers
It's no secret that transformer models (like GPT-3, LLaMa, and ChatGPT) have revolutionized AI. They're used for not only natural language processing but also computer vision, speech processing, and other tasks. Hugging Face is a Python deep learning library centered around the power of transformers. So before we get into the details of how it works, we're going to explore what transformers are and why they enable such powerful models.
Recurrent Neural Networks (RNNs)
Before moving forward, it would be useful to have a quick recap of sequence models. For normal data, feedforward neural networks are quite helpful, but for the sequence data (like text, speech, or some video data), we need to have contextual information as well.
For example, in the sentence, “We walked to town, my sister leading the way in a very large beaver bonnet, and carrying a basket….” we cannot determine who’s carrying the basket or who was accompanying the narrator or what she was wearing until we model the connection between all these words. This is done using RNNs, which maintain the context/history of the current input. That input could be a word or a movie frame, for example, but I'll use the term data to keep it more general. We can even maintain the history of the upcoming data by making a bidirectional neural network.
Vanishing gradients
One of the major issues with traditional RNNs is vanishing gradients. When applying backpropagation, we use the chain rule to connect any layer with the output layer (for calculating partial derivatives). With the increased number of layers, often the gradients get quite small during the backpropagation.
Since the chain rule includes a product of derivatives, it results in small or vanishing gradients for a number of dependent terms. Even if you don’t understand any of the Calculus, the bottom line here is that our neural network learns/trains quite slowly, and hence we need to make sure to have an optimum number of layers. As a result, traditional (often called vanilla) RNNs cannot have much context information.
It was Sepp Hochreiter who identified this problem and proposed a better model known as Long Short-term Memory (LSTM). LSTMs can have more contextual information as they suffer less from the vanishing gradient problem and have shown remarkable improvement over traditional RNNs.
Attention models
LSTMs were discovered in the mid-'90s and were remarkable for their time. Later on, they improved as a simpler/faster version, and GRU was introduced. However, given the pressing requirements of much bigger models, better models with much more contextual information capacity were needed. In 2014/15, an attention mechanism was introduced to address the existing models’ limitations.
Attention models are quite simple and intuitive. They focus on the relevance of the text by assigning different weights to different words.
Why transformers?
Using this attention mechanism, Google researchers published an iconic paper titled “Attention is All You Need.” They used the attention mechanism to make a new model known as transformers. While it's not possible to cover the detailed architecture here, we can acknowledge its power. Transformers have a number of advantages, like:
- Parallelization
- Long-range contextual capabilities
- Scalability
What is Hugging Face? 🤗
As we mentioned at the start, Hugging Face is a deep learning library centered around transformers. The use of the emoji and the fact that it's named after it might be confusing, and that’s understandable. My first impression was, “Is it an ML library or a comic book?” But the power of this library is not to be underestimated. It's continuously growing and successfully raised $235M in its latest Series D funding. As per the 2023 Stackoverflow survey, Hugging Face is the second most admired technology in the others category.
So what makes this library so exciting? The rest of this article is dedicated to answering that question by providing a 25,000ft view (at times, we'll fly closer to the ground).
The transformers library
Hugging Face centers around the transformers
library. The core philosophy behind the transformers library is this:
- Be as easy and fast to use as possible
- Provide state-of-the-art models with performances as close as possible to the original models
We'll soon see how the transformers library is truly living up to its promise. But you'll have to install the library first.
!pip install transformers#Uncomment if you are yet to install it
Collecting transformers
Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
[?25h.......
Installing collected packages: tokenizers, safetensors, huggingface-hub, transformers
Successfully installed huggingface-hub-0.17.2 safetensors-0.3.3 tokenizers-0.13.3 transformers-4.33.2
Hugging Face pipelines
We call a pipeline to perform inference on different tasks. These tasks are not just restricted to NLP but also support computer vision, reinforcement learning, and other domains.
Let’s start with the translation pipeline. Firstly, we'll import the pipeline
module.
from transformers import pipeline
In order to use a particular pipeline, say English to French, we'll need to specify it.
frenchTranslator = pipeline("translation_en_to_fr")
No model was supplied, defaulted to t5-base and revision 686f1db (<https://huggingface.co/t5-base>).
Using a pipeline without specifying a model name and revision in production is not recommended.
Now we can use it to translate a given English sentence to French. For example:
nietzscheQuote = "Our vanity is hardest to wound precisely when our pride has just been wounded."
frenchTranslator(nietzscheQuote)
[{'translation_text': "Notre vanité est la plus difficile à blesser précisément lorsque notre fierté vient d'être blessée."}]
It will download the model (most of them have a size of around a GB), and since we didn’t specify the model, it will default to T5. If we want to specify a particular model, we can do so as follows:
<model> = pipeline(<pipeline name>,model=<model name>)
frenchTranslatorSmall = pipeline("translation_en_to_fr",model="t5-small")
The pipeline can also be used seamlessly. For example:
frenchTranslatorSmall("The two most important days in your life are the day you are born and the day you find out why")
[{'translation_text': 'Les deux jours les plus importants de votre vie sont le jour où vous êtes née et le jour où vous découvrez pourquoi'}]
We can try some other tasks too. For example, let's try the Question Answering pipeline.
questionAnsweringSystem = pipeline("question-answering")
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (<https://huggingface.co/distilbert-base-cased-distilled-squad>).
Using a pipeline without specifying a model name and revision in production is not recommended.
As you're probably aware, a question-answering system takes some context and uses it to gather the required information. This information can then be used to answer the queries. For example, we take the iconic conclusion of the famous DNA double-helix paper:
context = "It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. Full details of the structure, including the conditions assumed in building it, together with a set of co-ordinates for the atoms, will be published elsewhere."
query1 = "What does specific pairing suggest?"
questionAnsweringSystem(query1,context)
{'score': 0.40637195110321045,
'start': 96,
'end': 149,
'answer': 'a possible copying mechanism for the genetic material'}
It returns some other attributes, like confidence score and starting/ending location in the contextual string. Let's try another query:
query2 = "Where are the full details?"
questionAnsweringSystem(query2,context)
{'score': 0.5380856990814209,
'start': 290,
'end': 309,
'answer': 'published elsewhere'}
This is just the tip of the iceberg. As of September 2023, Hugging Face has more than 30 tasks, and this list will keep on increasing, thanks to the community contributions. We'll cover the majority of these tasks in the follow-up articles, starting with text and token classification.