NLP and the rise of the Transformers
We’re Apify, and our mission is to make the web more programmable. Part of that is getting data for AI. Check us out.
In 2017, a ground-breaking research paper called Attention Is All You Need was published by a bunch of Cornell University students. It contains a blueprint for Transformers, which are the deep learning models that have been used for natural language processing ever since.
That paper was responsible for the widespread adoption of transformer models for NLP, and this is what led to the sharp rise to fame of large language models such as ChatGPT (the T stands for Transformer).
Large language models, or LLMs, are a form of generative AI. They are transformer models that use deep learning methods to understand and generate text in a human-like fashion.
Since 2017, transformer models like ChatGPT have gotten better at generating text that can pass for human writing. This is because their training datasets have grown in scope and size thanks to an automated data collection method called web scraping, which is a technique used to train LLMs and which is continuously used to improve and customize them with relevant up-to-date information.
However, there's one major drawback to training a huge, state-of-the-art transformer model for natural language processing: the price tag. For a project like that, you need millions of dollars.
Small businesses and startups wishing to launch their own NLP and LLM projects couldn't do much to compete with giants like Google, Facebook, and Microsoft, which were once the only companies developing and using NLP models. But then Hugging Face hit the scene…
Machine learning (ML) focuses on teaching machines to perform specific tasks with accuracy by identifying patterns. ML uses algorithms to learn from that data and make informed decisions based on what it has learned.
Deep learning is a subfield of ML that structures algorithms in layers to create an artificial neural network that can learn in a self-supervised fashion.
Large language models fall into the category of deep learning.
What is Hugging Face?
Hugging Face is one of the fastest-growing open-source projects around, yet ironically, it's a commercial company, and its repository is not an open-source platform, but then again, neither is GitHub (it's owned by Microsoft). Yet the files hosted on both platforms are open source, and that's what counts.
Hugging Face is changing how companies use NLP models by making them accessible to everyone. It builds open-source libraries to support AI and machine learning projects and helps people and organizations overcome the vast costs of building Transformers.
Hugging Face began in 2016 with its library of attention-based transformer models and a grand ambition to be “the GitHub of machine learning”. Today, it's one of the leading platforms for natural language processing and provides open-source NLP technologies, thousands of pre-trained models to perform tasks related to text, image, and audio, and a large number of datasets and tokenizers, not to mention multiple courses.
In addition to its transformer library, Hugging Face is famous for its Hub, with over 120 thousand models, 30 thousand datasets, and 50 thousand demo apps called Spaces, all of which are open source and publicly available.
A Transformer is deep learning architecture that relies on an 'attention mechanism', which lets the decoder use the most relevant parts of an input sequence in a flexible manner. Transformers were adopted for large language models because they don't need as much training time as other neural architectures.
How Hugging Face helps with NLP and LLMs
1. Model accessibility
Prior to Hugging Face, working with LLMs required substantial computational resources and expertise. Hugging Face simplifies this process by providing pre-trained models that can be readily fine-tuned and used for specific downstream tasks. The process involves three key steps:
- Model selection
Hugging Face's model hub hosts a vast collection of pre-trained models. Users can choose from a variety of architectures and sizes depending on their requirements.
- Fine-tuning
Hugging Face also provides fine-tuning scripts and examples for common NLP tasks. Users can pre-train models on their specific datasets (more about that later) by leveraging transfer learning, achieving state-of-the-art performance with less data and computation.
- Inference and deployment
Once the model is fine-tuned, it can be used for inference on new data. Hugging Face provides convenient APIs for deploying models in various environments, including web applications and cloud platforms.
2. Model interpretability
Interpreting the decisions made by LLMs is crucial for understanding and mitigating biases, ensuring ethical use, and building trust. Hugging Face integrates tools like the Transformers-Interpret library, which enables users to perform model interpretability tasks such as feature importance analysis, saliency mapping, and attention visualization. These features help to gain insights into how the model makes predictions.
3. Integration with other tools
Hugging Face seamlessly integrates with other popular NLP tools, further expanding its capabilities and usability. The transformers library supports PyTorch, TensorFlow, and JAX, enabling users to work with their preferred deep-learning framework.
4. Datasets for training LLMs
Creating datasets for training LLMs is a time-consuming and challenging process. If the data isn't accurate, up-to-date, and relevant to the purpose for which the LLM is being trained, it will hallucinate fake answers. That's why scraping data for generative AI is the best solution to customizing and improving large language models with relevant and current data.
Hugging Face also provides a brilliant solution, as it hosts over 30,000 prepared Datasets that you can use to train LLMs. These datasets contain examples connected with labels. The labels instruct the model on how to interpret the examples. It can then begin to identify patterns and the frequency of words, letters, and sentence structures. Train the LLM for long enough, and you can feed it with a prompt that isn't included in the dataset. The model will then produce an output based on the experience it built during its training.
5. Tokenizers
Tokenization is a fundamental step in NLP which involves converting text data into numerical tokens that can be processed by LLMs. Hugging Face's Tokenizers offer efficient tokenization algorithms for a wide range of languages. It ensures compatibility with the transformers library and helps users handle various text preprocessing tasks effectively. You can use these tokenizers to train new vocabularies and do all the pre-processing that NLP demands.
6. Hugging Face Pipelines
Hugging Face Pipelines provide a streamlined interface for common NLP tasks, such as text classification, named entity recognition, and text generation. It abstracts away the complexities of model usage, allowing users to perform inference with just a few lines of code.
So, is Hugging Face worth the hype?
Absolutely! Given how much Hugging Face has done for making Transformers affordable and to empower small companies and startups to train large language models, Hugging Face certainly deserves its reputation as a champion of open source, and it's undoubtedly living up to its mission to be the GitHub of machine learning.
Further reading 📚
If you want to learn how to use Hugging Face or if you want to find out more about LLMs and generative AI in general, I suggest you peruse the content below.
Hugging Face series
- How to use Hugging Face transformers and pipelines
- Text and token classification in NLP
- Machine translation with Hugging Face
- Question answering and conversational models
- How to use Hugging Face for computer vision
Other content on AI, machine learning, and NLP
- Traditional NLP techniques and the rise of LLMs
- What is Haystack? An introduction to the NLP framework
- LlamaIndex vs. LangChain
- Web scraping for machine learning
- Edge AI vs. Cloud AI
- What is data ingestion for large language models?
- How to do question answering from a PDF
- Web scraping for AI: how to collect data for LLMs
- What is a vector database?
- What is generative AI?