Text generation with Hugging Face

An exploration of the Hugging Face text-to-text generation pipeline.

Content

We're Apify, a full-stack web scraping and browser automation platform. As a result, it serves as a bridge between the web's vast data resources and sophisticated AI tools like Hugging Face. 

Following our article on machine translation, our Hugging Face series continues here.

The power of the pen is undisputed. Be it celebrities, politicians, or miser billionaires, at the end of the day, everyone resorts to writing or using the services of some ghostwriter to present a better picture of their views. With the advent of AI, we have gradually developed a number of models for a number of NLP tasks, including text generation.

Text generation models are quite helpful as they let us generate some text as per our requirements. I remember the very first time I felt the need to have such a system was back in early 2019 when I found some characters from classical literature pretty interesting (like Gavroche, the careless street urchin we all love) and wanted to emulate their talking style in some other text. In other words, I wanted to apply style transfer to the text rather than images.

Fast forward to 2021, and when GPT-3 introduced its API for the first time, I was able to fulfill my dreams and really play around with a number of literary characters, etc.

Let’s begin with the text generation.

import sentencepiece
from transformers import pipeline

Since we have a number of text generation tasks, respective pipelines are also diverse. Let’s begin with a simple generation.

generator = pipeline("text-generation")

By default, it uses the GPT-2 model. GPT-2 was the second of the Generative Pre-trained Transformers (GPT) family of large language models by OpenAI.

Although trained on much lesser data/fewer parameters, GPT-2 is still a pretty powerful model. More importantly, it is open source and hence used in a number of NLP tasks.

GPT-2 was introduced in 2019. Unlike its successors, GPT-3 and GPT-4, it is open-source, and hence we can use it. In simple text generation, we provide an “ice-breaker”, and it generates the text accordingly. We can customize the length by specifying the max_length argument.

generator("It's still unbelievable what Hannibal achieved in the BC times with the elephants. Arguably", max_length=240)

Output:

[{'generated_text': "It's still unbelievable what Hannibal achieved in the BC times with the elephants. Arguably one of the most amazing things.\n\nNow that we're really on his heels, I'd love to hear from you. In the same vein that we have with Marnie, are you up to having your stories told as a fan or just looking to bring back the memory of the night you went to see Hannibal play live?\n\nI'm looking forward to the challenge of doing that with your audience, and with all of you. It's always exciting.\n\nIn the wake of the death of a beloved and beloved cast member, the next big thing we wanted to do is bring something special to the show, one story that you could never have predicted. We love how he kept coming back.\n\nI believe the show has a long way to go after his death. He's a pretty sweet guy, but with his life, his career, his family, his character it's time that we really took a look at that.\n\nDo you have any specific questions to share about your work with new fans?"}]

The same opening can lead to a diverse set of outputs. It’s possible that you get a different output than what I got here (and this should happen as it's a stochastic system, not a fixed database). Let’s try another:

generator("I look out the window at the forest. It is starting to blow. I can see the shape of the wind on the water.", max_length=360)

Output:

[{'generated_text': 'I look out the window at the forest. It is starting to blow. I can see the shape of the wind on the water. Something is going to happen. But, for now it is too late."\n\nHe looked at her and then at herself. "I didn\'t die. It is your fault I went through what I did." He sat back down again, then took a good deep breath and said, "If I could say I lost another leg, or it might not have occurred to them that I would have died when I was still a child, what is that, you being sad?"\n\nShe could tell he was struggling with her. Not a single hint of sorrow, either.\n\n"Well, you have got to give up," he said, "so it won\'t happen to you again. We will try to do this once or twice before we\'re all gone. I would like all people to know that, don\'t get in your way."\n\nThe two of them laughed and laughed again. At the sound of his voice. She had never been through something like it before. "Do your best, you have to go on."\n\nHe took out his phone and started at her feet and looked at her. "Let\'s be real, you are still a human being," he said, "and not just a human being." He went to the phone again and took it again. At only moments, the phone stopped ringing.\n\nShe heard a small gasp. "What did you do?"\n\nHe tried to cover his mouth but couldn\'t. "Yeah, I didn\'t get there." He got an angry look in his eyes. "I didn\'t get there until two o\'clock."\n\nShe looked away'}]

Whoa! This is poetic. I am sure Per Petterson would be proud of that. Let’s try how GPT2 will try to write like Tolstoy:

generator("But when Volkónski said, with a frown, that it was in the Emperor’s name that he asked his opinion, Pfuel rose and, suddenly growing animated, began to speak: ", max_length=300)

Output:

[{'generated_text': 'But when Volkónski said, with a frown, that it was in the Emperor’s name that he asked his opinion, Pfuel rose and, suddenly growing animated, began to speak: ー ーー〜「Yes! You\'re just saying so with him. Now, let me ask the other two questions, about what is the Emperor\'s view?」\n\n"That is true, sir, right?"\n\nPfuel understood that all of Hircine\'s reasons and reasons were good reasons. He would ask about matters of power. But he wished he would give more clarity, then.\n\nSo on that subject to say the least, but the Emperor did not ask about what Hircine was thinking.\n\n◇ ◇\n\n「The second question is, do you wish to see the Empire return to the state of absolute independence, or do you wish to see the Emperor reintegrate with the people of Hircine?」\n\n…….\n\n◇ ◇\n\n「We will have to move on before we can talk again. 」\n\nPfuel looked at Hircine and said that.\n\n◇ ◇\n\n「Now then―――…that was a well thought out question, please follow and ask when?\n\nNo need to ask. The only thing to ask is that it cannot be said that he has made a decision on what to do to achieve the'}]

It’s quite interesting and a bit funny how it uses the combination of some special characters (especially \n).

Text-to-text generators

As mentioned in the rather long prologue, we can perform a number of NLP tasks beyond simple text generation using the text-to-text generation pipeline.

So, before starting the practical part, I would like to emphasize that text generation is not just about “Hey ChatGPT, write me an email for a potential supervisor.” Instead, it is something wonderful with many applications. It's as much a test of our skill as it is of the language model. For example:

  • Replying to some queries - nothing special in it, as Q/A systems also do it, as we saw earlier.
  • Style transfer - asking chatGPT to write Haikus, mimic Shakespeare style, etc., for fun is pretty common, but it can be useful in some real scenarios, too like making the tone of a report less casual or explaining a complex concept to high school students. This is something still prevalent in chatGPT as compared to the other models.
  • Checking the grammar - grammar checking software also uses text generators to suggest grammatical changes, rephrasing, etc. Autocorrect and suggestions by different software also use these generators.

In 2020, a multi-purpose NLP model was introduced. Text-to-Text Transfer Transformer, usually abbreviated as T5, is based on the simple principle of embedding the prompts within the text. Basically, our input will be a text, and the output will also be a text.

Let me explain it with an example:

text2textGenerator = pipeline("text2text-generation")
Unsurprisingly, the default Hugging Face model for text to text generation is T5-base. As a matter of fact, it’s immensely popular (more than 2.4 million downloads in August, 2023).

Question answering

Let's use this ubiquitous generator to do some question-answering.

text2textGenerator("context: Seismologists are finally gaining traction on one of their most tantalizing but challenging goals: using machine learning to improve earthquake forecasts. Three new papers describe deep-learning models that perform better than a conventional state-of-the-art model for forecasting earthquakes.  question: Whats exciting about this research?")

# Output:
# [{'generated_text': 'using machine learning to improve earthquake forecasts'}]

Translation

We saw translation in the previous article. We can also do machine translation prompts using the text generators.

text2textGenerator("translate into German: My name is Ali and I am your new neighbour.")

# Output:
# [{'generated_text': 'Mein Name ist Ali und ich bin Ihr neuer Nachbar.'}]
text2textGenerator("translate into French: My name is Ali and I am your new neighbour.")

# Output:
# [{'generated_text': "Je m'appelle Ali et je suis votre nouveau voisin."}]

You might notice it works fine in English and French, but no matter which other language you select, it gives the same (Deutsche) output.

text2textGenerator("translate into Arabic: My name is Ali and I am your new neighbour.")

# Output:
# [{'generated_text': 'Mein Name ist Ali und ich bin Ihr neuer Nachbar.'}]
💡
I checked the default model's (T5) research paper to realize it was fine-tuned on French, Spanish, and Romanian only. So, if you want it to perform some other translation tasks too occasionally, you will have to fine-tune it accordingly.

By the way, I confirmed it by prompting it with a Romanian translation, and it worked fine.

text2textGenerator("translate into Romanian: My name is Ali and I am your new neighbour.")

# Output:
# [{'generated_text': 'Mă numesc Ali şi sunt noul dumneavoastră vecin.'}]

Beyond T5

The couple of tasks (or sub-tasks) we saw above are the ones we saw in a previous post on text and token classification in NLP too.

There can be a number of other interesting tasks, as well, and we have some other models for them. For example, there is a model for converting formal text into informal (English only).

informalT5Generator = pipeline("text2text-generation", model="s-nlp/t5-informal")
informalT5Generator("It is my great pleasure to announce that I am coming to your city next week.")

# Output:
# [{'generated_text': "I'm coming to your city next week."}]

Why all the fuss? I like this simple/informal approach. Let me try this once more:

informalT5Generator("We regret to inform you that admissions have concluded.")

# Output:
# [{'generated_text': 'Sorry to tell you that admissions have been done.'}]

Actually, this style transfer task still has a lot of room for contributions. ChatGPT does it a little bit, but it's more of a “Jack of all trades”.

Text generation is a pretty vast field and can have a number of applications, such as conversational systems. But that will be the topic of our next article in this series.

Talha Irfan
Talha Irfan
I love reading Russian Classics, History, Cricket, Nature, Philosophy, and Science (especially Physics)— a lifelong learner. My goal is to learn and facilitate the learners.

Get started now

Step up your web scraping and automation