Multimodal AI: what can it do, and why is it a game-changer?

Current uses of multimodal AI and implications for the future.


Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about multimodal AI was inspired by our work on getting better data for LLMs and other machine learning models. Check us out.

What is multimodal AI?

I’m not one for hype, but honestly, I don’t think people have yet realized the implications of multimodal AI.

For those not familiar with it, multimodal AI is an AI system that can understand multiple data types (image, audio, text, and numerical data) and can use them together to establish content and interpret context.

If you haven’t heard, OpenAI’s ChatGPT now includes GPT-4V(ision), which can analyze graphics and photos. That means you can now get a language model to combine images with text (and audio) prompts. (More about audio later).

FYI, to use GPT-4V, you need a paid membership to ChatGPT-Plus (currently priced at $20 per month), but it’s still being rolled out in phases.

Bing has also introduced multimodality to its chat mode, and Google is planning to launch its own multimodal competitor, Gemini, any time now.

I won’t say, “The possibilities are endless!” because they’re not – yet. But already, GPT-4V has been used to do front-end web development, describe visual content, review product images, troubleshoot problems based on a photo, get advice on design, and decipher illegible text in historical manuscripts.

It took about six months for OpenAI to make good on its promise that GPT-4 would be multimodal, but it’s finally happening, and already we’ve seen examples of what it’s capable of.

Multimodal AI can understand and explain multimedia content

Uses of vision in multimodal AI

👨‍💻 Front-end development

Multimodality can recreate a website dashboard from screenshots or sketches. While its execution is not yet flawless, it shows promising signs of being able to reduce the time needed to go from design to prototype.

Another front-end development possibility that multimodality has opened up is the ability to improve code. By using the result of one run as the prompt for the next run, the model can keep refining the code independently.

📸 Explaining visual and multimedia content

ChatGPT can now describe images in detail. It can provide captions, explain the humor in a meme or editorial cartoon, break down complex infographics into simple text explanations, and describe the difference between one product photo and another.

This has opened up a range of possibilities: creating text for a comic or graphic novel, getting advice on interior design, identifying poisonous food, and comparing and matching products.

📜 Optical character recognition for ancient manuscripts

I have a background in theology and late antiquity, so this one excites me more than it does most people. But it’s a big deal for anyone deeply involved in the humanities.

While OCR is only a small part of GPT-4 Vision, it has immense significance for historians and scholars of ancient languages and literature. GPT-4V is capable of deciphering, translating, and analyzing historical manuscripts. Turns out that all those years I spent learning Ancient Greek and Latin were a waste of time.

Curious about AI's ability to do web scraping?

Check out AI web scraping tools: do they really work?

Uses of voice recognition in multimodal AI

🔎 Search engines

Bing Chat has already made searching for online content easier. You needn’t type your search query anymore. You only need to utter your query, and Bing will act just as it does when you type.

It will be interesting to see how this impacts keyword research in the near future. People are unlikely to be as brief in speaking as they are when they type search queries. This might make it easier to understand the intent behind keyword searches.

🤖 Intelligent assistants

The same ease can be applied to other things that we’ve used ChatGPT for but with the spoken word instead of the written one. Instead of typing prompts into an LLM, you can just state your intent, and the AI model will respond.

How is this different from Siri or Alexa? With multimodal AI or a voice-to-text system like OpenAI’s Whisper, you don’t need to dictate every word; you only need to say what action you want it to perform, and the model will handle the rest for you.

The implications of multimodal AI

Now imagine combining these multimodal AI capabilities with a tool like LangChain, which enables you to link AI models to outside sources, such as your Google Docs or Notion.

Soon, AI models will be able to function as intelligent assistants like the computer from Star Trek’s Enterprise. You'll be able to ask an LLM to retrieve a document or image from your files just by stating your request, and it will fetch it for you.

In other words, multimodal AI is on its way to taking automation to another level. We may never have to type in a search box or look through files ever again. And that’s just the tip of the iceberg.

As machine learning continues to develop and AI models are trained to become more performant and capable of a broader range of things, it’s not implausible that we’ll all have our own personal AI assistants in our devices in the near future, and not just the user interface of ChatGPT.

Extract text content from the web to feed your vector databases and fine-tune or train large language models 

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation