We're Apify, a full-stack web scraping and browser automation platform. As a result, it serves as a bridge between the web's vast data resources and sophisticated AI tools like Hugging Face.
With this article on computer vision and image classification, we bring our Hugging Face series to an end.
Transformers for vision tasks
So far, we have seen Hugging Face being applied to NLP tasks only. In 2021/22, when I was new to Hugging Face and transformers, I had a misconception that they were limited only to NLP tasks. So, naturally, it was surprising to see computer vision tasks on Hugging Face as well. But luckily, we can use the Hugging Face API for vision tasks as well. This article is dedicated to covering them.
It would be unfair to jump directly into the subject without saying a few words about the transformer model behind all these vision tasks. Personally, it took me some time to realize that we can use transformers for computer vision tasks.
Please feel free to skip to the image classification section below if you're a beginner or want to avoid the theory.
In one of the iconic papers of the current decade (or perhaps the whole history of AI), the Google Research team presented Vision Transformers in 2021. Vision transformers (often abbreviated as ViTs) are a revelation. Without using any convolution mechanism, they're able to achieve (or often supersede) the performance of traditional CNNs.
To put it another way, since CNNs are applied in a sequential manner, they keep the positional context, but vision transformers feed chunks of the image (16 × 16 in the original model) as tokens. Hence, they operate without any significant contextual information and defy both conventional and convolutional computer vision.
For the sake of brevity, I'll conclude on vision transformers here and move to the meat of the topic: image classification.
Image classification is a pretty straightforward task.
We have an image X and the ML model, f(X). After processing the image, the model (already trained on some dataset(s)) returns y, the class of the image. It's quite useful to have a number of applications. From visual search on search engines to our smartphones tagging the gallery items, image classification is everywhere.
So enough preamble. Let’s try it out.
Note: Since this is a prediction and can be off, so technically, it would be y′ (specifying the prediction) instead of y. But I would use y throughout to keep it simple.
!pip install transformers
from transformers import pipeline
imageClassifier = pipeline("image-classification")
By default, it uses the original ViT model (one used in the paper presenting ViTs), vit-base-patch16-224 (it was trained on the images of size 224 × 224 with patches of size 16 × 16) by Google Research.
To give you an idea of how popular this model is, I'd like to share a little stat: in September 2023 alone, this model was used by Hugging Face users more than 1M times.
Since we have to fetch the images, we need to import the respective libraries:
from PIL import Image
from io import BytesIO
I'm assuming that the Pillow library is already installed. Now, I'll take an image from the WikiArt and try image classification on it.
As we can see, it got it horribly wrong. It’s understandable that this image can be tougher to comprehend, but I feel that misclassifying such an iconic image (it's been used a lot in computer vision, too, like the classical style transfer papers in the mid-2010s) is something Google Research won’t be too proud of. Obviously, one or a few examples can’t evaluate any model. Let’s try another:
The old boy was in his 70s, so he should consider himself lucky to be considered a bit younger. I think (just a hypothesis) that it can be explained due to his firm face (fewer wrinkles than a typical person of this age).
I've thoroughly enjoyed these models, and I'll continue to explore further as I have heaps of tabs opened in the browser. I'm sure you'll keep enjoying these models too. Ciao!