Web scraping social media for OSINT

There's a constant stream of new user-generated content on social media platforms. That's a real challenge for open-source intelligence (OSINT) researchers. Good OSINT requires accurate and up-to-date information at all times. All those tweets, posts, likes, and comments represent huge amounts of data that needs to be gathered, cataloged, and archived.

Let's explore the main challenges in OSINT research, possible solutions, and why and how web scraping can help.

What is OSINT?

OSINT stands for open-source intelligence, it represents collecting and analyzing publicly available data to produce research and actionable intelligence. Researchers in the OSINT industry rely heavily on data from online sources, including social media.

Depending on its scale, OSINT research to some degree involves:

Collecting data from various social media monitoring platforms (Twitter (X), Facebook, Instagram, TikTok, Telegram, and more)
Data labeling
Data visualization
Image and video analysis

All of these are united by one thing: data. More specifically, data extracted from social media, and lots of it.

You'd think that OSINT researchers, journalists, think tanks and NGOs must be at the front of data extraction, using new cutting-edge technologies to collect vast databases of reports, evidence, etc. every other day. But the reality is a far cry from that. Arguably, OSINT researchers face more roadblocks to accessing social media data than regular internet users.

OSINT pain points and challenges

Old datasets

Most OSINT researchers start by looking for an existing OSINT dataset — something open source found on trusted resources. However, ready-made datasets are often outdated, have the wrong format, or need restructuring. The alternative is to create their own datasets, but that's no easy task for OSINT researchers. That's where we have our first actual roadblock.

A lot of OSINT work is manual

You'd be surprised how much of OSINT research is manual. Due to security concerns, OSINT tools are often something close to just offline Word and Excel. So, when it comes down to it, lots of investigative journalism is about copying and pasting things (at least at its core). Tweets from specific accounts on Twitter, posts from private Facebook groups, comments on TikTok videos, and views on Instagram Reels — all have to be meticulously logged.

Lack of know-how

Even if researchers want to automate data collection, they often can't. Web automation requires knowledge of social media APIs: how to find, make requests, and use them continuously. Due to a lack of technical expertise, NGOs often can't take the steps needed to automate data collection from the web.

False hope in official APIs

But let's say an NGO has both the budget and the know-how. Researchers should be able to request the necessary information from social media platforms, right? While most social media platforms have closed their APIs to the public, they often still provide non-commercial access (or the proverbial foot in the door) for academics, NGOs, OSINT researchers, and the like. All they need to do is request access and fill out a form.

Well, not so fast.

As Mark Scott, Politico's chief technology correspondent said in his 2024 survey on access to social media data:

“Getting independent researchers access to social media data is the broccoli of digital regulation. Everyone says it’s important. But when pushed, few people other than researchers understand why.”

So what's the issue?

What you see is not what you get

First of all, the official APIs may not provide information that’s representative of what you can actually see on social media. Yeah, you've read it right — the owners of the social media platforms take it upon themselves to decide which data to show and which portions to redact. This is often done under the guise of protecting the privacy of their users but is more often than not done to guard corporate insights.

This principle applies even to times of crisis, as one doctor noted during the COVID-19 pandemic:

“granular data access … ran contrary to [social media] platforms’ privacy policies, leaving a divide between what information public health authorities needed to do their work and what was legally available to them”.

No data map available

Even authorities with the power to demand access to data from social media companies can't be completely sure of its completeness. The API can just give you a portion of the data without much-needed other parts that make it make full sense. Without an index of data to extract information, understanding what data is even out there to extract can be challenging and disorienting.

Selective data to avoid accountability

Among the more peculiar reasons why data received from the official APIs might not be complete is avoiding accountability. Sometimes websites want to protect themselves from research conclusions that imply their influence on swaying people's opinions. So if your OSINT research goes along the lines of “Disinformation in the age of social media" — good luck getting any of that data through the official API of the media in question.

Not all OSINT hubs are created equal

Independent journalists without direct connections to social media platforms find it harder to get access to data from those platforms. Those with direct connections, however, especially prestigious American universities, get preferential access and can therefore do better research.

With all this, while there are some serious calls for creating a one-stop shop for social media data — a unified API infrastructure of various social media platforms that vetted organizations can access — we're still far from that reality. In these circumstances, in the tug-of-war for access to web data, web scraping wins in our book.

"APIs offer easily-accessible tools for data collection, but scraping provides a higher degree of independence in how data is collected”

Why web scraping is the answer to better OSINT research

Not all researchers are in favor of web scraping, mostly because it "is associated with potential privacy concerns that have led some social media companies to threaten legal action if outside researchers pursue such tactics”.

Just to reiterate, web scraping is legal.

In the real world, more often than not, NGOs and OSINT researchers use some form of web scraping or automated data extraction in their work. Otherwise, they risk complicating their research tenfold. Given how limiting and confusing official APIs often are, it's no wonder that web scraping remains the most sensible solution for ad-hoc and long-term OSINT research. Here's why:

Completeness of data

With web scraping tools, what you see is what you get. There are no limits on the number of calls or the amount of data you can extract. You don't need to qualify for access, and you have full independence in how you handle the extracted data.

Ease of use

The best data extraction tools can be tailored to the website they are scraping. Ideally for OSINT researchers, they should also be easy enough to get the hang of. Here are a few scrapers that our OSINT academics love to use for their projects:

➡️

Check out Apify Store for social media scrapers

Reliability

It's not enough to simply build a scraper these days. Extracting data from social media platforms is notoriously difficult. These sites have strong anti-scraping measures that can block you after just two careless runs with the same IP. This is why scrapers need to be supported by a whole array of elements that ensure the reliability of data extraction — using proxies, APIs, scheduling, and storage.

These elements improve the success rates of scraping projects, as shown by the example of Mnemonic, Berlin-based NGO specializing in archiving evidence of war crimes from social media. By their account, using residential IP proxies significantly improved their success rates to 65% almost immediately. Mnemonic continues to use our proxies and social media scrapers for various archival projects, including those related to the Ukrainian initiative.

➡️

Read about how Mnemonic uses Apify to track war crimes in Ukraine

Autonomy and flexibility

Researchers don't have to rely solely on our scrapers. If you already have your favorite OSINT scraping tools, the Apify platform can host and support them with all its useful extra features. You can build your own web scrapers from scratch or migrate your in-house scraping tools to the Apify platform. Besides, by having both internal and Apify tools at your disposal, you can switch between them if one solution encounters any issues and maintain continuity in your research efforts.

Complex solutions

Often, the core message is spread across multiple sources. Collecting and analyzing social media data from multiple resources is the goal of many OSINT research papers, work that takes months to merge and summarize. Being able to do that in just a few minutes is certainly a superpower any journalist would want to have.

This is what we're trying to do with our newest, more complex scraping solutions. As we're moving towards uniting multiple scrapers under a single umbrella and enhancing them with text analysis, these tools might be invaluable for the OSINT research toolbox.

⚡

Read about Power Actors – combinations of several scrapers.

More Apify OSINT and web scraping case studies

Don't take our word for it, ask the NGOs that have worked with our tools. You'd be surprised how much web scraping has transformed their social media analysis. From combating child trafficking with the Spotlight investigation tool to assisting in finding missing children with the Missing Children (Atfal Mafkoda) initiative, web scraping has proven to be a technology that can be used for social good. We hope web scraping and advanced technologies for data extraction will continue to help OSINT journalists, academics, and NGOs in tackling important societal issues and creating positive change.