How data access will define the next era of AI agents

Stop hand-feeding your agents information, and start giving them the ability to find it themselves.

Why aren't we all managing fleets of autonomous agents yet?

It's not the models. Frontier models are really, really good. The gap is information access. Agents need precise and secure access to any and all information their human counterparts would look up, and we're still building the systems to make that possible.

I work on Ref, a documentation tool for coding agents, so I think about this problem constantly. Here's why I believe data access is the fundamental bottleneck and what we can do about it.

Data access = Autonomous agents
Data access = autonomous agents

What is full autonomy for an agent?

An autonomous agent can complete tasks end-to-end. You guide the agent, view the results of its work, provide feedback, and operate in this loop until the task is complete.

The type of feedback you’re providing is important. For an autonomous agent, you should be providing feedback to refine or clarify the goal. You start the task with a general idea of what you want, and as you see the first few attempts, you understand the problem better and can express your desire more clearly. This is an expression of creativity and taste where the hands-on work is being done by the agent.

Unfortunately, we often end up providing a different kind of feedback.

Why agents need information access to be successful (examples)

In today’s agent paradigm, the feedback you provide is simply missing facts about the system the agent is operating. These are facts that exist somewhere on your computer or on the internet, and the agent should simply be able to look them up.

Let’s go through a few examples of missing context:

Agent example 1: BDR agent searching for potential leads

You dispatch a BDR agent to find possible customers to contact. A common failure mode is returning a ton of false-positive leads that a human operator needs to sift through before reaching out. For the ideal autonomous agent, you fix this by refining your prompt to communicate your ideal customer profile more clearly. However, if the agent doesn’t have the ability to properly research these leads to an adequate depth and compare them to your guidance, it will never be able to distinguish high-quality from low-quality leads.

Agent example 2: Customer support agent

The customer support agent is the first line of defense for your ticket queue. Whenever a user asks a question, the agent does its best to answer the question, and if it fails, the question is escalated to a human support specialist.

The support agent may be able to perfectly understand the product’s documentation and accurately answer many questions. However, some questions will pertain to a user’s account, and without access to that information, there is no possible way for the agent to succeed, and human intervention will be required.

Agent example 3: Coding agents

You give your coding agent a task involving a library or platform. You spend your time writing a detailed spec for the feature so that the agent has a perfect understanding. Unfortunately, when it comes time for code review, it’s a mess. The agent doesn’t understand the library and has guessed at functions and API definitions such that the entire thing needs to be redone.

These function definitions and APIs are simply facts that exist in the documentation. If the agent had been able to search and read that documentation, it would never have made those mistakes. As the developer, you lose autonomy when you need to go gather those pieces of documentation yourself.

Higher complexity tasks require even more context

The examples we’ve covered so far are fairly direct tasks with immediate instructions. Now imagine complex agents with long-standing instructions:

  • The BDR agent goes beyond lead generation to doing outreach and managing early relationships with prospects.
  • The support agent is guiding a user or team through the entire onboarding journey.
  • The engineering agent is monitoring PRs, incidents, backlog, pulling tasks and prioritizing its own work.

As we stretch out the agents' work to longer tasks with less human intervention, small errors will start to compound and be harder for the human minder to resolve. That means it’s even more imperative that we eliminate errors due to missing the right data.

What does it mean to have access to the right data?

Once we accept that autonomy requires data access, we can start to consider what that means in practice. It turns out that data needs to have a couple of properties.

1) Data must be up-to-date

Stale data can be just as detrimental as missing data. Imagine a support agent referencing a feature that’s been sunset or a coding agent trying to use the wrong version of an API or library. Stale data creates a false sense of security for you and the agent, leading you to believe that the agent has the necessary information when, in fact, it does not.

2) The data should make efficient use of the context window

The most obvious solution to not having the right data is to throw everything into the context window, but this is not ideal for several reasons. First, as we're all painfully aware at this point, inference costs money, so when you dump everything in the context window, you're paying for that unnecessary data.

Second, you open yourself up to issues with context rot. Context rot is the phenomenon where model performance degrades as context windows fill with irrelevant or tangentially related information. For example, if you dump an entire API documentation site into context when the agent only needs one specific endpoint, the model may confuse similar-looking function signatures or miss the exact method it needs because it's buried among hundreds of others. The context window will naturally fill as an agent works on complex tasks, so we should limit what we provide to the minimum necessary for success.

Apify logo
Your agents need current web access
Apify makes it happen instantly
Get web data

How can your agent get access to the data it needs?

This data access problem is precisely why tools like Ref and Apify exist. They bridge the gap between what agents need to know and what they can reliably access. But the real breakthrough is the infrastructure that makes these solutions work at scale.

The Model Context Protocol (MCP) is emerging as the standard that enables agents to access external data sources in a uniform way. Rather than every agent system implementing its own custom integrations, MCP provides a common language for agents to discover and use data sources. It's analogous to how USB-C standardized device connections: instead of every manufacturer building proprietary cables, we now have one protocol that works everywhere.

With MCP, the hard problems of scraping, indexing, and retrieval get solved once by domain experts, then packaged as "servers" that any agent can use. Consider the coding agent example from earlier: instead of hallucinating library APIs, an agent using Ref's MCP server can search documentation in real-time and retrieve only the relevant function signatures it needs. Similarly, a BDR agent with access to Apify's Actor marketplace can research leads using battle-tested scrapers for LinkedIn, company websites, or any other data source with no custom scraping code required.

The key insight is that data access shouldn't be something every agent developer reinvents. MCP lets specialists build robust solutions for specific data domains while agent developers focus on making their agents smarter at using that data. The infrastructure is maturing rapidly, and the quality bar for what counts as "good enough" data access is rising accordingly.

Closing questions

At this point, you should be convinced of the importance of providing up-to-date and accurate information to your agents, so I'll leave you with a couple of thought experiments.

  1. Where are you providing your agent with facts that it should be able to look up for itself?
  2. What becomes possible when your agent has precise, complete access to the info it needs?

The path to true autonomy is clear: stop hand-feeding your agents information, and start giving them the ability to find it themselves.

On this page

Build the scraper you want

No credit card required

Start building