Agents Are Only as Good as the Data They Can JOIN

Divya RanganathanApr 15, 20265 min read

Now that everyone is building agents on diverse data sources and agents have better reasoning tools, how do you quickly assemble data from many different sources, in the right shape, so the model can actually reason well with it? A version of this problem has always existed and DBAs have wrestled with this for a long time. It’s the join problem.

“An agent’s reasoning is only as rich as the context you can assemble for it. More sources means more opportunity and more complexity in stitching them together.”

This post is about that assembly challenge that agents need to work, specifically, joins. Not just the SQL JOIN keyword (though we’ll get there), but the broader problem of reconciling data spread across your CRM, your warehouse, your SaaS APIs, and your vector stores. We’ll walk through how teams are solving this today and where each approach runs into limits.

Why Joins Matter

Lets walk through an example: when a user asks an agent “Which enterprise accounts are at risk this quarter?”, the agent needs renewal dates from Salesforce, product usage from the warehouse, support ticket volume from Zendesk, and payment history from Stripe, then it needs to reconcile all of that on a shared key and handle missing values gracefully.

In a traditional single-database environment, this is done in SQL query. In an agent environment spanning multiple live systems every source has its own latency profile and error handling. A join across three SaaS APIs and a warehouse is essentially a small distributed systems problem. And unlike a query you run once and fine tune, agents run these queries in an ad hoc manner, at runtime, against whatever the user asks.

How Teams Are Solving It

Here are the five main patterns teams use to handle joins in agent systems. Lets go over them.

PATTERN 1 — WAREHOUSE TABLES

Generate SQL, run it in Snowflake/BigQuery/Databricks. Joins happen inside the database, results come back clean.

This is the default approach for any question that’s fundamentally analytical. If your agent can express the question as SQL and the data is already in the warehouse, let the database do the join. It will be faster, cheaper, and more accurate than anything you build in application code.

The problem comes in expectations. Pushdown assumes your data is already stored in the warehouse. The moment the answer requires a live SaaS API call, you’re out of luck. You’ll need to build pipelines to transform your data sources into the correct formats.

PATTERN 2 — PRE-JOINED VIEWS

Materialize wide tables via dbt or Fivetran. Agent queries a single denormalized source. Fast, predictable, simple.

Next, prejoined views, also known as materialized views. This has taken off in the past decade by technologies like dbt. The good news is that you get predictable schemas, relatively fast queries, and you dramatically reduce the surface area for the LLM to make SQL mistakes. But the trade-off is freshness and flexibility: you’ll end up maintaining a long-tail of increasingly specific wide tables as your agent’s query patterns expand.

This pattern is when you are interacting with your most common, high-value queries, and static list of trusted datasets. That is both it’s biggest strength and limitation

PATTERN 3 — RAG-STYLE RETRIEVAL

Fetch embeddings from multiple stores, merge in-context. Works for unstructured reasoning, breaks on aggregations.

This is the pattern most agent builders reach for first, and it’s frequently misapplied. Semantic retrieval is excellent when the question is about meaning and relevance — “find support tickets similar to this complaint” but it’s dangerous when the question requires precise aggregation or exact matching.

“Which customers had more than 5 support tickets last quarter?” is not necessarily a retrieval problem, it’s mostly a counting problem. So treating it like a retrieval problem will give you plausible-sounding answers, but when you dig a little further you’ll have incorrect answers. So the biggest takeaway is to make sure you understand that boundary before you ship.

PATTERN 4 — FEDERATED QUERY LAYER

A query engine (Trino, Presto, or custom) joins across SaaS + warehouse in real time, transparent to the agent.

This is where we think agent infrastructure is heading. Federated queries became popular in the early 2010s with Trino/Presto, but the power of scalability of data warehouses meant they quickly fell out of favor.

A federated query engine sits between your agent and your data sources, accepting queries and handling the cross-source join logic internally and pushing down filters where it can and merging results on-demand.

It interacts with the data as a source be it an API or warehouse, and queries it like any other table. But there are some serious engineering challenges: query planning, predicate pushdown, caching, and identity resolution across systems all have to be solved.

For agents answering cross-system questions in production, this is showing much more promise.

PATTERN 5 — APPLICATION-LAYER JOINS

Agent fetches data step-by-step, merges in Python. Simple to write, but latency compounds, schemas drift, and it breaks at scale.

Application-layer joins are surprisingly popular, and we see teams using this approach for quick prototypes, but then find themselves locked into an architecture that does not scale. Fetching data from three APIs sequentially in Python, then merging it in a dictionary comprehension, will work fine in a demo and fall apart in production for three reasons:

Latency compounds. Three 200ms API calls in series is 600ms before the model even starts reasoning.
Schema drift breaks silently. When the CRM adds a new account ID format, your join condition returns empty and no one notices.
It doesn’t scale. What works for 50 records in a test doesn’t work for 50,000 records in a real query.

Use it to prototype. Replace it before you ship to production.

As you build or scale your agent infrastructure, it’s worth auditing where your joins actually live. How many cross-source questions can your agent answer reliably today? Where does it silently degrade — returning partial results, slow responses, or stale data because the assembly layer isn’t keeping up with the reasoning layer?

Originally published on Medium.