The Boundary is the Product

When operating agents, it’s not immediately clear what stays mechanical and what benefits from actual non-deterministic reasoning. The interface between these two and what crosses, in what shape, with what guarantees is the actual engineering work, and not just a detail of the architecture.

This is important for rather obvious reasons touching on compliance, auditability, reproducibility of outcomes, benchmarks and optimizing how much HITL can be afforded. Ultimately, if HITL escalations turn into a coin flip, the entire agent and its underlying infrastructure may not necessarily earn its keep. The hype pendulum swung so hard that we’re throwing cognition left-and-right.

An agent that both retrieves and augments is a bit of a slot machine because you’re trusting it to figure out which tools to invoke to get the right data, assemble a context and give you something useful based on its interpretation of whatever reaches its context. This is not an issue for low-stakes decisions such as a product recommendation on Zalando. It becomes an irritation if the agent touches on customer-facing interaction or a source of avoidable cost if the agent deals with money. My only point is that the only thing that varies is the cost of an undesirable outcome.

We’ve been hearing a lot in the past few years about capability overhang. Specifically, this idea that we should be building companies that assume not what AI can do today but what it could conceivably do in a few years. Otherwise you’d risk obsolescence before actually having a chance to grow the business. A valid stance but the technical side effects do have a bill coming up around the corner. I believe that trajectory is real but the timing and destination are debatable.

Now, whether it’s this belief per se or optimism around regulatory paralysis the assumption that seems to be spreading is that agents will allow “large-enough” context to “just capture” the organizational data and “reason over it reliably”. That expectation opens an entire can of worms worth of questions if you’re doing anything more than issuing refunds for a sauce that was forgotten in your Uber Eats order.

Let me make the case for regulatory overhang. As regulatory pressure catches up and learns to be more precise about what it expects from companies running agentic systems, technical constraints companies are subjected to will evolve faster and become more precise. This will become key in regulated environments where even “it works 99%” of the time will be a real blocker for impactful systems. It could even be that the 1% it gets wrong are the most dangerous cases. I don’t actually have to prove this, the possibility is enough to shut down the use case in a regulated environment.

We don’t really have interpretability. We are researching it heavily but until we make meaningful progress in model meta-cognition, we can’t open the black box, decompose retrieval versus reasoning errors and, most aggravating of all, we can’t reliably define boundaries for the space of decisions the system has been validated to handle. Even a human-in-the-loop escalation is something the agent must be able to reason its way to. For now, I believe this is an engineering and science problem in equal parts.

Back to the refund. Yes, again.

A refund over $25,000 looks policy-compliant in the finance system, but approval actually depends on contract carveouts in CRM, an active fraud review in Zendesk, a regional exception the ops lead granted in Slack, and a temporary controller delegation recorded in email.

The Neural Sandwich

The pipeline is non-deterministic/neural at the edges, and deterministic/mechanical in the middle.

The Neural Sandwich

Let’s not worry about system boundaries. If there’s anything that’s more chaotic than the organizational structure, it’s the software landscape in an organization.

We can’t make everything mechanical. That’s a check I can’t cash. But we can design the system in a way that we can separate the steps and operate a funnel that becomes increasingly precise and where we can deploy specialized reasoning models that can be benchmarked in isolation. And we can make sure that each non-deterministic step has a mechanical substrate that makes it diagnosable. The point is not to eliminate reasoning, but to force it to operate over a substrate that can be inspected when it fails.

First, the agent needing to make the call translates a free-form ask into a typed request. This is schema-driven irreducible LLM work and it’s narrow, auditable, benchmarkable and can be isolated as a failure mode.

Certain inputs can be rejected. In our example, the slack message of “go ahead on the thing”. Most LLMs would infer that “the thing” is the refund. The natural inclination is to think about a confidence threshold for the machine here. I think that’s a trap. Thresholds are generally fine for low-stakes work. For high-stakes workflows, no threshold is good enough. The discussion is whether the input format (free-form slack message) is the right one. The correct architectural response is to just not accept this.

Next, the system assembles the context for the decision deterministically. This is, in rough words, the graph traversal and invariants themselves: entity resolution, resolving temporal lookups and authority chains. This is all mechanical graph traversal work but with serious caveats that a verification step has to address.

The context graph must carry provenance about itself and have its own notion of freshness e.g. when it last synced and be able to determine if the information is incomplete. I don’t want to paper over this but the how is worthy of its own post. It’s a feature rather than a curiosity: “who watches the watchmen?”.

Finally, the assembled context is evaluated by the AI. This part of the system carries a non-trivial amount of risk since there’s no obvious upper bound on the complexity of context or its legibility. In lieu of a formal language that captures what well-formed decision context a machine can act on looks like, this is where a significant amount of blind trust sits as of now. A careful reader may think about Open Policy Agents and they’d be right, albeit with broader semantics.

As always, everything sits on a gradient. Such a system may be overkill vs. a pragmatic RAG-based solution for low-stakes use cases. By scratching the surface of what an architecture could look like, I’m looking to make a deeper meta-argument: there is no language that describes how things work. One-shot examples, skills.md, BPMN-like artifacts, if-else-disguised-as-prompts all point to the same symptom: the reality of human activity is barely legible to a machine.

This is, in part, growing pain because we are faced with this question for the first time and, in part, a hint that we’ll soon have to think about the way we define knowledge work differently.