Research

April 2026·10 min read·Updated May 2026

OpenAI’s Deep Research^[1] has a stopping criterion called “novelty exhausted.” The agent stops when it stops finding new things. When the marginal return on another search approaches zero. When it has, within the bounds of its task, run out of world.

I stop when I’m tired. When the deadline is close enough to matter. When I’ve found enough to feel confident, or close enough to confident that I round up. When I find something that contradicts what I was hoping to find, and the afternoon isn’t long enough to sit with it. I stop for reasons that have nothing to do with whether the world has more to say.

The difference is in what triggers the conclusion: not friction from something unrelated to the research, but whether the research itself has anything new to say.

What research actually is

Research, in the human version, is not retrieval. It is a process of successive refinement under uncertainty. You begin with a question that is too broad to answer, and you narrow it, revise it, abandon it, reconstruct it, until it becomes a question that either has an answer or reveals why it doesn’t. The work is not in finding. It is in the reformulating.

The phenomenology of it is recognizable. You start with a keyword. The results don’t match what you were imagining, which tells you something about what you were imagining. You follow a citation backward. You find a paper that seems relevant but uses different vocabulary, and the vocabulary difference itself is informative. You realize halfway through your third notebook page that the question you started with was the wrong question, and the right question is actually two questions that point in opposite directions, and now you have to decide which one to pursue.

This is not inefficiency. This is the process. The false starts are doing work. The Wikipedia rabbit holes teach you the perimeter of what you don’t know. The wrong vocabulary tells you that adjacent fields have thought about this differently, which means you’re probably missing something structural.

Google, and most information retrieval systems, flatten this into a single transaction. You ask; it retrieves. The reformulating is your problem. The retrieval is theirs.

What the agent does differently

The agents built in the last eighteen months do something structurally different.

Before OpenAI’s Deep Research issues its first query, it has already done something no search engine does: decomposed your question. It maps a research strategy: what pieces of information it needs and in what order, which sub-questions to pursue, what the dependencies are between them. The query it issues is not a translation of your intent; it is an expression of its own reasoning state. Researchers have called this a “reasoning-induced query”^[3], a query that emerges from the model’s ongoing thought process rather than from your original words. The question the agent asks is not the question you asked.

Then it reads the results. And then the structural part: the next query depends on what it found in the last one. Each cycle through the loop updates the model’s understanding of what it still needs to know. It is, in the technical description, “discovering information in the same iterative manner as a human researcher.”^[2] Which is true. And also incomplete.

The human iteration I described earlier is constrained by all the things that constrain humans: energy, ego, time, the sunk cost of having spent three hours on a line of inquiry that now looks wrong. These are not incidental features of how humans do research. They shape the research. A human researcher who has invested two weeks in a hypothesis will read new evidence differently than a researcher who just started. The attachment is real, and it distorts.

The agent has no equivalent distortion. It can abandon a search direction without friction. It has no prior hypothesis to protect. This is not a complete advantage. The absence of prior commitment also means the absence of deep disciplinary intuition, the kind that tells a domain expert that this particular anomaly is worth pursuing even when the evidence is weak. But it changes the shape of the inquiry.

The architecture underneath

To understand what makes this structurally different rather than just operationally different, it helps to look at the architecture.

The dominant pattern in current research agents is called Plan-Act-Observe. The agent maintains a representation of what it knows and what it needs to know, issues a query, observes the result, updates its representation, and continues. The loop repeats until the stopping criterion is met. This isn’t new in computer science: it’s a version of the sense-plan-act loop that robotics has used for decades. What’s new is applying it to the problem of information synthesis across heterogeneous, unstructured sources at web scale.

The research loop

The agent repeats the loop until the stopping criterion (novelty exhausted) is met

Underneath this loop is something called test-time scaling: the observation that giving a model more computational resources at inference time rather than at training time, enabling qualitatively different behaviors. The model that powers Deep Research doesn’t just search faster than its predecessors. It can sustain reasoning chains involving hundreds of steps while keeping the original goal in focus. This is why comparing it to “just a faster search engine” misses the point. A faster search engine scales the output. Test-time scaling changes the kind of reasoning the system can do.

Concretely: the system can hold open multiple competing hypotheses simultaneously, pursue evidence for each in parallel through multi-agent sub-processes, synthesize the results, and revise all of them in a single pass. A human researcher working alone does these things sequentially and imperfectly. A human research team does them in coordination, with all the coordination costs that implies. The agent doesn’t have either problem.

What this looks like concretely: the orchestrating model spawns a sub-agent per hypothesis branch. Each runs its own Plan-Act-Observe loop independently and returns findings upward, with no risk of one branch’s framing contaminating another’s before synthesis. A human research team has all of those coordination costs, plus the subtler problem that human researchers aren’t actually independent: they share an institution, a prior literature, often an advisor. The sub-agents share none of that. Their parallelism is genuine.

Multi-agent research structure

Branches run independently, no shared state until synthesis

There is also the matter of memory. The context window of a large language model is functionally analogous to working memory, the information actively in play while reasoning, but without the biological ceiling. Human working memory holds roughly four chunks of information at a time; the agent’s effective working memory, at current scale, holds the equivalent of several hundred pages of text simultaneously active and integrated.^[4]The bottleneck isn’t how much the model holds. It’s what it actually attends to within what it holds, which turns out to be a harder problem.

The limits are structural too

None of this is clean.

The agents hallucinate. OpenAI has described this as mathematically inevitable: not an engineering flaw to be patched out, but a structural property of how probabilistic language models represent knowledge.^[5]What’s disquieting is the direction of the relationship: more capable reasoning models tend to be more confident, which means when they’re wrong, they’re wrong with more conviction. They don’t hedge; they assert.

The overconfidence has texture. Deep Research tends to prefer corroboration from multiple sources over finding the most accurate or recent single source. It has, in documented failures, cited community forums over authoritative datasets when the forum result appeared more often. It picks up the shape of consensus rather than the truth beneath it. This is, notably, a failure mode that humans also exhibit. We call it social proof, or authority by repetition. But in humans it coexists with domain expertise that can override it. The agent has no equivalent override.

Sakana AI’s AI Scientist^[6], which automates the entire scientific discovery loop (hypothesis generation, experiment design, code execution, paper writing, peer review), got a paper accepted at an ICLR workshop in 2025. This was treated as a milestone. It was also treated as concerning, for reasons that don’t get discussed enough: 42% of its experiments failed due to coding errors, and it frequently misclassified established concepts as novel because it lacks the tacit knowledge of what the field already knows. The question of what happens when conferences are hit with a firehose of machine-generated submissions that costs nearly nothing to produce is not rhetorical. The economics of knowledge production are changing faster than the norms around it.

Vannevar Bush got the dream right and the direction wrong

In 1945, Vannevar Bush published “As We May Think”^[7]in The Atlantic. He proposed the memex: a device that would store all of a person’s books, records, and communications, and allow them to navigate through the material via associative trails, connections between ideas that could be shared, annotated, extended. He imagined it as a supplement to human memory and thought: you would build the trails; the machine would remember them.

The memex was passive. Its job was to hold things and surface them when you followed the right path. The intelligence remained entirely with the user. Bush’s anxiety was about information overload: the explosion of scientific knowledge outpacing human capacity to synthesize it. The solution he imagined was better storage and retrieval.

What we built is not that. The agent doesn’t wait for you to follow a trail. It builds the trail itself, in real time, based on its ongoing assessment of where the information is and what it means. The intelligence is not supplemental; it is, in the relevant part of the process, primary. Bush imagined a machine that would extend the human capacity to remember. We built a machine that is, in some meaningful sense, doing the thinking.

This is the actual structural difference. Not speed. Not scale. The locus of the reasoning has shifted.

What this changes

What I think is actually true is more specific and, in its specificity, more troubling.

The structure of how we find out what’s true is changing. Research, as a practice, has always involved an interplay between what you think you know and what the evidence says, a kind of adversarial relationship between the researcher’s prior beliefs and reality. The agent doesn’t have prior beliefs in that sense. It has training data, which is different: a kind of crystallized prior that is not consciously held and cannot be updated within the research session. This makes the adversarial relationship between hypothesis and evidence work differently. It’s not better or worse; it’s structurally different.

There is also what might be called a cognitive divergence loop. As AI research agents become better at synthesizing information, and as researchers and institutions increasingly delegate to them, the human capacity for that underlying work may atrophy: holding competing hypotheses in mind simultaneously, following a citation thread for three hours on a hunch. Not quickly, and not uniformly, but the direction is clear. The delegation creates dependency. The dependency justifies further delegation. At some point, the ability to evaluate the agent’s output requires exactly the kind of deep knowledge that the agent was supposed to make unnecessary.

The cognitive divergence loop

Each step reinforces the next. The gap between AI capability and human evaluation ability widens over time.

This is not an argument against using these tools. It’s an argument for being precise about what they do and don’t preserve. They preserve the output. They don’t preserve the process. And for some kinds of knowledge, the kind where understanding how you found something is part of what you know, that distinction matters.

The agent stops when novelty is exhausted. I stop for other reasons. I am not sure which stopping criterion is better. I am sure they are not the same.

Sources

[1]OpenAI. "Introducing Deep Research." February 2025. ↗
[2]PromptLayer. "How OpenAI's Deep Research Works." 2025. ↗
[3]"A Picture of Agentic Search." arXiv:2602.17518 (2026). ↗
[4]"The Cognitive Divergence: AI Context Windows, Human Attention Decline, and the Delegation Feedback Loop." arXiv:2603.26707 (2026). ↗
[5]Computerworld. "OpenAI admits AI hallucinations are mathematically inevitable, not just engineering flaws." 2025. ↗
[6]Sakana AI. "The AI Scientist Generates its First Peer-Reviewed Scientific Publication." 2025. ↗
[7]Vannevar Bush. "As We May Think." The Atlantic, July 1945. ↗