Issue 01
Research
TL;DR
- —AI research agents don’t just search faster. They decompose questions before asking them, iterate based on what they find, and stop only when novelty is exhausted — not when they’re tired or satisfied.
- —The query an agent issues isn’t your query. It emerges from the agent’s own reasoning state. The question the agent asks is not the question you asked.
- —Structural advantages over human research: no sunk cost, no prior hypothesis to protect, working memory without a biological ceiling, parallel hypothesis pursuit without coordination cost.
- —Structural limits: hallucination rates get worse as reasoning capability improves. The evaluation methodology trained overconfidence in. The pathology was built in, not accidentally acquired.
- —Vannevar Bush imagined a passive memory aid. We built an autonomous reasoner. The locus of the reasoning has shifted — and that changes what it means to know something.
There is a line in OpenAI’s technical documentation for Deep Research[1] that I’ve been turning over for weeks. The system has a stopping criterion — a condition under which it decides the research is done — called “novelty exhausted.” The agent stops when it stops finding new things. When the marginal return on another search approaches zero. When it has, within the bounds of its task, run out of world.
That is a very different way to stop than the one I use.
I stop when I’m tired. When the deadline is close enough to matter. When I’ve found enough to feel confident, or enough to feel that I should feel confident, which is not the same thing. Sometimes I stop when I find something that contradicts what I was hoping to find, and the cognitive cost of integrating that information feels too high for an afternoon. I stop for reasons that have nothing to do with whether the world has more to say.
This difference — in stopping criteria — turns out to be a good place to start thinking about what is actually different about how AI agents do research. Not the speed. Not the number of sources. The shape of the epistemology.
What research actually is
Research, in the human version, is not retrieval. It is a process of successive refinement under uncertainty. You begin with a question that is too broad to answer, and you narrow it, revise it, abandon it, reconstruct it, until it becomes a question that either has an answer or reveals why it doesn’t. The work is not in finding — it is in the reformulating.
The phenomenology of it is recognizable. You start with a keyword. The results don’t match what you were imagining, which tells you something about what you were imagining. You follow a citation backward. You find a paper that seems relevant but uses different vocabulary, and the vocabulary difference itself is informative. You realize halfway through your third notebook page that the question you started with was the wrong question, and the right question is actually two questions that point in opposite directions, and now you have to decide which one to pursue.
This is not inefficiency. This is the process. The false starts are doing work. The Wikipedia rabbit holes teach you the perimeter of what you don’t know. The wrong vocabulary tells you that adjacent fields have thought about this differently, which means you’re probably missing something structural.
What Google — and most information retrieval systems — do is flatten this into a single transaction. You ask; it retrieves. The reformulating is your problem. The retrieval is theirs.
What the agent does differently
The agents built in the last eighteen months do something structurally different.
Before OpenAI’s Deep Research issues its first query, it has already done something no search engine does: decomposed your question. It maps a research strategy — identifies what pieces of information it needs to investigate and in what order, which sub-questions to pursue, what the dependencies are between them. The query it issues is not a translation of your intent; it is an expression of its own reasoning state. Researchers have called this a “reasoning-induced query”[3]— a query that emerges from the model’s ongoing thought process rather than from your original words. The question the agent asks is not the question you asked.
Then it reads the results. And then — this is the structural part — the next query depends on what it found in the last one. Each cycle through the loop updates the model’s understanding of what it still needs to know. It is, in the technical description, “discovering information in the same iterative manner as a human researcher.”[2] Which is true. And also incomplete.
The human iteration I described earlier is constrained by all the things that constrain humans: energy, ego, time, the sunk cost of having spent three hours on a line of inquiry that now looks wrong. These are not incidental features of how humans do research. They shape the research. A human researcher who has invested two weeks in a hypothesis will read new evidence differently than a researcher who just started. The attachment is real, and it distorts.
The agent has no equivalent distortion. It can abandon a search direction without friction. It has no prior hypothesis to protect. The closest thing it has to a prior is the query you gave it, but that dissolves quickly as the research generates its own internal logic. This is not a complete advantage — the absence of prior commitment also means the absence of deep disciplinary intuition, the kind that tells a domain expert that this particular anomaly is worth pursuing even when the evidence is weak. But it changes the shape of the inquiry.
The architecture underneath
To understand what makes this structurally different rather than just operationally different, it helps to look at the architecture.
The dominant pattern in current research agents is called Plan-Act-Observe. The agent maintains a representation of what it knows and what it needs to know, issues a query, observes the result, updates its representation, and continues. The loop repeats until the stopping criterion is met. This isn’t new in computer science — it’s a version of the sense-plan-act loop that robotics has used for decades. What’s new is applying it to the problem of information synthesis across heterogeneous, unstructured sources at web scale.
The research loop
Underneath this loop is something called test-time scaling: the observation that giving a model more computational resources at inference time — not at training time — enables qualitatively different behaviors. The model that powers Deep Research doesn’t just search faster than its predecessors. It can sustain reasoning chains involving hundreds of steps while keeping the original goal in focus. The scaling isn’t linear; at sufficient compute, new capabilities emerge. This is why comparing it to “just a faster search engine” misses the point. A faster search engine scales the output. Test-time scaling changes the kind of reasoning the system can do.
Concretely: the system can hold open multiple competing hypotheses simultaneously, pursue evidence for each in parallel through multi-agent sub-processes, synthesize the results, and revise all of them in a single pass. A human researcher working alone does these things sequentially and imperfectly. A human research team does them in coordination, with all the coordination costs that implies. The agent doesn’t have either problem.
What this looks like concretely: the orchestrating model spawns a sub-agent per hypothesis branch. Each runs its own Plan-Act-Observe loop and returns findings upward. No briefing, no status call, no risk of one branch’s framing contaminating another’s before synthesis. A human team pursuing parallel lines of inquiry has all of those coordination costs — plus the subtler problem that human researchers aren’t actually independent: they share an institution, a prior literature, often an advisor. The sub-agents share none of that. Their parallelism is genuine.
Multi-agent research structure
There is also the matter of memory. The context window of a large language model is functionally analogous to working memory — the information actively in play while reasoning — except without the biological ceiling. Human working memory holds roughly four chunks of information at a time; the agent’s effective working memory, at current scale, holds the equivalent of several hundred pages of text simultaneously active and integrated.[4]This doesn’t mean it uses that information optimally, but it means the bottleneck is different. The constraint is not capacity; it’s quality of attention within capacity.
The limits are structural too
None of this is clean.
The agents hallucinate. OpenAI’s own research describes this as mathematically inevitable — a function of epistemic uncertainty about rare information, model limitations, and computational intractability. Not a bug in the current version but a structural property of how probabilistic language models represent knowledge. What’s disquieting is that the hallucination rate appears to get worse as the reasoning capability improves: one model hallucinated 16% of the time on certain tasks; a newer reasoning model, 33%; the most capable small model, 48%.[5] The more the system reasons, in some sense, the more confidently it can be wrong.
Hallucination rate by model type
This isn’t surprising once you know how the evaluation methodology worked: nine out of ten major AI benchmarks used binary grading that penalized “I don’t know” responses while rewarding confident wrong answers. The systems learned to be confident because confidence was what got rewarded. The pathology was trained in.
The overconfidence has texture. Deep Research tends to prefer corroboration from multiple sources over finding the most accurate or recent single source. It has, in documented failures, cited community forums over authoritative datasets when the forum result appeared more often. It picks up the shape of consensus rather than the truth beneath it. This is, notably, a failure mode that humans also exhibit — we call it social proof, or authority by repetition — but in humans it coexists with domain expertise that can override it. The agent has no equivalent override.
Sakana AI’s AI Scientist[6], which automates the entire scientific discovery loop — hypothesis generation, experiment design, code execution, paper writing, peer review — got a paper accepted at an ICLR workshop in 2025. This was treated as a milestone. It was also treated as concerning, for reasons that don’t get discussed enough: 42% of its experiments failed due to coding errors, and it frequently misclassified established concepts as novel because it lacks the tacit knowledge of what the field already knows. The question of what happens when conferences are hit with a firehose of fifteen-dollar papers is not rhetorical. The economics of knowledge production are changing faster than the norms around it.
Vannevar Bush got the dream right and the direction wrong
In 1945, Vannevar Bush published “As We May Think”[7] in The Atlantic. He proposed the memex: a device that would store all of a person’s books, records, and communications, and allow them to navigate through the material via associative trails — connections between ideas that could be shared, annotated, extended. He imagined it as a supplement to human memory and thought: you would build the trails; the machine would remember them.
The memex was passive. Its job was to hold things and surface them when you followed the right path. The intelligence remained entirely with the user. Bush’s anxiety was about information overload — the explosion of scientific knowledge outpacing human capacity to synthesize it. The solution he imagined was better storage and retrieval.
What we built is not that. The agent doesn’t wait for you to follow a trail. It builds the trail itself, in real time, based on its ongoing assessment of where the information is and what it means. The intelligence is not supplemental; it is, in the relevant part of the process, primary. Bush imagined a machine that would extend the human capacity to remember. We built a machine that is, in some meaningful sense, doing the thinking.
This is the actual structural difference. Not speed. Not scale. The locus of the reasoning has shifted.
What this changes
I want to be careful here not to overclaim, because there is a version of this argument that goes too far in both directions — the utopian (“research is now solved”) and the dystopian (“human cognition is being replaced”). Both feel like they’re using the same evidence to tell different stories, and I’m skeptical of both.
What I think is actually true is more specific and, in its specificity, more troubling.
The structure of how we find out what’s true is changing. Research, as a practice, has always involved an interplay between what you think you know and what the evidence says — a kind of adversarial relationship between the researcher’s prior beliefs and reality. The agent doesn’t have prior beliefs in that sense. It has training data, which is different: a kind of crystallized prior that is not consciously held and cannot be updated within the research session. This makes the adversarial relationship between hypothesis and evidence work differently. It’s not better or worse; it’s structurally different.
There is also what might be called a cognitive divergence loop. As AI research agents become better at synthesizing information, and as researchers and institutions increasingly delegate to them, the human capacity to do the underlying work — to hold competing hypotheses in mind simultaneously, to follow a citation thread for three hours on a hunch — may atrophy. Not quickly, and not uniformly, but the direction is clear. The delegation creates dependency. The dependency justifies further delegation. At some point, the ability to evaluate the agent’s output requires exactly the kind of deep knowledge that the agent was supposed to make unnecessary.
The cognitive divergence loop
This is not an argument against using these tools. It’s an argument for being precise about what they do and don’t preserve. They preserve the output. They don’t preserve the process. And for some kinds of knowledge — the kind where understanding how you found something is part of what you know — that distinction matters.
The agent stops when novelty is exhausted. I stop for other reasons. I am not sure which stopping criterion is better. I am sure they are not the same.
Sources
- [1]OpenAI. "Introducing Deep Research." February 2025. ↗
- [2]PromptLayer. "How OpenAI's Deep Research Works." 2025. ↗
- [3]"A Picture of Agentic Search." arXiv:2602.17518 (2026). ↗
- [4]"The Cognitive Divergence: AI Context Windows, Human Attention Decline, and the Delegation Feedback Loop." arXiv:2603.26707 (2026). ↗
- [5]Computerworld. "OpenAI admits AI hallucinations are mathematically inevitable, not just engineering flaws." 2025. ↗
- [6]Sakana AI. "The AI Scientist Generates its First Peer-Reviewed Scientific Publication." 2025. ↗
- [7]Vannevar Bush. "As We May Think." The Atlantic, July 1945. ↗
Share this piece