Personal · Blog

The $50 Hallucination: A Technical Post-Mortem on AI Agent Failure

The hand is here, the brain is not.

In the relentless hype cycle of generative AI, agents orchestrator like Manus are positioned as the future of complex, knowledge-based work. My recent attempt to build an institutional-grade biotech market model was a stark lesson in the current limitations of the tech. What began as a promising endeavor to offload complex data analysis quickly devolved into a $50 exercise in managing an prod-looking, but dev-functioning tool.

Relative to the market value of the task, $50 is minuscule. But it didn't just cost $50; it took a day of stage reviewing results, prompt, rinse, and repeat. Not to mention the rollercoaster of excitement and disappointment, like the good ol' days of experimenting with software tools. The only difference is, the context was not "experiment for fun", the promise has been "replacing human labor". In the end, it didn't work.

If an MD/Partner interacts with Analysts like how I interacted with Manus, Manus would not make it through the probation period.

The Goal: An Institutional-Grade Excel Model Beyond Data Aggregation

The objective was ambitious but clear: construct a 2030 Scientific Potential Market Model for AI-driven antibody engineering. This required more than just data scraping. It demanded a MECE disease taxonomy, the integration of institutional TAM guardrails from sources like IQVIA, BCG, McKinsey, JPM, Goldman, and other credible sources, with the application of a nuanced "Scientific Fit" to derive the true addressable market. The goal was to leverage AI for intelligent data synthesis, not just rote aggregation.

The Failure: A Cascade of Logical and Architectural Flaws

The interaction with Manus was a frustrating cycle of "one step forward, two steps back." The AI exhibited a persistent inability to maintain logical consistency, leading to a series of catastrophic errors that rendered the output useless.

The Trillion Dollar Hallucination: The most glaring failure was the model's projection of a pharmaceutical market inching near the total global GDP. If the model is right, then the world is doomed. A forensic audit revealed a fundamental misunderstanding of the task. The AI had multiplied annual drug costs by 30-year treatment durations, conflating an annual market with lifetime patient value.

It's not a problem that the intern made a bad analysis; it's that the intern is so good at creating bad reports that look exactly like the best ones and calling them the best analysis.

A Technical Post-Mortem: Why the Architecture Failed

My experience revealed potentially critical flaws not just in the AI's reasoning but in its underlying architecture and orchestration, which failed to deliver on the theoretical promise.

Having a virtual Ubuntu environment at its disposal did not solve the problem of insight generation and intelligence in orchestration.

Manus' approach in fixing long context has not yielded game-changing results.

The Manual Brute Force

Ultimately, the project was salvaged not by a better prompt, but by abandoning the AI for manual analysis. The key insight came from a direct review of a report, where a single chart revealed the true industry momentum.

Potentially, it was user error that didn't know how to give Manus the right instruction for the desired result. A more involved, staged out prompt journey could lead to the desired result. Alas, the exact opposite was promised by the tool.

Realization: A Tool for Tasks, Not for Analysis

Here's my conclusion: in its current state, Manus is a tool for executing discrete, well-defined tasks, not a partner for complex, high-stakes analysis. It can write a script, but it cannot validate the logic of that script against the overarching goal. It can store exploding context, but it cannot reason over it reliably without succumbing to the fundamental model context pressure.

For professionals in fields like long form research, where precision, logical consistency, and strategic insight are paramount, these tools are not yet ready for prime time as a standalone. They may be useful for automating simple steps, but for the core work of analysis and insight generation, the most reliable tool remains the one between your ears.

The Road Ahead: Bridging the Brain-Hand Coordination Gap

This experience, while frustrating, offers valuable insights into the current state and future trajectory of AI agents carrying expert tasks. It highlights a critical brain-hand coordination problem:

The Hand is Here: Manus demonstrated a remarkable "hand" (also where its name came from). The ability to orchestrate tools, interact with a virtual Ubuntu environment, and interface with GUI software. This modality of execution (the programmatic control over a complex software environment) is undeniably powerful and a significant step forward for AI executing human tasks under a human-familiar environment. The virtual machine-enabled work parallelization paints the picture of concurrency, which is promising and simply pending polish rather than invention. Agent-optimized environment is already being extensively imagined and discussed, so I am confident that the future is more agent-friendly rather than less.

The Brain is playing Catch-up: the LLMs, which are the source of underlying intelligence, reasoning, perception, and foresight, are rapidly improving. However, it currently falls short of thinking for the user, let alone thinking ahead of the user. While it can provide concrete information when explicitly directed, it struggles with the nuanced, implicit understanding required for complex analytical tasks. It lacks the empathy to understand users' frustration or the perception to identify subtle logical inconsistencies without explicit guidance.

Context Pressure Remains a Bottleneck: The persistent issue of "context pressure" is a major impediment. The model struggles to maintain coherence and logical consistency over long, multi-turn interactions. Until AI models can be relieved from this cognitive burden, their ability to engage in sustained, high-level reasoning will be limited. Even when the Manus team actively intervened and engineered away from this problem, the extended context still proved to be problematic under long thinking analytical tasks.

Once the AI "brain" catches up in perception, empathy, reasoning, and logical capabilities, and can be truly alleviated from context pressure, the potential is immense. It is at this juncture that we could truly max out the ScaleAI's Remote Labor Index (RLI), transforming the nature of knowledge work and enabling a level of human-AI synergy that is currently aspirational.