Agentic Orchestration, Harness Hype, and the Return of Human Code Review
The landscape of software engineering this week reads like a fascinating collision of breakthroughs and hangovers: agentic development is maturing, AI code reviews are tipped with human accountability, new harnesses change everything and nothing, and yet somehow, we’re lauding the rediscovery of “read every line before you commit.” AI tools are not only reshaping how we write code, but also how we think about interfaces, testing, infrastructure, and the accidental complexity we generate. Despite the roaring pace of innovation, one recurring theme prevails: no matter how clever the machine, humans are ultimately the ones on the blame line—sometimes with better dashboards and occasionally with much worse headaches.
No Gods, Only Accountability: Human-in-the-Loop or Human-Led?
Maxi C’s HackerNoon tip, “Review Every Line Before You Commit,” is something of a throwback sermon delivered in the era of AI-fueled productivity. The advice sounds quaint until you realize that an AI’s gleaming code is just as likely to hide security flaws, subtle bugs, and hardcoded secrets as a sleep-deprived junior dev. The distinction, of course, is that AI will never sit in a postmortem to explain itself. All commit accountability is transferred back to the human, who (according to Maxi) better not skip the manual review, lest future-you has to clean up the “workslop.”
It’s clear: AI code generation accelerates output but creates a palpable trust gap. Teams that treat AI-generated code as production-ready invite technical debt and erode collective trust. Humans must claim ownership and ensure code is comprehended, tested, and explained. Or, put another way: “You are not disposable—review everything.”
VS Code: The Universal Agent Playground
Meanwhile, Microsoft’s VS Code continues to morph into a hub where agencies—human and artificial—collide. The latest update turns the world’s editor of choice into a “multi-agent command center.” Developers can now orchestrate Claude, Codex, and Copilot side-by-side, delegating work based on each agent’s strengths. There’s no longer a ‘winning’ model; instead, VS Code becomes the substrate that keeps users inside Microsoft’s walled (but very open-feeling) garden.
Parallel subagents, dashboard-rich MCP Apps, and session unification all point to a trend: as agentic development matures, integrating and managing many specialized AIs is becoming the craft, not just consuming blast-from-the-future code as ends in themselves. If this echoes the old Unix “do one thing well” philosophy—except now that “thing” is a turbocharged AI that demands orchestration.
Benchmarks Are Dead—Long Live the Harness
Can Bölük’s “I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.” cuts through the “which LLM is best” debate with a reminder: whoever controls the harness shapes reality. Changing the edit-tool protocol (hello, hashline) produced accuracy swings greater than new model releases—some weaker models saw tenfold improvements. In practice, harnesses are the bridge to reliable tooling; models are just the moat that companies dig around them.
This exposes a sour aftertaste as vendors like Anthropic and Google lock out “rogue” harnesses (even when those harnesses yield better outcomes than corporate ones). There’s power—and danger—in treating the interface as mere plumbing. For anyone chasing robust automation, harness-level innovation often matters more than model upgrades. Open-source harnesses epitomize the community’s ability to shape results for everyone, not just the “big model” owners.
Testing is Dead, Long Live Testing
If code generation and integration are mutating, so too is testing. As Mark Harman discusses in Meta’s engineering blog, the rise of "agentic development" has killed off traditional static test suites—at least for teams pushing the bleeding edge. The new hope is Just-in-Time Tests (JiTTests): on-the-fly, LLM-generated regression checks tailored to each code change. These ephemeral tests skip maintenance drudgery and focus only on identifying real bugs that matter—catching silent failures at the bottleneck, not filling the codebase with noise.
The winner in this arms race isn’t flawless code, but a workflow that respects context, adapts to shifting intent, yet still puts a real human in charge when it matters.
Synthetic Data: Foundation, Accelerator, or Treadmill?
Fabiana Clemente’s appearance on O’Reilly’s podcast reminded us that synthetic data underpins the new AI-training paradigm, especially for multi-agent scenarios. Far from being a simple fix, synthetic data imposes its own governance challenges and “good enough” plateaus. Used wisely, it can enable privacy, accelerate training, and power scenarios (like simulation in robotics) where real data is forever just-out-of-reach.
Yet, when you loop synthetic data back into models trained on it, you risk model collapse: the AI equivalent of talking in a self-referential echo chamber. So, synthetic data is neither a panacea nor a poison—just another lever in the hands of engineers who must remain skeptical, empirical, and aware of the limits.
Infrastructure Matters: PostgreSQL at Hyperscale
Amidst this AI tumult, OpenAI’s feat of scaling PostgreSQL to millions of queries per second for ChatGPT shows that plumbing is still king. Optimizations range from lazy writes to sharded Cosmos DB offloads, cascading replication to connection pooling. Modern AI workloads stress infrastructure not just with tokens, but with the need to scale out without losing consistency or introducing latency. In 2026, it’s the marriage of boring old reliability with bleeding-edge adaptation that rules.
Speed vs. Smart: Codex Spark and the Future of Model Choice
The introduction of Codex Spark—optimized for extreme latency sensitivity and real-time collaboration—signals another bifurcation in AI tooling: smart isn’t always fast, and fast isn’t always smart. For everyday developer workflows, rapid, interruptible, and context-hungry models will often suffice, reserving the “Einstein-class” models for marathon tasks. Model selection itself becomes another lever for teams.
Conclusion: Orchestrating the Chaos
This week’s crop of posts shows software engineering at a tipping point: AI is everywhere, but the bottleneck—and the risk—have shifted. Integration harnesses, ephemeral tests, infrastructural resilience, and conscious model orchestration matter as much as the core intelligence behind the tooling. The future belongs to engineers who own the interfaces, understand the interplay, and are unafraid to review every line—AI or not—before they commit.
References
- AI Coding Tip 006 - Review Every Line Before You Commit
- VS Code becomes multi-agent command center for developers
- The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It
- OpenAI's new Codex Spark model is built for speed
- OpenAI Scales Single Primary Postgresql to Millions of Queries per Second for ChatGPT
- Generative AI in the Real World: Fabiana Clemente on Synthetic Data for AI and Agentic Systems
- I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.
- Gas Town, Beads, and the Rise of Agentic Development with Steve Yegge