What’s new today…

VentureBeat Transformative tech coverage that matters

  • LangChain’s CEO argues that better models alone won’t get your AI agent to production
    by taryn.plumb@venturebeat.com (Taryn Plumb) on March 7, 2026 at 10:00 pm

    As models get smarter and more capable, the “harnesses” around them must also evolve. This “harness engineering” is an extension of context engineering, says LangChain co-founder and CEO Harrison Chase in a new VentureBeat Beyond the Pilot podcast episode. Whereas traditional AI harnesses have tended to constrain models from running in loops and calling tools, harnesses specifically built for AI agents allow them to interact more independently and effectively perform long-running tasks. Chase also weighed in on OpenAI’s acquisition of OpenClaw, arguing that its viral success came down to a willingness to “let it rip” in ways that no major lab would — and questioning whether the acquisition actually gets OpenAI closer to a safe enterprise version of the product. “The trend in harnesses is to actually give the large language model (LLM) itself more control over context engineering, letting it decide what it sees and what it doesn’t see,” Chase says. “Now, this idea of a long-running, more autonomous assistant is viable.”Tracking progress and maintaining coherenceWhile the concept of allowing LLMs to run in a loop and call tools seems relatively simple, it’s difficult to pull off reliably, Chase noted. For a while, models were “below the threshold of usefulness” and simply couldn’t run in a loop, so devs used graphs and wrote chains to get around that. Chase pointed to AutoGPT — once the fastest-growing GitHub project ever — as a cautionary example: same architecture as today’s top agents, but the models weren’t good enough yet to run reliably in a loop, so it faded fast. But as LLMs keep improving, teams can construct environments where models can run in loops and plan over longer horizons, and they can continually improve these harnesses. Previously, “you couldn’t really make improvements to the harness because you couldn’t actually run the model in a harness,” Chase said. LangChain’s answer to this is Deep Agents, a customizable general-purpose harness. Built on LangChain and LangGraph, it has planning capabilities, a virtual filesystem, context and token management, code execution, and skills and memory functions. Further, it can delegate tasks to subagents; these are specialized with different tools and configurations and can work in parallel. Context is also isolated, meaning subagent work doesn’t clutter the main agent’s context, and large subtask context is compressed into a single result for token efficiency. All of these agents have access to file systems, Chase explained, and can essentially create to-do lists that they can execute on and track over time. “When it goes on to the next step, and it goes on to step two or step three or step four out of a 200 step process, it has a way to track its progress and keep that coherence,” Chase said. “It comes down to letting the LLM write its thoughts down as it goes along, essentially.” He emphasized that harnesses should be designed so that models can maintain coherence over longer tasks, and be “amenable” to models deciding when to compact context at points it determines is “advantageous.” Also, giving agents access to code interpreters and BASH tools increases flexibility. And, providing agents with skills as opposed to just tools loaded up front allows them to load information when they need it. “So rather than hard code everything into one big system prompt,” Chase explained, “you could have a smaller system prompt, ‘This is the core foundation, but if I need to do X, let me read the skill for X. If I need to do Y, let me read the skill for Y.'” Essentially, context engineering is a “really fancy” way of saying: What is the LLM seeing? Because that’s different from what developers see, he noted. When human devs can analyze agent traces, they can put themselves in the AI’s “mindset” and answer questions like: What is the system prompt? How is it created? Is it static or is it populated? What tools does the agent have? When it makes a tool call, and gets a response back, how is that presented? “When agents mess up, they mess up because they don’t have the right context; when they succeed, they succeed because they have the right context,” Chase said. “I think of context engineering as bringing the right information in the right format to the LLM at the right time.” Listen to the podcast to hear more about: How LangChain built its stack: LangGraph as the core pillar, LangChain at the center, Deep Agents on top.Why code sandboxes will be the next big thing. How a different type of UX will evolve as agents run at longer intervals (or continuously). Why traces and observability are core to building an agent that actually works. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

  • Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough
    on March 7, 2026 at 5:00 am

    “When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej KarpathyThe “March of Nines” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “usually works” and “operates like dependable software” determines adoption.The compounding math behind the March of Nines“Every single nine is the same amount of work.” — Andrej KarpathyAgentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, validation, formatting, and audit logging. If a workflow has n steps and each step succeeds with probability p, end-to-end success is approximately p^n.In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.Per-step success (p)10-step success (p^10)Workflow failure rateAt 10 workflows/dayWhat does this mean in practice90.00%34.87%65.13%~6.5 interruptions/dayPrototype territory. Most workflows get interrupted99.00%90.44%9.56%~1 every 1.0 daysFine for a demo, but interruptions are still frequent in real use.99.90%99.00%1.00%~1 every 10.0 daysStill feels unreliable because misses remain common.99.99%99.90%0.10%~1 every 3.3 monthsThis is where it starts to feel like dependable enterprise-grade software.Define reliability as measurable SLOs“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — Andrej KarpathyTeams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:Workflow completion rate (success or explicit escalation).Tool-call success rate within timeouts, with strict schema validation on inputs and outputs.Schema-valid output rate for every structured response (JSON/arguments).Policy compliance rate (PII, secrets, and security constraints).p95 end-to-end latency and cost per workflow.Fallback rate (safer model, cached data, or human review).Set SLO targets per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.Nine levers that reliably add nines1) Constrain autonomy with an explicit workflow graphReliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate.Persist state with idempotent keys so retries are safe and debuggable.2) Enforce contracts at every boundaryMost production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.Use JSON Schema/protobuf for every structured output and validate server-side before any tool executes.Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI).3) Layer validators: syntax, semantics, business rulesSchema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available.Business rules: approvals for write actions, data residency constraints, and customer-tier constraints.4) Route by risk using uncertainty signalsHigh-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing.Gate risky steps behind stronger models, additional verification, or human approval.5) Engineer tool calls like distributed systemsConnectors and dependencies often dominate failure rates in agentic systems.Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.Version tool schemas and validate tool responses to prevent silent breakage when APIs change.6) Make retrieval predictable and observableRetrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.Track empty-retrieval rate, document freshness, and hit rate on labeled queries.Ship index changes with canaries, so you know if something will fail before it fails.Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk.7) Build a production evaluation pipelineThe later nines depend on finding rare failures quickly and preventing regressions.Maintain an incident-driven golden set from production traffic and run it on every change.Run shadow mode and A/B canaries with automatic rollback on SLI regressions.8) Invest in observability and operational responseOnce failures become rare, the speed of diagnosis and remediation becomes the limiting factor.Emit traces/spans per step, store redacted prompts and tool I/O with strong access controls, and classify every failure into a taxonomy.Use runbooks and “safe mode” toggles (disable risky tools, switch models, require human approval) for fast mitigation.9) Ship an autonomy slider with deterministic fallbacksFallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat autonomy as a knob, not a switch, and make the safe path the default.Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations.Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low.Expose per-tenant safe modes: disable risky tools/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents.Design resumable handoffs: persist state, show the plan/diff, and let a reviewer approve and resume from the exact step with an idempotency key.Implementation sketch: a bounded step wrapperA small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):    # trace all retries under one span    span = start_span(name)    for attempt in range(1, max_attempts + 1):        try:            # bound latency so one step can’t stall the workflow            with deadline(timeout_s):                out = attempt_fn() # gate: schema + semantic + business invariants            validate_fn(out)            # success path            metric(“step_success”, name, attempt=attempt)            return out        except (TimeoutError, UpstreamError) as e:            # transient: retry with jitter to avoid retry storms            span.log({“attempt”: attempt, “err”: str(e)})            sleep(jittered_backoff(attempt))        except ValidationError as e:            # bad output: retry once in “safer” mode (lower temp / stricter prompt)            span.log({“attempt”: attempt, “err”: str(e)})            out = attempt_fn(mode=”safer”)    # fallback: keep system safe when retries are exhausted    metric(“step_fallback”, name)    return EscalateToHuman(reason=f”{name} failed”)Why enterprises insist on the later ninesReliability gaps translate into business risk. McKinsey’s 2025 global survey reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.Closing checklistPick a top workflow, define its completion SLO, and instrument terminal status codes.Add contracts + validators around every model output and tool input/output.Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).Route high-impact actions through higher assurance paths (verification or approval).Turn every incident into a regression test in your golden set.The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.

  • Anthropic launches Claude Marketplace, giving enterprises access to Claude-powered tools from Replit, GitLab, Harvey and more
    on March 7, 2026 at 12:25 am

    San Francisco startup Anthropic continues to ship new AI products and services at a blistering pace, despite a messy ongoing dispute with the U.S. Department of War.Today, the company announced Claude Marketplace, a new offering that lets enterprises with an existing Anthropic spend commitment apply part of it toward tools and applications powered by Anthropic’s Claude models but made and offered by external partners including GitLab, Harvey, Lovable, Replit, Rogo and Snowflake.According to Anthropic’s Claude Marketplace FAQ, the program is designed to simplify procurement and consolidate AI spend. Anthropic says the Marketplace is now in limited preview and that enterprises interested in using it should reach out to their Anthropic account team to get started.For customers interested in the Marketplace, Anthropic says purchases made through it “count against a portion of your existing Anthropic commitment,” and that the company will manage invoicing for partner spend — meaning enterprises can use part of their existing Anthropic commitment to buy Claude-powered partner solutions without separately handling partner invoicing. In effect, Anthropic is positioning Claude Marketplace as a more centralized way for enterprises to procure certain Claude-powered partner tools.Yet, the whole point of Anthropic’s Claude Code and Claude Cowork applications for many users was that they could shift enterprise spend and time away from current third-party software-as-a-service (Saas) apps and instead, they could “vibe code” new solutions or bespoke, AI-powered workflows. This idea is so pervasive that prior Claude integrations have on several recent occasions caused a major selloff in SaaS stocks after investors thought Claude could threaten the underlying companies and applications. Claude Marketplace seems to be pushing against that idea, suggesting current SaaS apps are still valuable and perhaps even more useful and appealing to enterprises with Claude integrated into them. The launch raises a broader question about how enterprises will choose to use Claude: directly through Anthropic’s own products and APIs, or through third-party applications that embed Claude for more specialized workflows.Tool integrationModel and chat platforms have always sought to offer integrations, aiming to cut the time users spend building their app versions. OpenAI added third-party apps into ChatGPT and launched a new App Directory in December 2025. This brought in offerings from companies such as Canva, Expedia and Figma that users can invoke by using “@” mentions while prompting on the chatbot. However, three months in, it’s unclear exactly how many people use ChatGPT Apps, particularly in enterprises — will Claude’s Marketplace be able achieve more success here, given rising enterprise adoption of Claude and Anthropic products? ChatGPT’s focus in its integrated apps was on retail and individual consumer-focused tasks rather than the enterprise more broadly, but the company has also tried to appeal to that market with new plugins for ChatGPT released alongside its new GPT-5.4 this week.Other AI tool marketplaces have also cropped up. Lightning AI launched an AI Hub last year following similar moves from AWS and Hugging Face. Many AI marketplaces, such as Salesforce’s, focus on surfacing AI agents that may already have the capabilities customers need. How does Anthropic’s solution stand out from these? Asked for comment a spokesperson responded:”Claude is a model — it reasons, writes, analyzes, and codes. But Harvey isn’t just Claude with a legal prompt. It’s a purpose-built platform built for how legal teams actually work — with the domain expertise, workflow integrations, compliance infrastructure, and institutional knowledge that enterprises require. Same with Rogo for finance, Snowflake for enterprise data, or GitLab for software development. These partners have spent years building the product layer on top of Claude that makes it useful for specific industries and workflows. That’s actually the point. Thousands of businesses use Claude to power their products — and the best ones have built something Claude alone can’t replicate. Claude Marketplace isn’t Anthropic trying to replace those products. It’s Anthropic investing in them — making it easier for enterprises to access the best Claude-powered tools without managing a separate procurement process for each one. Claude is the intelligence layer. Our partners are the product.”Native vs appEnterprise users adapted their Claude or ChatGPT platforms to recognize preferences, connect to their data sources and retain context. So much of how people use enterprise AI these days focuses on customizability, on making the system work for their needs. Platforms like OpenClaw also allowed people to set up autonomous agents that can have full access to their computers to complete tasks and execute workflows. In other words, Claude and other platforms can already do much of the work that these new third-party Marketplace tools enable — provided they have the right context and data. However, third-party tools and integrations allow enterprise users to avoid doing the work themselves and instead invoke an existing tool to handle it. For those whose businesses are built around specific, tool-based workflows, the Marketplace may be exactly the right AI integration for them. In addition, there’s also a good chance that enterprises already paying for Claude may now take advantage of the new Marketplace to explore third-party tools and services they wouldn’t have otherwise. While it’s still unclear what Claude Marketplace would look like in action, it’s possible that, with these tools, enterprises could use Claude as an orchestrator, where the platform acts as a command center that taps the right tool and accesses the right context without constantly prompting. Observers noted that Claude Marketplace offers enterprises a way to “pre-approve” apps, bypassing the often long and cautious approval process. Some people noted that Anthropic’s move tracks with how many businesses will want to work directly with the platforms without requiring users to move to their separate offerings. Anthropic’s biggest challenge with Claude Marketplace, however, is adoption. Many of the partners for its launch already have enterprise customers who deploy their tools through an API or already connect via MCP or other protocols for context. Some users may have already vibe-coded apps that tap into these integrations. It’s now a matter of enterprise users showing they want to use these new tools within their Claude workflows.

AWS News Blog Announcements, Updates, and Launches

    Feed has no items.