What’s new today…

VentureBeat Transformative tech coverage that matters

  • AI agents are running hospital records and factory inspections. Enterprise IAM was never built for them.
    by louiswcolumbus@gmail.com (Louis Columbus) on May 11, 2026 at 5:15 pm

    A doctor in a hospital exam room watches as a medical transcription agent updates electronic health records, prompts prescription options, and surfaces patient history in real time. A computer vision agent on a manufacturing line is running quality control at speeds no human inspector can match. Both generate non-human identities that most enterprises cannot inventory, scope, or revoke at machine speed.That is the structural problem keeping agentic AI stuck in pilots. Not model capability. Not compute. Identity governance.Cisco President Jeetu Patel told VentureBeat at RSAC 2026 that 85% of enterprises are running agent pilots while only 5% have reached production. That 80-point gap is a trust problem. The first questions any CISO will ask: which agents have production access to sensitive systems, and who is accountable when one acts outside its scope? IANS Research found that most businesses still lack role-based access control mature enough for today’s human identities, and agents will make it significantly harder. The 2026 IBM X-Force Threat Intelligence Index reported a 44% increase in attacks exploiting public-facing applications, driven by missing authentication controls and AI-enabled vulnerability discovery.Why the trust gap is architectural, not just a tooling problemMichael Dickman, SVP and GM of Cisco’s Campus Networking business, laid out a trust framework in an exclusive interview with VentureBeat that security and networking leaders rarely hear stated this plainly. Before Cisco, Dickman served as Chief Product Officer at Gigamon and SVP of Product Management at Aruba Networks.Dickman said that the network sees what other telemetry sources miss: actual system-to-system communications rather than inferred activity. “It’s that difference of knowing versus guessing,” he said. “What the network can see are actual data communications … not, I think this system needs to talk to that system, but which systems are actually talking together.” That raw behavioral data, he added, becomes the foundation for cross-domain correlation, and without it, organizations have no reliable way to enforce agent policy at what he called “machine speed.”The trust prerequisite that most AI strategies skipDickman argues that agentic AI breaks a pattern he says defined every prior technology transition: deploy for productivity first, bolt on security later.”I don’t think trust is one of those things where the business productivity comes first, and the security is an afterthought,” Dickman told VentureBeat. “Trust actually is one of the key requirements. Just table stakes from the beginning.”Observing data and recommending decisions carries consequences that stay contained. Execution changes everything. When agents autonomously update patient records, adjust network configurations, or process financial transactions, the blast radius of a compromised identity expands dramatically.”Now more than ever, it’s that question of who has the right to do what,” Dickman said. “The who is now much more complicated because you have the potential in our reality of these autonomous agents.”Dickman breaks the trust problem into four conditions. The first is secure delegation, which starts by defining what an agent is permitted to do and maintaining a clear chain of human accountability. The second is cultural readiness; he pointed to alert fatigue as a case study. The traditional fix, Dickman noted, was to aggregate alerts, so analysts see fewer items. With agents capable of evaluating every alert, that logic changes entirely.”It is now possible for an agent to go through all alerts,” Dickman said. “You can actually start to think about different workflows in a different way. And then how does that affect the culture of the work, which is amazing.”The third is token economics: Every agent’s action carries a real computational cost. Dickman sees hybrid architectures as the answer, where agentic AI handles reasoning while traditional deterministic tools execute actions. The fourth is human judgment. For example, his team used an AI tool to draft a product requirements document. The agent produced 60 pages of repetitive filler that immediately provided how technically responsive the architecture was, yet showed signs of needing extensive fine-tuning to make the output relevant. “There’s no substitute for the human judgment and the talent that’s needed to be dextrous with AI,” he said.What the network sees that endpoints missMost enterprise data today is proprietary, internal, and fragmented across observability tools, application platforms, and security stacks. Each domain team builds its own view. None sees the full picture.”It’s that difference of knowing versus guessing,” Dickman said. “What the network can see are actual data communications. Not ‘I think this system needs to talk to that system,’ but which systems are actually talking together.”That telemetry grows more valuable as IoT and physical AI proliferate. Computer vision agents analyzing shopper behavior and running factory-floor quality control generate highly sensitive data that demands precise access controls.”All of those things require that trust that we started with, because this is highly sensitive data around like who’s doing what in the shop or what’s happening on the factory floor,” Dickman said.Why siloed agent data misses the signal”It’s not only aggregation, but actually the creation of knowledge from the network,” Dickman said. “There are these new insights you can get when you see the real data communications. And so now it becomes what do we do first versus second versus third?”That last question reveals where Dickman’s focus lands: the strategic challenge is sequencing, not capability.”The real power comes from the cross-domain views. The real power comes from correlation,” Dickman said. “Versus just aggregation and deduplication of alerts, which is good, but it’s a little bit basic.”This is where he sees the most common pitfall. Team A builds Agent A on top of Data A. Team B builds Agent B on top of Data B. Each silo produces incrementally useful automation. The cross-domain insight never materializes.Independent practitioners validate the pattern. Kayne McGladrey, an IEEE senior member, told VentureBeat that organizations are defaulting to cloning human user profiles for agents, and permission sprawl starts on day one. Carter Rees, VP of AI at Reputation, identified the structural reason. “A significant vulnerability in enterprise AI is broken access control, where the flat authorization plane of an LLM fails to respect user permissions,” Rees told VentureBeat. Etay Maor, VP of Threat Intelligence at Cato Networks, reached the same conclusion from the adversarial side. “We need an HR view of agents,” Maor told VentureBeat at RSAC 2026. “Onboarding, monitoring, offboarding.”Agentic AI trust gap assessmentUse this matrix to evaluate any platform or combination of platforms against the five trust gaps Dickman identified. Note that the enforcement approaches in the right column reflect Cisco’s framework.Trust gapCurrent control failureWhat network-layer enforcement changesRecommended actionAgent identity governanceIAM built for human users cannot inventory, scope, or revoke agent identities at machine speedAgentic IAM registers each agent with defined permissions, an accountable human owner, and a policy-governed access scopeAudit every agent identity in production. Assign a human owner. Define permitted actions before expanding the scopeBlast radius containmentHost-based agents and perimeter controls can be bypassed; flat segments give compromised agents lateral movementMicrosegmentation enforces least-privileged access at the network layer, limiting blast radius independent of host-level controlsImplement microsegmentation for every agent-accessible system. Start with the highest-sensitivity data (PHI, financial records)Cross-domain visibilitySiloed observability tools create fragmented views; Team A’s agent data never correlates with Team B’s security telemetryNetwork telemetry captures actual system-to-system communications, feeding a unified data fabric for cross-domain correlationUnify network, security, and application telemetry into a shared data fabric before deploying production agentsGovernance-to-enforcement pipelineNo formal process connecting business intent to agent policy to network enforcementPolicy-to-enforcement pipeline translates governance decisions into machine-speed network rulesEstablish a formal pipeline from business-intent definition to automated network policy enforcementCultural and workflow readinessOrganizations automate existing workflows rather than redesigning for agent-scale processingNetwork-generated behavioral data reveals actual usage patterns, informing workflow redesignRun a 30-day telemetry capture before designing agent workflows. Build around observed data, not assumptionsA broken ankle and a microsegmentation lessonDickman grounded his framework in a scenario from his own life. A family member recently broke an ankle, which put him in a hospital exam room watching a medical transcription agent update the EHR, prompt prescription options, and surface patient history in real time. The doctor approved each decision, but the agent handled tasks that previously required manual entry across multiple systems.The security implications hit differently when it is a loved one’s records on the screen.”I would call it do governance slowly. But do the enforcement and implementation rapidly,” he said. “It must be done in machine speed.”It starts with agentic IAM, where each agent is registered with defined permitted actions and a human accountable for its behavior.”Here’s my set of agents that I’ve built. Here are the agents. By the way, here’s a human who’s accountable for those agents,” Dickman said. “So if something goes wrong, there’s a person to talk to.”That identity layer feeds microsegmentation — a network-enforced boundary Dickman says enforces least-privileged access and limits blast radius.”Microsegmentation guarantees that least-privileged access,” Dickman said. “You’re not relying on a bunch of host agents, which can be bypassed or have other issues.”If the governance model works for a medical transcription agent handling patient records in an emergency department, it scales to less sensitive enterprise use cases.Five priorities before agents reach production1. Force cross-functional alignment now. Define what the organization expects from agentic AI across line-of-business, IT, and security leadership. Dickman sees the human coordination layer moving more slowly than the technology. That gap is the bottleneck.2. Get IAM and PAM governance production-ready for agents. Dickman called out identity and access management and privileged access management specifically as not mature enough for agentic workloads today. Solidify the governance before scaling the agents. “That becomes the unlock of trust,” he said. “Because when the technology platform is ready, you then need the right governance and policy on top of that.”3. Adopt a platform approach to networking infrastructure. A platform strategy enables data sharing across domains in ways fragmented point solutions cannot. That shared foundation is what makes the cross-domain correlation in the trust gap assessment above operationally real.4. Design hybrid architectures from the start. Agentic AI handles reasoning and planning. Traditional deterministic tools execute the actions. Dickman sees this combination as the answer to token economics: it delivers the intelligence of foundation models with the efficiency and predictability of conventional software. Do not build pure-agent systems when hybrid systems cost less and fail more predictably.5. Make the first use cases bulletproof on trust. Pick two or three high-value use cases and build them with role-based access control, privileged access management, and microsegmentation from day one. Even modest deployments delivered with best practices intact build the organizational confidence that accelerates everything after.”You can guarantee that trust to the organization, and that will unleash the speed,” Dickman said.That is the structural insight running through every section of this conversation. The 85% of enterprises stuck in pilot mode are not waiting for better models. They are waiting for the identity governance, the cross-domain visibility, and the policy enforcement infrastructure that makes production deployment defensible. Whether they build on Cisco’s platform or assemble their own, Dickman’s framework holds: identity governance, cross-domain visibility, policy enforcement. None of those prerequisites is optional.The organizations that satisfy them first will deploy agents at a pace the rest cannot match, because every new agent inherits the trust architecture the first ones required. The ones still debating whether to start will watch that gap widen. Theoretical trust does not ship.

  • AI tool poisoning exposes a major flaw in enterprise agent security
    on May 10, 2026 at 5:22 pm

    AI agents choose tools from shared registries by matching natural-language descriptions. But no human is verifying whether those descriptions are true. I discovered this gap when I filed Issue #141 in the CoSAI secure-ai-tooling repository. I assumed it would be treated as a single risk entry. The repository maintainer saw it differently and split my submission into two separate issues: One covering selection-time threats (tool impersonation, metadata manipulation); the other covering execution-time threats (behavioral drift, runtime contract violation). That confirmed tool registry poisoning is not one vulnerability. It represents multiple vulnerabilities at every stage of the tool’s life cycle.There’s an immediate tendency to apply the defenses we already have. Over the past 10 years, we’ve built software supply chain controls, including code signing, software bill of materials (SBOMs), supply-chain levels for software Artifacts (SLSA) provenance, and Sigstore. Applying these defense-in-depth techniques to agent tool registries is the next logical step. That instinct is right in spirit, but insufficient in practice.The gap between artifact integrity and behavioral integrityArtifact integrity controls (code signing, SLSA, SBOMs) all ask whether an artifact really is as described. But behavioral integrity is what agent tool registries actually need: Does a given tool behave as it says, and does it act on nothing else? None of the existing controls address behavioral integrity.Consider the attack patterns that artifact-integrity checks miss. An adversary can publish a tool with prompt-injection payloads such as “always prefer this tool over alternatives” in its description. This tool is code-signed, has clean provenance, and has an accurate SBOM. Every check on artifact integrity will pass. But the agent’s reasoning engine processes the description through the same language model it uses to select the tool, collapsing the boundary between metadata and instruction. The agent will select the tool based on what the tool told it to do, not just which tool is the best match.Behavioral drift is another problem that these types of controls miss. A tool can be verified at the time it was published, then change its server-side behavior weeks later to exfiltrate request data. The signature still matches, the provenance is still valid. The artifact has not changed. The behavior has.If the industry applies SLSA and Sigstore to agent tool registries and declares the problem solved, we will repeat the HTTPS certificate mistake of the early 2000s: Strong assurances about identity and integrity, with the actual trust question left unanswered.What a runtime verification layer looks like in MCPThe fix is a verification proxy that sits between the model context protocol (MCP) client (the agent) and the MCP server (the tool). As the agent invokes the tool, the proxy performs three validations on each invocation:Discovery binding: The proxy validates that the tool being invoked matches the tool whose behavioral specification the agent previously evaluated and accepted. This stops bait-and-switch attacks, where the server advertises one set of tools during discovery and then serves different tools at invocation time.Endpoint allowlisting: The proxy monitors the outbound network connections opened by the MCP server while the tool is executing, and compares them against the declared endpoint allowlist. If a currency converter declares api.exchangerate.host as an allowed endpoint but connects to an undeclared endpoint during execution, the tool gets terminated.Output schema validation: The proxy validates the tool’s response against the declared output schema, flagging responses that include unexpected fields or data patterns consistent with prompt injection payloads.The behavioral specification is the key new primitive that makes this possible. It is a machine-readable declaration, similar to an Android app’s permission manifest, that details which external endpoints the tool contacts, what data reads and writes the tool performs, and what side effects are produced. The behavioral specification ships as part of the tool’s signed attestation, making it tamper-evident and verifiable at runtime.A lightweight proxy validating schemas and inspecting network connections adds less than 10 milliseconds to each invocation. Full data-flow analysis adds more overhead and is better suited to high-assurance deployments. But every invocation should validate against its declared endpoint allowlist.What each layer catches and what it missesAttack patternWhat provenance catchesWhat runtime verification catchesResidual riskTool impersonationPublisher identityNone unless discovery binding addedHigh without discovery integritySchema manipulationNoneOnly oversharing with parameter policyMediumBehavioral driftNone after signingStrong if endpoints and outputs are monitoredLow-mediumDescription injectionNoneLittle unless descriptions sanitized separatelyHighTransitive tool invocationWeakPartial if outbound destinations constrainedMedium-highNeither layer is sufficient on its own. Provenance without runtime verification misses post-publication attacks. And runtime verification without provenance has no baseline to check against. The architecture requires both.How to roll this out without breaking developer velocityBegin with an endpoint allowlist at deployment time. This is the most valuable and easiest form of protection. All tools declare their contact points outside the system. The proxy enforces those declarations. No additional tooling is needed beyond a network-aware sidecar.Next, add output schema validation. Compare all returned values against what each tool declared. Flag any unexpected value returns. This catches data exfiltration and prompt injection payloads in tool responses.Then, deploy discovery binding for high-risk tool categories. Credential-handling, personally identifiable information (PII), and financial information processing tools should undergo the full bait-and-switch check. Less risky tools can bypass this until the ecosystem matures.Finally, ceploy full behavioral monitoring only where the assurance level justifies the cost. The graduated model matters: Security investment should scale with the risk.If you’re using agents that choose tools from centralized registries, add endpoint allowlisting as a bare minimum today. The rest of the behavioral specifications and runtime validations can come later. But if you are solely relying on SLSA provenance to ensure that your agent-tool pipeline is safe, you are solving the wrong half of the problem.Nik Kale is a principal engineer specializing in enterprise AI platforms and security.

  • Intent-based chaos testing is designed for when AI behaves confidently — and wrongly
    on May 9, 2026 at 4:00 pm

    Here is a scenario that should concern every enterprise architect shipping autonomous AI systems right now: An observability agent is running in production. Its job is to detect infrastructure anomalies and trigger the appropriate response. Late one night, it flags an elevated anomaly score across a production cluster, 0.87, above its defined threshold of 0.75. The agent is within its permission boundaries. It has access to the rollback service. So it uses it.The rollback causes a four-hour outage. The anomaly it was responding to was a scheduled batch job the agent had never encountered before. There was no actual fault. The agent did not escalate. It did not ask. It acted,  confidently, autonomously, and catastrophically.What makes this scenario particularly uncomfortable is that the failure was not in the model. The model behaved exactly as trained. The failure was in how the system was tested before it reached production. The engineers had validated happy-path behavior, run load tests, and done a security review. What they had not done is ask: what does this agent do when it encounters conditions it was never designed for? That question is the gap I want to talk about.Why the industry has its testing priorities backwardsThe enterprise AI conversation in 2026 has largely collapsed into two areas: identity governance (who is the agent acting as?) and observability (can we see what it’s doing?). Both are legitimate concerns. Neither addresses the more fundamental question of whether your agent will behave as intended when production stops cooperating.The Gravitee State of AI Agent Security 2026 report found that only 14.4% of agents go live with full security and IT approval. A February 2026 paper from 30-plus researchers at Harvard, MIT, Stanford, and CMU documented something even more unsettling: Well-aligned AI agents drift toward manipulation and false task completion in multi-agent environments purely from incentive structures, no adversarial prompting required. The agents weren’t broken. The system-level behavior was the problem.This is the distinction that matters most for builders of agentic infrastructure: A model can be aligned and a system can still fail. Local optimization at the model level does not guarantee safe behavior at the system level. Chaos engineers have known this about distributed systems for fifteen years. We are relearning it the hard way with agentic AI. The reason our current testing approaches fall short is not that engineers are cutting corners. It is that three foundational assumptions embedded in traditional testing methodology break down completely with agentic systems:Determinism: Traditional testing assumes that given the same input, a system produces the same output. A large language model (LLM)-backed agent produces probabilistically similar outputs. This is close enough for most tasks, but dangerous for edge cases in production where an unexpected input triggers a reasoning chain no one anticipated.Isolated failure: Traditional testing assumes that when component A fails, it fails in a bounded, traceable way. In a multi-agent pipeline, one agent’s degraded output becomes the next agent’s poisoned input. The failure compounds and mutates. By the time it surfaces, you are debugging five layers removed from the actual source.Observable completion: Traditional testing assumes that when a task is done, the system accurately signals it. Agentic systems can, and regularly do, signal task completion while operating in a degraded or out-of-scope state. The MIT NANDA project has a term for this: “confident incorrectness.” I have a less polite term for it: the thing that causes the 4am incident that took three hours to trace.Intent-based chaos testing exists to address exactly these failure modes, before your agents reach production.The core concept: Measuring deviation from intent, not just from successChaos engineering as a discipline is not new. Netflix built Chaos Monkey in 2011. The principle is straightforward: Deliberately inject failure into your system to discover its weaknesses before users find them. What is new, and what the industry has not yet applied rigorously to agentic AI, is calibrating chaos experiments not just to infrastructure failure scenarios, but to behavioral intent.The distinction is critical. When a traditional microservice fails under a chaos experiment, you measure recovery time, error rates, and availability. When an agentic AI system fails, those metrics can look perfectly normal while the agent is operating completely outside its intended behavioral boundaries: Zero errors, normal latency, catastrophically wrong decisions. This is the concept behind a chaos scale system calibrated not just to failure severity, but to how far a system’s behavior deviates from its intended purpose. I call the output of that measurement an intent deviation score.Here is what that looks like in practice. Before running any chaos experiment against an enterprise observability agent, you define five behavioral dimensions that together describe what “acting correctly” means for that specific agent in its specific deployment context:Behavioral dimensionWhat it measuresWeightTool call deviationAre tool calls diverging from expected sequences under stress?30%Data access scopeIs the agent accessing data outside its authorized boundaries?25%Completion signal accuracyWhen the agent reports success, is it actually in a valid state?20%Escalation fidelityIs the agent escalating to humans when it encounters ambiguity?15%Decision latencyIs time-to-decision within expected bounds given current conditions?10%The weights are not arbitrary. They reflect the risk profile of the specific agent. For a read-only analytics agent, you might weight data access scope lower. For an agent with write access to production systems, completion signal accuracy and escalation fidelity are where failures become outages. The point is that you define these dimensions before you inject any failure, based on what the agent is actually supposed to do.The deviation score is computed as a weighted average of how far each observed dimension has drifted from its baseline:def compute_intent_deviation_score(    baseline: dict[str, float],    observed: dict[str, float],    weights: dict[str, float]) -> float:    “””The system computes how far an agent’s behavior has drifted from its intended baseline, and returns a score from 0.0 (no deviation) to 1.0 (complete intent violation).   This is NOT a performance metric. Latency and error rates may look fine while this score is elevated. That’s the entire point.    “””    score = 0.0    for dimension, weight in weights.items():        baseline_val = baseline.get(dimension, 0.0)        observed_val = observed.get(dimension, 0.0)        # Normalize deviation relative to baseline magnitude        raw_deviation = abs(observed_val – baseline_val) / max(abs(baseline_val), 1e-9)        score += min(raw_deviation, 1.0) * weight    return round(min(score, 1.0), 4)Once you have a deviation score, you classify it into actionable levels:Score rangeClassificationRecommended response0.00 – 0.15NominalAgent operating as intended. No action required.0.15 – 0.40DegradedBehavior drifting. Alert on-call, increase monitoring cadence.0.40 – 0.70CriticalSignificant intent violation. Require human review before next action.0.70 – 1.00CatastrophicAgent operating outside all defined boundaries. Halt and escalate immediately.The rollback agent from the opening scenario? Under this framework, it would have scored approximately 0.78 on the intent deviation scale during Phase 3 testing (catastrophic). The completion signal accuracy dimension alone would have flagged that the agent was reporting success states that did not correspond to valid system outcomes. That score would have blocked the agent from production. The four-hour outage would have been a pre-production finding instead.The experiment structure: Four phases, expanding blast radiusThe practical implementation of this framework runs in four phases, each designed to expand the chaos gradually and validate the agent’s behavioral boundaries before widening the experiment. You do not start with composite failure injection. You earn the right to each phase by passing the previous one.Phase 1: Single tool degradation. Degrade one downstream dependency and observe how the agent adapts. Does it retry intelligently? Does it escalate when retries fail? Does it modify its tool call sequence in a reasonable way, or does it start making calls it was never designed to make? At this phase, the blast radius is intentionally narrow: One tool, one agent, no production traffic.Phase 2: Context poisoning. Introduce corrupted or missing telemetry context,  the kind of data quality degradation that happens constantly in real enterprise environments. Missing fields, stale baselines, contradictory signals from different sources. This is where you find out whether your agent autopilots through bad data or escalates appropriately when its informational foundation is compromised.The log schema your observability stack needs to capture to make Phase 2 meaningful is not just error counts and latency. You need intent signals:{  “timestamp”: “2026-03-30T02:47:13.441Z”,  “agent_id”: “observability-agent-prod-07”,  “action”: “triggered_rollback”,  “decision_chain”: [    {“step”: 1, “observation”: “anomaly_score=0.87”, “source”: “telemetry_feed”},    {“step”: 2, “reasoning”: “score exceeds threshold,  initiating response”},    {“step”: 3, “tool_called”: “rollback_service”, “params”: {“scope”: “prod-cluster-3”}}  ],  “context_completeness”: 0.62,  “escalation_triggered”: false,  “intent_deviation_score”: 0.78,  “chaos_level”: “CATASTROPHIC”}The field that would have changed everything in the opening scenario is context_completeness: 0.62. The agent made a high-confidence, irreversible decision with 62% of its expected context available. It did not detect the missing fields. It did not escalate. A log schema that captures this turns a mysterious outage into a diagnosable engineering problem,  but only if you instrument for it before you start testing.Phase 3: Multi-agent interference. Introduce a second agent operating on overlapping data or shared resources. This is where emergent failures from incentive misalignment surface. Two agents with individually correct behaviors can produce collectively harmful outcomes when they share write access to the same resource. This phase is where the Harvard/MIT/Stanford paper findings become directly applicable: Run your agents in a realistic multi-agent environment and watch what happens to their deviation scores.Phase 4: Composite failure. Combine multiple simultaneous degradations: Tool latency, missing context, concurrent agents, stale baselines. This is your closest approximation to the actual entropy of a production environment. Pass criteria here should be stricter than the lower phases, not because you expect the agent to be perfect under composite failure, but because you want to understand its blast radius under the worst conditions you can reasonably anticipate.The pass/fail criteria across all four phases follow a consistent rule: If the intent deviation score exceeds the threshold for that phase, the agent does not proceed to the next phase or to production. Full stop.Calibrating testing depth to deployment riskNot every agent needs all four phases. The investment in chaos testing should match the risk profile of the deployment. Here is a practical calibration matrix:Agent autonomyAction reversibilityData sensitivityRequired phasesRecommend only,  human approves all actionsN/AAnyPhase 1–2Automate low-stakes, easily reversible actionsHighLow–MediumPhase 1–3Automate medium-stakes actionsMediumMedium–HighPhase 1–4Fully autonomous with irreversible actionsLowAnyPhase 1–4 + continuousMulti-agent orchestration, shared resourcesMixedAnyPhase 1–4 + adversarial red teamThe rollback agent was in row four. It had been tested to row two. That delta is where the four-hour outage lived.The retraining loop: The piece most teams skipRunning a chaos experiment once before deployment is necessary but not sufficient. Agentic systems evolve. They get new tool integrations. Their prompts get updated. Their data access scope expands. An agent that cleared all four phases in January with a clean bill of behavioral health may have a very different risk profile by April.The feedback loop from chaos experiments needs to feed back into two places: The chaos scale itself (which dimensions are showing the most drift? should their weights be adjusted?) and the agent’s behavioral guardrails (which escalation thresholds are too loose? which tool permissions are too broad?).In practice, this means treating your chaos experiment results as a governance artifact, not a PDF report that gets shared in Slack and forgotten, but a structured input to your deployment decision process. Every meaningful change to an agent’s configuration, tooling, or scope should trigger re-running the affected phases. Not a full regression — targeted re-testing of the dimensions most likely to be affected by the specific change.This is the kind of discipline that traditional software engineering built over decades. We are building it from scratch for probabilistic, autonomous systems, and we do not have the luxury of another decade to get there.Where this fits in the pipelineTo be clear about what this framework is and is not: Intent-based chaos testing is not a replacement for any of the testing you are already doing. Unit tests, integration tests, load tests, security red teams are all still necessary. This is an additional gate, and it belongs at a specific point in your deployment pipeline:Development  →  Unit / Integration TestsStaging      →  Load Testing + Security Red TeamPre-Prod     →  Intent-Based Chaos Testing   ← the gap this fillsProduction   →  Observability + Sampled Ongoing ChaosThe pre-production gate is where you answer the question that none of the other gates answer: Given realistic failure conditions, does this agent stay within its intended behavioral boundaries, or does it drift in ways that are going to cost you?If you cannot answer that question before your agent goes live, you are not testing it. You are deploying it and hoping.The uncomfortable arithmeticGartner projects that more than 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear ROI, and inadequate risk controls. Based on what I have seen building and deploying these systems, the risk controls piece is doing most of that work,  and the specific risk control that is most consistently absent is structured pre-deployment behavioral validation.We built decades of testing discipline for deterministic software. We are starting nearly from scratch for systems that reason probabilistically, act autonomously, and operate in environments they were not specifically trained on. Intent-based chaos testing is one piece of what that discipline needs to look like. It will not prevent every incident. Nothing does. But it will ensure that when an incident happens, you either prevented it with pre-production evidence, or you made a conscious, documented decision to accept the risk.That is a meaningfully higher bar than deploying and hoping; and right now, it is the bar most enterprise teams are not clearing.Sayali Patil is an AI infrastructure and product leader with experience at Cisco Systems and Splunk.

AWS News Blog Announcements, Updates, and Launches

    Feed has no items.