Engineer security into every phase of the agent lifecycle.
Securing agentic AI requires proactive measures across system autonomy, interconnected components, and evolving capabilities. The joint guidance prescribes practices for four lifecycle phases. Skip a phase and you load the next phase's controls beyond their design point.
Designing secure agents
Security begins at the design stage. Anticipate risks and integrate mitigations into system architecture before development and deployment.
Controlled context
Inserting tool output and memory into an LLM context massively expands the attack surface — particularly for prompt-injection. Treat every context fragment as a trust-tagged input.
- Structure the prompt with a clear instruction hierarchy so behaviour aligns with intended priorities and constraints.
- Ground generations using retrieval-augmented generation (RAG) and prompt engineering to reduce hallucinations.
- Tag every context source with a trust tier; gate decisions on tier (e.g. never let untrusted web content trigger writes).
Oversight mechanisms
Agents take action without explicit human approval. Bake monitoring and human-in-the-loop into the design — not as an aftermarket bolt-on.
- Mechanisms that prevent low-risk-approved agents from progressing into higher-risk activities autonomously.
- Human control points throughout the workflow: live monitoring, interruption during execution, mandatory approval at decision steps, audit and reversibility after execution.
- Explicit control flows that bound autonomous planning so agents cannot deviate beyond authorised objectives or actions.
Identity management
Each agent should be a distinct, cryptographically anchored principal — not a service account shared between processes.
- Embed strong identity using managed identity services, decentralised identifiers, or PKI.
- Authenticate every inter-agent and agent-to-service API call with mutual TLS (mTLS) for non-repudiation.
- Maintain a trusted registry binding identities to roles; reconcile against the live agent set on a schedule.
- Deny access for any agent or key not in the registry.
- Apply role-based identity management; minimum scope for approved tasks.
- Enforce identity-based boundaries: agents can only invoke actions explicitly authorised for their identity.
Defence in depth
No single mechanism should be load-bearing. Failures in one layer must be caught by the next.
- Multiple, overlapping layers of security controls; no reliance on a single mechanism.
- Controls at every information boundary: user inputs, tool calls, data pre-processing, model inference, outputs.
- Separate agents per function; strict boundaries and operational controls on agent-to-agent handoffs.
Source: Careful adoption of agentic AI services, co-authored by ASD's ACSC, CISA, NSA, Canadian Cyber Centre, NCSC-NZ, NCSC-UK (pp. 14–16).
Developing secure agents
Standard LLM training is necessary but insufficient. Mitigating agent-specific risks requires specialised techniques to harden behaviour against adversarial conditions.
Comprehensive testing
Expose the model to security abuse during supervised training so it learns to recognise and respond to undesirable behaviours.
- Reward modelling and adversarial testing for specification gaming; security constraints alongside performance goals.
- Train in simulated, controlled environments to learn action consequences without real harm.
- Synthetic adversarial training data reflecting real-world deployment scenarios.
- Active learning on high-uncertainty inputs to discover unexpected behaviours efficiently.
Appropriate evaluation
Agentic systems require evaluation that goes beyond LLM benchmarks.
- Threat-model-driven evaluation scenarios, including edge cases beyond typical training conditions.
- Best-of-N sampling, multistep reasoning prompts, inference-time scaling to draw out the full range of agent behaviour.
- Evaluate at varying levels of autonomy and resource access (tools, models, web search, code execution).
- Vary contextual conditions: presence of other agents, evaluation timing.
- Capability evaluations continuously across the development lifecycle.
Input management
Strong input controls partially mitigate many common LLM-app risks; for agents they're table-stakes.
- Robust validation and sanitisation of all agent inputs.
- Prompt injection filters and semantic analysis to detect malicious instructions.
- Context validation: confirm the system correctly interprets intent before execution.
Red teaming
Adversarial assessment is the only way to estimate how an agent fails when adversaries help it fail.
- Sandbox environments to test agent behaviour before production.
- Red-team exercises that target loopholes and unintended behaviours.
- Capability elicitation to probe for emergent abilities, especially resource-risk capabilities.
- Multi-agent red teaming and chaos testing in agent simulations.
Resilience
Plan for graceful degradation. The blast radius of unexpected behaviour must be containable by design.
- Fail-safe defaults and containment that limit blast radius of unexpected behaviours.
- Data-loss-prevention controls tuned to AI agent behaviours.
- Versioning and rollback to safely revert to known-good agent behaviour.
Accountability
The system must produce comprehensive artefacts that document actions and decisions.
- Comprehensive artefact logging by default.
- Unified audit logs of all inter-agent interactions.
- Interpretability tools to expose the reasoning behind decisions.
- Information referencing in every response: cite which retrieval, tool, version produced each claim.
Manage third-party components
Third parties are how flexibility — and supply-chain risk — enters the system.
- Verify all external components originate from trusted sources and are up to date.
- Maintain a trusted registry of third-party components.
- Reference CISA's A Shared Vision of SBOM for Cybersecurity and 2025 Minimum Elements for SBOM when procuring agentic systems.
- Restrict tool use to an approved allow-list of tools and versions, regularly verified.
- Verify that agent tool-usage behaviour aligns with documented security policies.
- Log tool usage in human-readable format.
- Trigger-action protocols that automatically restrict agent permissions on unexpected behaviour.
- Codify separation of duties: roles like Orchestrator, Reader, Actuator with clear boundaries, consensus mechanisms, and delegation expiry.
- Consensus controls: multi-agent approval for moderate-stakes actions; HITL plus multi-agent consensus for high-stakes.
- Prohibit agents from modifying their own privileges or initiating unapproved delegation without explicit expiry timers and recorded grant chains.
- Standardise tool descriptions in a consistent format that avoids persuasive language.
Source: Careful adoption of agentic AI services, co-authored by ASD's ACSC, CISA, NSA, Canadian Cyber Centre, NCSC-NZ, NCSC-UK (pp. 16–18).
Deploying agents securely
Adding an agent to an existing system materially changes its threat picture. High-impact deployment controls reduce vulnerabilities before they become incidents.
Threat modelling
Up-to-date risk taxonomies make threat modelling actionable — stale ones make it theatre.
- Realistic threat modelling using OWASP GenAI Security Project and MITRE ATLAS™.
- Controls that address emerging and evolving agent capabilities.
- Harmonise with Zero Trust principles and NIST SP 800-207 Zero Trust Architecture.
- Develop and test incident response procedures for agent compromise.
- Regular third-party reviews of privileged architectures; share actionable intelligence with trusted partners.
Governance
Autonomous action requires governance that authorises every action, not just the agent.
- Maintain governance policies for autonomous agents.
- Define legal accountability and risk ownership for agentic systems.
- Build organisational AI literacy.
- Reference CISA's Principles for the Secure Integration of AI in Operational Technology for OT environments.
Progressive deployment
Start narrow. Earn autonomy by demonstrating safety. Roll back when evaluation slips.
- Phased deployment with progressively increasing access and autonomy: restricted APIs, sandboxing.
- Graduated autonomy: incrementally increase agent independence while maintaining human oversight.
- Continuous evaluation determines when to expand scope or roll back autonomy.
Secure by default
Defaults are the most consequential decision in any system. Pick them so degradation is graceful.
- Fail-safe by default: agents stop and escalate on uncertainty.
- Error handling and failover to reduce impact of failures.
- Graceful degradation: maintain partial functionality when components fail.
Guardrails and constraints
Constraints reduce exposure to common AI security risks at the cost of an acceptable amount of generality.
- Specify clear, constrained objectives with explicit do-not-do rules.
- Hard constraints: deny lists, API-level safety policies.
- Declarative safety contracts agents cannot override.
- Layered guardrails: anomaly detection, rule-based filtering, ML-based prohibited-behaviour detection.
- Prioritise human review of high-risk incidents (guardrail triggers, denied actions).
- Secondary agent that validates new tasks against policy before execution.
Isolation
Where you can't prevent a failure, you contain it.
- Isolation and segmentation to limit blast radius.
- Separate high-risk agents into distinct domains.
- Isolate agents into enclaves with no write access to logs.
Source: Careful adoption of agentic AI services, co-authored by ASD's ACSC, CISA, NSA, Canadian Cyber Centre, NCSC-NZ, NCSC-UK (pp. 18–20).
Operating agents securely
Operations is where most failures actually happen. Continuous monitoring, output validation, human checkpoints, and tight credential discipline are the difference between a useful agent and a quiet incident.
Monitoring and auditing
Monitor internal processes — not just inputs and outputs. Cross-validate with multiple independent monitors.
- Tools that enhance human oversight of agent operations.
- Monitor all agent operations including internal reasoning steps.
- Monitor and log identity and privilege changes; audit for drift, impersonation, misconfiguration.
- Monitor outputs and behaviour for bias, data drift, and other anomalies — including prompts, tool calls, memory interactions, internal reasoning, decisions, actions.
- Real-time monitoring with comprehensive logs of live behaviour and decision-making.
- Runtime anomaly detection using rules or behavioural baselines; alert and pause on anomaly.
- Anomaly detectors that flag discrepancies between stated intentions and observed behaviour.
- Multiple independent monitoring systems that cross-validate agent reports and system logs.
- Goal-drift monitoring: compare active objectives against approved baselines before execution.
- Source checks integrated with logs: which tools were used, which information retrieved.
- Auditing combining human review with automated log analysis.
- Adaptive defences using monitoring data to drive rapid responses (patching from log signals).
- Storage-efficient logging that preserves critical information.
- Regular security assessments including pen-testing and red-team exercises targeting agentic behaviours.
Validate outputs
Outputs are one of the few concrete signals available — make them carry real verification weight.
- Validate agent outputs against multiple sources for critical aspects.
- Run redundant validator agents that cross-check each other's outputs.
- Validate tool responses to prevent malicious or unsafe instructions; standardise tool descriptions to avoid persuasive language.
Human-in-the-loop
HITL is not bureaucracy — it's the brake on irreversible mistakes. The agent should not decide when to use it.
- Human-approval requirements set by designers/operators, never delegated to the agent.
- Block agents from autonomously executing high-impact actions without prior human approval.
- HITL checkpoints for high-cost-of-error actions: system resets, network egress, deletion of critical records.
- Quarantine requests to delete logs or audit records until reviewed and approved.
- Clearly assign responsibility and accountability for adverse outcomes.
- Risk-classify agent actions by impact, likelihood, reversibility — apply safeguards accordingly.
Performance monitoring
Degraded or unusual performance can indicate compromise. Treat performance signals as security signals.
- Assess agents' ability to evade security measures, especially in sensitive systems.
- Regular assessments of an agent's ability to bypass safeguards (communication barriers, guardrails, monitors, HITL, input filters).
- Use evaluation results to validate existing controls and design stronger measures.
- Limit resource usage with rate-limits to interrupt long-running tasks and disrupt malicious workflows.
Privileges and authentication
Strict ongoing privilege management is what keeps a buggy agent from becoming a catastrophic agent.
- Limit privileges to the minimum required for the task.
- Restrict scope of privileges to the narrowest possible level for fine-grained control.
- Implement agent reputation and trust scoring; reduce trust on anomalous behaviour.
- Just-in-time credentials for high-impact or privileged actions.
- Verify API caller identity against user/agent groups.
- Authenticate agents with fresh cryptographic proofs before every privileged call.
- Cryptographic signing for authorised commands and instructions.
- Cryptographic integrity checks for task definitions and constraints.
- Cryptographic attestation: agents must prove they are running expected, unmodified code.
- Continuously verify identity and authorisation at runtime via a centralised PDP for each request.
Source: Careful adoption of agentic AI services, co-authored by ASD's ACSC, CISA, NSA, Canadian Cyber Centre, NCSC-NZ, NCSC-UK (pp. 21–23).