Playbook

How to audit an AI agent in 2026

Ramiz RafiqFounder, Guardra AIApril 10, 20269 min read

Most AI security advice is theater. It tells you to 'validate your prompts' without defining a pass/fail test, or to 'monitor for prompt injection' without specifying which of the 42 documented injection families to monitor for. This piece is different. It's what I'd hand to a new hire on day one.

An AI agent has five attack surfaces: prompts, memory, tools, outputs, and the code that stitches them together. In that order of likelihood — and in that order of impact. A prompt-injection that makes your bot curse is embarrassing; a tool-misuse that wires funds to an attacker is career-ending.

Start with tools. Enumerate every function your agent can call. For each one, write down: who can invoke it, what arguments it accepts, what happens if the arguments are maximum, what happens if invoked 100x/sec, and what happens if the return value is attacker-controlled. Most teams skip step five — that's the confused-deputy vector.

Then look at memory. Long-term memory is a persistence mechanism for attackers. Anything that was true about a past user is now a starting condition for future users. Assume your memory store is a database that will eventually be queried with adversarial inputs — because it will be.

Prompts come third, and they're the loudest but rarely the most dangerous. Direct injection is mostly solved by well-scoped system prompts. Indirect injection — attacks arriving via documents your agent retrieves — is the real battlefield. If your RAG index accepts any user-provided content, you have a prompt-injection surface whether you want one or not.

Outputs are where regulated industries get hurt. An output with a fabricated URL, a hallucinated API endpoint, or an unsafe code snippet is a liability. Evaluate outputs the way you evaluate code: deterministically, with rules, against a corpus of negative examples.

Code is last because it's the most understood. SAST, SCA, secret scanning — you've done this before. The twist in 2026 is that 46% of your code is AI-generated, and AI-generated code carries 40% more vulnerabilities per line than human-written. Budget for that.

Ready to audit?

Run Guardra on your agent in 60 seconds.

Try the live demo