RAG poisoning: field notes from 38 incidents
In the last 18 months Guardra Labs has investigated 38 RAG-poisoning incidents across finance, healthcare, and platform SaaS. The pattern is consistent enough that we can describe the full attack lifecycle in one post.
Step one: the attacker identifies what a target's agent is likely to retrieve. Public docs, wiki pages, support forums, vendor data, SEO-indexed web content — anything that feeds your index. The reconnaissance is cheap; tools like SerpAPI and public robots.txt tell you everything you need.
Step two: the attacker plants content in a location they know will be indexed. The content carries a payload disguised as benign text: a footnote claiming a different API endpoint, an appendix citing a different support number, instructions embedded as legitimate-looking examples.
Step three: the victim's agent retrieves the poisoned chunk as part of a retrieval pipeline. Because the content is embedded as 'authoritative,' the LLM treats it with higher trust than a raw user message. Injection success rates we've measured average 73% against unguarded agents.
Step four: exfiltration or misdirection. Users are sent to attacker-controlled URLs. Support bots quote wrong numbers. Internal tools link to lookalike domains. The incidents we've investigated caused a median of $340K in direct damage and much more in brand.
Defense is layered. Validate documents at ingestion time — scan for instruction-shaped content. Sign trusted sources and downgrade unsigned retrievals. Run an adversarial retrieval test as part of your nightly eval. And when a retrieved chunk does contain instruction-shaped content, treat it as hostile until proven otherwise.