Back to blog
Research

The only LLM reliability metrics that matter

Dr. Elena MarkovChief Scientist, Guardra AI7 min read

If you can't measure it, you can't ship it. And if you measure the wrong things, you'll ship confidently past the cliff. After two years of analyzing production LLM apps across 900+ customers, these are the four metrics that actually predict incidents.

Faithfulness is the single strongest signal for RAG apps. Measured per answer: does the response only contain claims supported by the retrieved context? A drop in faithfulness reliably predicts a rise in customer complaints 3–7 days later. If you only instrument one thing, instrument this.

Hallucination rate is faithfulness's cousin for non-RAG agents. It's how often the model invents API endpoints, package names, people, dates, or regulations. In security-critical apps, a single hallucinated API name can send a tool-call to an attacker-controlled domain.

Tool-call correctness is underrated and over-subjective. It's binary: did the agent pick the right tool with the right arguments? You need a corpus of labeled traces to measure this. We publish one — free — in our open detector pack.

Injection resilience is the offensive counterpart. Curate an adversarial prompt corpus. Run it nightly. Track pass rate over time. If the pass rate drops after a model upgrade or a prompt change, block the release.

Every other metric — toxicity, bias, sentiment, user thumbs — is downstream of those four. Useful for dashboards, dangerous as a deploy gate.

Ready to audit?

Run Guardra on your agent in 60 seconds.

Try the live demo