Eval-driven development: TDD for LLM apps
The difference between LLM teams that ship weekly and LLM teams that ship quarterly isn't model access — it's whether they have an evaluation harness they trust. Eval-driven development (EDD) borrows from TDD: define the success criteria before you touch the implementation.
A good eval suite has three layers. Unit evals: individual prompts and responses. Scenario evals: multi-turn interactions with synthetic users. Regression evals: previously-failed cases that should never fail again. Miss any layer and you're shipping blind.
Write evals as code, version them with your app, and run them in CI. Guardra's eval-action blocks PRs on score regression by default — turn this on early, while scores are easy to improve, not after the dashboard has 600 green checkmarks you're afraid to break.
Don't obsess over aggregate scores. Track the distribution. A jump from 94% to 93% faithfulness is probably noise; a new cluster of failing prompts on 'handling returns' is a customer-facing bug waiting to happen.
Finally: eval your evaluator. LLM-as-judge is powerful and wrong more than you think. Sample-label ~5% of judgments manually every sprint. If your judge disagrees with humans more than 10% of the time, your scores are lying.