Automated Benchmarks
Testing verifiable outputs programmatically.
Example: arithmetic questions like 1 + 1 = 2.
Pros
- Verifiable.
- Scales at low cost.
- Consistent: can be run many times.
- Understandable.
- Datasets are available and can be iterated on to improve quality.
Cons
- Many general questions cannot be put into a verifiable format.
- If the dataset is public, it can be contaminated.
Format Tips
- For some LLMs, the output may need to append the system prompt. Need to identify which ones.
- For some LLMs, following the user-and-assistant chat template may be helpful.