Automated Benchmarks

Testing verifiable outputs programmatically.

Example: arithmetic questions like 1 + 1 = 2.

Pros

  • Verifiable.
  • Scales at low cost.
  • Consistent: can be run many times.
  • Understandable.
  • Datasets are available and can be iterated on to improve quality.

Cons

  • Many general questions cannot be put into a verifiable format.
  • If the dataset is public, it can be contaminated.

Format Tips

  • For some LLMs, the output may need to append the system prompt. Need to identify which ones.
  • For some LLMs, following the user-and-assistant chat template may be helpful.