Wahrwelt Notes

❯

❯

❯

Automated benchmarks

Automated benchmarks

May 13, 20261 min read

Automated Benchmarks

Testing verifiable outputs programmatically.

Example: arithmetic questions like 1 + 1 = 2.

Pros

Verifiable.
Scales at low cost.
Consistent: can be run many times.
Understandable.
Datasets are available and can be iterated on to improve quality.

Cons

Many general questions cannot be put into a verifiable format.
If the dataset is public, it can be contaminated.

Format Tips

For some LLMs, the output may need to append the system prompt. Need to identify which ones.
For some LLMs, following the user-and-assistant chat template may be helpful.

Graph View

Automated Benchmarks
Pros
Cons
Format Tips

Created with Quartz v0.1.0 © 2026

Main
RSS