Consistency Metrics (pass@k)

Measure not only capability, but reliability across repeated runs.