Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena.
We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314).
Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper).
If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.
Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.
Content
- Style Control Leaderboard
- Leaderboard
- Install
- Evaluation
- Style Control Guide
Style Control Leaderboard
Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost.
Please refer to the blogpost for methodology and technical background.
(Updated: 10/14)
claude-3-5-sonnet-20240620 | score: 82.0 | 95% CI: (-1.6, 2.2) | average #tokens: 567
o1-preview-2024-09-12 | score: 81.6 | 95% CI: (-2.4, 2.2) | average #tokens: 1193
o1-mini-2024-09-12 | score: 79.2 | 95% CI: (-2.6, 2.4) | average #tokens: 1399
gpt-4-turbo-2024-04-09 | score: 74.4 | 95% CI: (-2.5, 2.1) | average #tokens: 662
gpt-4-0125-preview | score: 73.5 | 95% CI: (-2.4, 1.8) | average #tokens: 619
gpt-4o-2024-08-06 | score: 71.0 | 95% CI: (-2.5, 2.8) | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9 | 95% CI: (-3.3, 3.3) | average #tokens: 869
gpt-4o-2024-05-13 | score: 69.9 | 95% CI: (-2.5, 2.3) | average #tokens: 696
athene-70b | score: 67.7 | 95% CI: (-3.2, 2.2) | average #tokens: 685
yi-lightning | score: 67.1 | 95% CI: (-2.3, 2.8) | average #tokens: 875
llama-3.1-405b-instruct | score: 66.8 | 95% CI: (-2.6, 1.9) | average #tokens: 658
claude-3-opus-20240229 | score: 65.5 | 95% CI: (-2.3, 2.5) | average #tokens: 541
yi-large-preview | score: 65.0 | 95% CI: (-2.4, 2.0) | average #tokens: 720
gpt-4o-mini-2024-07-18 | score: 64.2 | 95% CI: (-2.7, 2.9) | average #tokens: 668
qwen2.5-72b-instruct | score: 63.4 | 95% CI: (-2.5, 2.7) | average #tokens: 821
mistral-large-2407 | score: 63.1 | 95% CI: (-2.6, 3.1) | average #tokens: 623
gemini-1.5-pro-api-0514 | score: 62.4 | 95% CI: (-2.7, 2.1) | average #tokens: 676
glm-4-0520 | score: 61.3 | 95% CI: (-3.3, 3.0) | average #tokens: 636
yi-large | score: 59.3 | 95% CI: (-3.1, 2.2) | average #tokens: 626
deepseek-coder-v2 | score: 58.2 | 95% CI: (-2.6, 2.8) | average #tokens: 578
glm-4-0116 | score: 54.1 | 95% CI: (-2.5, 2.5) | average #tokens: 622
llama-3.1-70b-instruct | score: 51.6 | 95% CI: (-2.5, 2.7) | average #tokens: 628
glm-4-air | score: 50.4 | 95% CI: (-1.8, 2.5) | average #tokens: 619
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
claude-3-sonnet-20240229 | score: 49.7 | 95% CI: (-2.0, 2.6) | average #tokens: 552
gpt-4-0613 | score: 49.6 | 95% CI: (-2.5, 2.7) | average #tokens: 354
qwen2-72b-instruct | score: 49.5 | 95% CI: (-2.4, 2.4) | average #tokens: 515
gemma-2-27b-it | score: 47.4 | 95% CI: (-2.8, 2.8) | average #tokens: 577
gemini-1.5-pro-api-0409-preview| score: 46.8 | 95% CI: (-2.8, 2.7) | average #tokens: 478
mistral-large-2402 | score: 45.5 | 95% CI: (-2.5, 2.1) | average #tokens: 400
claude-3-haiku-20240307 | score: 45.3 | 95% CI: (-2.3, 3.1) | average #tokens: 505
llama-3-70b-instruct | score: 44.3 | 95% CI: (-2.2, 3.5) | average #tokens: 591
mixtral-8x22b-instruct-v0.1 | score: 44.0 | 95% CI: (-2.9, 2.9) | average #tokens: 430
qwen1.5-72b-chat | score: 39.7 | 95% CI: (-2.1, 2.2) | average #tokens: 474
gemini-1.5-flash-api-0514 | score: 39.7 | 95% CI: (-2.5, 2.4) | average #tokens: 642
mistral-next | score: 39.6 | 95% CI: (-2.2, 2.5) | average #tokens: 297
mistral-medium | score: 39.0 | 95% CI: (-2.4, 3.3) | average #tokens: 485
phi-3-medium-4k-instruct | score: 38.7 | 95% CI: (-2.1, 2.6) | average #tokens: 517
command-r-plus | score: 37.3 | 95% CI: (-2.3, 1.6) | average #tokens: 541
claude-2.0 | score: 36.7 | 95% CI: (-2.2, 2.6) | average #tokens: 295
claude-2.1 | score: 35.1 | 95% CI: (-2.9, 2.5) | average #tokens: 290
gpt-3.5-turbo-0613 | score: 34.9 | 95% CI: (-2.2, 3.0) | average #tokens: 401
gpt-3.5-turbo-0125 | score: 34.7 | 95% CI: (-2.3, 2.7) | average #tokens: 329
phi-3-small-8k-instruct | score: 33.6 | 95% CI: (-2.6, 2.3) | average #tokens: 568
gemma-2-9b-it | score: 33.3 | 95% CI: (-2.7, 2.8) | average #tokens: 541
gpt-3.5-turbo-1106 | score: 33.0 | 95% CI: (-2.4, 2.9) | average #tokens: 285
dbrx-instruct-preview | score: 32.0 | 95% CI: (-2.5, 2.4) | average #tokens: 415
internlm2-20b-5-chat | score: 30.2 | 95% CI: (-2.2, 2.5) | average #tokens: 576
mixtral-8x7b-instruct-v0.1 | score: 29.8 | 95% CI: (-2.0, 2.1) | average #tokens: 457
gpt-3.5-turbo-0314 | score: 29.4 | 95% CI: (-2.8, 2.1) | average #tokens: 334
starling-lm-7b-beta | score: 26.0 | 95% CI: (-2.4, 2.2) | average #tokens: 530
snowflake-arctic-instruct | score: 25.9 | 95% CI: (-2.6, 1.8) | average #tokens: 365
gemini-1.0-pro | score: 24.9 | 95% CI: (-2.1, 2.4) | average #tokens: 322
command-r | score: 23.4 | 95% CI: (-1.9, 1.8) | average #tokens: 432
snorkel-mistral-pairrm-dpo | score: 21.8 | 95% CI: (-2.2, 1.9) | average #tokens: 564
yi-34b-chat | score: 21.8 | 95% CI: (-2.2, 2.0) | average #tokens: 611
internlm2-20b-chat | score: 21.1 | 95% CI: (-1.9, 1.3) | average #tokens: 667
llama-3-8b-instruct | score: 19.7 | 95% CI: (-1.6, 1.8) | average #tokens: 585
llama-3.1-8b-instruct | score: 18.2 | 95% CI: (-1.8, 2.0) | average #tokens: 861
tulu-2-dpo-70b | score: 18.0 | 95% CI: (-1.7, 1.8) | average #tokens: 550
starling-lm-7b-alpha | score: 16.4 | 95% CI: (-1.5, 1.5) | average #tokens: 483
phi-3-mini-128k-instruct | score: 16.1 | 95% CI: (-1.5, 1.9) | average #tokens: 609
mistral-7b-instruct | score: 15.2 | 95% CI: (-2.0, 1.5) | average #tokens: 541
llama-2-70b-chat | score: 13.4 | 95% CI: (-1.5, 1.7) | average #tokens: 595
vicuna-33b | score: 11.7 | 95% CI: (-1.9, 1.7) | average #tokens: 451
gemma-1.1-7b-it | score: 11.6 | 95% CI: (-1.4, 1.2) | average #tokens: 341
gemma-7b-it | score: 7.0 | 95% CI: (-1.1, 1.0) | average #tokens: 378
gemma-1.1-2b-it | score: 3.5 | 95% CI: (-0.6, 0.7) | average #tokens: 316
gemma-2b-it | score: 2.9 | 95% CI: (-0.5, 0.6) | average #tokens: 369
For more information click here.