Arena-Hard-Auto : Advancing LLM Evaluation With Style Control Integration

Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314).

Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper).

If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.

Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.

Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.

Content

  • Style Control Leaderboard
  • Leaderboard
  • Install
  • Evaluation
  • Style Control Guide

Style Control Leaderboard

Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost.

Please refer to the blogpost for methodology and technical background.

(Updated: 10/14)

claude-3-5-sonnet-20240620     | score: 82.0  | 95% CI: (-1.6, 2.2)  | average #tokens: 567                                                      
o1-preview-2024-09-12          | score: 81.6  | 95% CI: (-2.4, 2.2)  | average #tokens: 1193                                                     
o1-mini-2024-09-12             | score: 79.2  | 95% CI: (-2.6, 2.4)  | average #tokens: 1399                                                     
gpt-4-turbo-2024-04-09         | score: 74.4  | 95% CI: (-2.5, 2.1)  | average #tokens: 662                                                      
gpt-4-0125-preview             | score: 73.5  | 95% CI: (-2.4, 1.8)  | average #tokens: 619                                                      
gpt-4o-2024-08-06              | score: 71.0  | 95% CI: (-2.5, 2.8)  | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9  | 95% CI: (-3.3, 3.3)  | average #tokens: 869
gpt-4o-2024-05-13              | score: 69.9  | 95% CI: (-2.5, 2.3)  | average #tokens: 696                                                      
athene-70b                     | score: 67.7  | 95% CI: (-3.2, 2.2)  | average #tokens: 685                                                      
yi-lightning                   | score: 67.1  | 95% CI: (-2.3, 2.8)  | average #tokens: 875                                                      
llama-3.1-405b-instruct        | score: 66.8  | 95% CI: (-2.6, 1.9)  | average #tokens: 658                                                      
claude-3-opus-20240229         | score: 65.5  | 95% CI: (-2.3, 2.5)  | average #tokens: 541                                                      
yi-large-preview               | score: 65.0  | 95% CI: (-2.4, 2.0)  | average #tokens: 720                                                                  
gpt-4o-mini-2024-07-18         | score: 64.2  | 95% CI: (-2.7, 2.9)  | average #tokens: 668                                                                  
qwen2.5-72b-instruct           | score: 63.4  | 95% CI: (-2.5, 2.7)  | average #tokens: 821                                                                  
mistral-large-2407             | score: 63.1  | 95% CI: (-2.6, 3.1)  | average #tokens: 623                                                                                 
gemini-1.5-pro-api-0514        | score: 62.4  | 95% CI: (-2.7, 2.1)  | average #tokens: 676                                                                                 
glm-4-0520                     | score: 61.3  | 95% CI: (-3.3, 3.0)  | average #tokens: 636                                                                                 
yi-large                       | score: 59.3  | 95% CI: (-3.1, 2.2)  | average #tokens: 626                                                                                                  
deepseek-coder-v2              | score: 58.2  | 95% CI: (-2.6, 2.8)  | average #tokens: 578                                                                                                  
glm-4-0116                     | score: 54.1  | 95% CI: (-2.5, 2.5)  | average #tokens: 622                                                                                                  
llama-3.1-70b-instruct         | score: 51.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 628                                                                                                  
glm-4-air                      | score: 50.4  | 95% CI: (-1.8, 2.5)  | average #tokens: 619                                                                                                  
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423                                                                                                                       
claude-3-sonnet-20240229       | score: 49.7  | 95% CI: (-2.0, 2.6)  | average #tokens: 552                                                                                                                       
gpt-4-0613                     | score: 49.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 354                                                                                                                       
qwen2-72b-instruct             | score: 49.5  | 95% CI: (-2.4, 2.4)  | average #tokens: 515                                                                                                                       
gemma-2-27b-it                 | score: 47.4  | 95% CI: (-2.8, 2.8)  | average #tokens: 577                                                                                                                       
gemini-1.5-pro-api-0409-preview| score: 46.8  | 95% CI: (-2.8, 2.7)  | average #tokens: 478                                                                                                                      
mistral-large-2402             | score: 45.5  | 95% CI: (-2.5, 2.1)  | average #tokens: 400                                                                                                                                                 
claude-3-haiku-20240307        | score: 45.3  | 95% CI: (-2.3, 3.1)  | average #tokens: 505                                                                                                                                                 
llama-3-70b-instruct           | score: 44.3  | 95% CI: (-2.2, 3.5)  | average #tokens: 591
mixtral-8x22b-instruct-v0.1    | score: 44.0  | 95% CI: (-2.9, 2.9)  | average #tokens: 430
qwen1.5-72b-chat               | score: 39.7  | 95% CI: (-2.1, 2.2)  | average #tokens: 474
gemini-1.5-flash-api-0514      | score: 39.7  | 95% CI: (-2.5, 2.4)  | average #tokens: 642
mistral-next                   | score: 39.6  | 95% CI: (-2.2, 2.5)  | average #tokens: 297
mistral-medium                 | score: 39.0  | 95% CI: (-2.4, 3.3)  | average #tokens: 485
phi-3-medium-4k-instruct       | score: 38.7  | 95% CI: (-2.1, 2.6)  | average #tokens: 517
command-r-plus                 | score: 37.3  | 95% CI: (-2.3, 1.6)  | average #tokens: 541
claude-2.0                     | score: 36.7  | 95% CI: (-2.2, 2.6)  | average #tokens: 295
claude-2.1                     | score: 35.1  | 95% CI: (-2.9, 2.5)  | average #tokens: 290
gpt-3.5-turbo-0613             | score: 34.9  | 95% CI: (-2.2, 3.0)  | average #tokens: 401
gpt-3.5-turbo-0125             | score: 34.7  | 95% CI: (-2.3, 2.7)  | average #tokens: 329
phi-3-small-8k-instruct        | score: 33.6  | 95% CI: (-2.6, 2.3)  | average #tokens: 568
gemma-2-9b-it                  | score: 33.3  | 95% CI: (-2.7, 2.8)  | average #tokens: 541                                                                  
gpt-3.5-turbo-1106             | score: 33.0  | 95% CI: (-2.4, 2.9)  | average #tokens: 285                                                                  
dbrx-instruct-preview          | score: 32.0  | 95% CI: (-2.5, 2.4)  | average #tokens: 415                                                                  
internlm2-20b-5-chat           | score: 30.2  | 95% CI: (-2.2, 2.5)  | average #tokens: 576                                                                  
mixtral-8x7b-instruct-v0.1     | score: 29.8  | 95% CI: (-2.0, 2.1)  | average #tokens: 457                                                                  
gpt-3.5-turbo-0314             | score: 29.4  | 95% CI: (-2.8, 2.1)  | average #tokens: 334                                                                  
starling-lm-7b-beta            | score: 26.0  | 95% CI: (-2.4, 2.2)  | average #tokens: 530                                                                  
snowflake-arctic-instruct      | score: 25.9  | 95% CI: (-2.6, 1.8)  | average #tokens: 365                                                                  
gemini-1.0-pro                 | score: 24.9  | 95% CI: (-2.1, 2.4)  | average #tokens: 322                                                                  
command-r                      | score: 23.4  | 95% CI: (-1.9, 1.8)  | average #tokens: 432                                                                  
snorkel-mistral-pairrm-dpo     | score: 21.8  | 95% CI: (-2.2, 1.9)  | average #tokens: 564                                                                  
yi-34b-chat                    | score: 21.8  | 95% CI: (-2.2, 2.0)  | average #tokens: 611                                                                  
internlm2-20b-chat             | score: 21.1  | 95% CI: (-1.9, 1.3)  | average #tokens: 667                                                                  
llama-3-8b-instruct            | score: 19.7  | 95% CI: (-1.6, 1.8)  | average #tokens: 585                                                                                                  
llama-3.1-8b-instruct          | score: 18.2  | 95% CI: (-1.8, 2.0)  | average #tokens: 861                                                                                                  
tulu-2-dpo-70b                 | score: 18.0  | 95% CI: (-1.7, 1.8)  | average #tokens: 550                                                                                                  
starling-lm-7b-alpha           | score: 16.4  | 95% CI: (-1.5, 1.5)  | average #tokens: 483                                                                                                  
phi-3-mini-128k-instruct       | score: 16.1  | 95% CI: (-1.5, 1.9)  | average #tokens: 609                                                                                                  
mistral-7b-instruct            | score: 15.2  | 95% CI: (-2.0, 1.5)  | average #tokens: 541                                                                                                  
llama-2-70b-chat               | score: 13.4  | 95% CI: (-1.5, 1.7)  | average #tokens: 595                                                                                                  
vicuna-33b                     | score: 11.7  | 95% CI: (-1.9, 1.7)  | average #tokens: 451                                                                                                  
gemma-1.1-7b-it                | score: 11.6  | 95% CI: (-1.4, 1.2)  | average #tokens: 341                                                                                                  
gemma-7b-it                    | score:  7.0  | 95% CI: (-1.1, 1.0)  | average #tokens: 378                                                                                                  
gemma-1.1-2b-it                | score:  3.5  | 95% CI: (-0.6, 0.7)  | average #tokens: 316                                                                                                  
gemma-2b-it                    | score:  2.9  | 95% CI: (-0.5, 0.6)  | average #tokens: 369                                                                                                  

For more information click here.

Varshini

Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies.

Recent Posts

Bash Scripting Best Practices Every Beginner Should Know

Introduction Bash scripting is a powerful way to automate Linux tasks, but writing a script…

1 day ago

How To Create A Self-Signed SSL Certificate Using Bash And OpenSSL

Introduction A self-signed SSL certificate is a certificate that is created and signed by the…

1 day ago

How To Debug Bash Scripts Using bash -x And set Commands

Introduction Debugging is an important part of Bash scripting. When a script does not work…

2 days ago

How To Use Cron Jobs With Bash Scripts For Automation

Introduction Cron jobs are used in Linux to run commands or Bash scripts automatically at…

2 days ago

How To Use Pipes In Bash Scripts For Command Chaining

Introduction Pipes are an important feature in Linux and Bash scripting. A pipe allows you…

2 days ago

How To Use grep, awk, And sed In Bash Scripts

Introduction The grep, awk, and sed commands are powerful text-processing tools in Linux. They are…

2 days ago