Tech today

Arena-Hard-Auto : Advancing LLM Evaluation With Style Control Integration

Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314).

Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper).

If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.

Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.

Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.

Content

  • Style Control Leaderboard
  • Leaderboard
  • Install
  • Evaluation
  • Style Control Guide

Style Control Leaderboard

Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost.

Please refer to the blogpost for methodology and technical background.

(Updated: 10/14)

claude-3-5-sonnet-20240620     | score: 82.0  | 95% CI: (-1.6, 2.2)  | average #tokens: 567                                                      
o1-preview-2024-09-12          | score: 81.6  | 95% CI: (-2.4, 2.2)  | average #tokens: 1193                                                     
o1-mini-2024-09-12             | score: 79.2  | 95% CI: (-2.6, 2.4)  | average #tokens: 1399                                                     
gpt-4-turbo-2024-04-09         | score: 74.4  | 95% CI: (-2.5, 2.1)  | average #tokens: 662                                                      
gpt-4-0125-preview             | score: 73.5  | 95% CI: (-2.4, 1.8)  | average #tokens: 619                                                      
gpt-4o-2024-08-06              | score: 71.0  | 95% CI: (-2.5, 2.8)  | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9  | 95% CI: (-3.3, 3.3)  | average #tokens: 869
gpt-4o-2024-05-13              | score: 69.9  | 95% CI: (-2.5, 2.3)  | average #tokens: 696                                                      
athene-70b                     | score: 67.7  | 95% CI: (-3.2, 2.2)  | average #tokens: 685                                                      
yi-lightning                   | score: 67.1  | 95% CI: (-2.3, 2.8)  | average #tokens: 875                                                      
llama-3.1-405b-instruct        | score: 66.8  | 95% CI: (-2.6, 1.9)  | average #tokens: 658                                                      
claude-3-opus-20240229         | score: 65.5  | 95% CI: (-2.3, 2.5)  | average #tokens: 541                                                      
yi-large-preview               | score: 65.0  | 95% CI: (-2.4, 2.0)  | average #tokens: 720                                                                  
gpt-4o-mini-2024-07-18         | score: 64.2  | 95% CI: (-2.7, 2.9)  | average #tokens: 668                                                                  
qwen2.5-72b-instruct           | score: 63.4  | 95% CI: (-2.5, 2.7)  | average #tokens: 821                                                                  
mistral-large-2407             | score: 63.1  | 95% CI: (-2.6, 3.1)  | average #tokens: 623                                                                                 
gemini-1.5-pro-api-0514        | score: 62.4  | 95% CI: (-2.7, 2.1)  | average #tokens: 676                                                                                 
glm-4-0520                     | score: 61.3  | 95% CI: (-3.3, 3.0)  | average #tokens: 636                                                                                 
yi-large                       | score: 59.3  | 95% CI: (-3.1, 2.2)  | average #tokens: 626                                                                                                  
deepseek-coder-v2              | score: 58.2  | 95% CI: (-2.6, 2.8)  | average #tokens: 578                                                                                                  
glm-4-0116                     | score: 54.1  | 95% CI: (-2.5, 2.5)  | average #tokens: 622                                                                                                  
llama-3.1-70b-instruct         | score: 51.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 628                                                                                                  
glm-4-air                      | score: 50.4  | 95% CI: (-1.8, 2.5)  | average #tokens: 619                                                                                                  
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423                                                                                                                       
claude-3-sonnet-20240229       | score: 49.7  | 95% CI: (-2.0, 2.6)  | average #tokens: 552                                                                                                                       
gpt-4-0613                     | score: 49.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 354                                                                                                                       
qwen2-72b-instruct             | score: 49.5  | 95% CI: (-2.4, 2.4)  | average #tokens: 515                                                                                                                       
gemma-2-27b-it                 | score: 47.4  | 95% CI: (-2.8, 2.8)  | average #tokens: 577                                                                                                                       
gemini-1.5-pro-api-0409-preview| score: 46.8  | 95% CI: (-2.8, 2.7)  | average #tokens: 478                                                                                                                      
mistral-large-2402             | score: 45.5  | 95% CI: (-2.5, 2.1)  | average #tokens: 400                                                                                                                                                 
claude-3-haiku-20240307        | score: 45.3  | 95% CI: (-2.3, 3.1)  | average #tokens: 505                                                                                                                                                 
llama-3-70b-instruct           | score: 44.3  | 95% CI: (-2.2, 3.5)  | average #tokens: 591
mixtral-8x22b-instruct-v0.1    | score: 44.0  | 95% CI: (-2.9, 2.9)  | average #tokens: 430
qwen1.5-72b-chat               | score: 39.7  | 95% CI: (-2.1, 2.2)  | average #tokens: 474
gemini-1.5-flash-api-0514      | score: 39.7  | 95% CI: (-2.5, 2.4)  | average #tokens: 642
mistral-next                   | score: 39.6  | 95% CI: (-2.2, 2.5)  | average #tokens: 297
mistral-medium                 | score: 39.0  | 95% CI: (-2.4, 3.3)  | average #tokens: 485
phi-3-medium-4k-instruct       | score: 38.7  | 95% CI: (-2.1, 2.6)  | average #tokens: 517
command-r-plus                 | score: 37.3  | 95% CI: (-2.3, 1.6)  | average #tokens: 541
claude-2.0                     | score: 36.7  | 95% CI: (-2.2, 2.6)  | average #tokens: 295
claude-2.1                     | score: 35.1  | 95% CI: (-2.9, 2.5)  | average #tokens: 290
gpt-3.5-turbo-0613             | score: 34.9  | 95% CI: (-2.2, 3.0)  | average #tokens: 401
gpt-3.5-turbo-0125             | score: 34.7  | 95% CI: (-2.3, 2.7)  | average #tokens: 329
phi-3-small-8k-instruct        | score: 33.6  | 95% CI: (-2.6, 2.3)  | average #tokens: 568
gemma-2-9b-it                  | score: 33.3  | 95% CI: (-2.7, 2.8)  | average #tokens: 541                                                                  
gpt-3.5-turbo-1106             | score: 33.0  | 95% CI: (-2.4, 2.9)  | average #tokens: 285                                                                  
dbrx-instruct-preview          | score: 32.0  | 95% CI: (-2.5, 2.4)  | average #tokens: 415                                                                  
internlm2-20b-5-chat           | score: 30.2  | 95% CI: (-2.2, 2.5)  | average #tokens: 576                                                                  
mixtral-8x7b-instruct-v0.1     | score: 29.8  | 95% CI: (-2.0, 2.1)  | average #tokens: 457                                                                  
gpt-3.5-turbo-0314             | score: 29.4  | 95% CI: (-2.8, 2.1)  | average #tokens: 334                                                                  
starling-lm-7b-beta            | score: 26.0  | 95% CI: (-2.4, 2.2)  | average #tokens: 530                                                                  
snowflake-arctic-instruct      | score: 25.9  | 95% CI: (-2.6, 1.8)  | average #tokens: 365                                                                  
gemini-1.0-pro                 | score: 24.9  | 95% CI: (-2.1, 2.4)  | average #tokens: 322                                                                  
command-r                      | score: 23.4  | 95% CI: (-1.9, 1.8)  | average #tokens: 432                                                                  
snorkel-mistral-pairrm-dpo     | score: 21.8  | 95% CI: (-2.2, 1.9)  | average #tokens: 564                                                                  
yi-34b-chat                    | score: 21.8  | 95% CI: (-2.2, 2.0)  | average #tokens: 611                                                                  
internlm2-20b-chat             | score: 21.1  | 95% CI: (-1.9, 1.3)  | average #tokens: 667                                                                  
llama-3-8b-instruct            | score: 19.7  | 95% CI: (-1.6, 1.8)  | average #tokens: 585                                                                                                  
llama-3.1-8b-instruct          | score: 18.2  | 95% CI: (-1.8, 2.0)  | average #tokens: 861                                                                                                  
tulu-2-dpo-70b                 | score: 18.0  | 95% CI: (-1.7, 1.8)  | average #tokens: 550                                                                                                  
starling-lm-7b-alpha           | score: 16.4  | 95% CI: (-1.5, 1.5)  | average #tokens: 483                                                                                                  
phi-3-mini-128k-instruct       | score: 16.1  | 95% CI: (-1.5, 1.9)  | average #tokens: 609                                                                                                  
mistral-7b-instruct            | score: 15.2  | 95% CI: (-2.0, 1.5)  | average #tokens: 541                                                                                                  
llama-2-70b-chat               | score: 13.4  | 95% CI: (-1.5, 1.7)  | average #tokens: 595                                                                                                  
vicuna-33b                     | score: 11.7  | 95% CI: (-1.9, 1.7)  | average #tokens: 451                                                                                                  
gemma-1.1-7b-it                | score: 11.6  | 95% CI: (-1.4, 1.2)  | average #tokens: 341                                                                                                  
gemma-7b-it                    | score:  7.0  | 95% CI: (-1.1, 1.0)  | average #tokens: 378                                                                                                  
gemma-1.1-2b-it                | score:  3.5  | 95% CI: (-0.6, 0.7)  | average #tokens: 316                                                                                                  
gemma-2b-it                    | score:  2.9  | 95% CI: (-0.5, 0.6)  | average #tokens: 369                                                                                                  

For more information click here.

Varshini

Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies.

Recent Posts

Shebang (#!) in Bash Script

When you write a Bash script in Linux, you want it to run correctly every…

12 hours ago

Bash String Concatenation – Bash Scripting

Introduction If you’re new to Bash scripting, one of the first skills you’ll need is…

16 hours ago

Learn Bash Scripting: How to Create and Run Shell Scripts for Beginners

What is Bash Scripting? Bash scripting allows you to save multiple Linux commands in a file and…

1 day ago

Bash if…else Statement – Bash Scripting

When it comes to automating tasks on Linux, Bash scripting is an essential skill for both beginners…

1 day ago

Bash Functions Explained: Syntax, Examples, and Best Practices

Learn how to create and use Bash functions with this complete tutorial. Includes syntax, arguments,…

4 days ago

50+ Essential Linux Commands for Beginners and Experts: A Complete Guide

Introduction Unlock the full potential of your Linux system with this comprehensive guide to essential…

3 weeks ago