Tech today

Arena-Hard-Auto : Advancing LLM Evaluation With Style Control Integration

Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314).

Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper).

If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.

Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.

Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.

Content

  • Style Control Leaderboard
  • Leaderboard
  • Install
  • Evaluation
  • Style Control Guide

Style Control Leaderboard

Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost.

Please refer to the blogpost for methodology and technical background.

(Updated: 10/14)

claude-3-5-sonnet-20240620     | score: 82.0  | 95% CI: (-1.6, 2.2)  | average #tokens: 567                                                      
o1-preview-2024-09-12          | score: 81.6  | 95% CI: (-2.4, 2.2)  | average #tokens: 1193                                                     
o1-mini-2024-09-12             | score: 79.2  | 95% CI: (-2.6, 2.4)  | average #tokens: 1399                                                     
gpt-4-turbo-2024-04-09         | score: 74.4  | 95% CI: (-2.5, 2.1)  | average #tokens: 662                                                      
gpt-4-0125-preview             | score: 73.5  | 95% CI: (-2.4, 1.8)  | average #tokens: 619                                                      
gpt-4o-2024-08-06              | score: 71.0  | 95% CI: (-2.5, 2.8)  | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9  | 95% CI: (-3.3, 3.3)  | average #tokens: 869
gpt-4o-2024-05-13              | score: 69.9  | 95% CI: (-2.5, 2.3)  | average #tokens: 696                                                      
athene-70b                     | score: 67.7  | 95% CI: (-3.2, 2.2)  | average #tokens: 685                                                      
yi-lightning                   | score: 67.1  | 95% CI: (-2.3, 2.8)  | average #tokens: 875                                                      
llama-3.1-405b-instruct        | score: 66.8  | 95% CI: (-2.6, 1.9)  | average #tokens: 658                                                      
claude-3-opus-20240229         | score: 65.5  | 95% CI: (-2.3, 2.5)  | average #tokens: 541                                                      
yi-large-preview               | score: 65.0  | 95% CI: (-2.4, 2.0)  | average #tokens: 720                                                                  
gpt-4o-mini-2024-07-18         | score: 64.2  | 95% CI: (-2.7, 2.9)  | average #tokens: 668                                                                  
qwen2.5-72b-instruct           | score: 63.4  | 95% CI: (-2.5, 2.7)  | average #tokens: 821                                                                  
mistral-large-2407             | score: 63.1  | 95% CI: (-2.6, 3.1)  | average #tokens: 623                                                                                 
gemini-1.5-pro-api-0514        | score: 62.4  | 95% CI: (-2.7, 2.1)  | average #tokens: 676                                                                                 
glm-4-0520                     | score: 61.3  | 95% CI: (-3.3, 3.0)  | average #tokens: 636                                                                                 
yi-large                       | score: 59.3  | 95% CI: (-3.1, 2.2)  | average #tokens: 626                                                                                                  
deepseek-coder-v2              | score: 58.2  | 95% CI: (-2.6, 2.8)  | average #tokens: 578                                                                                                  
glm-4-0116                     | score: 54.1  | 95% CI: (-2.5, 2.5)  | average #tokens: 622                                                                                                  
llama-3.1-70b-instruct         | score: 51.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 628                                                                                                  
glm-4-air                      | score: 50.4  | 95% CI: (-1.8, 2.5)  | average #tokens: 619                                                                                                  
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423                                                                                                                       
claude-3-sonnet-20240229       | score: 49.7  | 95% CI: (-2.0, 2.6)  | average #tokens: 552                                                                                                                       
gpt-4-0613                     | score: 49.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 354                                                                                                                       
qwen2-72b-instruct             | score: 49.5  | 95% CI: (-2.4, 2.4)  | average #tokens: 515                                                                                                                       
gemma-2-27b-it                 | score: 47.4  | 95% CI: (-2.8, 2.8)  | average #tokens: 577                                                                                                                       
gemini-1.5-pro-api-0409-preview| score: 46.8  | 95% CI: (-2.8, 2.7)  | average #tokens: 478                                                                                                                      
mistral-large-2402             | score: 45.5  | 95% CI: (-2.5, 2.1)  | average #tokens: 400                                                                                                                                                 
claude-3-haiku-20240307        | score: 45.3  | 95% CI: (-2.3, 3.1)  | average #tokens: 505                                                                                                                                                 
llama-3-70b-instruct           | score: 44.3  | 95% CI: (-2.2, 3.5)  | average #tokens: 591
mixtral-8x22b-instruct-v0.1    | score: 44.0  | 95% CI: (-2.9, 2.9)  | average #tokens: 430
qwen1.5-72b-chat               | score: 39.7  | 95% CI: (-2.1, 2.2)  | average #tokens: 474
gemini-1.5-flash-api-0514      | score: 39.7  | 95% CI: (-2.5, 2.4)  | average #tokens: 642
mistral-next                   | score: 39.6  | 95% CI: (-2.2, 2.5)  | average #tokens: 297
mistral-medium                 | score: 39.0  | 95% CI: (-2.4, 3.3)  | average #tokens: 485
phi-3-medium-4k-instruct       | score: 38.7  | 95% CI: (-2.1, 2.6)  | average #tokens: 517
command-r-plus                 | score: 37.3  | 95% CI: (-2.3, 1.6)  | average #tokens: 541
claude-2.0                     | score: 36.7  | 95% CI: (-2.2, 2.6)  | average #tokens: 295
claude-2.1                     | score: 35.1  | 95% CI: (-2.9, 2.5)  | average #tokens: 290
gpt-3.5-turbo-0613             | score: 34.9  | 95% CI: (-2.2, 3.0)  | average #tokens: 401
gpt-3.5-turbo-0125             | score: 34.7  | 95% CI: (-2.3, 2.7)  | average #tokens: 329
phi-3-small-8k-instruct        | score: 33.6  | 95% CI: (-2.6, 2.3)  | average #tokens: 568
gemma-2-9b-it                  | score: 33.3  | 95% CI: (-2.7, 2.8)  | average #tokens: 541                                                                  
gpt-3.5-turbo-1106             | score: 33.0  | 95% CI: (-2.4, 2.9)  | average #tokens: 285                                                                  
dbrx-instruct-preview          | score: 32.0  | 95% CI: (-2.5, 2.4)  | average #tokens: 415                                                                  
internlm2-20b-5-chat           | score: 30.2  | 95% CI: (-2.2, 2.5)  | average #tokens: 576                                                                  
mixtral-8x7b-instruct-v0.1     | score: 29.8  | 95% CI: (-2.0, 2.1)  | average #tokens: 457                                                                  
gpt-3.5-turbo-0314             | score: 29.4  | 95% CI: (-2.8, 2.1)  | average #tokens: 334                                                                  
starling-lm-7b-beta            | score: 26.0  | 95% CI: (-2.4, 2.2)  | average #tokens: 530                                                                  
snowflake-arctic-instruct      | score: 25.9  | 95% CI: (-2.6, 1.8)  | average #tokens: 365                                                                  
gemini-1.0-pro                 | score: 24.9  | 95% CI: (-2.1, 2.4)  | average #tokens: 322                                                                  
command-r                      | score: 23.4  | 95% CI: (-1.9, 1.8)  | average #tokens: 432                                                                  
snorkel-mistral-pairrm-dpo     | score: 21.8  | 95% CI: (-2.2, 1.9)  | average #tokens: 564                                                                  
yi-34b-chat                    | score: 21.8  | 95% CI: (-2.2, 2.0)  | average #tokens: 611                                                                  
internlm2-20b-chat             | score: 21.1  | 95% CI: (-1.9, 1.3)  | average #tokens: 667                                                                  
llama-3-8b-instruct            | score: 19.7  | 95% CI: (-1.6, 1.8)  | average #tokens: 585                                                                                                  
llama-3.1-8b-instruct          | score: 18.2  | 95% CI: (-1.8, 2.0)  | average #tokens: 861                                                                                                  
tulu-2-dpo-70b                 | score: 18.0  | 95% CI: (-1.7, 1.8)  | average #tokens: 550                                                                                                  
starling-lm-7b-alpha           | score: 16.4  | 95% CI: (-1.5, 1.5)  | average #tokens: 483                                                                                                  
phi-3-mini-128k-instruct       | score: 16.1  | 95% CI: (-1.5, 1.9)  | average #tokens: 609                                                                                                  
mistral-7b-instruct            | score: 15.2  | 95% CI: (-2.0, 1.5)  | average #tokens: 541                                                                                                  
llama-2-70b-chat               | score: 13.4  | 95% CI: (-1.5, 1.7)  | average #tokens: 595                                                                                                  
vicuna-33b                     | score: 11.7  | 95% CI: (-1.9, 1.7)  | average #tokens: 451                                                                                                  
gemma-1.1-7b-it                | score: 11.6  | 95% CI: (-1.4, 1.2)  | average #tokens: 341                                                                                                  
gemma-7b-it                    | score:  7.0  | 95% CI: (-1.1, 1.0)  | average #tokens: 378                                                                                                  
gemma-1.1-2b-it                | score:  3.5  | 95% CI: (-0.6, 0.7)  | average #tokens: 316                                                                                                  
gemma-2b-it                    | score:  2.9  | 95% CI: (-0.5, 0.6)  | average #tokens: 369                                                                                                  

For more information click here.

Varshini

Varshini is a Cyber Security expert in Threat Analysis, Vulnerability Assessment, and Research. Passionate about staying ahead of emerging Threats and Technologies.

Recent Posts

garak, LLM Vulnerability Scanner : The Comprehensive Tool For Assessing Language Model Security

garak checks if an LLM can be made to fail in a way we don't…

5 hours ago

Vermilion : Mastering Linux Post-Exploitation For Red Team Success

Vermilion is a simple and lightweight CLI tool designed for rapid collection, and optional exfiltration…

5 hours ago

AD-CS-Forest-Exploiter : Mastering Security Through PowerShell For AD CS Misconfiguration

ADCFFS is a PowerShell script that can be used to exploit the AD CS container…

5 hours ago

Usage Of Tartufo – A Comprehensive Guide To Securing Your Git Repositories

Tartufo will, by default, scan the entire history of a git repository for any text…

5 hours ago

Loco : A Rails-Inspired Framework For Rust Developers

Loco is strongly inspired by Rails. If you know Rails and Rust, you'll feel at…

1 day ago

Monolith : The Ultimate Tool For Storing Entire Web Pages As Single HTML Files

A data hoarder’s dream come true: bundle any web page into a single HTML file.…

1 day ago