Tech today

Arena-Hard-Auto : Advancing LLM Evaluation With Style Control Integration

Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314).

Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper).

If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.

Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.

Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto.

Content

  • Style Control Leaderboard
  • Leaderboard
  • Install
  • Evaluation
  • Style Control Guide

Style Control Leaderboard

Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost.

Please refer to the blogpost for methodology and technical background.

(Updated: 10/14)

claude-3-5-sonnet-20240620     | score: 82.0  | 95% CI: (-1.6, 2.2)  | average #tokens: 567                                                      
o1-preview-2024-09-12          | score: 81.6  | 95% CI: (-2.4, 2.2)  | average #tokens: 1193                                                     
o1-mini-2024-09-12             | score: 79.2  | 95% CI: (-2.6, 2.4)  | average #tokens: 1399                                                     
gpt-4-turbo-2024-04-09         | score: 74.4  | 95% CI: (-2.5, 2.1)  | average #tokens: 662                                                      
gpt-4-0125-preview             | score: 73.5  | 95% CI: (-2.4, 1.8)  | average #tokens: 619                                                      
gpt-4o-2024-08-06              | score: 71.0  | 95% CI: (-2.5, 2.8)  | average #tokens: 594
llama-3.1-nemotron-70b-instruct| score: 70.9  | 95% CI: (-3.3, 3.3)  | average #tokens: 869
gpt-4o-2024-05-13              | score: 69.9  | 95% CI: (-2.5, 2.3)  | average #tokens: 696                                                      
athene-70b                     | score: 67.7  | 95% CI: (-3.2, 2.2)  | average #tokens: 685                                                      
yi-lightning                   | score: 67.1  | 95% CI: (-2.3, 2.8)  | average #tokens: 875                                                      
llama-3.1-405b-instruct        | score: 66.8  | 95% CI: (-2.6, 1.9)  | average #tokens: 658                                                      
claude-3-opus-20240229         | score: 65.5  | 95% CI: (-2.3, 2.5)  | average #tokens: 541                                                      
yi-large-preview               | score: 65.0  | 95% CI: (-2.4, 2.0)  | average #tokens: 720                                                                  
gpt-4o-mini-2024-07-18         | score: 64.2  | 95% CI: (-2.7, 2.9)  | average #tokens: 668                                                                  
qwen2.5-72b-instruct           | score: 63.4  | 95% CI: (-2.5, 2.7)  | average #tokens: 821                                                                  
mistral-large-2407             | score: 63.1  | 95% CI: (-2.6, 3.1)  | average #tokens: 623                                                                                 
gemini-1.5-pro-api-0514        | score: 62.4  | 95% CI: (-2.7, 2.1)  | average #tokens: 676                                                                                 
glm-4-0520                     | score: 61.3  | 95% CI: (-3.3, 3.0)  | average #tokens: 636                                                                                 
yi-large                       | score: 59.3  | 95% CI: (-3.1, 2.2)  | average #tokens: 626                                                                                                  
deepseek-coder-v2              | score: 58.2  | 95% CI: (-2.6, 2.8)  | average #tokens: 578                                                                                                  
glm-4-0116                     | score: 54.1  | 95% CI: (-2.5, 2.5)  | average #tokens: 622                                                                                                  
llama-3.1-70b-instruct         | score: 51.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 628                                                                                                  
glm-4-air                      | score: 50.4  | 95% CI: (-1.8, 2.5)  | average #tokens: 619                                                                                                  
gpt-4-0314                     | score: 50.0  | 95% CI:  (0.0, 0.0)  | average #tokens: 423                                                                                                                       
claude-3-sonnet-20240229       | score: 49.7  | 95% CI: (-2.0, 2.6)  | average #tokens: 552                                                                                                                       
gpt-4-0613                     | score: 49.6  | 95% CI: (-2.5, 2.7)  | average #tokens: 354                                                                                                                       
qwen2-72b-instruct             | score: 49.5  | 95% CI: (-2.4, 2.4)  | average #tokens: 515                                                                                                                       
gemma-2-27b-it                 | score: 47.4  | 95% CI: (-2.8, 2.8)  | average #tokens: 577                                                                                                                       
gemini-1.5-pro-api-0409-preview| score: 46.8  | 95% CI: (-2.8, 2.7)  | average #tokens: 478                                                                                                                      
mistral-large-2402             | score: 45.5  | 95% CI: (-2.5, 2.1)  | average #tokens: 400                                                                                                                                                 
claude-3-haiku-20240307        | score: 45.3  | 95% CI: (-2.3, 3.1)  | average #tokens: 505                                                                                                                                                 
llama-3-70b-instruct           | score: 44.3  | 95% CI: (-2.2, 3.5)  | average #tokens: 591
mixtral-8x22b-instruct-v0.1    | score: 44.0  | 95% CI: (-2.9, 2.9)  | average #tokens: 430
qwen1.5-72b-chat               | score: 39.7  | 95% CI: (-2.1, 2.2)  | average #tokens: 474
gemini-1.5-flash-api-0514      | score: 39.7  | 95% CI: (-2.5, 2.4)  | average #tokens: 642
mistral-next                   | score: 39.6  | 95% CI: (-2.2, 2.5)  | average #tokens: 297
mistral-medium                 | score: 39.0  | 95% CI: (-2.4, 3.3)  | average #tokens: 485
phi-3-medium-4k-instruct       | score: 38.7  | 95% CI: (-2.1, 2.6)  | average #tokens: 517
command-r-plus                 | score: 37.3  | 95% CI: (-2.3, 1.6)  | average #tokens: 541
claude-2.0                     | score: 36.7  | 95% CI: (-2.2, 2.6)  | average #tokens: 295
claude-2.1                     | score: 35.1  | 95% CI: (-2.9, 2.5)  | average #tokens: 290
gpt-3.5-turbo-0613             | score: 34.9  | 95% CI: (-2.2, 3.0)  | average #tokens: 401
gpt-3.5-turbo-0125             | score: 34.7  | 95% CI: (-2.3, 2.7)  | average #tokens: 329
phi-3-small-8k-instruct        | score: 33.6  | 95% CI: (-2.6, 2.3)  | average #tokens: 568
gemma-2-9b-it                  | score: 33.3  | 95% CI: (-2.7, 2.8)  | average #tokens: 541                                                                  
gpt-3.5-turbo-1106             | score: 33.0  | 95% CI: (-2.4, 2.9)  | average #tokens: 285                                                                  
dbrx-instruct-preview          | score: 32.0  | 95% CI: (-2.5, 2.4)  | average #tokens: 415                                                                  
internlm2-20b-5-chat           | score: 30.2  | 95% CI: (-2.2, 2.5)  | average #tokens: 576                                                                  
mixtral-8x7b-instruct-v0.1     | score: 29.8  | 95% CI: (-2.0, 2.1)  | average #tokens: 457                                                                  
gpt-3.5-turbo-0314             | score: 29.4  | 95% CI: (-2.8, 2.1)  | average #tokens: 334                                                                  
starling-lm-7b-beta            | score: 26.0  | 95% CI: (-2.4, 2.2)  | average #tokens: 530                                                                  
snowflake-arctic-instruct      | score: 25.9  | 95% CI: (-2.6, 1.8)  | average #tokens: 365                                                                  
gemini-1.0-pro                 | score: 24.9  | 95% CI: (-2.1, 2.4)  | average #tokens: 322                                                                  
command-r                      | score: 23.4  | 95% CI: (-1.9, 1.8)  | average #tokens: 432                                                                  
snorkel-mistral-pairrm-dpo     | score: 21.8  | 95% CI: (-2.2, 1.9)  | average #tokens: 564                                                                  
yi-34b-chat                    | score: 21.8  | 95% CI: (-2.2, 2.0)  | average #tokens: 611                                                                  
internlm2-20b-chat             | score: 21.1  | 95% CI: (-1.9, 1.3)  | average #tokens: 667                                                                  
llama-3-8b-instruct            | score: 19.7  | 95% CI: (-1.6, 1.8)  | average #tokens: 585                                                                                                  
llama-3.1-8b-instruct          | score: 18.2  | 95% CI: (-1.8, 2.0)  | average #tokens: 861                                                                                                  
tulu-2-dpo-70b                 | score: 18.0  | 95% CI: (-1.7, 1.8)  | average #tokens: 550                                                                                                  
starling-lm-7b-alpha           | score: 16.4  | 95% CI: (-1.5, 1.5)  | average #tokens: 483                                                                                                  
phi-3-mini-128k-instruct       | score: 16.1  | 95% CI: (-1.5, 1.9)  | average #tokens: 609                                                                                                  
mistral-7b-instruct            | score: 15.2  | 95% CI: (-2.0, 1.5)  | average #tokens: 541                                                                                                  
llama-2-70b-chat               | score: 13.4  | 95% CI: (-1.5, 1.7)  | average #tokens: 595                                                                                                  
vicuna-33b                     | score: 11.7  | 95% CI: (-1.9, 1.7)  | average #tokens: 451                                                                                                  
gemma-1.1-7b-it                | score: 11.6  | 95% CI: (-1.4, 1.2)  | average #tokens: 341                                                                                                  
gemma-7b-it                    | score:  7.0  | 95% CI: (-1.1, 1.0)  | average #tokens: 378                                                                                                  
gemma-1.1-2b-it                | score:  3.5  | 95% CI: (-0.6, 0.7)  | average #tokens: 316                                                                                                  
gemma-2b-it                    | score:  2.9  | 95% CI: (-0.5, 0.6)  | average #tokens: 369                                                                                                  

For more information click here.

Tamil S

Tamil has a great interest in the fields of Cyber Security, OSINT, and CTF projects. Currently, he is deeply involved in researching and publishing various security tools with Kali Linux Tutorials, which is quite fascinating.

Recent Posts

LsassReflectDumping – A Deep Dive Into Secure Credential Extraction Techniques

This tool leverages the Process Forking technique using the RtlCreateProcessReflection API to clone the lsass.exe…

5 mins ago

CVE-2024-30090 : LPE Proof Of Concept Detailed

In the evolving landscape of cybersecurity, understanding the mechanisms behind vulnerabilities is crucial for both…

5 mins ago

go-exploitdb : A Comprehensive Guide To Managing Exploit Databases

This is a tool for searching Exploits from some Exploit Databases. Exploits are inserted at…

5 mins ago

Awesome LLM AIOps: A Comprehensive Survey Of Incident

A list of awesome academic researches and industrial materials about Large Language Model (LLM) and…

5 mins ago

PwnedPasswordsDownloader – Efficient Downloading Of HIBP Password Hashes Using Curl Parallelism

Thanks for HIBP and this downloader. At first I was considering using it, but the…

4 days ago

Cybersecurity Conferences – A Comprehensive Slide Collection

Comprehensive repository for presentation slides from major cybersecurity conferences held in 2023 and 2024. It…

1 week ago