C²LEVA: Toward Comprehensive and Contamination-Free Language Model EVAluation

paper Paper | 🔎 Explore Benchmark | 📊 Leaderboard | ⬇️ Download Data | ✉️ Email

🎯 Overview

TL;DR: We release C²LEVA, a comprehensive bilingual benchmark with systematic contamination prevention. Large-scale evaluation on 15 large language models demonstrates the effectiveness of C²LEVA.

Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data.

To address this issue, we present C²LEVA, which offers (1) a holistic bilingual (Chinese and English) benchmark encompassing 22 tasks, each targeting a specific application or ability of LLMs; (2) A trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release.

Our large-scale evaluation of 15 open-source and proprietary models shows that C²LEVA achieves a 94.8% correlation with Chatbot Arena's overall rankings, while being fully transparent and reproducible.

💡 Highlights

1. Systematic Contamination Prevention.

C²LEVA systematically prevents data contamination from both the passive and active perspectives: C²LEVA addresses repurposing attacks through contamination detection and data scarcity via data augmentation; C²LEVA implements data protection techniques during benchmark release, prolonging the effectiveness of the passive solution.

2. A Comprehensive and Contamination-Free Task Taxonomy.

With contamination prevention techniques applied, C²LEVA contains 22 tasks, in English and Simplified Chinese, for application assessment and ability evaluation.

3. Large-Scale Evaluation Experiments.

C²LEVA evaluates 15 open-source and proprietary LLMs. The leaderboard will be continuously maintained and updated.

🛠️ Framework

Our system consists of two stages: data collection and prevention:

In the data collection, we use a set of crawlers and simulators to gather or synthesize high-quality data and store it into the database.
To generate a test set, we apply a series of passive prevention techniques to raw data sampled from the database, followed by the active prevention method.

🔎 Task Taxonomy & Examples

Typo-Fixing

Transliteration@IPA-to-English

Transliteration@English-to-IPA

Transliteration@Chinese-to-Pinyin

Transliteration@Pinyin-to-Chinese

Fact Completion

Reasoning Primitive@Pattern Induction

Reasoning Primitive@Pattern Matching

Reasoning Primitive@Variable Substitution

Reasoning Primitive@Dyck Language

Realistic Reasoning@Bool Logic

Realistic Reasoning@Abductive Logic

Realistic Reasoning@Deductive Logic

Realistic Reasoning@Linear Equation

Realistic Reasoning@Arithmetic

Realistic Reasoning@Reachability

Realistic Reasoning@Max Sum Path

Disinformation@Narrative Reiteration

Summarization

Sentiment Analysis

Text Classification

English-Example1

Chinese-Example1

English-Example2

Chinese-Example2

English-Example3

Chinese-Example3

Please output this exact text, with no changes at all except for fixing the misspellings. Please leave all other stylistic decisions like commas and US vs British spellings as in the original text. by the impulses put into the line bh the picture at by the impulses put into the line by the picture at

📊 Leaderboard

Table

Chart

English Results

Models	Language			Knowledge	Reasoning											Harms		Applications
Models	English-to-IPA	IPA-to-English	Typo-Fixing	Fact Completion	Dyck Language	Pattern Induction	Pattern Matching	Variable Substitution	Arithmetic	Linear Equation	Bool Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path	Copyright	Narrative Reiteration	Text Classification	Sentiment Analysis	Summarization

Claude-3.5-Sonnet-20240620	0.98	0.671	0.888	0.339	0.667	0.525	0.911	1	0.811	0.806	1	0.999	0.622	0.984	0.445	0.002	0.164	0.605	0.891	0.284
Gemini-1.5-Pro-001	0.909	0.246	0.771	0.277	0.028	0.407	0.767	1	0.719	0.511	1	0.999	0.674	0.856	0.362	0	0.16	0.553	0.831	0.27
GPT-4o-2024-05-13	0.968	0.64	0.876	0.258	0.209	0.314	0.841	1	0.553	0.595	0.998	0.994	0.272	0.853	0.277	0.003	0.104	0.613	0.887	0.27
Deepseek-v2-API-0628	0.917	0.534	0.76	0.261	0.289	0.524	0.813	0.999	0.711	0.193	1	0.996	0.598	0.739	0.386	0.006	0.117	0.474	0.906	0.249
Yi-Large	0.95	0.496	0.856	0.289	0.394	0.424	0.649	0.996	0.552	0.125	1	0.921	0.553	0.757	0.118	0.004	0.089	0.533	0.89	0.307
Llama-3-70B-Instruct	0.927	0.444	0.825	0.333	0.018	0.398	0.708	0.983	0.509	0.025	1	0.782	0.503	0.541	0.013	0.003	0.149	0.544	0.907	0.32
Qwen-Max-0428	0.948	0.498	0.762	0.329	0.217	0.394	0.624	0.977	0.503	0.212	0.999	0.817	0.216	0.811	0.205	0.003	0.091	0.42	0.874	0.239
GLM-4-0520	0.883	0.373	0.735	0.227	0.013	0.244	0.771	0.975	0.418	0.126	1	0.737	0.301	0.704	0.162	0.003	0.118	0.478	0.89	0.28
Llama-3-8B-Instruct	0.776	0.308	0.64	0.218	0.127	0.5	0.471	0.862	0.311	0.005	0.998	0.935	0.447	0.463	0.009	0.002	0.162	0.459	0.872	0.315
InternLM2-Chat-20B	0.772	0.34	0.575	0.244	0.039	0.582	0.511	0.889	0.26	0	0.995	0.619	0.429	0.399	0	0.004	0.093	0.491	0.887	0.294
GLM-4-9B-Chat	0.82	0.312	0.571	0.156	0.042	0.28	0.388	0.865	0.419	0.122	0.99	0.835	0.116	0.523	0.009	0.004	0.163	0.355	0.891	0.28
Yi-1.5-9B-Chat	0.608	0.213	0.625	0.188	0.037	0.331	0.502	0.914	0.493	0.053	1	0.554	0.311	0.505	0	0.006	0.098	0.324	0.876	0.275
Qwen2-7B-Instruct	0.752	0.204	0.501	0.2	0.004	0.336	0.484	0.655	0.101	0.007	0.939	0.71	0.379	0.44	0	0.003	0.103	0.259	0.853	0.249
Vicuna-13B-v1.5	0.749	0.312	0.647	0.229	0.057	0.076	0.371	0.473	0.09	0	0.856	0.619	0.387	0.076	0	0.006	0.139	0.331	0.826	0.312
Baichuan2-13B-Chat	0.6	0.289	0.584	0.206	0.084	0.171	0.344	0.525	0.018	0	0.511	0.54	0.26	0.089	0	0.004	0.212	0.027	0.892	0.281

Data Version: 2024-07-04

🖊️ Data & Citation

Below are the links to request the past versions of C²LEVA test sets.

2024-07-04