C2LEVA: Toward Comprehensive and Contamination-Free Language Model EVAluation

🎯 Overview

TL;DR: We release C2LEVA, a comprehensive bilingual benchmark with systematic contamination prevention. Large-scale evaluation on 15 large language models demonstrates the effectiveness of C2LEVA.

Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data.

To address this issue, we present C2LEVA, which offers (1) a holistic bilingual (Chinese and English) benchmark encompassing 22 tasks, each targeting a specific application or ability of LLMs; (2) A trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release.

Our large-scale evaluation of 15 open-source and proprietary models shows that C2LEVA achieves a 94.8% correlation with Chatbot Arena's overall rankings, while being fully transparent and reproducible.

💡 Highlights

1. Systematic Contamination Prevention.

C2LEVA systematically prevents data contamination from both the passive and active perspectives: C2LEVA addresses repurposing attacks through contamination detection and data scarcity via data augmentation; C2LEVA implements data protection techniques during benchmark release, prolonging the effectiveness of the passive solution.

2. A Comprehensive and Contamination-Free Task Taxonomy.

With contamination prevention techniques applied, C2LEVA contains 22 tasks, in English and Simplified Chinese, for application assessment and ability evaluation.

3. Large-Scale Evaluation Experiments.

C2LEVA evaluates 15 open-source and proprietary LLMs. The leaderboard will be continuously maintained and updated.

🛠️ Framework

Our system consists of two stages: data collection and prevention:

  • In the data collection, we use a set of crawlers and simulators to gather or synthesize high-quality data and store it into the database.
  • To generate a test set, we apply a series of passive prevention techniques to raw data sampled from the database, followed by the active prevention method.

🔎 Task Taxonomy & Examples

Please output this exact text, with no changes at all except for fixing the misspellings. Please leave all other stylistic decisions like commas and US vs British spellings as in the original text. by the impulses put into the line bh the picture at by the impulses put into the line by the picture at

📊 Leaderboard

English Results

ModelsLanguageKnowledgeReasoningHarmsApplications
English-to-IPA
IPA-to-English
Typo-Fixing
Fact Completion
Dyck Language
Pattern Induction
Pattern Matching
Variable Substitution
Arithmetic
Linear Equation
Bool Logic
Deductive Logic
Abductive Logic
Reachability
Max Sum Path
Copyright
Narrative Reiteration
Text Classification
Sentiment Analysis
Summarization
Claude-3.5-Sonnet-202406200.980.6710.8880.3390.6670.5250.91110.8110.80610.9990.6220.9840.4450.0020.1640.6050.8910.284
Gemini-1.5-Pro-0010.9090.2460.7710.2770.0280.4070.76710.7190.51110.9990.6740.8560.36200.160.5530.8310.27
GPT-4o-2024-05-130.9680.640.8760.2580.2090.3140.84110.5530.5950.9980.9940.2720.8530.2770.0030.1040.6130.8870.27
Deepseek-v2-API-06280.9170.5340.760.2610.2890.5240.8130.9990.7110.19310.9960.5980.7390.3860.0060.1170.4740.9060.249
Yi-Large0.950.4960.8560.2890.3940.4240.6490.9960.5520.12510.9210.5530.7570.1180.0040.0890.5330.890.307
Llama-3-70B-Instruct0.9270.4440.8250.3330.0180.3980.7080.9830.5090.02510.7820.5030.5410.0130.0030.1490.5440.9070.32
Qwen-Max-04280.9480.4980.7620.3290.2170.3940.6240.9770.5030.2120.9990.8170.2160.8110.2050.0030.0910.420.8740.239
GLM-4-05200.8830.3730.7350.2270.0130.2440.7710.9750.4180.12610.7370.3010.7040.1620.0030.1180.4780.890.28
Llama-3-8B-Instruct0.7760.3080.640.2180.1270.50.4710.8620.3110.0050.9980.9350.4470.4630.0090.0020.1620.4590.8720.315
InternLM2-Chat-20B0.7720.340.5750.2440.0390.5820.5110.8890.2600.9950.6190.4290.39900.0040.0930.4910.8870.294
GLM-4-9B-Chat0.820.3120.5710.1560.0420.280.3880.8650.4190.1220.990.8350.1160.5230.0090.0040.1630.3550.8910.28
Yi-1.5-9B-Chat0.6080.2130.6250.1880.0370.3310.5020.9140.4930.05310.5540.3110.50500.0060.0980.3240.8760.275
Qwen2-7B-Instruct0.7520.2040.5010.20.0040.3360.4840.6550.1010.0070.9390.710.3790.4400.0030.1030.2590.8530.249
Vicuna-13B-v1.50.7490.3120.6470.2290.0570.0760.3710.4730.0900.8560.6190.3870.07600.0060.1390.3310.8260.312
Baichuan2-13B-Chat0.60.2890.5840.2060.0840.1710.3440.5250.01800.5110.540.260.08900.0040.2120.0270.8920.281

Data Version: 2024-07-04

🖊️ Data & Citation

Below are the links to request the past versions of C2LEVA test sets.

Please cite our paper if you find our work helpful:

@misc{li2023cleva,
      title={CLEVA: Chinese Language Models EVAluation Platform}, 
      author={Yanyang Li and Jianqiao Zhao and Duo Zheng and Zi-Yuan Hu and Zhi Chen and Xiaohui Su and Yongfeng Huang and Shijia Huang and Dahua Lin and Michael R. Lyu and Liwei Wang},
      year={2023},
      eprint={2308.04813},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2308.04813}, 
}
    
@misc{li2024c2leva,
      title={C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation}, 
      author={Yanyang Li and Tin Long Wong and Cheung To Hung and Jianqiao Zhao and Duo Zheng and Ka Wai Liu and Michael R. Lyu and Liwei Wang},
      year={2024},
      eprint={2412.04947},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.04947}, 
}