Tag

#llm-evaluation

Every item tagged llm-evaluation, newest first.

2 items

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Researchers frame LLM evaluation under selective human supervision as a positive-unlabeled learning problem. They propose a method to quantify and audit biases in LLM-as-a-Judge systems, finding systematic issues like verbosity bias. The approach helps builders assess LLM reliability in real-world scenarios. This work informs strategies to improve LLM evaluation.

Key takeaways

LLM-as-a-Judge systems show systematic biases like verbosity bias.
Positive-unlabeled learning can quantify LLM evaluation biases.
Method helps assess LLM reliability in real-world scenarios.

aarXiv#llm-evaluation #positive-unlabeled-learning #bias-auditing

researchDec 4

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

The 3C3H AraGen benchmark and leaderboard evaluate LLMs on Arabic text generation tasks. It assesses capabilities in content creation, coherence, consistency, and helpfulness. You can use AraGen to compare model performance on Arabic language tasks. The AraGen leaderboard ranks models like Llama-3, Mixtral, and Gemma.

Key takeaways

3C3H AraGen evaluates LLMs on Arabic text generation.
Assesses content creation, coherence, consistency, and helpfulness.
Leaderboard compares models like Llama-3, Mixtral, and Gemma.

HHugging Face Blog#llm-evaluation #arabic-language #benchmarks