Tag

#ai-benchmarks

Every item tagged ai-benchmarks, newest first.

3 items

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Researchers created GameCraft-Bench, a benchmark for evaluating AI agents' ability to build playable games end-to-end in a real game engine. The benchmark uses a popular open-source game engine and provides a dataset for training and testing AI models. You can explore the project on GitHub and Hugging Face. The benchmark aims to assess the capabilities of AI agents in game development.

Key takeaways

GameCraft-Bench evaluates AI agents building playable games in a real game engine.
The benchmark includes a dataset for training and testing AI models.
Project resources are available on GitHub and Hugging Face.

rr/LocalLLaMA#game-development #ai-benchmarks #open-source

research17h

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Researchers introduced TxBench-PP, a benchmark for evaluating AI agents in small-molecule preclinical pharmacology. It provides a standardized way to assess AI performance in drug discovery, focusing on realistic program decisions. The benchmark aims to facilitate trusted evaluation and deployment of AI agents in this field. You can use TxBench-PP to compare AI models and improve their performance in drug discovery.

Key takeaways

TxBench-PP is a benchmark for small-molecule preclinical pharmacology.
It tests AI agents on realistic program decisions in drug discovery.
The benchmark aims to enable trusted evaluation and deployment of AI agents.

aarXiv#drug-discovery #ai-benchmarks #pharmacology

researchJul 17

Back to The Future: Evaluating AI Agents on Predicting Future Events

Researchers from Hugging Face and the University of Edinburgh evaluated AI agents on their ability to predict future events. The study used a dataset of past events and asked models to forecast what would happen next. The best-performing model was a fine-tuned version of Llama-3-8B, which outperformed other models like GLM-5.2 and Mistral-7B.

Key takeaways

Llama-3-8B fine-tune bests other models on future event prediction.
Study used dataset of past events to test forecasting abilities.
GLM-5.2 and Mistral-7B also evaluated.

HHugging Face Blog#ai-benchmarks #forecasting #fine-tuning