ResearchTopic

Benchmarks

Every story we’ve tagged Benchmarks.

Safety & Policy

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

The UK's AI Security Institute found that standard benchmarks underestimate AI agents' capabilities when computing budgets are limited. The study showed that increasing the token budget can improve success rates by up to 25%.

#AI Regulation #AI Safety #Benchmarks #Responsible AI

The Decoder8 min read12h ago

Research

Long Context vs. Short Context Model: When Does a Long Context Model Win?

Anthropic's Claude 3.5 Sonnet outperforms GPT-4o in long-context tasks, with strengths in summarization and code analysis.

#Benchmarks #Long Context #Model Release #Reasoning Models

Towards Data Science51 min read14h ago

Launches

Meta Watermelon 🍉, Anthropic Samsung chips 🤝, autoresearch in practice 📈

Meta's new model, Watermelon, matches GPT-5.5 benchmarks. It's still in training and uses significant compute. This development is notable in the AI landscape.

#AI Search #Benchmarks #Model Release

TLDR AI6 min read1d ago

You’re all caught up.

UK&#039;s AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

Long Context vs. Short Context Model: When Does a Long Context Model Win?

Meta Watermelon 🍉, Anthropic Samsung chips 🤝, autoresearch in practice 📈

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do