Huawei's new AI benchmark reveals a stark reality about current artificial intelligence capabilities. The Claw-Anything benchmark simulates extended digital environments and tasks AI agents with navigating complex, real-world scenarios over extended periods.
The results are sobering. OpenAI's GPT-4.5, currently the most advanced model available, achieved only a 34.5% success rate on the benchmark. This low score exposes a fundamental limitation in today's frontier models. They struggle with sustained reasoning, long-horizon planning, and adapting to dynamic environments that require months of simulated time to complete.
Huawei designed Claw-Anything to test AI agents beyond single-task performance. The benchmark creates persistent digital worlds where agents must maintain context, make sequential decisions, and handle failure states over extended periods. This mirrors real-world complexity far better than traditional benchmarks that measure isolated, short-duration tasks.
The benchmark's structure reveals why state-of-the-art models falter. GPT-4.5's sub-35% performance suggests current architectures cannot effectively manage context windows spanning weeks or months of simulated interaction. Agents lose track of objectives, repeat mistakes, and fail to compound learning across extended timespans.
This matters for autonomous systems development. Self-driving vehicles, robotic process automation, and autonomous trading algorithms all require the kind of sustained reasoning Claw-Anything tests. A 34.5% success rate indicates these applications remain far from full autonomy in complex, changing environments.
The benchmark positions Huawei as a serious player in AI evaluation methodology. While OpenAI and Anthropic focus on scaling models, Huawei's work highlights gaps in how the industry measures actual competence. Other labs will likely adopt or build competing benchmarks that stress-test long-horizon reasoning.
GPT-4.5's failure on Claw-Anything doesn't diminish its capabilities on conventional tasks. Rather, it clarifies that current AI agents lack the temporal coherence and adaptive planning needed for fully autonomous operation. Future model improvements will likely target these specific weaknesses revealed
