The humanoid benchmark landscape.
We mapped every public humanoid and embodied-AI benchmark we could find — 41of them, across simulation, datasets, standards, and competitions. Most live in a simulator. Almost none score a real humanoid doing real work. Physical Turing is the humanoid testing company closing that gap: we're running the real-world trials now and publishing them as the Physical Turing Index.
Almost everything is measured in a simulator.
Benchmarking humanoids is booming — but the work clusters in simulation and narrow lab rigs. Datasets, standards, and competitions each cover a slice; very few put a whole humanoid through real-world conditions and publish a comparable score. Here is the public field, filterable.
- 41
- Public benchmarks mapped
- humanoid + embodied-AI
- 23
- Run in simulation
- incl. sim-to-real transfer
- 18
- Real-world, not sim
- standards · datasets · contests
- 16
- Built for humanoids
- the rest are adapted
- 2024Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation
27 whole-body humanoid tasks (locomotion + bimanual manipulation) on a simulated Unitree H1 with dexterous hands, built to stress hierarchical control and long-horizon reasoning.
LocomotionDexterityAutonomyUC Berkeley (Sferrazza et al.) - 2023
An imitation-learning benchmark spanning humanoid, quadruped, and musculoskeletal embodiments with real motion-capture references, used to evaluate whole-body locomotion in MuJoCo.
LocomotionTU Darmstadt (Al-Hafez et al.) - 2024
A mobile bi-manual humanoid benchmark of 40 demonstration-driven household tasks requiring coordinated whole-body manipulation and locomotion.
DexterityLocomotionAutonomyImperial College London (Chernyadev et al.) - 2024
A sim-to-real RL framework for humanoid locomotion with an Isaac Gym → MuJoCo verification stage, demonstrated zero-shot on physical RobotEra and Unitree hardware.
LocomotionShanghai Qi Zhi Institute (Gu et al.) - 2025Aligning Simulation and Real Physics for Agile Humanoid Whole-Body Skills
Learns a residual delta-action model from real rollouts to close the dynamics mismatch, enabling agile whole-body skills on a real Unitree G1.
LocomotionCMU + NVIDIA (He et al.) - 2024Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation
A whole-body teleoperation and learning system for a Unitree H1 with dexterous hands, deploying sim-trained policies on real hardware across manipulation and locomotion.
LocomotionDexterityInteractionCMU (He et al.) - 2025Versatile Neural Whole-Body Controller for Humanoid Robots
A single neural whole-body controller distilled from a motion-imitation policy that unifies multiple control modes and transfers to a real Unitree H1.
LocomotionDexterityNVIDIA + CMU (He et al.) - 2025Learning Getting-Up Policies for Real-World Humanoid Robots
A curriculum that learns fall-recovery / getting-up policies in simulation and deploys them on a real Unitree G1 — one of the few works targeting recovery, not nominal gait.
LocomotionSafetyUIUC + Simon Fraser University (He et al.) - 2025
A multi-simulator (Isaac Gym / Isaac Sim / Genesis) RL framework for humanoid whole-body control — standardized training pipelines rather than a fixed task set or scoreboard.
LocomotionDexterityCMU LeCAR-Lab - 2025
An open-source, low-cost (~1 m, ~16 kg) humanoid with sim-trained locomotion deployed on accessible hardware, lowering the barrier to real-world humanoid experimentation.
LocomotionUC Berkeley - RoboCasaSimulation2024
A large-scale simulated kitchen suite (100 tasks across 120 scenes) for generalist manipulation and imitation learning, built on robosuite.
DexterityAutonomyUT Austin + NVIDIA (Nasiriany et al.) - Meta-WorldSimulation2019
A 50-task robotic-arm manipulation benchmark for multi-task and meta-reinforcement learning — a long-standing standard for sim manipulation generalization.
DexterityAutonomyStanford / UC Berkeley (Yu et al.) - RLBenchSimulation2019
A 100+ task vision-guided manipulation benchmark on a Franka arm in CoppeliaSim, widely used for language-conditioned and few-shot manipulation.
DexterityAutonomyImperial College London (James et al.) - CALVINSimulation2022
A benchmark for long-horizon, language-conditioned manipulation that scores chained instruction-following over sequences of tabletop tasks.
DexterityAutonomyUniversity of Freiburg (Mees et al.) - ManiSkillSimulation2021
A GPU-accelerated (SAPIEN) benchmark and environment suite for generalizable manipulation, now in its third generation with large-scale parallel simulation.
DexterityAutonomyUC San Diego (Hao Su Lab) - robosuiteSimulation2020
A modular MuJoCo-based manipulation framework and standardized task suite that underpins many downstream benchmarks (including RoboCasa).
DexterityARISE Initiative (Stanford / UT Austin) - LIBEROSimulation2023
A lifelong-learning manipulation benchmark of 130 language-conditioned tasks designed to study knowledge transfer in robot policies.
DexterityAutonomyUT Austin (Liu et al.) - RoboTwin 2.0Simulation2025
A scalable dual-arm manipulation benchmark with strong domain randomization and automated data generation — the current iteration of the RoboTwin line.
DexterityAutonomyHKU + AgiBot - SIMPLERSimulation2024Simulated Manipulation Policy Evaluation for Real Robots
A simulator built to predict real-robot manipulation performance — a cheaper, reproducible proxy that correlates with physical evaluations of generalist policies.
DexterityAutonomyGoogle DeepMind + UC San Diego + Stanford - HabitatSimulation2021Habitat 2.0 / 3.0
A high-performance embodied-AI simulator for navigation and rearrangement; Habitat 3.0 adds simulated humanoids for human-robot collaboration research.
AutonomyInteractionMeta AI (FAIR) - AI2-THOR / RoboTHORSimulation2017
An interactive 3D embodied-AI simulator and sim-to-real navigation benchmark for object navigation and task completion in household scenes.
AutonomyInteractionAllen Institute for AI - BEHAVIOR-1KSimulation2022
1,000 everyday household activities in OmniGibson, targeting long-horizon, full-house embodied tasks aligned with human time-use surveys.
DexterityAutonomyInteractionStanford Vision and Learning Lab - VLN-CE / R2RSimulation2020Vision-and-Language Navigation in Continuous Environments
The standard benchmark family for instruction-following navigation, evaluating agents that follow natural-language route directions in photorealistic 3D scenes.
AutonomyInteractionOregon State / Georgia Tech + community - FMBReal-world2024Functional Manipulation Benchmark for Generalizable Robotic Learning
A physical manipulation benchmark with standardized 3D-printed objects and assembly tasks plus a large real-robot dataset, designed for reproducible hardware evaluation.
DexterityUC Berkeley (Luo et al.) - YCB Object & Model SetStandard2015
The de facto standard physical object set and protocols for benchmarking grasping and manipulation on real hardware across labs.
DexterityYale–CMU–Berkeley - GraspNet-1BillionReal-world2020
A large real-world grasping dataset and evaluation with a billion grasp-pose annotations across cluttered scenes — a reference benchmark for grasp detection.
DexterityShanghai Jiao Tong University (Fang et al.) - 2018
Standardized physical task boards (peg-in-hole, connectors, fasteners) with timed scoring for repeatable, cross-lab assessment of real manipulation dexterity.
DexterityNIST (with IEEE RAS challenges) - Robotic Grasping & Manipulation CompetitionCompetition2016
A recurring real-hardware competition at ICRA/IROS scoring grasping and manipulation on standardized tasks — one of the few sustained real-world manipulation contests.
DexterityIEEE RAS / NIST - RoboArena & AutoEvalReal-world2025Distributed real-robot policy evaluation
An emerging class of distributed real-robot evaluation efforts (multi-lab comparison, autonomous scoring) seeking reproducible physical policy evaluation beyond simulation.
DexterityAutonomyUC Berkeley + Stanford (community) - Open X-Embodiment / RT-XStandard2023
A consolidated cross-embodiment dataset (1M+ real-robot trajectories from 22 embodiments) that underpins generalist-policy training rather than a single scoreboard.
DexterityAutonomyGoogle DeepMind + 30+ institutions - 2024
An industry-oriented effort to define repeatable real-world test protocols for humanoids in manufacturing — one of few initiatives aimed squarely at physical humanoid evaluation.
DexterityLocomotionEnduranceSafetyFraunhofer IPA - 2015
The landmark real-world humanoid disaster-response competition (driving, doors, valves, debris, stairs) that exposed how brittle physical humanoid autonomy was in the field.
LocomotionDexterityAutonomySafetyDARPA - ANA Avatar XPRIZECompetition2022
A real-world teleoperated-avatar competition scoring remote manipulation, mobility, and human interaction through an embodied robot, judged on task and experience.
DexterityInteractionAutonomyXPRIZE Foundation - 2025
A large-scale real-world humanoid competition spanning athletics, combat, and task events — high-profile, but with limited standardized, reproducible scoring.
LocomotionDexterityEnduranceAutonomyBeijing (government / industry consortium) - 2025
A 21 km real-world humanoid running event — a rare public endurance stress test, though battery swaps and low finish rates make it more spectacle than standardized benchmark.
LocomotionEnduranceBeijing E-Town organizers - 1997
Long-running real-world robot competitions: the Humanoid League scores bipedal soccer and the @Home League scores domestic service tasks and human interaction under standardized rules.
LocomotionAutonomyInteractionDexterityRoboCup Federation - ISO 10218 / ISO/TS 15066Standard2011
The core industrial-robot safety standards; ISO/TS 15066 defines collaborative force/pressure limits for human-robot contact — the basis for any physical safety evaluation.
SafetyISO - ISO 13482Standard2014
The safety standard for personal-care robots (mobile servant, physical assistant, person carrier) — the closest existing standard to free-roaming humanoids around people.
SafetyInteractionISO - 2024Safety for industrial mobile + humanoid robots (in development)
An in-development standard explicitly scoping safety for industrial mobile robots including legged/humanoid form factors — the first standards-track work aimed at humanoids.
SafetyISO TC299 - ASTM F45 / E54.09Standard2013
Standardized test methods for mobile-robot performance (navigation, docking, obstacle handling) and emergency-response robots — repeatable physical test artifacts and scoring.
LocomotionAutonomyDexterityASTM International - IEC 61508 / 61496 / 62998Standard2010
Functional-safety and safety-sensor standards (SIL levels, protective equipment, smart sensing) governing the safety subsystems any deployed humanoid must satisfy.
SafetyIEC
Where the field falls short — and what we're testing.
Mapping the public benchmarks onto our six capability domains shows the pattern: simulation coverage runs deep, real-world coverage is thin, and some domains are barely measured at all. Physical Turing is running real-world trials to fill each row.
- Safety & Compliance25% of IndexSimulationSparseReal-worldPartialIn trials
Real-world coverage is only pass/fail conformance — contact-force limits and functional safety. No public benchmark scores how a full humanoid behaves around people, recovers from a fall, or fails gracefully under real disturbance.
Public workISO/TS 15066ISO 13482ISO/AWI 25785-1IEC 61508 - Dexterity & Manipulation20% of IndexSimulationMatureReal-worldPartialIn trials
Simulation is saturated, but real-world manipulation benchmarks are mostly fixed-arm grasping and assembly. Reproducible scoring of dexterous bimanual humanoid manipulation under clutter barely exists outside competitions.
Public workMeta-WorldManiSkillRoboCasaFMBYCBNIST boards - Locomotion & Balance20% of IndexSimulationMatureReal-worldSparseIn trials
Whole-body locomotion is heavily benchmarked in simulation, but real-world results are demos and one-off events (marathons, getting-up clips). There is no standardized real-world test of gait robustness over terrain, perturbation, and time.
Public workHumanoidBenchLocoMuJoCoASAPHumanUPASTM F45 - Autonomy & Perception20% of IndexSimulationMatureReal-worldSparseComing soon
Long-horizon task autonomy is almost entirely evaluated in simulators. Real-world autonomy rests on an aging competition (DRC) and nascent distributed-eval efforts — no sustained public leaderboard for physical humanoid task completion.
Public workHabitatBEHAVIOR-1KVLN-CERoboArenaDARPA DRC - Human Interaction10% of IndexSimulationSparseReal-worldSparseComing soon
The least-benchmarked domain. Simulators add scripted human avatars and competitions judge it subjectively, but there is no public, repeatable measurement of how a physical humanoid behaves safely and legibly with real people.
Public workHabitat 3.0RoboCup @HomeANA Avatar XPRIZEISO 13482 - Endurance & Reliability5% of IndexSimulationNoneReal-worldSparseComing soon
Essentially unmeasured publicly. Simulation ignores battery, heat, and wear, and real-world evidence is limited to spectacle events. There is no standardized duty-cycle or mean-time-between-failures benchmark for humanoids.
Public workBeijing Half-MarathonWorld Humanoid Robot GamesFraunhofer IPA
The real-world leaderboard the field is missing.
One 0–100 score per platform — a weighted composite across all six domains, measured on real hardware in the conditions a humanoid will actually work in. The board fills in column by column as each platform clears a domain's battery. Most cells are still in trials or queued; the format below shows where the numbers land.
| Platform | Class | Safety | Dexterity | Locomotion | Autonomy | Interaction | Endurance | PT Index |
|---|---|---|---|---|---|---|---|---|
| Platform A | Full-size | 94 | Trials | 88 | Trials | Soon | Soon | Pending |
| Platform B | Compact | Trials | 91 | Trials | Soon | Soon | Soon | Pending |
| Platform C | Full-size | Trials | Trials | 85 | Soon | Soon | Soon | Pending |
| Platform D | Lightweight | Soon | Soon | Soon | Soon | Soon | Soon | Pending |
| Platform E | Compact | Soon | Soon | Soon | Soon | Soon | Soon | Pending |
Your platform here. Enroll a humanoid and we publish its results on the board.
Enroll a platformThe numeric scores above are illustrative sample data shown to demonstrate the format — platforms are anonymized and these are not real evaluations. The “in trials” cells are domains we are actively measuring on real hardware right now; “coming soon” domains are queued. Our first published results land as those trials complete.
How the Index works.
The PT Index is a weighted 0–100 composite. We run the same instrumented protocols on every platform, log everything, and publish a result your engineers can reproduce.
Want your platform on the board?
Physical Turingruns the trials so you can ship what's actually ready. Tell us about your policy or humanoid and we'll scope an evaluation.
Questions first? Email support@physicalturing.ai.