Humanoid Benchmarks

The humanoid benchmark landscape.

We mapped every public humanoid and embodied-AI benchmark we could find — 41of them, across simulation, datasets, standards, and competitions. Most live in a simulator. Almost none score a real humanoid doing real work. Physical Turing is the humanoid testing company closing that gap: we're running the real-world trials now and publishing them as the Physical Turing Index.

41 benchmarks surveyedIndependent · Real-world conditions · Reproducible protocols
The field today

Almost everything is measured in a simulator.

Benchmarking humanoids is booming — but the work clusters in simulation and narrow lab rigs. Datasets, standards, and competitions each cover a slice; very few put a whole humanoid through real-world conditions and publish a comparable score. Here is the public field, filterable.

41
Public benchmarks mapped
humanoid + embodied-AI
23
Run in simulation
incl. sim-to-real transfer
18
Real-world, not sim
standards · datasets · contests
16
Built for humanoids
the rest are adapted
41 benchmarksPublic · external work
  • HumanoidBenchSimulationHumanoid
    2024
    Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation

    27 whole-body humanoid tasks (locomotion + bimanual manipulation) on a simulated Unitree H1 with dexterous hands, built to stress hierarchical control and long-horizon reasoning.

    LocomotionDexterityAutonomy
    UC Berkeley (Sferrazza et al.)
  • LocoMuJoCoSimulationHumanoid
    2023

    An imitation-learning benchmark spanning humanoid, quadruped, and musculoskeletal embodiments with real motion-capture references, used to evaluate whole-body locomotion in MuJoCo.

    Locomotion
    TU Darmstadt (Al-Hafez et al.)
  • BiGymSimulationHumanoid
    2024

    A mobile bi-manual humanoid benchmark of 40 demonstration-driven household tasks requiring coordinated whole-body manipulation and locomotion.

    DexterityLocomotionAutonomy
    Imperial College London (Chernyadev et al.)
  • Humanoid-GymSimulationHumanoid
    2024

    A sim-to-real RL framework for humanoid locomotion with an Isaac Gym → MuJoCo verification stage, demonstrated zero-shot on physical RobotEra and Unitree hardware.

    Locomotion
    Shanghai Qi Zhi Institute (Gu et al.)
  • ASAPSim → realHumanoid
    2025
    Aligning Simulation and Real Physics for Agile Humanoid Whole-Body Skills

    Learns a residual delta-action model from real rollouts to close the dynamics mismatch, enabling agile whole-body skills on a real Unitree G1.

    Locomotion
    CMU + NVIDIA (He et al.)
  • OmniH2OSim → realHumanoid
    2024
    Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation

    A whole-body teleoperation and learning system for a Unitree H1 with dexterous hands, deploying sim-trained policies on real hardware across manipulation and locomotion.

    LocomotionDexterityInteraction
    CMU (He et al.)
  • HOVERSim → realHumanoid
    2025
    Versatile Neural Whole-Body Controller for Humanoid Robots

    A single neural whole-body controller distilled from a motion-imitation policy that unifies multiple control modes and transfers to a real Unitree H1.

    LocomotionDexterity
    NVIDIA + CMU (He et al.)
  • HumanUPSim → realHumanoid
    2025
    Learning Getting-Up Policies for Real-World Humanoid Robots

    A curriculum that learns fall-recovery / getting-up policies in simulation and deploys them on a real Unitree G1 — one of the few works targeting recovery, not nominal gait.

    LocomotionSafety
    UIUC + Simon Fraser University (He et al.)
  • HumanoidVerseSimulationHumanoid
    2025

    A multi-simulator (Isaac Gym / Isaac Sim / Genesis) RL framework for humanoid whole-body control — standardized training pipelines rather than a fixed task set or scoreboard.

    LocomotionDexterity
    CMU LeCAR-Lab
  • Berkeley Humanoid LiteSim → realHumanoid
    2025

    An open-source, low-cost (~1 m, ~16 kg) humanoid with sim-trained locomotion deployed on accessible hardware, lowering the barrier to real-world humanoid experimentation.

    Locomotion
    UC Berkeley
  • RoboCasaSimulation
    2024

    A large-scale simulated kitchen suite (100 tasks across 120 scenes) for generalist manipulation and imitation learning, built on robosuite.

    DexterityAutonomy
    UT Austin + NVIDIA (Nasiriany et al.)
  • Meta-WorldSimulation
    2019

    A 50-task robotic-arm manipulation benchmark for multi-task and meta-reinforcement learning — a long-standing standard for sim manipulation generalization.

    DexterityAutonomy
    Stanford / UC Berkeley (Yu et al.)
  • RLBenchSimulation
    2019

    A 100+ task vision-guided manipulation benchmark on a Franka arm in CoppeliaSim, widely used for language-conditioned and few-shot manipulation.

    DexterityAutonomy
    Imperial College London (James et al.)
  • CALVINSimulation
    2022

    A benchmark for long-horizon, language-conditioned manipulation that scores chained instruction-following over sequences of tabletop tasks.

    DexterityAutonomy
    University of Freiburg (Mees et al.)
  • ManiSkillSimulation
    2021

    A GPU-accelerated (SAPIEN) benchmark and environment suite for generalizable manipulation, now in its third generation with large-scale parallel simulation.

    DexterityAutonomy
    UC San Diego (Hao Su Lab)
  • robosuiteSimulation
    2020

    A modular MuJoCo-based manipulation framework and standardized task suite that underpins many downstream benchmarks (including RoboCasa).

    Dexterity
    ARISE Initiative (Stanford / UT Austin)
  • LIBEROSimulation
    2023

    A lifelong-learning manipulation benchmark of 130 language-conditioned tasks designed to study knowledge transfer in robot policies.

    DexterityAutonomy
    UT Austin (Liu et al.)
  • RoboTwin 2.0Simulation
    2025

    A scalable dual-arm manipulation benchmark with strong domain randomization and automated data generation — the current iteration of the RoboTwin line.

    DexterityAutonomy
    HKU + AgiBot
  • SIMPLERSimulation
    2024
    Simulated Manipulation Policy Evaluation for Real Robots

    A simulator built to predict real-robot manipulation performance — a cheaper, reproducible proxy that correlates with physical evaluations of generalist policies.

    DexterityAutonomy
    Google DeepMind + UC San Diego + Stanford
  • HabitatSimulation
    2021
    Habitat 2.0 / 3.0

    A high-performance embodied-AI simulator for navigation and rearrangement; Habitat 3.0 adds simulated humanoids for human-robot collaboration research.

    AutonomyInteraction
    Meta AI (FAIR)
  • 2017

    An interactive 3D embodied-AI simulator and sim-to-real navigation benchmark for object navigation and task completion in household scenes.

    AutonomyInteraction
    Allen Institute for AI
  • BEHAVIOR-1KSimulation
    2022

    1,000 everyday household activities in OmniGibson, targeting long-horizon, full-house embodied tasks aligned with human time-use surveys.

    DexterityAutonomyInteraction
    Stanford Vision and Learning Lab
  • VLN-CE / R2RSimulation
    2020
    Vision-and-Language Navigation in Continuous Environments

    The standard benchmark family for instruction-following navigation, evaluating agents that follow natural-language route directions in photorealistic 3D scenes.

    AutonomyInteraction
    Oregon State / Georgia Tech + community
  • FMBReal-world
    2024
    Functional Manipulation Benchmark for Generalizable Robotic Learning

    A physical manipulation benchmark with standardized 3D-printed objects and assembly tasks plus a large real-robot dataset, designed for reproducible hardware evaluation.

    Dexterity
    UC Berkeley (Luo et al.)
  • The de facto standard physical object set and protocols for benchmarking grasping and manipulation on real hardware across labs.

    Dexterity
    Yale–CMU–Berkeley
  • 2020

    A large real-world grasping dataset and evaluation with a billion grasp-pose annotations across cluttered scenes — a reference benchmark for grasp detection.

    Dexterity
    Shanghai Jiao Tong University (Fang et al.)
  • Standardized physical task boards (peg-in-hole, connectors, fasteners) with timed scoring for repeatable, cross-lab assessment of real manipulation dexterity.

    Dexterity
    NIST (with IEEE RAS challenges)
  • A recurring real-hardware competition at ICRA/IROS scoring grasping and manipulation on standardized tasks — one of the few sustained real-world manipulation contests.

    Dexterity
    IEEE RAS / NIST
  • 2025
    Distributed real-robot policy evaluation

    An emerging class of distributed real-robot evaluation efforts (multi-lab comparison, autonomous scoring) seeking reproducible physical policy evaluation beyond simulation.

    DexterityAutonomy
    UC Berkeley + Stanford (community)
  • A consolidated cross-embodiment dataset (1M+ real-robot trajectories from 22 embodiments) that underpins generalist-policy training rather than a single scoreboard.

    DexterityAutonomy
    Google DeepMind + 30+ institutions
  • 2024

    An industry-oriented effort to define repeatable real-world test protocols for humanoids in manufacturing — one of few initiatives aimed squarely at physical humanoid evaluation.

    DexterityLocomotionEnduranceSafety
    Fraunhofer IPA
  • DARPA Robotics ChallengeCompetitionHumanoid
    2015

    The landmark real-world humanoid disaster-response competition (driving, doors, valves, debris, stairs) that exposed how brittle physical humanoid autonomy was in the field.

    LocomotionDexterityAutonomySafety
    DARPA
  • 2022

    A real-world teleoperated-avatar competition scoring remote manipulation, mobility, and human interaction through an embodied robot, judged on task and experience.

    DexterityInteractionAutonomy
    XPRIZE Foundation
  • World Humanoid Robot GamesCompetitionHumanoid
    2025

    A large-scale real-world humanoid competition spanning athletics, combat, and task events — high-profile, but with limited standardized, reproducible scoring.

    LocomotionDexterityEnduranceAutonomy
    Beijing (government / industry consortium)
  • 2025

    A 21 km real-world humanoid running event — a rare public endurance stress test, though battery swaps and low finish rates make it more spectacle than standardized benchmark.

    LocomotionEndurance
    Beijing E-Town organizers
  • RoboCup (Humanoid & @Home)CompetitionHumanoid
    1997

    Long-running real-world robot competitions: the Humanoid League scores bipedal soccer and the @Home League scores domestic service tasks and human interaction under standardized rules.

    LocomotionAutonomyInteractionDexterity
    RoboCup Federation
  • The core industrial-robot safety standards; ISO/TS 15066 defines collaborative force/pressure limits for human-robot contact — the basis for any physical safety evaluation.

    Safety
    ISO
  • ISO 13482Standard
    2014

    The safety standard for personal-care robots (mobile servant, physical assistant, person carrier) — the closest existing standard to free-roaming humanoids around people.

    SafetyInteraction
    ISO
  • ISO/AWI 25785-1StandardHumanoid
    2024
    Safety for industrial mobile + humanoid robots (in development)

    An in-development standard explicitly scoping safety for industrial mobile robots including legged/humanoid form factors — the first standards-track work aimed at humanoids.

    Safety
    ISO TC299
  • 2013

    Standardized test methods for mobile-robot performance (navigation, docking, obstacle handling) and emergency-response robots — repeatable physical test artifacts and scoring.

    LocomotionAutonomyDexterity
    ASTM International
  • Functional-safety and safety-sensor standards (SIL levels, protective equipment, smart sensing) governing the safety subsystems any deployed humanoid must satisfy.

    Safety
    IEC
The gap

Where the field falls short — and what we're testing.

Mapping the public benchmarks onto our six capability domains shows the pattern: simulation coverage runs deep, real-world coverage is thin, and some domains are barely measured at all. Physical Turing is running real-world trials to fill each row.

  • Safety & Compliance
    25% of Index
    Simulation
    Sparse
    Real-world
    Partial
    In trials

    Real-world coverage is only pass/fail conformance — contact-force limits and functional safety. No public benchmark scores how a full humanoid behaves around people, recovers from a fall, or fails gracefully under real disturbance.

    Public workISO/TS 15066ISO 13482ISO/AWI 25785-1IEC 61508
  • Dexterity & Manipulation
    20% of Index
    Simulation
    Mature
    Real-world
    Partial
    In trials

    Simulation is saturated, but real-world manipulation benchmarks are mostly fixed-arm grasping and assembly. Reproducible scoring of dexterous bimanual humanoid manipulation under clutter barely exists outside competitions.

    Public workMeta-WorldManiSkillRoboCasaFMBYCBNIST boards
  • Locomotion & Balance
    20% of Index
    Simulation
    Mature
    Real-world
    Sparse
    In trials

    Whole-body locomotion is heavily benchmarked in simulation, but real-world results are demos and one-off events (marathons, getting-up clips). There is no standardized real-world test of gait robustness over terrain, perturbation, and time.

    Public workHumanoidBenchLocoMuJoCoASAPHumanUPASTM F45
  • Autonomy & Perception
    20% of Index
    Simulation
    Mature
    Real-world
    Sparse
    Coming soon

    Long-horizon task autonomy is almost entirely evaluated in simulators. Real-world autonomy rests on an aging competition (DRC) and nascent distributed-eval efforts — no sustained public leaderboard for physical humanoid task completion.

    Public workHabitatBEHAVIOR-1KVLN-CERoboArenaDARPA DRC
  • Human Interaction
    10% of Index
    Simulation
    Sparse
    Real-world
    Sparse
    Coming soon

    The least-benchmarked domain. Simulators add scripted human avatars and competitions judge it subjectively, but there is no public, repeatable measurement of how a physical humanoid behaves safely and legibly with real people.

    Public workHabitat 3.0RoboCup @HomeANA Avatar XPRIZEISO 13482
  • Endurance & Reliability
    5% of Index
    Simulation
    None
    Real-world
    Sparse
    Coming soon

    Essentially unmeasured publicly. Simulation ignores battery, heat, and wear, and real-world evidence is limited to spectacle events. There is no standardized duty-cycle or mean-time-between-failures benchmark for humanoids.

    Public workBeijing Half-MarathonWorld Humanoid Robot GamesFraunhofer IPA
The Physical Turing Index

The real-world leaderboard the field is missing.

One 0–100 score per platform — a weighted composite across all six domains, measured on real hardware in the conditions a humanoid will actually work in. The board fills in column by column as each platform clears a domain's battery. Most cells are still in trials or queued; the format below shows where the numbers land.

PlatformClassSafetyDexterityLocomotionAutonomyInteractionEndurancePT Index
Platform AFull-size94Trials88TrialsSoonSoonPending
Platform BCompactTrials91TrialsSoonSoonSoonPending
Platform CFull-sizeTrialsTrials85SoonSoonSoonPending
Platform DLightweightSoonSoonSoonSoonSoonSoonPending
Platform ECompactSoonSoonSoonSoonSoonSoonPending
88illustrative sample scoreTrialsbeing measured nowSoonqueued

Your platform here. Enroll a humanoid and we publish its results on the board.

Enroll a platform

The numeric scores above are illustrative sample data shown to demonstrate the format — platforms are anonymized and these are not real evaluations. The “in trials” cells are domains we are actively measuring on real hardware right now; “coming soon” domains are queued. Our first published results land as those trials complete.

Methodology

How the Index works.

The PT Index is a weighted 0–100 composite. We run the same instrumented protocols on every platform, log everything, and publish a result your engineers can reproduce.

Scoring weights
Safety & Compliance
25%
Dexterity & Manipulation
20%
Locomotion & Balance
20%
Autonomy & Perception
20%
Human Interaction
10%
Endurance & Reliability
5%
What makes a result trustworthy
  • Independent by design. We don't build robots. We have no incentive but the truth about how yours performs.
  • Logged and reproducible. Every result is instrumented, logged, and repeatable. Receive a detailed report as well as raw testing logs & visualizations.
  • Messy but controlled. We test in the messy, variable conditions your robot will actually operate in, while keeping everyone safe.

Every capability that clears its threshold earns the Physical Turing Mark — proof the platform passed an independent, real-world bar, not a demo-floor one.

Want your platform on the board?

Physical Turingruns the trials so you can ship what's actually ready. Tell us about your policy or humanoid and we'll scope an evaluation.

Questions first? Email support@physicalturing.ai.