In short
- Alibaba unveiled the Qwen-Robotic Suite, a trio of AI fashions designed to deal with robotic navigation, manipulation, and physics-based world simulation by a unified software program stack.
- The corporate says its fashions prime a number of robotics benchmarks, utilizing thousands and thousands of coaching samples and tens of 1000’s of hours of open-source robotic information.
- Actual-world robotic deployment stays years away.
Alibaba’s Qwen group dropped the Qwen-Robotic Suite on Tuesday: three basis fashions forming what they name a “full stack for embodied intelligence.” Qwen-RobotNav handles mobility. Qwen-RobotManip handles manipulation. Qwen-RobotWorld simulates the physics that make each attainable. Every works independently. Collectively, they’re the Android second for robotics—the working system, not the {hardware}.
📣 Introducing the Qwen-Robotic Suite — Qwen-RobotNav, Qwen-RobotManip, Qwen-RobotWorld, three basis fashions, a full stack for embodied intelligence.
🧭 Qwen-RobotNav — the gateway to mobility.
• Unifies 5 navigation duties in a single mannequin: instruction following, point-goal,… pic.twitter.com/noumjTtTeS— Qwen (@Alibaba_Qwen) June 16, 2026
Alibaba is true now the one firm in China spanning chips, cloud, fashions, serving platforms, and functions. For the corporate, robotics is probably the most bodily expression of that wager, what is called embodied AI.
AI brokers at present depend on LLMs to energy their choices. The same old approach robots work is by machine-learning fashions which, though superior, lack the adaptability of generative AI. Bodily brokers face a unique, more durable class of failure modes: physics, not prompts.
For these use instances, Alibaba launched this new AI suite with completely different elements:
Qwen-RobotNav unifies 5 navigation duties—instruction following, point-goal navigation, object search, goal monitoring, and autonomous driving—every demanding completely different visible reminiscence methods. Most fashions hardcode one technique. Qwen-RobotNav exposes a parameterized interface: token finances, temporal decay, per-camera weights {that a} planner can reconfigure mid-episode.
Educated on 15.6 million samples with randomization throughout all parameters, it achieves 76.5% success on VLN-CE RxR, a benchmark for vision-and-language navigation in real-world environments, and 90% monitoring on EVT-Bench, which evaluates an agent’s potential to persistently observe transferring targets.

Qwen-RobotManip tackles one of many greatest challenges in robotic manipulation: completely different robots symbolize actions in essentially other ways. A Franka arm (a kind of robotic with seven axis of motion) operates by joint angles, whereas an ALOHA robotic (a low-cost bimanual robotic platform broadly utilized in robotics analysis) represents actions by the place and orientation of its grippers (end-effector poses). Humanoids add one other layer of complexity, utilizing whole-body coordinates.
To bridge these incompatible motion areas, Alibaba synthesized roughly 38,100 hours of coaching information from open-source robotic datasets and human movies—with out counting on proprietary information assortment. The mannequin ranks first on RoboChallenge Table30-v1, outperforming earlier approaches by 20%.

Qwen-RobotWorld is probably the most formidable: a language-conditioned video world mannequin treating pure language as a common motion interface. “Decide up the purple cup and pour water on the flower” works whether or not the actor is a gripper, an autonomous automobile, or a cellular navigation agent.
The Embodied World Data corpus spans 8.6 million video-text pairs—200 million frames—throughout manipulation (5.9 million samples, 1,300+ expertise, 20+ morphologies), autonomous driving (Waymo, NVIDIA PhysicalAI-AD, Bench2Drive), indoor navigation (VLNVerse), and human-to-robot switch throughout 14 robotic arms.
It ranks first on EWMBench and DreamGen Bench, two benchmarks that consider if world fashions predict and generate life like bodily environments. It additionally beats all open-source fashions on WorldModelBench and PBench, and scores completely on physics adherence: Newton’s legal guidelines, mass conservation, fluid dynamics, gravity.

The ChatGPT of robots?
Whereas Western labs (Google DeepMind, Nvidia, Determine, Bodily Intelligence) pursue related objectives, most deal with navigation or manipulation, not a unified, composable suite. Alibaba’s vertical integration from chips by functions means they management the complete stack. The open-source basis differentiates towards opponents counting on non-public robotic information.
There are some misconceptions that might be value clearing: These aren’t robots however software program fashions—brains, not our bodies. They run on {hardware} from AgileX, Franka, Common Robots, Unitree, and others.
Additionally, regardless of these being generative AI fashions for robots, these aren’t LLMs like your typical ChatGPT. A language mannequin predicts tokens. These fashions should perceive physics, spatial relationships, and penalties of bodily actions. A language mannequin tells you a glass breaks if dropped. Qwen-RobotWorld predicts the way it breaks—shatter sample, fluid dynamics, secondary collisions. Qwen-RobotManip plans a grasp that forestalls the drop solely.
Do not anticipate to have your individual housemaid robotic anytime quickly. The hole between a managed demo of a robotic putting fruit in a basket and a robotic reliably working in your house is big. RoboCasa365, LIBERO-Plus, RoboTwin-Clean2Rand—these are simulation benchmarks. Actual-world deployment introduces sensor noise, actuator drift, and the lengthy tail of edge instances which have humbled each robotics effort in historical past, and Alibaba acknowledges this.
The technical achievements are actual, although. RobotManip’s alignment-first strategy solves a real bottleneck in cross-embodiment coaching. RobotNav’s parameterized remark interface is a intelligent resolution to the context-strategy downside. RobotWorld’s language-as-universal-action-interface is the appropriate abstraction for cross-domain world modeling.
Alibaba hasn’t disclosed pricing, timelines, or which prospects get entry past pilot packages.
Each day Debrief E-newsletter
Begin each day with the highest information tales proper now, plus unique options, a podcast, movies and extra.
