News center > News > Headlines > Context
After kicking OpenAI, Figure Light Speed ​​released a well-designed model Helix with unprecedented capabilities and created many firsts
Editor
20 hours ago 6,074

After kicking OpenAI, Figure Light Speed ​​released a well-designed model Helix with unprecedented capabilities and created many firsts

Source: Heart of Machine

After the sudden end of its partnership with OpenAI in February, well-known robotics startup Figure AI revealed the reason behind this Thursday night: they have created their own universal embossed smart model Helix.

Helix is ​​a universal vision-language-action (VLA) model that unifies perception, language understanding and learning controls to overcome multiple long-term challenges in robotics .

Helix has created several firsts:

Full Body Control: It is the first high-speed continuous control VLA model of the upper body of a humanoid robot in history, covering the wrist, torso, and head With a single finger;

Multi-robot collaboration: Two robots can use one model to control and cooperate to complete unprecedented tasks;

Catch any item: you can pick up any small Objects, including thousands of items they have never encountered before, simply follow natural language instructions;

Single neural network: Helix uses a set of neural network weights to learn all behaviors—Crawl and place items, use drawers and refrigerators, and interact across robots – without any task-specific fine-tuning;

Localization: Helix is ​​the first robot VLA ever to run on a local GPU The model has already achieved commercial implementation capabilities.

In the field of intelligent driving, various car manufacturers are promoting the large-scale implementation of end-to-end technologies this year. Now VLA-powered robots have entered the countdown to commercialization. It seems that Helix is ​​a well-designed person. A major breakthrough in intelligence.

A set of Helix neural network weights run on two robots simultaneously, working together to store groceries that have never been seen before.

New extension to humanoid robotics

Figure says the home environment is the biggest challenge facing robotics. Unlike controlled industrial environments, homes are filled with countless irregular objects such as fragile glassware, crumpled clothing, scattered toys, and the shape, size, color and texture of each item is unpredictable. In order for robots to play a role in the home, they need to be able to generate intelligent new behaviors on demand.

Current robotics cannot be extended into home environments—At present, teaching robots a single new behavior requires a lot of manpower investment. Either it takes hours of manual programming by PhD experts or thousands of demonstrations, both of which are prohibitively expensive.

Figure 1: The expansion curve of different methods to acquire new robot skills. In traditional heuristic operations, the growth of skills depends on the manual scripting of experts. In traditional robot imitation learning, the expansion of skills depends on In the collected data. And through Helix, new skills can be specified instantly through languages.

At present, other fields of artificial intelligence have mastered this ability to generalize instantly. If vision-language can be simply used The rich semantic knowledge captured in the model (VLM) is directly transformed into robotic actions, which may achieve technological breakthroughs.

This new capability will fundamentally change the expansion trajectory of robotics (Figure 1). Thus, The key question becomes: How to extract all this common sense knowledge from VLM and translate it into generalizable robot control? Figure builds Helix To bridge this gap.

Helix: The first robot system 1 + system 2 VLA model

Helix is ​​the first "System 1 + System 2" VLA model in the field of robots, using At high speed and dexterity, control the entire upper body of the humanoid robot.

Figure says that previous methods face a fundamental trade-off: VLM backbone is universal but not fast, while robotic visual motion strategies are fast but not universal enough. Helix solves this trade-off through two complementary systems that are trained end-to-end to communicate:

System 1 (S1): a fast-responsive visual motion strategy, Convert the potential semantic representations generated by S2 to precise continuous robotic actions at 200 Hz;

System 2 (S2): A VLM pre-trained on-board Internet running at 7-9 Hz for scenario understanding and language understanding to achieve widespread generalization across objects and contexts.

This decoupling architecture allows each system to operate on its optimal time scale. S2 can "think slowly" high-level goals, while S1 can "then quickly" the actions performed and adjusted by the robot in real time. For example, in collaborative behavior (see figure below), S1 can quickly adapt to the changing movements of the partner robot while maintaining the semantic goals of S2.

Helix allows robots to quickly and fine motion adjustments, which is necessary to react to partners when performing new semantic goals.

Helix's design has the following key advantages over existing methods:

Speed ​​and generalized capabilities: Helix's velocity and behavioral cloning strategies specifically for single tasks It is comparable, and can generalize zero samples to thousands of new test objects at the same time;

Scalability: Helix directly outputs high-dimensionalityContinuous control of action space avoids the complex action tokenization scheme used in the previous VLA method. These solutions have achieved some success in low-dimensional control settings such as binary parallel jaws, but faced extension challenges in high-dimensional humanoid control;

Architecture simplicity: Helix uses standard architecture— An open source, open weight VLM for System 2, and a simple Transformer-based visual motion strategy for System 1;

Separation of concerns: Decoupling S1 and S2 allows us to iterate separately Each system without being limited to finding a unified observation space or action representation.

Figure introduces some models and training details, which collects a data set of diversified remote operation behaviors of high-quality, multi-robots, and multi-operators, totaling approximately 500 hours. To generate training pairs under natural language conditions, engineers used an automatically labeled visual language model (VLM) to generate post-hoc instructions.

This VLM will process segmented video clips from the robot's on-board camera and prompt: "What instructions will you give the robot to make it perform the actions seen in the video?" All items processed during training Excluded in the assessment to prevent data contamination.

Model Architecture

The Helix system is mainly composed of two main components: S2, a VLM backbone network; S1, a potential conditional visual motion Transformer.

S2 is built on an open source, open weight VLM with 7 billion parameters, which is pre-trained on Internet-scale data. It processes monocular robot images and robot state information (including wrist posture and finger position) and projects them into the visual language embedding space. Combined with natural language instructions that specify the desired behavior, S2 distiles all semantic task-related information into a continuous latent vector, passing it to S1 to adjust its low-level actions.

S1 is an 80 million parameter cross attention encoder - decoder Transformer responsible for low-level control. It relies on a fully convolutional multi-scale visual backbone network for visual processing, which is initialized entirely in a simulated environment. While S1 receives the same image and state inputs as S2, it processes these inputs at a higher frequency for more sensitive closed-loop control. The latent vectors from S2 are projected into the marker space of S1 and are connected along the sequence dimension with visual features extracted by the S1 visual backbone network, providing task conditions.

While operating, S1 outputs complete upper body humanoid controls at 200 Hz, including desired wrist posture, finger flexion and abduction control, and torso and head orientation targets. Figure attaches a synthetic "Task Completion Percent" action to the action space.Enables Helix to predict its own termination conditions, making it easier to sort multiple learned behaviors.

Training

Helix's training is completely end-to-end: maps from raw pixels and text commands to continuous actions with standard regression losses.

The backpropagation path of the gradient is through the hidden communication vectors used to regulate S1 behavior from S1 to S2, thereby allowing joint optimization of the two components.

Helix does not need to be adjusted for a specific task; it only needs to maintain a single training phase and a set of neural network weights, without the need for separate action heads or fine-tuning phases for each task.

During training, they also add a time offset between the S1 and S2 inputs. This offset is calibrated to match the gap between inference latency for S1 and S2 deployments, ensuring that real-time control requirements during deployment are accurately reflected in training.

Optimized streaming inference

Helix's training design enables efficient parallel deployment of models on Figure robots, each equipped with dual low-power embedded GPUs. The inference pipeline is divided into S2 (advanced hidden planning) and S1 (low-level control) models, each running on a dedicated GPU.

S2 runs as an asynchronous background process to handle the latest observations (onboard camera and robot state) and natural language commands. It constantly updates shared memory hidden vectors encoding advanced behavioral intents.

S1 is executed as a separate real-time process and its goal is to maintain the critical 200Hz control loop required to smoothly execute the entire upper body movement. Its input is the latest observations and the latest S2 hidden vector. Because of the inherent velocity difference between S2 and S1 inferences, S1 naturally runs at higher temporal resolution on robot observations, creating a tighter feedback loop for reaction control.

This deployment strategy intentionally reflects the time offset introduced in training, thereby minimizing the gap in training inference distribution. This asynchronous execution model allows two processes to run at their respective optimal frequencies, allowing Helix to run as fast as the fastest single-task imitation learning strategy.

Interestingly, after Figure released Helix, Tsinghua University doctoral student Yanjiang Guo said that his technical ideas are quite similar to one of their CoRL 2024 papers, and interested readers can also refer to them.

Thesis address: https://arxiv.org/abs/2410.05273

Result

Fine-grained VLA Full Upper body control

Helix can coordinate 35 at 200HzThe action space of freedom, controlling everything from single finger movement to the end effector track, head gaze and torso posture.

Head and torso control have unique challenges – when the head and torso move, it changes the range of reach and the range that the robot can reach, thus creating a feedback loop that was used to be like this. Will cause instability.

Video 3 demonstrates this coordinated practice: The robot tracks hands smoothly with the head while adjusting the torso for optimal reach while maintaining precise finger control for grip. Prior to this, it was difficult to achieve such a level of accuracy in such a high-dimensional action space, even for a single and known task. Figure says no VLA system has been able to demonstrate this level of real-time coordination while maintaining the ability to generalize across tasks and objects.

Helix's VLA can control the entire upper body of a humanoid robot , This is the first model in the field of robot learning to do a little bit.

Zero-sample multi-robot collaboration

Figure said they pushed Helix to the limit in a difficult multi-agent operation scenario: two Figure robots collaborate to achieve zero-sample grocery storage.

Video 1 shows two fundamental advancements: Two robots successfully operate brand new cargo (items that have never been encountered during training), demonstrating robustness to a variety of shapes, sizes and materials Generalization.

In addition, both robots operate with the same Helix model weights, without the need for specific robot training or explicit role assignments. Their synergy is achieved through natural language prompt words such as "hand a bag of cookies to the robot on your right" or "take a bag of cookies from the robot on your left and place it in an open drawer" (see Video 4). This is the first time using VLA to demonstrate flexible and extended collaboration between multiple robots. This achievement is particularly significant considering that they successfully handle entirely new objects.

Helix enables precise multi-robot collaboration

The ability to "pick up anything" emerges

Just a "Pick up [X]" command, the Figure robot equipped with Helix can basically pick up any small household items. In systematic testing, without any prior demonstration or custom programming, the robot successfully handled thousands of new items that were placed in a mess – from glassware and toys to tools and clothes.

SpecialIt is not worth noting that Helix can establish a connection between Internet-scale language understanding and precise robotic control. For example, when prompted to “pick up the desert item,” Helix not only can determine that the toy cactus matches this abstract concept, but also selects the nearest hand and can safely grab it with precise motion commands. "This universal 'language-to-action' crawling capability opens exciting new possibilities for deploying humanoid robots in unstructured environments," said Figure. p>

Helix can translate high-level instructions such as "Pick up [X]" into low-level actions.

Discussion

Helix's training efficiency is very high

Helix achieves powerful object generalization with very few resources. "We used about 500 hours of high-quality supervised data in total to train Helix, which is only a small part of the VLA dataset collected before (<5%) and does not rely on multi-robot embossed collection or multiple "They noticed that this collection scale is closer to the modern single-task imitation learning dataset. Although the data requirements are relatively small, Helix can be extended to more challenging action spaces, namely full upper body humanoid controls with high-speed, high-dimensional output.

Single weight set

Existing VLA systems often require specialized fine-tuning or dedicated action headers to optimize performance for performing different advanced behaviors. It is worth noting that Helix can use only one set of neural network weights (System 2 is 7B and System 1 is 80M), and can pick and place items in various containers, operate drawers and refrigerators, coordinate and skillful multi-robot handovers, and Manipulate thousands of new objects and other actions.

"Pick up Helix" (Helix means spiral)

Summary

Helix is ​​the first to pass natural language The "vision-language-movement" model that directly controls the entire upper body of the humanoid robot. Unlike earlier robotic systems, Helix was able to generate growth horizons, collaborative, and dexterous operations on the fly without any task-specific demonstration or a lot of manual programming.

Helix demonstrates strong object generalization ability, able to pick up thousands of novel household items of different shapes, sizes, colors and material properties that have never been encountered during training , just use natural language commands. "This represents a transformative step in Figure's expansion of humanoid robot behaviors -- a step we believe will be crucial as our robots increasingly assist everyday home environments," the company said.

While these early results are indeed exciting, overall, what we saw above is proof of concept and just shows the possibility. The real change will happen when Helix can be deployed on a large scale. Looking forward to the arrival of that day soon!

Lastly, by the way, the release of Figure may be just a small step in the many breakthroughs of embodied intelligence this year. Early this morning, 1X Robot also officially announced that it will launch new products soon.

Keywords: Bitcoin
Share to: