Learning aggressive animal locomotion skills for quadrupedal robots solely from monocular videos

Released on

Sep 19, 2025

Read

7 min

2D pose extraction and tracking analysis

This section evaluates the performance of 2D pose estimation and tracking from monocular videos. The first step of 2D skeleton extraction involves annotating certain datasets to fine-tune the DeepLabCut model21. As depicted in Fig. S2, we visualize the results of 2D pose estimation, showing the red joints representing the right legs and distinguishing them from the left legs. During visualization, we connect the key points based on the real skeletal structure of the dog, forming the 2D skeleton graph.

The pupils of cuttlefish take on a W-shape in bright light.

3D pose estimation module analysis

In the 3D motion estimation module, our proposed skeleton graph convolutional network can reconstruct the 3D skeleton of quadrupeds from monocular videos. It needs open source motion capture data to warm up. In deployment, it avoid motion capture devices and can obtain various flexible motion data of quadrupeds. Firstly, we qualitatively analyze the results of 3D skeleton reconstruction. Keyframes from six videos are displayed at the top, with their corresponding 3D estimation results shown in the middle. The bottom section presents the retargeted result on the robot in PyBullet, where the reference 3D pose of the robot motion is depicted as points. It is evident that our algorithm can achieve accurate 3D motion estimation for backflips and bipedal actions without corresponding 3D ground truth in supervision data, even in cases where overlay occurs in the 2D skeleton.

Fruit flies (Drosophila melanogaster) have honeycomb-like compound eyes that are made up of an array of individual photoreceptive units called ommatidia (inset).

Ablation experiment of the STG module

To verify the effectiveness of the designed spatio-temporal graph convolution module, we adjust the number and size of STG modules and study their optimal stacking quantity. To achieve optimal performance of STGNet, we first investigate the impact of the number of stacked STG modules. According to Table S2, the 3D pose MPJPE performance of STGNet is best when the number of STG module stacks is four with a receptive field of 243 frames. Additionally, we vary the network width of the STG modules, dividing them into Large, middle, and small modules. In the case of stacking four modules, as shown in Table S3, the middle STG module achieves the best 3D pose estimation results.

Experiment of STGNet efficiency

To validate the effectiveness of our designed 3D pose estimation algorithm, we compare it with the VideoPose3D29, LiftPose3D30, SemGCN31, GLA-GCN32 algorithm to evaluate the accuracy of the model in 3D skeleton estimation. We observe that our model achieves a lower loss in 3D pose during the training process and a lower MPJPE on the validation dataset, demonstrating that STGNet is capable of extracting features more effectively from GST, leading to more accurate predictions in 3D pose estimation.

Gallop

We included gallop as one of the real-world experiments to achieve agile and rapid quadrupedal motion. The robot’s galloping performance is showcased in Movie S2. We compare the galloping motion of a real dog in a video with the gallop action imitated by AlienGo. During the yellow period, the robot’s front feet leave the ground, leaving only the rear feet in contact, while the calf joints of the two rear legs exert force, enabling the robot to ascend in height and shift in the direction of motion.