OpenPose vs. AlphaPose, Which one is better?

Pose estimation is estimating the configuration of the human body from an image or a video.

For my project, I had to extract pose sequences from videos of a single person dancing. There are several pose estimation open source libraries available, but I was not sure which one performs the best and is the most suitable for my project. I tried two of the most renowned libraries - OpenPose and AlphaPose - and compared the results.

Short conclusion, in my case, was that AlphaPose was better than OpenPose. Please note that this experiment is case specific, and the result can vary depending on your data, dynamics of your videos, the number of people to estimate poses, etc.

What is OpenPose / AlphaPose?

OpenPose is a multi-person 2D pose estimation system to detect the human body and hand, facial, and foot keypoints. It takes a bottom-up approach by using Part Affinity Field (PAFs) which is information about limbs from the image. Because OpenPose is only based on a single frame, it shows good results on clear images but disregards the context over consecutive frames, and therefore can be unreliable on frames with problems such as blurring due to fast motion or different lighting. BODY_25 and COCO are the two available models of OpenPose, and the default model is BODY_25. It is documented that COCO is less accurate, hence BODY_25 was used for this experiment.

Link: https://github.com/CMU-Perceptual-Computing-Lab/openpose

AlphaPose (Fang et al., 2017), in contrast, takes a top-down approach which is also called as Detect-and-Track method because it detects a bounding box first and then estimates the pose. It is also single-frame-based as OpenPose. Out of multiple models and the number of keypoints available, Fast Pose model with 26 keypoints was used because it returns poses in the most similar format as the OpenPose BODY_25. In addition, Fast Pose trained with ResNet152, which had the highest average precision (AP) among models with 17 keypoints, was used to compare with the others because most of the previous dance generation studies used only 17 or 18 joints.

Link: https://github.com/MVIG-SJTU/AlphaPose

Below is the sample poses extracted by AlphaPose with 17 joints, AlphaPose with 26 joints, and OpenPose with 25 joints (from left to the right).

Dataset for Pose Estimation

The dataset I used consists of videos of open-style choreography performed by a solo dancer. The videos are mostly based on street dance or hip hop and meet the following criteria:

1) the dance movements illustrate musical styles and beats well

2) the dance movements are dynamic with a variety of footwork, levels, and use of space

3) the dance movements are big and clear enough so that the skeleton’s movements can be recognized as real dance movements.

4) carmera moving is zero or as little as possible so that we have the dancer in the frame most of the time.

50 dance videos were used for this experiment. The videos were adjusted to have the same size of 640 x 320 and frame per second (fps) of 30.

Evaluation Metrics

The metrics I used to compare the results are:

number of joints: The number of joints was used to check if more joints can reflect the dance better and if there is a difference in accuracy.
confidence scores: Confidence scores were part of the output from the pose estimation
missing frames: The number of missing frames was counted from the results of each pose estimation output.
incorrect detection: A function that determines whether the frame contains misdetection was defined as

(Correcting this misdetection to get clean data is covered in the next post.)

For each metric except for the number of joints, both mean and median were taken into account.

Results

Table 1 is the result of simple evaluation of AlphaPose and OpenPose. OpenPose had the least number of missing frames. AlphaPose with 17 joints had the least number of incorrect detections while OpenPose had 4.5 times as much. AlphaPose with 26 joints had the highest average confidence score, while OpenPose had the lowest. Each metric shows different results, but the number of incorrect detection was considered the most significant to determine the best method because it directly influences the quality of pose data the most.

Although the number of missing frames is also important, the gap between Open- Pose and AlphaPose was not noticeable. In addition, even though OpenPose had the least missing frames, it had a lot of frames missing parts of the keypoints, while AlphaPose had all the keypoints filled in.

The confidence score was considered the least important because it is output from the pose estimation system and therefore can be model-dependent. In terms of the number of joints, 25 and 26 joints are more expressive than 17 joints because they have keypoints of feet. However, what feet add is trivial because most of the dance movements in the dataset are focused on arms, body, and legs rather than on feet, and the number of incorrect detection was significantly higher than 17 joints. 17 is enough to represent dance movements from the dataset, and it leaves less space to generate unrealistic poses for the training later as well. Therefore, AlphaPose with 17 joints was selected to create a complete dataset for my project.

If you'd like to see how I cleaned the messy output of AlphaPose, check my next post: Cleaning messy pose estimation.

Search This Blog

Coding HAMA