Machine Learning Tools for Gesture Recognition and Hand Tracking: A Comparison with Google MediaPipe

Published in

Becoming Human: Artificial Intelligence Magazine

7 min readJan 29, 2021

Hardware agnostic hand tracking for VR on a HTC Vive

Hand tracking and gesture recognition technology represent a revolution in the way people interact with technology: virtual interactions with digital and holographic objects, touchless controls with smart displays, and remote interactions with autonomous devices are now possible.

These new ways of interacting pave the way to a wide variety of applications in industries such as entertainment, manufacturing, robotics, automotive, and healthcare.

How to provide ready-to-run, accurate and battery-efficient hand tracking models

Clay AIR has been developing gesture recognition and hand tracking solutions since 2015, backed by ten years of R&D. Among all of the players who aim to provide a hardware-agnostic, performant, and intuitive solution, the challenge remains unchanged.

How to provide ready-to-run and accurate hand tracking models on any device while preserving the CPU consumption?

Clay AIR introduces new proprietary tools USG & KANT designed to improve accuracy, performance and training time of hand tracking models

Clay AIR hand tracking and gesture recognition technology is the articulation of our models, the proprietary tools we designed to train our models, and our technical capabilities in other interaction technology (i.e. 6DoF, SLAM, Planar Detection, Body and Face Recognition).

In this publication, we will refer to Google.AI’s paper state-of-the-art technology for real-time hand tracking: “New On-Device, Real-Time Hand Tracking with MediaPipe” to introduce Clay AIR’s two latest proprietary tools, USG & KANT, designed to improve model training, resulting in higher gesture recognition accuracy, increased performance, and quicker model readiness and training time. You can find more information about our patents and scientific papers here.

Clay AIR hand tracking technology using Nreal cameras

Similarities and differences to Google’s Machine Learning Pipeline for real-time hand tracking

Google’s approach provides high-fidelity hand and finger tracking from an HD RGB camera by employing machine learning (ML) to infer 21 key points of the hand from a single frame.

The architecture of Clay AIR’s machine learning pipeline for gesture recognition and hand tracking is different in the methods and tools used to train our models, which results in higher performance, quicker implementation time, and higher accuracy.

Hand landmark model differences

Google hand landmark model performs precise key point localization of 21 3D hand-knuckle coordinates inside the detected hand regions via regression (direct coordinate prediction). Google feeds its real-time hand tracking model with cropped real-world photos and rendered synthetic images to predict the 21 key points

Clay AIR’s hand landmark models perform a prediction of 22 (now 23) 3D key point coordinates obtained from a 1,4M sample database, with cropped real-world and synthetic images.

However, the input (monochrome), the resolution (96x96p, 112x112p, 128x128p with a correlative maximum distance of 5,6 feet), the blending distribution (synthetic/manual), the bounding box (adaptive/rectangular), the model itself (direct 3D) and the training method differ from Google’s.

Input differences

At Clay AIR we are able to use a 6DoF input out of a monochrome camera, versus Google’s 256x256p out of an RGB camera.

Monochrome sensors are typically already being used for room-scaling purposes, and running our software through the same camera allows us to avoid opening other cameras such as the RGB, which are well known for their heat and high CPU consumption.

Monochrome inputs are more challenging to process, as the images are lower resolution, in black and white, and more distorted. Even so, we are able to run machine-learning based tracking and gesture recognition through monochrome cameras, in addition to RGB, IR and ToF cameras.

Google uses models without SSD, which results in a slower and less accurate object detection.

Annotation, training and mixing method differences

As Google stated, 30K samples were used, partly manually annotated and partly synthetic. Manual annotation usually costs 0.5 cents per sample and lasts 3 to 4 weeks.

As part of the process is manual, the positioning is uncertain and the confidence of the keypoints is consequently lessened, thus jitter is likely to occur. On the other hand, synthetic data can carry biases such as the image’s grain, that can result in less recognized hands.

Clay AIR developed two proprietary tools to accelerate the annotation and training process of in-house hand tracking models

KANT (Knowledge Automated Notation Tool)

KANT is a generic annotation tool that enables us to generate 90k samples per hour. Any object could be generated, but we use it to generate balanced and representative hand poses and positions.

It includes luminance and background matching to adjust seamlessly to new devices, ISP, and FOV. The resulting generated samples feed our 2D or 3D hand models.

Google’s annotation process compared to Clay AIR proprietary tools

Google vs Clay AIR hand tracking: comparison of annotation, training method, lens, input data processing, hand landmark model, and computation.

Computing time comparison on different devices: Google vs Clay AIR

Implications for our partners and end-users using Clay AIR hand tracking models

A shorter implementation time

KANT and USG impact the training velocity drastically: from 10,000 images in 2D per month with the previous tools to 90,000 per hour in 3D, resulting in a shorter time to implementation for our partners.

Increased accuracy

With semi-automated annotation processes and an increased diversity of samples, Clay AIR is able to provide less jitter and reduce inaccuracy of manual-only annotated data. The hand tracking models are therefore more accurate, which increases the sense of immersion for users.

A power-efficient solution

The two tools enable Clay AIR to use only one camera feed as no more triangulation is needed to predict the Z point, thus cutting the DSP load in half.

It is particularly interesting for partners looking to implement intuitive hand interactions on lightweight devices with lower computing power, and in systems where the CPU load must be spared for essential features, such as in the driver monitoring systems in cars or trucks.

Using Clay AIR proprietary tools to annotate and train data sets: faster process and increased performance

About Clay AIR

Clay AIR is the only hardware-agnostic software solution for hand tracking and gesture recognition with leading-edge performance.

Clay AIR is a proprietary software solution that enables realistic interaction with the digital world for a variety of mobile, XR, and other devices using computer vision and artificial intelligence.

Recently, Clay AIR collaborated with Lenovo to bring native gesture recognition to the ThinkReality A6 augmented reality (AR) headset.

Clay AIR also partnered with Renault-Nissan-Mitsubishi to create their prototype in-car air gesture controls to increase safety and improve driving experiences.

The company is also working with Nreal to add hand tracking and gesture recognition to its mixed reality headsets, and with Qualcomm to implement Clay AIR’s technology at the chipset level to simplify integrations and bring hand tracking and gesture controls to more AR and VR devices.

If you would like more information about implementing our solutions, feel free to reach out to us here.

Don’t forget to give us your 👏 !