Hand Gesture Recognition ... on a Mobile?!?!?

Last month, Google announced the release of ‘On-Device, Real-Time Hand Tracking with MediaPipe’ on its AI blog, but what is it and how is it relevant for music technologists?

Essentially, this technology allows a computer to recognise hand poses. Not only does it work in real-time, it can also operate on a mobile phone and is now completely open source.

Let’s break down the title of the blog post.

  • On Device: the programme is able to run on tiny, low-powered microprocessors. In other words, mobile phones and tablets.

  • Real-TIme: the programme will operate live, in the moment, without significant lag or buffering.

  • Hand Tracking: the programme can recognise a hand, and its associated pose, whilst moving via the device’s onboard camera.

  • MediaPipe: Google’s framework for building ‘multimodal’ (eg. video, audio, any time series data) machine learning pipelines (applications).



If you take a look at the blog post or the Github resource, you will see a number of gifs that show the hand tracking algorithm at work.


There are many interesting things about the above gif. Firstly, consider how difficult it is for a computer to recognise a hand. The hand exists in 3D space, but the picture from the camera is in 2D. Fingers overlap, move, block each other, and change shape (i.e. bend). Notice how the computer manages to track the hand as it moves around in space: from left to right, up and down, towards and away from the camera, and twisting on various axes. We get to see the front and back of the hand, the position of the wrist in relation to the hand moves, and the demonstrator’s head even comes into view briefly. The programme locks onto the hand the entire time.

Secondly, take a look at the background of the gif, it is quite complicated. It shows ceiling beams and struts, computer screens, walls, and desks. If you don’t believe me that this is a hard task for a computer to achieve, think back to the last time you tried to use the Magic Wand tool in Photoshop to remove the background from a picture.

This is even more impressive in the following gif, where the demonstrator is outside and the background is constantly changing as the camera position moves.

Hand Tracking for Music Making

The open sourcing of this project allows anyone to download the code and create their own mobile applications. The code is written in a mixture of C++ and python, for the most part. There are lots of on-line tutorials to get you coding in python and a good place to start is with w3 schools. C++ is harder to code in, but the beauty of this repo is that it is already written, you just have to implement it. I won’t go into the nitty-gritty of how to set-up and build an app here, you can find out more by visiting the Github repo.

Once implemented, it would not be difficult to create a music making app for your phone that responds to hand poses and gestures. For example, you could create a simple theremin-like instrument, using the different poses to control the timbre of a synthesizer. Perhaps an even easier idea might be to create an app that sends OSC messages to your laptop so that it can control parameters in Max, Ableton, or Logic. What if the audience at your next gig all had the app and their gestures and poses were used to control the music? Say you are improvising on stage, the audience could vote with ‘thumbs up’ or ‘thumbs down’ on the quality of your performance, with the results changing parameters that control what they see/hear.

The possibilities are endless and, once you have mastered the basics of making an app, the only thing holding you back will be your imagination.

Happy coding!