19 January, 2024
Marcel Heisler

Making an Android Robot Head Sing



During android Andrea’s recent visit to the Mercedes-Benz Museum, it was sometimes asked to sing by visitors (13 times to be precise). I took my last day before the winter holidays as an occasion to get one step closer to fulfill this wish, by teaching one of our android heads some Christmas songs. This blogpost describes how it works and also provides some examples of what did not work out so well.

Edit: We had to exchange some of the Christmas songs.

Song Credit: Please Use This Song (Jon Lajoie)

Base case

The robot heads (as well as Andrea) are already capable of automatically lip-syncing audio files containing speech signals. This is done using a pre-trained machine learning (ML) model that animates a 3D mesh of a head according to a speech signal. The output mesh is then mapped to control the robot’s actuators, as described in our paper Making an Android Robot Head Talk.

To make the robots appear a bit more life-like we also manually defined some animations like blinking and slight random eye-movements that can be executed repeatedly in random time intervals (see our paper An Android Robot Head as Embodied Conversational Agent for some details).

Mouth movements

Using music as input to the head animation ML model does not work, since music instruments confuse the model that was not trained on music. Thus, we need a cappella songs as input to generate mouth movements. Two ways to get a cappella songs were tested:

1. Text-to-audio ML models

In contrast to plain text-to-speech (TTS) models, some recently released text-to-audio models are able to produce other sounds than speech as well. For example Bark can generate singing sounds using notes in the prompt, like ♪ In the jungle, the mighty jungle, the lion barks tonight ♪.

Apart from Bark, Meta’s Audiobox was tried out using its feature to describe a speaking voice. Some prompts were tested, like “a male/ female voice singing”. The results are quite entertaining, though not really satisfying. While some more prompt engineering might improve on this, trying the other source of a cappella songs, was more promising.

2. Music source separation

The task of music source separation is to decompose music into its constitutive components. I used Demucs to split a song into separate audio files for bass, drums, most important vocals and other. Generating mouth movements for the extracted vocals works quite well and results in the head singing a cappella. Playing the whole original song along with the generated mouth movements results in the video at top of the page.

Song Credit: Please Use This Song (Jon Lajoie)

Neck movement

Some rhythmic head movements were added based on librosa’s beat_track method to detect beats. On every second detected beat the neck actuator is set to move to a different side. However, this might need some more finetuning depending on different songs.

Finally, we detect silence in the vocals audio file using librosa's trim, so other animations can control the mouth actuators during the song when there is no singing, yet or anymore. In this case a slight smile is put on in such cases.

Song Credit: NEFFEX - This Is Not A Christmas Song (with Ryan Oakes)