Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa
Ruhr-Universität Bochum
Personal assistants such as Alexa, Siri, or Cortana are widely deployed these days. Such Automatic Speech Recognition (ASR) systems can translate and even recognize spoken language and provide a written transcript of the spoken language. Recent advances in the fields of deep learning and big data analysis supported significant progress for ASR systems and have become almost as good at it as human listeners.
In this work, we demonstrate how an adversary can attack speech recognition systems by generating an audio file that is recognized as a specific audio content by a human listener, but as a certain, possibly totally different, text by an ASR system:
More audio examples can be found at the end of the articel.
These so-called adversarial examples exploit the fact that ASR systems based on machine learning have some wiggle room that an attacker can exploit: by adding some distortions, we can mislead the algorithm into recognizing a different sentence. The main contribution of our work is to demonstrate that we can generate the noise in such a way that it is almost completely hidden from human listeners. For this purpose, we take advantage of a psychoacoustic model of human hearing, describing the masking effects of human perception, which is also used for MP3 decoding.
Automatic Speech Recognition (ASR) systems can recognize speech, recently, almost as well as human listeners. We find ASR systems in a variety of modern systems, for example, home automation systems or personal assistants in smartphones. Using different methodologies and technologies from the field of computational linguistics, ASR systems can recognize and even translate spoken languages. Advances in the fields of deep learning and big data have supported recent progress for ASR systems, however, they have also created new vulnerabilities. We can attack speech recognition systems by triggering their actions through imitated voice commands. In this work, we implement an attack that activates ASR systems without being recognized by humans.
We used Kaldi as state-of-the-art ASR system, which consists of three different steps to calculate transcriptions of raw audio:
Psychoacoustic hearing thresholds describe masking effects in human acoustic perception. Probably the best-known example for this is MP3 compression, where the compression algorithm uses a set of computed hearing thresholds to find out which parts of the input signal are imperceptible for human listeners. By removing those parts, the audio signal can be transformed into a smaller but lossy representation, which requires less disk space and less bandwidth to store or transmit.
MP3 compression depends on an empirical set of hearing thresholds that define how dependencies between certain frequencies can mask, i.e., make imperceptible, other parts of an audio signal. We utilize this psychoacoustic model for our manipulations, i.e., all changes are hidden in the inperceptible parts of the audio signal..
For the attack, in principle, we use the very the same algorithm as for the training of neural networks. The algorithm is based on gradient descent, but instead of updating the parameters of the neural network as for training, the audio signal is modified. We use the hearing thresholds to avoid changes in easily perceptible parts of the audio signal.
In general it is possible to hide any transcription in any audio file with a success rate of nearly 100 %.
As an example, we have some audio clips, which are modified with the described algorithm:
Original | Modified | Noise | |
---|---|---|---|
Speech | |||
Music | |||
Birds | |||
Speech |
@INPROCEEDINGS{Schoenherr2019, author = {Lea Sch\"{o}nherr and Katharina Kohls and Steffen Zeiler and Thorsten Holz and Dorothea Kolossa}, title = {Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding}, booktitle = {Network and Distributed System Security Symposium (NDSS)}, year = {2019} }