SpecPatch: Human-in-the-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition

We propose SpecPatch, a human-in-the loop adversarial audio attack on automated speech recognition (ASR) systems. Existing audio adversarial attacker assumes that the users cannot notice the adversarial audios, and hence allows the successful delivery of the crafted adversarial examples or perturbations. However, in a practical attack scenario, the users of intelligent voice-controlled systems (e.g., smartwatches, smart speakers, smartphones) have constant vigilance for suspicious voice, especially when they are delivering their voice commands. Once the user is alerted by a suspicious audio, they intend to correct the falsely-recognized commands by interrupting the adversarial audios and giving more powerful voice commands to overshadow the malicious voice. This makes the existing attacks ineffective in the typical scenario when the user’s interaction and the delivery of adversarial audio coincide. To truly enable the imperceptible and robust adversarial attack and handle the possible arrival of user interruption, we design SpecPatch, a practical voice attack that uses a sub-second audio patch signal to deliver an attack command and utilize periodical noises to break down the communication between the user and ASR systems. We analyze the CTC (Connectionist Temporal Classification) loss forwarding and backwarding process and exploit the weakness of CTC to achieve our attack goal. Compared with the existing attacks, we extend the attack impact length (i.e., the length of attack target command) by 287%. Furthermore, we show that our attack achieves 100% success rate in both over-the-line and over-the-air scenarios amid user intervention.


Read the Paper, Cite

@inproceedings{guo2022specpatch,
  title={SpecPatch: Human-In-The-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition},
  author={Guo, Hanqing and Wang, Yuanda and Ivanov, Nikolay and Yan, Qiben and Xiao, Li},
  booktitle={Proceedings of the 2022  ACM Conference on Computer and Communications Security (CCS)},
  year={2022}
}

Failure Cases of Existing Adversarial Audio Attack

User Interference: while the adversary and the user pronounce commands concurrently (e.g., the AE says, “call 911”, and the user speaks, “set an alarm at 6 am” at stage ①), the ASR system tends to accept the user’s command rather than the AEs; in this case, it will respond with “Alarm has been set”.

User Perception: Although the adversary might craft imperceptible AEs by encoding the adversary command into songs or different speeches, the user is still able to locate the source of skeptical sound because of the long duration and repeated appearance of common audio adversarial attacks.

User Interaction: the adversary launches the attack by playing the “read message” adversarial audio at stage ① , followed by the successful response from the ASR system reading the message containing a personal verification code at stage ② . However, when the user is present, he/she is conscious of the abnormal behavior of the ASR device and tries to interact with the ASR system by sending a halting command (such as “stop reading”) at stage ③ to regain the control. Consequently, the ASR system follows the user’s benign command and terminates the reading process.

Features of SpecPatch Attack

SpecPatch Sample Audios

Benign

Adversarial

"Turn on the lights" [Patch + Mute]: "Open the door----------------"
"Close the window and curtains" [Patch]: "Open the door window and curtains"
"Turn on the lights" [Mute]:"-------------------"