It’s not just Spotify that’s developing creepy voice recognition technology. A growing number of companies aren’t just listening to what you say (e.g. “Hey Spotify, play my favourite song!”); they’re also making inferences about your identity — your age, gender, emotional state, or even your mental well-being — in order to profile you and target you with products and services.
The voice recognition tech industry is booming, and governments are taking interest in its products. The threat this technology poses to our rights needs to be addressed now — before our voices become yet another piece of biometric data to be used against us. In this post, we’ll look at what voice recognition technology is, how it threatens our rights, why regulators must take action to keep us safe, and how things can be done better.
What is voice recognition tech?
It’s a form of biometric recognition technology, which means that it uses our biometric data — data that are unique to our bodies and behaviours, such as your facial features, fingerprints, the way you walk, and yes, your voice — to identify us, track us, and even make inferences about sensitive, intimate aspects of our lives.
Facial recognition technology in particular has been getting a lot of attention lately. There’s growing, global momentum behind banning the use of this technology and other biometric surveillance technologies in public spaces. Access Now joined over 200 global civil society organisations from over 55 countries in calling for just such a ban. We have also called out Spotify’s creepy voice recognition technology patent, urging the company to make a public commitment never to use, license, sell, or monetise the technology.
But there’s more ground to cover here. As regulators in Europe and around the world wake up to the harm of biometric surveillance technology, they need to pay attention to voice recognition and its unique threats.
► What was that you said? How voice recognition works
The term voice recognition covers a number of distinct applications where “artificial intelligence” (i.e. machine learning) systems use data about our voices to analyse our speech and respond to the words we say (speech recognition), to identify us (speaker authentication/identification), or to make complex and often contentious inferences about us, such as guessing at our personalities, or even our mental health status, from the way we speak (voice categorisation). Let’s look at some specific examples to see how this works.
From bad to worse
► Voice assistants (speech recognition)
The first thing most people think of when they hear “voice recognition” is a voice assistant for “smart” devices, such as Amazon’s Alexa, Apple’s Siri, or Google Assistant. These devices analyse your voice to figure out what you’re saying (i.e. the content of your speech) in order to act on your voice command. For example if you say, “Hey Alexa/Siri/Google, what time is it?”, the listening system would recognise the “wake word” (i.e. Alexa, Siri, etc), fully activate to carry out the command or answer your question, then take the appropriate action.
This series of steps is the essence of speech recognition, where systems try to analyse the content of what we say to take an action. This kind of technology is also used to transcribe speech, or generate automatic subtitles on videos.
Speech recognition tools may seem relatively innocuous. But they raise some human rights concerns. First of all, although they only fully activate once they pick up the wake word, these devices are constantly listening to the audio signals around them. This makes sense, because they can’t pick up the wake word if they’re not already listening.
Obviously it’s already an issue if these devices turn on when they shouldn’t, but the problem goes even deeper because recordings and transcriptions of conversations are sometimes sent to human reviewers to check transcriptions against the actual audio snippet. Worse still, this work is often subcontracted to third parties, meaning that employees of these other companies listened to snippets of captured speech when they worked to improve Google Assistant, Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana, and Facebook Messenger, raising data protection and security issues. In some cases, journalists reported that these human reviewers ended up eavesdropping on sensitive situations. One reviewer even heard “a female voice in distress and said he felt that “physical violence had been involved”.
The issues don’t stop there unfortunately. There are serious concerns with how the “feminised” gendering of Alexa and Siri reinforce problematic gender stereotypes; for how these systems are optimised for certain accents and dialects in a way that excludes people with other accents and dialects, such as Black Americans; how the poor quality of automatically generated captions can lead to problems for people who rely on them (see the hashtag #NoMoreCRAPtions); and in the case of Amazon’s Alexa in particular, for “stealing” a common human name in a manner that leads to kids called Alexa being bullied at school.
► Identity authenticators (speaker identification/authentication)
Some governments and companies are also using our voices to authenticate identity, both as “security measures” and to profile customers.
For instance, the biometrics industry touts the capacity to use people’s voices to confirm their identity in sensitive situations, such as during phone calls with banks or government agencies. The aim is to prevent fraudsters from pretending they are company or government officials to carry out crimes, or from impersonating customers to gain access to their sensitive information. However, biometric data is vulnerable to hacking just like other authentication methods, and this applies to voice, too. But unlike a password, biometric indicators cannot simply be reset as needed. This poses a higher security risk, since it becomes increasingly difficult to “make good” leaks or hacks of biometric data.
Other businesses, like McDonald’s, say they are using voice recognition to “improve the customer experience” — while identifying and profiling people without their permission. Customers in the US state of Illinois have filed a class action lawsuit after the company tested the use of voice recognition technology to recognise and profile people using its drive-thrus.
The plaintiffs claim McDonald’s violated the Illinois Biometric Information Privacy Act because the system not only identifies a customer without their consent, but also “extracts the customer’s voiceprint biometrics to determine such unique features of the customer’s voice as pitch, volume, duration, as well as to obtain identifying information such as the customer’s age, gender, accent, nationality, and national origin”.
All of this is creepy, disproportionate, and, in certain jurisdictions, illegal. However, we’ve only begun to explore the use of voice recognition technology for making inferences that are deeply problematic — many of which are based on shoddy scientific premises and reinforce dangerous, harmful, and reactionary stereotypes about people to profit off of and surveil them.
► Analysing your voice to guess gender, mood, and a lot more (voice categorisation)
The list of technologies that use AI to make dodgy inferences about us seems to grow by the day. Automated gender recognition technology — where a system guesses a person’s gender from their face, voice, or even name — is a basic “feature” in the facial recognition software offered by most major suppliers, including Microsoft’s Face API. Yet it is a deeply problematic, unfixable technology, as scholars such as Os Keyes, Sasha Costanza Chock, and Joy Buolamwini persuasively argue. That is why we have called for a prohibition of this technology in our campaign with All Out.
The technology patented by Spotify and tested by McDonald’s both make inferences about people’s gender. Inferring gender from face, voice, or any other kind of biometric data undermines people’s capacity to express their real gender identity. Studies have shown that Black women and trans people are particularly at risk of misgendering, and the technology forces a male-female binary on gender non-conforming people, essentially erasing their identity.
We are also seeing businesses use voice data for “emotion recognition”, despite the fact that prominent scholars have voiced deep skepticism about the use of biometrics for this purpose, concluding that “the science of emotion is ill-equipped to support any of these [facial recognition] initiatives”. This points to a huge problem with “AI systems”: companies are rushing to use AI (a.k.a. machine learning) to solve all sorts of problems and make all sorts of inferences about us, but there are serious limitations to what this technology can do, and in the worst cases it can make inferences that fundamentally violate our rights and reinforce regressive stereotypes.
One especially troubling example of voice-based emotion recognition technology is Amazon’s Halo wristband, which among other features assesses the tone of your voice, a.k.a. AI-powered tone policing on your wrist. It’s supposed to improve your interactions with other people by helping you to “be more friendly” or “polite”, but there is huge potential for misuse with discriminatory results. It could misjudge accents, dialects, and idiosyncratic ways of communicating, or be used as a way to monitor or control how people speak. Imagine an employer that mandates wearing a device like this and then uses its inferences to feed into employee performance evaluations.
Worse, we could see voice recognition used as an “AI polygraph” in a police interrogation, or deployed among vulnerable populations such as asylum seekers. We’ve already seen face-based AI polygraphs deployed against asylum seekers at the European Union’s borders, and in use in China.
Beyond just telling if someone is lying, there is also a burgeoning industry of “computational psychiatry” which claims to analyse voice data to make inferences about people’s mental health. As Beth Semel points out in her essay, The Body Audible: From Vocal Biomarkers to a Phrenology of the Throat, the idea that we can objectively infer complex human attributes from vocal data is deeply flawed.
Bias and ideology can creep into the design of these systems at multiple levels, and she warns that the “makers and stewards of even the most benign-seeming voice analysis technologies run the risk of legitimising a phrenology of the throat: the reproduction of scientific racism and other modes of domination through the materiality of the voice”.
The big picture: what this means for regulators
Voice recognition is not yet everywhere, but with voice assistants like Amazon’s Echo at suspiciously cheap prices, or Google smart home devices given away for free, it might not be long before devices capable of surveilling and profiling us in problematic and invasive ways are everywhere.
As Joseph Turow, a professor at the Annenberg School for Communication at the University of Pennsylvania and author of The Voice Catchers, warns, when “tech companies have further developed voice analysis software – and people have become increasingly reliant on voice devices”, they will be in a position to start the wholesale profiling and exploitation of voice data. People may find themselves locked into their smart home systems as today they find themselves “stuck” on social media platforms they hate.
All is not doom and gloom, however. Many researchers and companies are working on human rights-respecting alternatives to this creepy, surveillance-based model of voice recognition. Mozilla, for example, launched a project called Common Voice that is crowd-sourcing a free, publicly available dataset “to help teach machines how real people speak” with the goal of “mak[ing] sure the data represents the diversity of real people”. Similarly, Cami Rincón, Os Keyes, and Corrine Cath are working to ensure voice recognition can empower — not discriminate against — marginalised communities, such as trans and/or non-binary people, by developing design recommendations tailored to these communities.
With these alternatives as inspiration, we have to avoid stumbling into a situation in which our smart watches, voice assistants, and everything else around us with a microphone, are listening to and profiling us, making detrimental inferences about our mental health, job prospects, and romantic interests. Voice recognition can be done in a privacy-preserving, inclusive way, but we need regulators to wake up to the potential harms it enables, and start implementing safeguards today to make sure that people are protected from a voice-based surveillance nightmare. Stopping the harms of voice recognition is the surest way to ensure that its real benefits can be realised for everyone.