In September, Wired examined potential flaws in Apple’s controversial implementation of “differential privacy” in its new operating systems. Just last week, Apple published a whitepaper expanding on its use of this approach.
We’re trying to put Apple’s use (or misuse) of differential privacy into context, as part of our efforts to defend and extend the digital rights of users at risk, who depend on technological approaches like this to speak freely and stay safe online.
In the first of this three-part series, we explained what differential privacy is and how it works. Now, in part two, we’ll address some common misunderstandings, and explore the issues that make it complicated, in Apple’s case and beyond. Finally, in part three, we’ll look at the ways differential privacy is used in practice today and how companies can implement it responsibly.
Uses and misuses
Differential privacy is most useful when one party — a company, researcher, or government — has access to lots of data. In the global setting this means physical access to raw data, and in the local setting it means the ability to collect private data points from a large group of people (for a refresher on local and global differential privacy, read the first post here). In both settings, the more that people contribute to a dataset, the more useful analysis of that data can be.
For that reason, companies with millions of users and petabytes of data have been the first to adopt differential privacy. These companies have the means to design complex systems as well as access to the massive datasets necessary for useful analysis. For example, Apple has incorporated local differential privacy into its operating systems in order to figure out which emojis users substitute for words, which Spotlight links they click on, and get basic statistics from its Health app. Uber has developed a globally private system to allow its engineers to study large-scale ride patterns without touching raw user data. And Google has incorporated local privacy into Chrome to report a bevy of statistics, including which websites cause the browser to crash most often. Because millions of people use Chrome each day, Google can process the millions of private reports its users submit to get useful information about how the Web works. The same system would be a lot less feasible for a start-up with just a few hundred users and a tight budget.
Even when lots of users are involved, differential privacy isn’t a silver bullet. It only really works to answer simple questions, with answers that can be expressed as a single number or as counts of a few possibilities. Think of political polls: pollsters ask yes/no or multiple-choice questions and use them to get approximate results which are expressed in terms of uncertainty (e.g. 48% ± 2). Differentially private results have similar margins of error, determined by epsilon. For more complex data, such methods usually add so much noise that the results are pretty much useless. A differentially private photo would be a meaningless slab of randomly colored pixels, and a private audio file would be indistinguishable from radio static. Differential privacy is not, and never will be, a replacement for good, old-fashioned strong encryption.
The man who learned too much
Differentially private systems are tricky to get right, even for the right problem. Dishonest or sloppy engineering can cause a system that seems “private” to instead leak sensitive information.
In our last post, we gave two examples of systems with differential privacy. As we described, answering one question (in software parlance, a query) with differential privacy is easy, and can be as simple as flipping a coin. For example, Bob might query Alice for her preference in the upcoming election; to answer with differential privacy, Alice can flip a coin and let that determine whether she will answer truthfully or at random.
However, when answering hundreds or thousands of queries — as is required by software systems in the real world — maintaining privacy becomes much more complicated. Here, we’ll touch on three issues that engineers need to think about when designing systems for protecting user privacy: accumulated knowledge, collusion, and correlated data. These issues all relate back to the issue of privacy budgets. Remember, privacy budgets are the limits on the privacy loss that any individual or group is allowed to accrue to protect the privacy of the data. It’s not enough to build a system that’s capable of giving differentially private answers and set it loose: the system has to track who asks for what data and make sure nobody goes “over budget,” ever.
Accumulated knowledge
The first issue may seem obvious: in order to be meaningful, privacy budgets have to be permanent. Once an untrusted querier uses up their budget, they need to be barred from asking questions about the same data ever again. Unfortunately, evidence shows that some companies either don’t understand or don’t respect this requirement. For example, Apple built local privacy into MacOS, which means the OS should protect users’ data from Apple itself. However, according to researchers, the MacOS implementation actually grants Apple a fresh privacy budget every single day. That practice allows Apple — the untrusted party — to accumulate more information about each user on each subsequent day. With every set of responses, the company can become more certain about the true nature of each user’s data. Data that are protected at first become more exposed the longer the system is left running.
Collusion
Second is the issue of collusion. Suppose Mrs. Alice, a teacher, has a private set of student grades, and grants Bob and Betty privacy budgets of ϵ = 10 each to query it (remember, ϵ measures how private / noisy the data are). Both of them can make the same set of queries independently, using up their own privacy budgets. However, if the two collude, and Bob shares his answers with Betty or vice versa, the total privacy loss in Mrs. Alice’s system can jump to ϵ = 20, which is less private. If Bob and Betty both ask for Andy’s test score, they can average their two answers together to get a better estimate than either one could on their own. And if more people conspire — say, the rest of the students students in Andy’s class — they can combine their answers to figure out what he got to within a fraction of a point.
Collusion is more of an issue for global privacy than it is for local. In a typical global system, a single party answers queries for lots of different people, each of whom has their own privacy budget. in a local system, like Apple’s, there is usually only one party (e.g. Apple) who is interested in the data, so one privacy budget suffices.
Collusion is tough to deal with because it’s hard to control with technical means alone. In general, system designers probably need to find a non-technical solution, potentially leveraging policies and procedures, to make sure would-be Bobs and Bettys don’t learn too much.
Correlation
Finally, privacy budgeting is further complicated by correlated data. We’ve already established that differentially private systems have to limit the amount of times anyone asks for the same data, but that’s not enough. Systems have to extend similar protections to related (mathematically, correlated) data as well. Under strict interpretations of differential privacy, any two data points that are correlated with each other have to be treated, for privacy purposes, as the same data point.
What does that mean? Let’s look at an example. If Bob asks once for Alice’s favorite brand of peanut butter and once for the age of her cat, he can use two separate privacy budgets; since, presumably, the data are unrelated, the number of times he asks about her cat doesn’t affect how much he can ask about her taste in processed legumes. But if Bob asks for two related attributes — Alice’s income and her monthly rent, for example — privacy loss adds up. Alice’s income doesn’t completely determine how much she spends on her apartment, but the two values are correlated. Knowing the answer to one question will help Bob predict the other. To preserve strictly defined differential privacy, Alice must treat Bob’s questions as if they were two ways of asking the same thing.
Unlike the first two issues, experts disagree about how best to deal with correlated data, even within the research community. Some privacy literature ignores correlation entirely and doesn’t extend extra protections to related data points; other papers endorse a strict approach like the one described above. Since correlation is everywhere, adopting a strict approach makes analysis harder. A few researchers are addressing the problem directly, looking for a way to protect data without adding too much noise, but no single approach has won a consensus yet. In the meantime, and unsurprisingly, companies like Google have chosen the less strict approaches that favor better data utility over more airtight privacy.
This all underlines the fact that differential privacy is a young, lively, and complex field of study. There are no easy answers. For privacy-conscious tech companies, building a system with differential privacy is not a way to avoid tough choices about user privacy — it means confronting them head on.
Perils and promise
Early adopters need to understand differential privacy’s pitfalls and respect its strengths and weaknesses. Otherwise, their systems are destined to fail. If companies try to force probabilistic privacy into places where it doesn’t fit, they’ll have to choose between systems that don’t provide good privacy and ones that don’t give useful results. Going forward, there’s a risk that companies will choose utility over privacy but brand their systems as “differentially private” anyway, luring consumers in under false pretenses and defeating the purpose of the technique. Regulators and consumer advocates need to be prepared to defend users and take dishonest companies to task.
Despite its complications, differential privacy is ascendent. In the past decade, it’s been one of the hottest topics in data science, and it’s now making the leap from theory to practice. When used correctly, it elegantly addresses the need for privacy-preserving analysis of big datasets. As long as the organizations that adopt differential privacy are transparent and honest about how their systems work, the techniques should benefit data collectors and vulnerable users alike. In part three of this series, we’ll expand on how differential privacy is used today, where it could go from here, and how companies and lawmakers can adopt it responsibly.