Differential privacy, part 3: Extraordinary claims require extraordinary scrutiny

In September, Wired wrote about Apple’s controversial implementation of “differential privacy” in its new operating systems. We’re trying to put Apple’s use (or misuse) of differential privacy into context, to ensure that this promising new technological approach to keeping people’s data private will be used responsibly. That’s especially important for safeguarding the privacy of those most at risk, such as activists, journalists, and human rights defenders.

In our first post in this three-part series, we explained what differential privacy is and how it works, and in part two, we explored the issues that make it complicated. Now, in part three, we examine the ways differential privacy is used in practice today and discuss how companies can adopt it responsibly.

Differential privacy is new, exciting, and not widely understood. To people familiar with encryption, it sounds like a new form of technical magic: a complicated mathematical guarantee that your data will be safe. But while the application of encryption is black and white — either your data are encrypted, or not — differential privacy can only be understood in shades of gray. In the first post of this series, we explained that epsilon (ϵ), the “magic number” that describes how private a system is, can be tough for even experts to interpret. And in the second post, we touched on some of the limits of differentially private systems and the many ways they can fail. Early adopters of differential privacy will have to make trade-offs between utility and privacy, and they are bound to make mistakes. They need to be open and honest about the ways their systems work so that people can make educated choices about their data.

Now we’ll take a closer look at the details of differential privacy in action, focusing on its use by two major companies: Google and Apple.

Google

In 2014, to little fanfare, Google introduced differential privacy to Google Chrome, the most popular browser in the world. Using a custom system, called RAPPOR, Google collects probabilistic information about Chrome users’ browsing patterns. With the system, the browser converts strings of characters (like URLs) to short strings of bits using a hash function, then adds probabilistic noise and reports the results to Google. Once Google has collected hundreds of thousands of these private hashed strings, they can learn which strings are most common without knowing any one person’s real string. As described in our first post, this is a locally private system, and Google acts as an untrusted aggregator. This lets them estimate things like the click-through rates for dialogs and the most common home pages worldwide.

Google built RAPPOR into Chromium, the open-source core of the browser, so anyone can actually dig into the code and see how it works. Which is exactly what we did. We found 135 metrics that can be reported privately by Chrome in total. Each is assigned to one of two tiers of privacy settings, which are detailed in the table below. There are 107 metrics that use the lower-epsilon, higher-privacy setting, and 28 metrics that use the lower-privacy setting so that Google can get better estimates. Metrics are reported for those of us who don’t opt-out of collection at installation.

Type

Single report epsilon Total privacy budget Total metrics
Normal reports 2.04 4.39 107
Low privacy reports  3.15 7.78 28

 

Table 1: Summary of private metrics reported to Google by Chrome. Higher values of epsilon guarantee less privacy.

The system protects privacy over time in a few ways. Each metric is reported a maximum of once every half hour. If the same value for the same metric is ever reported more than once, the reports draw from a total privacy budget, so Google can’t collect reports about the same value over and over. Those budgets are permanent.

However, there are a couple of issues with calling the system in Chrome truly “differentially private.” First, it only protects users from repeatedly reporting the same exact value. Privacy loss due to similar (but distinct) values for the same metric is not accounted for. Second, it doesn’t handle correlated metrics well. Many of the 135 metrics are extremely similar or concretely related, yet each uses its own budget. In the last post, we explained that any two related data points need to be protected as if they were two versions of the same thing.

Let’s look at an example from the Chrome for Android source code. When a site triggers an “app banner,” Chrome reports the site’s URL with two separate metrics: one is reported when the banner is displayed, and one when it’s dismissed. Obviously, in order to be dismissed, the banner has to be displayed first; therefore, whenever a user dismisses an app’s notification, two separate metrics report that Site X opened a banner. This means that information like “User Y visited Site X” is more vulnerable to disclosure than it could be.

From one perspective, this is a problem. Google has published highly-cited research about protecting user privacy, but when it comes to its own product, they haven’t implemented it as well as they could have. It sets a bad example. If Google can’t or won’t respect the rules of differential privacy, who will?

Still, from a practical standpoint, the system is pretty good for privacy. None of the metrics report information that most people would consider very sensitive, and linking any value to a single user would be exceedingly difficult. Without RAPPOR, all of the metrics would likely be reported in the clear. In addition, Google has not advertised that Chrome is using differential privacy with the exception of a single post on its security blog.

Apple

Apple has been the highest-profile adopter of differential privacy so far. The company has shown a commitment to user privacy in other areas, so despite a lack of details about the system at launch, consumers may have been inclined to trust Apple. The latest versions of iOS and MacOS do indeed include some impressive protections for user data. However, Apple’s implementation takes some liberties that undermine the company’s central claim that data are collected “using differential privacy.”

Like Google’s, Apple’s system is locally private, but it differs in a couple of key respects. For one, while Google assigns one budget to each value for each metric, Apple assigns privacy budgets to whole categories of related metrics. That means that all health metrics are protected by the same “health” budget, and every emoji your computer reports draws from the same “keyboard.Emoji” budget. This is a good thing: it means Apple has considered, and addressed, the problem of correlated metrics.

The other major difference is decidedly less good. According to a study of MacOS, budgets are not permanent, and the “balance” of each privacy budget increases periodically. Once an app has used up its whole budget for reporting data, it just has to wait a few hours before it can report more. Apple has stated that they maintain user data for up to three months. This means they may use weeks’ worth of data from a single user in the same analysis. This kind of behavior is unprecedented in academic literature, and it defeats much of the purpose of having a privacy budget in the first place.

In the time since we posted part one, Apple has released a white paper laying out their differential privacy implementation in more detail than they had before. According to the report, Apple does not store any kind of identifying data (including IP address) along with private reports, and does not attempt to link private reports to users in any way. These are good protections for users to have, and indicate a comprehensive commitment to privacy on Apple’s part. However, the document doesn’t refute what the researchers discovered: as measured by epsilon, privacy loss is unbounded in the long run.

In practice, Apple’s system offers strong protections for user privacy. Reports are randomized, stripped of identification, and only analyzed in aggregate; even if Apple employees were malicious, it would be nearly impossible for them to learn much about a single user that they didn’t already know. That does not change the fact that, analyzed under the rubric differential privacy, their system is lacking. That’s important because Apple has made the decision to advertise differential privacy as a major part of their data protection strategy.  As the first big company to sell — and market — products using the technology, Apple is setting a precedent. If they play fast and loose with mathematical proofs, other companies might feel licensed to do the same without including the extra protections Apple has. Apple has demonstrated a responsible approach to privacy, but they also have a responsibility to be accurate about what their systems do — and what they do not.

Going forward

Differential privacy appears to be on the brink of widespread adoption. In September, the bipartisan U.S. congressional Commission on Evidence-Based Policy released its final report, outlining a plan for increasing the role of data analysis in policy making. The report devotes significant time to privacy-preserving data sharing. It mentions differential privacy by name as a way for government-sponsored researchers to analyze sensitive data.

This means that, in the next few years, this approach may be applied to data that are much more sensitive than emojis. We should encourage companies, as well as governments, to embrace differential privacy where it makes sense. We also must insist that they be honest and upfront with those whose privacy is at stake, and that they treat the theory with respect.

That starts with the companies leading the charge, namely Apple and Google. The systems that both companies have built undoubtedly protect user data to some extent. Both companies should be applauded for voluntarily adding privacy protections to their products, and for being willing to experiment with cutting-edge technological approaches. But differential privacy is a measurement of a particular kind of privacy, and by that measure, both systems come up short.

While their software systems are comparable, the two companies have approached transparency very differently. Google published a research paper outlining the details and drawbacks of their approach before deploying it in their products, and they made their implementation open source from the beginning. In contrast, Apple touted their system’s use of differential privacy in marketing literature, but did not release critical details about it until over a year later — after researchers had already dug into it on their own. The lack of transparency may have misled users about the guarantees Apple’s products actually provide.

Finally, it’s important to reiterate that differential privacy is a specific, extraordinarily strict mathematical standard. It should never become the single metric by which privacy is assessed. In most cases, it’s above and beyond what’s necessary to keep data “private” in the conventional sense, and for many tasks it’s impossible to build differentially private systems that do anything useful. Companies should try to embrace differential privacy for the right problems, but when they make extraordinary claims about their systems, they must expect to be held to extraordinary standards.