(Re | De) – Identification

Lots of concern has been put forth about re-identifying people from de-identified data. For those who may not know the term, de-identification is the process of removing identifying information data sets.  Think health records where you remove the persons name, date of birth, address, etc and leave the raw health data.

The risks of re-identification have been both exaggerated and underplayed.  I’d like to put forth a small framework for thinking about the re-identification and risk.  There are two variables I’d like you first to consider. The first is whether the information can be re-identified with public data, private data or theoretical data.  For instance, in the infamous NetFlix’s’ case, individual’s movie viewing habits were re-identified through publicly available data.  Specifically, NetFlix released de-identified data containing only the names of rentals, dates of rentals and a unique identifier to correlate renters.  The purpose was to provide the data to allow teams to develop a better suggestion engine. However, by correlating this data with public IMDB posts, researchers were able to identify a subset of the renters who had purchased certain movies on certain days and then posted reviews of the films on IMDB.

Contrast that scenario to what happen with the reidentification of Paula Brodwell.  Paula Broadwell was CIA Director Petraeus’ biographer and was carrying on an illicit affair with him.  Suffice to say the FBI was able to identify her because of the correlation of two privately held pieces of information.  She logged into a joint Google account shared by her and Director Petraeus. Google had the IP addresses but not her name. The FBI traced those IP addresses to the hotels where they originated but that still didn’t reveal her name. Mrs. Broadwell was uncovered because only she fit the unique pattern of having stayed at those hotels at the times the Gmail account was accessed from those IP addresses.  Even the anonymous information about IP addresses and times logged into a Gmail account, when combined with the hotel’s private information, was enough to re-identify Mrs. Broadwell.

The third classification would be the combination of data with theoretical data. I don’t have a good example of this but consider again, some health data about the patients at a clinic.  If someone had been sitting in the parking lot, monitoring the cars, they could theoretically combine that information to identify which cars arrived on which day with which treatments were performed on those days.  With enough days of monitoring, a good portion of the clinics population might be re-identifying or at the least correlated to what car they came in.

So the first question one needs to ask,

1) Is data needed for re-identification publicly, privately or theoretically available?

The second question to consider for risk analysis is one of scale.

2) When combined with other available data sets, will re-identification be available for one, many or all of the anonymized population?

For example, in the Paula Broadwell case, combining the data only identified one person. It would have taken similar effort for every Gmail account holder to try and identify them.  This is an economic barrier to privacy, one of the three privacy vectors I often talk about.  In the Netflix’s case, combining the data allowed for re-identification of a subset of the renters, though probably not a large subset.

Clearly the availability of public data sets are more risky to re-identification than private data sets which still pose more risk than the theoretical data set.

Once you’ve identified the means of re-identification, you then have to take into consideration the particularized risks that will befall re-identified persons.  Not just compliance risks and not just objective harms like identity theft but also consider subjective harms and loss of trust that may befall the organization releasing the information.  There are policy arguments to weigh when considering releasing data for public or private review and by no means does the privacy argument rule the day. But before you can make the determination, you must have all the available information in front of you.