Breaking down “Personal Data”

I’ve rallied for years against the use of PII (or Personally Identifiable Data) as unhelpful in the privacy sphere. This term is used is some US legislation and has unfortunately made its way into the vernacular of the cyber-security industry and privacy professionals. Use of the term PII is necessarily limiting and does allow organizations to see the breadth of privacy issues that may accompany non-identifying personal data. This post is meant to shed light on the nuances in different types of data. While I’ll reference definitions found in the GDPR, this post is not meant to be legislation specific.

Personal Data versus Non-personal Data

The GDPR defines Personal Data as “means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” The key here term in the definition is the phrase “relating to.” This broad refers to any data or information that has anything to do with a particular person, regardless of whether that data helps identify the person or that person is known. This contrast with non-personal data which has no relationship to an individual.

Personal Data: “John Smith’s eyes are blue.”

In this phrase, there are three pieces of personal data. The first is the name John which is a first name related to an individual, John Smith. The second is his last name. Finally, the third is blue eyes, which also relates to John Smith.

Anonymous Data: “People’s eyes are blue.”

No personal data is indicated in the above sentence as the data doesn’t relate to an individual, identified or identifiable. It relates to people in general.

Identified Data versus Pseudonymous Data

Much consternation has been exhibited over the concept of pseudonymized data. The GDPR provides a definition of pseudonymized: “means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.” The key phrase in this definition is that data can no longer be attributed to an individual without additional data. Let me break this down.

Identified Data: “John Smith’s eyes are blue.”
The same phrase we used in our example for Personal Data is identified because the individual, John Smith, is clearly identified in the statement.

Pseudonymous (Identifiable) Data: “User X’s eyes are blue.”
Here we have processed the individual’s name and replaced it with User X. In other words, its been pseudonymized. However, it is still identifiable. From the definition above, Personal Data is data relating to an identified or identifiable individual. Blue eyes are still related to an identifiable individual, User X (aka John Smith). We just don’t know who he is at the moment. Potentially we can combine information that links User X to John Smith. Where some people struggle is understanding there must be some form of separation between the use of the User X pseudonym and User X’s underlying identity. Store both in one table without any access controls and you’ve essentially pierced the veil of pseudonymity. WARNING: Here is where it can get tricky. Blue eyes are potentially identifying. If John Smith is the only user with blue eyes, it makes it much easier to identify User X as John Smith. This is huge pitfall as most attributable data is potentially re-identifying when combined with some other data.

Identifying Data versus Attributable Data

In looking at the phrase “John Smith’s eyes are blue” we can distinguish between identifying data and attributable data.

Identifying Data: “John Smith”
Without going into the debate of number of John Smiths in the world, we can consider a person’s name as fairly identifying. While John Smith isn’t necessarily uniquely identifying, a type of data, a name, can be uniquely identifying.

Attributable Data: “blue eyes”
Blue eyes is an attribution. It can be attributable to a person, in the case of our phrase “John Smith’s eyes are blue.” It can be attributable to a pseudonym: “User X’s eyes are blue.” As we’ll see below, it can also be attributed anonymously.

Anonymous Data versus Anonymized Data

GDPR doesn’t define anonymous data but in Recital 26 it says “anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” In the first example, I distinguished Personal Data with Anonymous Data, which didn’t relate to specific individual. Now we need to consider the scenario where we have clearly Personal Data which we anonymize (or render anonymous in such a manner that the data subject is not or no longer identifiable).

Anonymous Data: “People’s eyes are blue.”
For this statement, we were never talking about a specific individual, we’re making a generalized statement about people and an attributed shared by people.

Anonymized Data: “User’s eyes are blue.”
For this statement, we took Identified Data (“John Smith’s eyes are blue”) and processed in a way that is potentially anonymous. We’ve now returned to the conundrum presented with Pseudonymous data. Specifically, if John Smith is the only user with blue eyes, then this is NOT anonymous. Even if John Smith is a one of a handful of users with blue eyes, the degree of anonymity is fairly low. This is the concept of k-anonymity, whereby a particular individual is indistinguishable from k-1 other individuals in the data set. However, even this may not given sufficient anonymization guarantees. Consider a medical dataset of names, ethnicities and heart condition. A hospital releases an anonymized list of heart conditions (3 people with heart failure, 2 without). Someone with outside knowledge (that those of Japanese descent rarely have heart failure and the names of patients) could make a fairly accurate guess as to which patients had heart failure and which did not. This revelation brought about the concept of l-diversity in anonymized data. The point here is that unlike Anonymous Data which never related to a specific individual, Anonymized Data (and Pseudonymized Data) should be carefully examined for potential re-identification. Anonymizing data is a potential minefield.

If you need help navigating this minefield, please feel free to reach out to me at Enterprivacy Consulting Group

Bots, privacy and sucide

I had the pleasure of serving last week on a panel at the Privacy and Security Forum with privacy consultant extraordinaire Elena Elkina and renowned privacy lawyer Mike Hintze. The topic of the panel was Good Bots and Bad Bots: Privacy and Security in the Age of AI and Machine Learning. Serendipitously, on the plane to D.C. earlier that morning someone had left a copy of the October issue of Wired Magazine, the cover of which displayed a dark and grim image of Ryan Gosling, Harrison Ford, Denis Villeneuve, and Ridley Scott from the new dystopian film, Blade Runner 2049. Not only was this a great intro to the idea of bot (in the movie’s case human like androids) but the magazine contained two pertinent articles to our panel discussion: “Q: In. Say. A customer service chat window, what’s the polite way to ask whether I’m talking to a human or a robot?” and “Stop the chitchat: Bots don’t need to sound like us.” Our panel dove into the ethics and legality of deception, in say a customer service bot pretending to be human.

White the idea was fresh in my mind, I wanted to take a moment to replay some of the concepts we touch upon for a wider audience and talk about the case study we used in more detail than the forum allowed. First off, what did we mean by bots? I don’t claim this is a definitive definition but we took the term, in this context, to mean two things:

  • Some form of human like interface. This doesn’t mean they have to the realism of Replicants in Blade Runner, but some mannerisms in which a person might mistake the bot for another person. This goes back to the days, as Elena pointed out, of Alan Turing and his Turing test, years before any computer could even think about passing. (“I see what you did there.”). The human like interface potentially has an interesting property, are people more likely to let their guard down and share sensitive information if they think they are talking to another person? I don’t know the answer to that and their may be some academic research on that point. If their isn’t I submit that it would make for some interesting research.
  • The second is the ability to learn and be situationally aware. Again, this doesn’t require the super sophistication of IBM’s Watson but any ability to adapt to changing inputs from the person with whom it is interacting. This is key, like the above, to giving the illusion a person is interacting with another person. By counter example, Tinder is littered with “bots” that recite scripts with limited, if any, ability to respond to interaction.

Taxonomy of Risk

Now that we have a definition, what are some of the heightened risks associated with these unique characteristics of a bot that, say, a website doesn’t have? I use Dan Solove’s Taxonomy of Privacy as my goto risk framework. Under the taxonomy I see 5 heightened risks:

  1. Interrogation (questioning or probing of personal information): In order to be situationally aware, to “learn” more, a bot may ask questions of someone. Those questions could go too far. While humans have developed social filters, which allows us to withhold inappropriate questions, a bot lacking a moral or social compass could ask questions which make the person uncomfortable or is invasive. My classic example of interrogation is an interview where the interviewer asks the candidate if they are pregnant or planning to become pregnant. Totally inappropriate in a job interview. One could imagine a front like recruitment bot smart enough to know that pregnancy may impact immediate job attendance of a new hire but not smart enough to know that it’s inappropriate to ask that question (and certainly illegal in the U.S. to use pregnancy as a discriminatory criteria in hiring).
  2. Aggregation (combining of various piece of personal information): Just as not all questions are interrogations, not all aggregation of data creates a privacy issue. It is when data is combined in new and unexpected ways, resulting in information disclosure than the individual didn’t want to disclose. Anyone could reasonably assume Target is aggregating sales data to stock merchandise and make broad decisions about marketing, but the ability to discern pregnancy of a teenager from non-baby related purchased was unexpected, and uninvited. For a pizza ordering bot, consider the difference between knowing my last order was a vegetable pizza and discerning that I’m a vegetarian (something I didn’t disclose) because when I order for one its always vegetable but if I order for more than one, it includes meat dishes.
  3. Identification (linking of information to a particular individual): There may be perfectly legitimate reasons a bot would need to identify a person (to access that person’s bank account for instance) but identification as an issue comes into play when its the perception of the individual that they would remain anonymous or at the very least pseudonymous. If I’m interacting with a bot as StarLord1999 and all the sudden it calls me by the name Jason, I’m going to be quite perturbed.
  4. Exclusion (failing to let an individual know about the information that others have about her and participate in its handling or use): As with aggregation, a situationally aware bot, pulling information from various sources may alter its interaction in a way that excludes the individual from some service without the individual understanding why and based on data the individual doesn’t know it has. For instance, imagine a mortgage loan bot, that pulls demographic information based on a user’s current address, and steers them towards less favorable loan products. That practice sounds a lot like red-lining and if it has discriminatory effects, could be illegal in the U.S.
  5. Decisional Interference (intruding into an individual’s decision making regarding her privacy affairs): The classic example I use for decisional interference is China’s historic one-child policy which interferes with a family’s decision making on their family make-up, namely how many children to have. So you ask, how can a bot have the same effect? Note the law is only influential, albeit in a very strong way. A family can still physically have multiple children, hide those children or take other steps to disobey the law, but the law is still going to have a manipulatory effect on the decision making. A bot, because if it’s human interface, and advanced learning and situational knowledge, can be used to psychologically manipulate people. If the bot knows someone is psychologically prone to a particular type of argument style (say appealing to emotion) it can use that and information at it’s disposal to subtly persuade you towards a certain decision. This is a form of decisional interference.

Architecture and Policy

I’m not going to go into a detailed analysis of how to mitigate these issues, but I’ll touch on two thoughts: first, architectural design and second, public policy analysis. Privacy friendly architecture can be analyzed along two axes, identifiability and centralization. The more identified and more centralized the design, the less privacy friendly it is. It should be obvious that reducing identifiability reduces the risk of identification and aggregation (because you can’t aggregate external personal data from unidentified individuals) so I’ll focus here on centralization. Most people would mistakenly think of bots as being run by a centralized server, but this is far from the case. The Replicants in Blade Runner or “autonomous” cars are both prominent examples of bots which are decentralized. In fact, it should be glaringly apparent that a self-driving car being operated by a server in some warehouse introduces unnecessary safety risks. The latency of the communication, potential for command injections at the server or network layer, and potential for service interruption are unacceptable. The car must be able to make decisions immediately, without delay or risk of failure. Now decentralization doesn’t help with many of the bot specific issues outlined above, but it does help with other more generic privacy issues, such as insecurity, secondary use and others.

Public policy analysis is something I wanted to introduce with my case study during the interactive portion of the session at the Privacy and Security Forum. The case study I present was as follows:

Kik is a popular platform for developing Bots. https://bots.kik.com/#/ Kik is a mobile chat application used by 300 million people worldwide and an estimated 40% of US teens at one time or another have used the application. The National Suicide Prevention Hotline, recognizing that most teens don’t use telephones wants to interact with them in services they use. The Hotline wants to create a bot to interact with those teens and suggest helpful resources. Where the bot recognizes a significant risk of suicide rather than just casual inquiries or people trolling the service, the interactions will first be monitored by a human who can then intervene in place of the bot, if necessary.

I’ll highlight one issue, decisional interference, to show why it’s not a black and white analysis. Here, one of the objectives of the service and the bot, is to prevent suicide. As a matter of public policy, we’ve decided that suicide is a bad outcome and we want to help people who are depressed and potentially suicidal get the help they need. We want to interfere with this decision. Our bot must be carefully designed to promote this outcome. We don’t want the bot to develop in a way that doesn’t reflect this. You could imagine a sophisticated enough bot going awry and actually encouraging callers to commit suicide. The point is, we’ve done that public policy analysis and determined what the socially acceptable outcome is. Many times organizations have not thought through what decisions might be manipulated by the software they create and what the public policy is that should guide they way the influence those decisions. Technology is not neutral. Whether it’s is decisional interference or exclusion or any of the other numerous privacy issues, thoughtful analysis must precede design decisions.