Privacy Engineering 10 years on

[In July 2023, Kim Wuyts and Isabel Barbera invited me to present the keynote talk to the International Workshop on Privacy Engineering in Delft, Netherlands. Subsequent to that, and because we felt there wouldn’t be an overlapping audience, Nandita Narla and Nikita Samarin, invited me to give the same talk to another group of privacy engineers at the PEP23 workshop ahead of SOUPS in Anaheim, CA. For those who couldn’t be there at either event, I decided to write this blog post to summarize my talk.]

In September of 2013, I authored a blog post for the International Association of Privacy Professionals entitled “Is 2013 the year of the privacy engineer?”  The post came after myself and Stuart Shapiro provided two early workshops on privacy engineering to crowds at IAPP conferences that year and just months before our seminal paper on Privacy Engineering co-authored with, at the time, Information and Privacy Commissioner of Ontario Canada, Ann Cavoukian. The IAPP blog laid out a basic argument that addressing privacy issues needed to move out of legal departments and into engineering. Clearly, my prognostication was premature as to actual industry action, which still prefers legal solutions over technical ones. What about 2023, though? Is it finally time for the ascendence of the privacy engineer?

Before we can answer this question, we need to explore a bit about the concept of privacy engineering. Just a few weeks ago the IAPP published an infographic “Defining Privacy Engineering.”

The publication was the culmination of my push, while a member of the IAPP’s Privacy Engineering Advisory Board, to advocate for rigorous construction of the field. My concern was the term being diluted and applied to anything “technical” in the field of privacy. Anecdotal evidence suggested many lawyers and privacy analysts were deeming any type of coding or technical implementation as “engineering.” This led to bloggers and journalists also applying the engineering moniker to those doing the technical aspects of privacy based on what they were learning from the privacy professionals of the day.

A feedback loop ensued! Privacy professionals reading those blogs and articles similarly referred to technical privacy folks as engineers. Even those doing technical work started referring to themselves as privacy engineers, a form of self-selection bias, which only added fuel to the cycle. Of course, compounding this are job descriptions, which, despite being labeled as “privacy engineer” could just as easily fall under the title of “analyst,” “technologist” or some other non-engineering role. During my exploration of this topic, I found that HR departments at large not-to-be-named tech companies often require the engineering department to hire an “engineer” thus leading to an explosion of engineering roles with no engineering duties (“public relations engineer”, “finance engineer”, etc.)

This debate (“who is the true engineer”) that I’m putting forward is reminiscent of one from my childhood, the question of who is a punk and who is a poseur (aka, a suburbanite who spikes their hair on the weekends but doesn’t live the lifestyle of punk). Though I don’t recall the intensity of the debate at that time, in 1985 MTV (back when they played music and not 24 hours of Ridiculousness) put out a documentary on the LA punk scene entitled “Punks & Poseurs” which suggested the question was one of massive scrutiny within the subculture.

Lest you think the punk/privacy connection is fleeting, I present exhibit two into evidence: the color similarities of the album cover for the Exploited’s Punks Not Dead and CMU’s privacy engineering program logo “Privacy is Not Dead.”

While I don’t think any privacy engineers are going around beating up faux privacy engineers, the debate is worth having, in my opinion, and potentially more important to those relying on the services of such engineers. So let’s stage dive into the discussion:

What is Engineering?

There are a few authoritative definitions of engineering, but I like this one from the Accreditation Board for Engineering and Technology: “The profession in which a knowledge of the mathematical or physical sciences gained by study, experience and practice is applied with judgement to develop ways to utilize, economically, the materials and forces of nature for the benefit of mankind.” Applying this to privacy, it seems fairly logical that privacy takes the role of the “benefit of mankind.” I think the primary gist of what distinguishes an engineer from say an analyst or other professional working in the area is the application of knowledge of mathematical or physical sciences.

I’d like to analogize this to bridge building. You can design and build a bridge without “engineering” that bridge. However, you may end up with a bridge that collapses, as the Tacoma Narrows Bridge did in 1940. In response, a builder could naively suggest a stronger bridge, better materials, or more pillars, but it takes the analysis of an engineer, applying the mathematical and physical sciences, to determine the problem of aerolastic flutter and engineer a solution.

Another analogy I like comes from the realm of software engineering. Programming students are often given the task of sorting lists. Invariably, they produce a sorting program that compares all elements of the lists to every other element. Those who have more study in proper design of sorting algorithms know that this is inefficient, requiring, at worst, n times n passes (where n is the number of elements on the list). More efficient sorting algorithms exist such as “merge sort.” Other sorting algorithms can take advantage of qualities of the elements being sorted to be even more efficient. These are often described using big O notation, where O means on the order of. Inefficient algorithms, as just described, are usual O(n2) and more efficient ones are O(n log n). Hence, the study of big O notation and algorithms distinguishes the engineering of software versus the programming of software. Both an engineer and a programmer may write software to result in sorting a list, but the engineer has developed a more efficient program  by applying the study of mathematics and science to optimize the resources of the computer.

Unlike bridge engineering or software engineering, privacy engineering actually encompasses several different domains. In this way, it’s more similar to, say, safety engineering, where the quality being achieved (safety or privacy) can be desired in many distinct applications.

We can think about safety as a desired quality in the aforementioned bridge engineering, but also in industrial machine design, tool safety, food handling equipment, medical equipment, and many other fields. Someone designing safety into a bridge will not use the same skills and resources as someone building safety in food handling, though they will still have the generic application of math and science to solving the problem. Similarly, privacy engineering is an umbrella covering different disciplines. As part of the aforementioned attempt by the IAPP to define privacy engineering, the advisory board identified a non-exhaustive set of domains where you might apply different engineering techniques to build in privacy.

Software Engineering: This is what most people think about when thinking of privacy engineering. Software engineering covers the design and development of software and building in privacy during the collection, processing and sharing of data. The engineering comes into play during considerations of the tradeoffs of efficiency, utility and risks to individuals.

IT Architecture: When you combine multiple IT components (which may be software engineered), you may get emergent properties or exacerbate risks that were negligible in their individual components. IT Architecture includes considerations of protocols and interfaces between components, again optimizing between efficiency, utility and risks.

Data Science: Privacy in data science concerns the risks from data sets, combinations of data, inferences and such without consideration of the specifics of software collecting, processing or sharing that data. Data science for privacy often involves investigating identifiability of data or linkability between data sets.

HCI/UI-UX: It’s not all about the data. Many privacy harms occur during the interactions between individuals and computers. Spam, pop-up ads and other intrusions are a form of invasions into ones personal space. Manipulative designs may interfere with individuals’ autonomy and decision making. The study of Human Computer Interactions can make use of behavioral economics, psychology and other scientific analysis to minimize these privacy harms.

Business Processing Engineering: Business process engineering traditionally covers the design of efficient business processes. Think industrial manufacturing, though it can be applied to more white-collar business processes like HR and Marketing. While efficiency of resources is the primary goal of business processes, they can be designed to improve privacy as well. This is a nascent area of development, but one worth watching.

Physical Engineering: While there are few, if any, self-identified privacy engineers working in physical spaces, privacy has been a design consideration in the physical world for as long as there have been eaves to stand under. Not only in private spaces in homes, but hospitals are also a frequent target for privacy design. But what about engineering? There is increasing opportunity for physical engineering for privacy in regard to IOT devices, both in their design and the design of spaces to prevent ubiquitous surveillance by those devices.  

Systems Engineering: This is my primary area of work. Systems engineering considers the interconnectedness of all of the above: cyber-physical-social systems and the interactions and interfaces between them and individuals. Systems engineering looks at narrow risks, risk from emergent properties, cascading risks between system elements and tradeoffs of privacy with other quality and functional system requirements.

Next Steps for Privacy Engineering

As you can see, privacy engineering covers a wide swath of different domains. Besides the generic desire for more privacy, what cross-discipline language can we seek to define similarities? I  think we can draw upon the analogous safety engineering for guidance. What are safety engineers trying to do? They are attempting to reduce risk, to make products or service safer by reducing the likelihood of harmful events and the level of harm when those events occur. Similarly, privacy engineers need to focus on reducing privacy risk, the likelihood of privacy events and impacts of those events. [Note, if you’re thinking data breaches, you and I need to have a talk. Privacy is not security, but I won’t derail this blog post with a privacy versus security risks deep dive.]

Privacy risk research is still in its early development. Significant research still needs to be done into quantifying privacy risk, creating feedback loops to incorporate real world data back into risk analysis, measuring the effectiveness of controls on risk and producing ways of determining reasonable levels of risk tolerance. In addition there needs to be more work to tie controls to risk. Privacy professionals have worked for years at developing various technical and organizational controls meant to improve privacy. However, scant work has been done to tie those controls to actual risk reduction. It’s as if bridge engineers said using high-tensile steel improves bridge safety and had tons of data and science backing up the resilience and breakpoints of the steel. That statement seems logical but lacks measurable risk reduction. While you may measure the tensile strength of the steel, that’s different than measuring use of that steel in a particular design’s effect on risk. Similarly, there are many measures, metrics and KPIs for privacy, but what does the ε in differential privacy translate into as to whether a real threat actor is likely to attempt to re-identify the patient in a doctors office’s prescription for 325 mg of acetaminophen?

There also needs to be a causal connection between controls and risk reduction. Increasing the tensile strength of the steel in a bridge isn’t going to reduce risk if the weak point is the connection to the earth. Bridging this gap is what the Institute of Operational Privacy Design (IOPD) is working on in our newest standard. The IOPD is attempting to use structured assurance cases, something used in safety engineering, in privacy. The idea is that you state a claim, such as “This privacy risk has been mitigated” then make an argument as to why it should be considered mitigated. Finally, you provide evidence to support this argument. Lawyers should like this, because it’s similar to a legal argument, showing why some evidence supports some legal claim.

Moneyball and the lesson for Privacy

If you haven’t read the book, Moneyball: the Art of Winning an Unfair Game, or seen the movie staring Brad Pitt and Jonah Hill, I highly encourage you to see them out. I tend to be a “privacy” subject matter expert, but one of the most important aspects of that expertise is the ability to consume diverse content from other expects and apply it to the world of privacy. Whether its behavioral economics (from sources like the Freakonomics and Hidden Brain podcasts), how people learn (from YouTube channel Vertasium), or use of statistics and probability in baseball (i.e. Moneyball).

                I see a huge parallel between Baseball management in the pre-2000 era and the privacy management today. For a hundred plus years, Baseball team management was governed by intuition. Managers and scouts thought they knew what made up a good team. Baseball statistics were published for decades, but wasn’t used in a way that really optimized team performance. The concept of Sabermetrics (the empircal analysis of in game acitivity) began in the 1960s, grew in 1970s, but then transformed the business of Baseball in the 2000s (as chronicled in the book and movie; also see this podcast). Sabremetrics took the “intuition” out of Baseball and turned it into a science, one based on statistics and probability. Science can’t tell you a particular batter will hit a home run against a particular pitcher, but probability can show that you, if you make the same choices, game after game, you’re going to end up with result in a certain range.

                Similar transformations have hit other industries. I believe, the privacy profession is in line for this sort of refactoring. Attempts have been made for decades to create KPI (Key Performance Indicators) for privacy. One of my earliest introductions was a talk by Tracy Ann Kosa at an IAPP conference about a decade ago.  But most privacy program KPIs, in my opinion, still follow the Baseball analogy. Scouts were looking at player stats, but ultimately they were looking at the wrong statistics to determine game outcome. As Jonah Hill says in the movie, “Your goal is not to buy players, its to buy wins.” The privacy profession has, for its short life, relied mostly on heuristics, shortcuts that we intuitively sense, improve desired outcomes. Those shortcuts (such as GAPP, FIPPs, Principles, etc.) have become goals themselves and the metrics tied to those intermediary goals. But, like Baseball, the goal is not the player, the goal is the game. Analogously, a goal in privacy is not transparency, but rather whether a person has the ability to make decisions about things that affect them (with full knowledge and without being overwhelmed). Transparency is a sideshow. Transparency is worthless if it overwhelms or doesn’t support meaningful decision making.

                The point of this post is that metrics are important, rigorously important. Privacy needs to step out of the dark ages of intuition, superstition, and old wives tales.

The author is principal at Enterprivacy Consulting Group, a boutique consulting firm focused on privacy engineering, privacy by design and the NIST Privacy Framework.

Diagramming Data Transfers

International data transfer is probably one of my least favorite privacy exercises. Why? Probably my main dislike deals with the fact that its not really about privacy, but often more about protectionism. That being said, data transfers are a hot hot topic these days, both in Europe and in countries like China and Brazil. It wasn’t until recently that I realized if you look at GDPR from far away, you see there are really four key chapters

  • Chapter II – Principles
  • Chapter III – Rights of the Data Subject
  • Chapter IV – Controllers and Processors
  • Chapter V – Transfers

Fundamentally, most people, myself included, equate GDPR to principles of data processing, rights afforded the data subject and obligations of controllers and processers, but right up there with those key concepts is a whole chapter on international data transfers. At least in the minds of the GDPR authors, transfers are clearly not an afterthought but one of the four major components of the regulation.

With data transfers clearly an important element of GDPR, its important that the analysis of transfers be done with some care. I, for one, am a very visual person. I can analogize concepts visually much easier than verbally. Astute readers may have noticed the diagrams accompany my previous post on data transfer regarding the recent draft guidelines put out by the European Data Protection Board. In working through data transfer scenarios, I’ve found it extremely helpful to illustrate or diagram them.

Disambiguating “transfers” and “transmission” of data

Before diving into how to diagram data transfers, it’s important to distinguish the terms transfer and transmit. Transmission of data occurs when data goes from one place to another. If I send you an email with an attachment, I am transmitting data to you. That data is transmitted over multiple servers, through different service providers and maybe through different geographies. A transfer, however, under the GDPR is a legal chain by which a controller or processer transfers data to another controller or processor. In other words the transmission ≠ the transfer.

Perhaps an example would help. I use Dreamhost to host my website. On my website, I host a file (say a pretty infographic). You go to my website and download the file. The file is transmitted from Dreamhost (wherever their servers are) to you (wherever you are) However, there is not a legal transfer from Dreamhost to you. The transfer was from me to you. I may never have even had the file in my possession. Let’s illustrate that.

Simple transfer/transmission diagram

As you can see, I’ve used dotted lines to illustrate the transmission of data, the 0’s and 1’s flying over the internet from DreamHost to You. But the transfer, the metaphysical or conceptual transfer, from me to you, is illustrated with a solid line. Let’s look at an EU example.

European Union data transfers

EU data transfer diagram

Here, ChairFans, GmbH, a German company, sends a file to Star Analytics, Inc., in the US, to analyze. In this case the transmission is parallel to the transfer. ChairFans is transmitting data to Star Analytics and they are also transferring that data. Remember, the transmission refers to the physical bits flying across the Atlantic Ocean and the transfer refers to the act of one entity giving the other entity the data. If you’re still struggling, you can think of it this way. If a hacker broke into ChairFans and stole the data, the data would still be transmitted over the internet across the Atlantic Ocean, but ChairFans didn’t “transfer” data to the hacker. It was not a deliberative, intentional act of making the data available to the hacker.

Do we need a GDPR Chapter V transfer tool for this transfer? If there was

Diagramming

To diagram data transfers, I’m using diagrams.net, a free (and privacy friendly) tool to create diagrams. I’ll provide the file for all these diagrams at the end of this blog. I’ve also included a template for the shapes I’m using, which you can use to create your own data transfer diagrams. For the following example, I’m first going to illustrate the Use Case for supplementary measures in the EDPB Recommendations.

EDPB Recommendation

Use Case 1: Data storage for backup and other purposes that do not require access to data in the clear

A data exporter uses a hosting service provider in a third country to store personal data, e.g., for backup purposes. Notice, I’ve now added the labels, Exporter and Importer to the entities.

Illustration 3

Use Case 2: Transfer or pseudonymised Data

A data exporter first pseudonymises data it holds, and then transfers it to a third country for analysis, e.g., for purposes of research. This really isn’t distinguished from the previous example. I’ve added a gear icon to indicate the pseudonymization.

Illustration 4

Use Case 3: Encrypted data merely transiting third countries

A data exporter wishes to transfer data to a destination recognised as offering adequate protection in accordance with Article 45 GDPR. The data is routed via a third country.

Illustration 5

Use Case 4: Protected recipient

A data exporter transfers personal data to a data importer in a third country specifically protected by that country’s law, e.g., for the purpose to jointly provide medical treatment for a patient, or legal services to a client. No different than illustration 3

Use Case 5: Split or multi-party processing

The data exporter wishes personal data to be processed jointly by two or more independent processors located in different jurisdictions without disclosing the content of the data to them. Prior to transmission, it splits the data in such a way that no part an individual processor receives suffices to reconstruct the personal data in whole or in part. The data exporter receives the result of the processing from each of the processors independently, and merges the pieces received to arrive at the final result which may constitute personal or aggregated data.

Illustration 6

Use Case 6: Transfer to cloud service providers or other processor which require access to data in the clear

A data exporter uses a cloud service provider or other processor to have personal data processed according to its instructions in a third country.

Illustration 7

Use Case 7: Remote access to data for business purposes

A data exporter makes personal data available to entities in a third country to be used for shared business purposes. A typical constellation may consist of a controller or processor established on the territory of a Member State transferring personal data to a controller or processor in a third country belonging to the same group of undertakings, or group of enterprises engaged in a joint economic activity. The data importer may, for example, use the data it receives to provide personnel services for the data exporter for which it needs human resources data, or to communicate with customers of the data exporter who live in the European Union by phone or email. Here I’ve added an IT system to indicate that the Common Enterprise has remote access to that IT system.

Illustration 8

EDPB Guideline on Article 3 and Chapter V

Next up I’ll tackle the examples from the draft EDPB Guidelines on the Interplay Article 3 and Chapter V.

Example 1

Maria, living in Italy, inserts her personal data by filling a form on an online clothing website in order to complete her order and receive the dress she bought online at her residence in Rome. The online clothing website is operated by a company established in Singapore with no presence in the EU. In this case, the data subject (Maria) passes her personal data to the Singaporean company, but this does not constitute a transfer of personal data since the data are not passed by an exporter (controller or processor), since they are passed directly and on her own initiative by the data subject herself. Thus, Chapter V does not apply to this case. Nevertheless, the Singaporean company will need to check whether its processing operations are subject to the GDPR pursuant to Article 3(2).12

Illustration 10

Example 2

Company X established in Austria, acting as controller, provides personal data of its employees or customers to a company Z established in Chile, which processes these data as processor on behalf of X. In this case, data are provided from a controller which, as regards the processing in question, is subject to the GDPR, to a processor in a third country. Hence, the provision of data will be considered as a transfer of personal data to a third country and therefore Chapter V of the GDPR applies. Note, I’ve added the labels C and P to indicate Processor and Controller

Illustration 9

Example 3: Processor in the EU sends data back to its controller in a third country

XYZ Inc., a controller without an EU establishment, sends personal data of its employees/customers, all of them non-EU residents, to the processor ABC Ltd. for processing in the EU, on behalf of XYZ. ABC re-transmits the data to XYZ. The processing performed by ABC, the processor, is covered by the GDPR for processor specific obligations pursuant to Article 3(1), since ABC is established in the EU. Since XYZ is a controller in a third country, the disclosure of data from ABC to XYZ is regarded as a transfer of personal data and therefore Chapter V applies.

Illustration 10

Example 4: Processor in the EU sends data to a sub-processor in a third country

Company A established in Germany, acting as controller, has engaged B, a French company, as a processor on its behalf. B wishes to further delegate a part of the processing activities that it is carrying out on behalf of A to sub-processor C, a company established in India, and hence to send the data for this purpose to C. The processing performed by both A and its processor B is carried out in the context
of their establishments in the EU and is therefore subject to the GDPR pursuant to its Article 3(1), while the processing by C is carried out in a third country. Hence, the passing of data from processor B to sub-processor C is a transfer to a third country, and Chapter V of the GDPR applies.

Illustration 11

Example 5: Employee of a controller in the EU travels to a third country on a business trip

George, employee of A, a company based in Poland, travels to India for a meeting. During his stay in India, George turns on his computer and accesses remotely personal data on his company’s databases to finish a memo. This remote access of personal data from a third country, does not qualify as a transfer of personal data, since George is not another controller, but an employee, and thus an integral part of the controller (company A). Therefore, the disclosure is carried out within the same controller (A). The processing, including the remote access and the processing activities carried out by George after the access, are performed by the Polish company, i.e. a controller established in the Union subject to Article 3(1) of the GDPR.

Illustration 12

Example 6: A subsidiary (controller) in the EU shares data with its parent company (processor) in a
third country

The Irish Company A, which is a subsidiary of the U.S. parent Company B, discloses personal data of its employees to Company B to be stored in a centralized HR database by the parent company in the U.S. In this case the Irish Company A processes (and discloses) the data in its capacity of employer and hence as a controller, while the parent company is a processor. Company A is subject to the GDPR pursuant to Article 3(1) for this processing and Company B is situated in a third country. The disclosure therefore qualifies as a transfer to a third country within the meaning of Chapter V of the GDPR.

Illustration 13

Example 7: Processor in the EU sends data back to its controller in a third country

Company A, a controller without an EU establishment, offers goods and services to the EU market. The French company B, is processing personal data on behalf of company A. B re-transmits the data to A. The processing performed by the processor B is covered by the GDPR for processor specific obligations pursuant to Article 3(1), since it takes place in the context of the activities of its establishment in the EU. The processing performed by A is also covered by the GDPR, since Article 3(2) applies to A. However, since A is in a third country, the disclosure of data from B to A is regarded as a transfer to a third country and therefore Chapter V applies.

Illustration 14

My comments to the EDPB

Subsequent to the draft guidelines above, I made a comment to the EDPB on two scenarios they should cover. Those scenarios are detailed below.

A data subject contracts with X, GmbH (in Germany) which is a European Union based subsidiary of X, Inc (in the United States). However, the data subject never actually supplies personal data to X, GmbH as the data subject directly transmit data to X, Inc. in the United States. This is a Chapter V transfer of data requiring a transfer tool. X, GmbH and X, Inc. use standard contractual clauses in place governing the transfer of data. X, GmbH is the exporter and X, Inc. is the importer. 

Illustration 15

ABC, GmbH (in Germany) instructs employees to use a service provided by X, Inc., in the United States. Employees’ behavior is tracked via the service provided by X, Inc, thus X, Inc. is subject to GDPR for the data under Article 3.2(b). Because ABC, GmbH is “mak[ing] personal data, subject to this processing, available to…” X, Inc. via instructions to its employees, there is a transfer of data under Article V. ABC, GmbH and X, Inc. execute the standard contractual clauses with ABC, GmbH as the exporter and X, Inc. as the importer.

Illustration 16

I posed this scenario on LinkedIn

Company X, GmbH (DE) host data on an Australian data server. They contract with Company Y, Inc. in the United States to process data. Company Y’s employee working remotely in Australia, accesses the data on the data server. X, GmbH and Y, Inc. execute Standard Contractual Clauses to govern the transfer. There is a legal transfer of data because X, who has putative control over the data in Australia, gave access to Y, who has putative control over it’s employee in Australia. This despite the fact that the data never left Australia.

Illustration 17

If you want to explore these scenarios and make some of you’re own, download this file (be sure to right click and save the file to your desktop). Then go to https://diagrams.net then open the file from there.

Comment on Guidelines 05/2021 on the Interplay between the application of Article 3 and the provisions on international transfer as per Chapter V

Below are my comments on recent EDPB Guidelines which I’m submitting as part of their public consultation.

The guidelines need to provide an example which is a very common scenario:

A consumer data subject (located in the EU) registers with an online service. The service is being offered by a company in a third country, thus placing the company under the territorial scope of GPDR via Art. 3.2(a). However, when registering for the service, the data subject enters into an agreement with a subsidiary of the company located in the EU. For avoidance of doubt, the subsidiary in the EU, never possesses personal data of the data subject. It appears, in this scenario, that there is a legal “transfer” from the subsidiary in the EU to the parent company in the third country.

Example I

Illustration of data transmission and a legal data transfer

In the example illustrated above, a data subject contracts with X, GmbH (in Germany) which is a European Union based subsidiary of X, Inc (in the United States). However, the data subject never actually supplies personal data to X, GmbH as the data subject directly transmit data to X, Inc. in the United States. This is a Chapter V transfer of data requiring a transfer tool. X, GmbH and X, Inc. use standard contractual clauses in place governing the transfer of data. X, GmbH is the exporter and X, Inc. is the importer. 

A similar scenario exists when a business in the EU directs its employees to use an app (such as for Human Resource purposes) which is provided by a vendor in a third country which monitors the behavior of the employees (such as job time tracking), thus subjecting the vendor to GDPR under Art 3.2(b). Even though the employer never holds the data, this still appears to be a transfer under the guidance (“otherwise makes personal data available”). A clarifying example in the guidelines be helpful.

Example II

No alt text provided for this image
Illustration of a data transmission and legal data transfer

ABC, GmbH (in Germany) instructs employees to use a service provided by X, Inc., in the United States. Employees’ behavior is tracked via the service provided by X, Inc, thus X, Inc. is subject to GDPR for the data under Article 3.2(b). Because ABC, GmbH is “mak[ing] personal data, subject to this processing, available to…” X, Inc. via instructions to its employees, there is a transfer of data under Article V. ABC, GmbH and X, Inc. execute the standard contractual clauses with ABC, GmbH as the exporter and X, Inc. as the importer.

Respectfully submitted,

R. Jason Cronk

CHALLENGE:90 Trails in 100 Days

Challenge: 90 Trails in 100 Days

Sunrise over Myakka River State Park

In the fall of 2020, I went on a wonderful, adventurous, hike down from the rim of the Grand Canyon into Phantom Ranch on the river. The trip was arduous and the climb out was tough given my hailing from the vertically challenged state of Florida. Prior to this undertaking, I had begun training, in Florida, including perhaps Florida’s most vertical trail, the Torreya Challenge in Torreya State Park. After this adventure and the two months of training leading up to it, I became a bit of a couch potato in December and January. About the middle of January, I decided to do something about that. Seeing all the wonderful trails in my area, I set myself a challenge, to hike or bike 90 different trails in 100 days. I gave myself a hundred days to account for weather, work or other impediments to doing a trail each day.  

Now you may be asking, what this has to do with privacy. The short answer is not much, but there is always an angle. In using AllTrails, the trail mapping application, I discovered a nifty way to stalk people. See my previous blog post for more.

In February, thanks to Publix Supermarkets, I procured a large amount of trail mix. By the end, despite adding some more in March and April, I was down to one container.


My typical kit consistent of day pack with water reservoir, bear bell, bear mace, Chapstick, headphones to listen to Privacy podcasts, snacks (not pictured trail mix), and trail maps. Also not pictured are optional sunscreen and insect repellent.

Off I set on January 30th. The task seemed simple but, as with many things, implementation was more fraught than at first imagined. I tried to do longer trails or those  farther away when I had more time (like weekends) or during nice weather. One early trail, Pond Loop, at Okeeheepkee Prairie County Park, I completed on a rainy afternoon was only 0.5 miles. I decided after that to only include trails over 1 mile long. This lead me to a few times combining short trails into one “trail” or stretching a trail to it’s extreme (exploring every nook and cranny) to try to get that mile in. Trying to define a “trail” also led to some creative interpretations. Not all trails are simply laid out in the platonic idealized state. Some are based on forest service roads, some intersect and loop and figure 8. Some are out and back, retracing your steps. I had to break some long trails, like the 30+ miles of the St. Marks trail into more manageable pieces of about 16 mile chunks (8 out and 8 back). Some of my trips weren’t trails at all, but I counted them, like when I walked 6 miles home after dropping off a truck at the rental car company. I learned about new trails, which weren’t easily found, like the Capital to Coast trail still under construction, which when fully complete with be 120 miles of biking or shared use paths from Tallahassee to the Emerald coast. 

It was perhaps the best time to be out hiking and biking in Florida. Spring weather meant it wasn’t too hot, the parks were green and flowers were in full bloom. I saw so many animals, many that your rarely see as a weekend warrior.  In addition to the usual squirrels and lizards, I saw a bobcat, a family of boar, water moccasins and other snakes, a mole,  a red pileated woodpecker, a gopher tortoise, giant mosquitoes, ticks, spiders, and many more. I did not, however, see a bear. Not to say they weren’t there, but ever since I encountered a bear last year, I’ve been hiking with a bear bell and bear mace, so they’ve thankfully kept their distance. 

What lurks beneath? Creature from the black lagoon? Manatee? Large alligator? Something was moving fast and leaving a wake under the Crooked River in Tate’s Hell State Forest.

As a capstone to my challenge, I returned to Torreya State Park to take on the Torreya Challenge. It was a wonderous exhausting 4 hours which left me with a terrible head ache, but I made it. 90 different trails. 98 days in the making. 478 miles covered. Challenge complete. Level UP! 

More Galleries

Animals and Insects
Flora and Fungi
Doll's Head Trail
IMG_2133
Georgia
Econfina State Park
Emerald Coast
Torreya State Park - Torreya Trail and the Torreya Challenge Trail
Panoramas
# Location Trail Miles Type Links
1 Lake Talquin State Forest – Lines Track West Loop 3.8 Hike favicons (16×16)
2 Ellinor Klapp-Phipps Park West Loop 2.9 Hike
3 J.R. Alford Greenway Bluebird Loop 3.9 Hike
4 San Luis Mission Park San Luis Park Loop 1.98 Hike favicons (16×16)
5 Apalachicola National Forest GF&A Trail 5 Hike

6 Maclay Gardens State Park Shared Trail Loop 5.5 Bike https://www.floridastateparks.org/maclaygardens
7 Lake Talquin State Forest – Lines Track Talquin Loop (Blue) 6 Bike
8 Okeeheepkee Prairie County Park Pond Loop 0.5 Hike  
9 Apalachicola National Forest Munson Hills 8.4 Bike
10 Governors Park Fern Trail 3.4 Hike  
11 Three Rivers State Park Eagle Trail 3

Bike

12 St Marks Trail North Trail 16 Bike
13 Ochlockonee River WMA Old Cemetary Rd 3.9 Hike  
14 Lafayette Heritage Trail Park Lafayette Heritage Trail 6.9 Hike  
15 Wakulla Springs State Park Wakulla Springs Park Trail 10.1 Hike
16

Orchard Pond

Orchard Pond Trail 6.9 Bike  
17

Silver Lake Recreation Area

Silver Lake Habitat Trail 1.4 Hike
18 Timberlane Ravine Nature Preserve Timberlane Ravine Nature Trail 1.5 Hike
19 San Felasco Hammock Preserve State Park Moonshine Creek Trai 1.6 Hike
20 Lakeland Highlands Scrub  Lakeland Highlands Scrub Trail 3.1 Hike  
21 Catfish Creek Preserve State Park Campsite 2 White Trail 5.5 Hike
22 Black Creek Preserve Red Trail 4.9 Hike  
23 Tom Brown Park Magnolia MTB Trail 3 Bike  
24 Central Park Central Park Lake Loop  1.9 Hike  
25 Apalachicola National Forest Camel Lake Loop 8.2 Hike
26 J. R. Alford Greenway Yellow Loop 5.3 Bike  
27 Miccosukee Canopy Road Greenway Miccosukee Greenways Trail 15.5 Bike  
28 Bald Point State Park Loop 3.1 Hike
29 A.J. Henry Park A.J. Henry Park Trails 1.9 Hike
30 Alfred B. Maclay Gardens State Park Bike Loops 5.7 Bike
31 Wakulla State Forest Nemours Trail 1.6 Hike
32 Ochlocknee River WMA Cut Through Lewis Loop 2.5 Hike  
33 Kolomoki Mounds State Park Spruce Pine Trail 3.1 Hike
34 Wilson Hospice House Wilson Hospice House Trail 1.3 Hike  
35 Marjorie Turnbull Park Trail 1.6 Hike  
36 Tallahassee Morning Hike from Budget 5.4 Hike  
37 Gil Waters Preserve at Lake Munson Trail 1 Hike
38 Bald Point State Park Sandy Trails 4.2 Bike
39 Elinor Klapp-Phipps Park Redbug Trails 4.6 Bike
40 Tate’s Hell State Forest High Bluff Coastal Loop Trail 9 Bike
41 Ochlocknee State Park Pine Bluff Trail 1.2 Hike
42 St. Joseph Island Loggerhead Trail and Maritime Hammock Nature Trail 21.5 Both
43 Tom Brown Park West Cadillac Trail 3.1 Bike
44 J.R. Alford Greenway Long Leaf Trail 4 Hike  
45 Alfred B. Maclay Gardens State Park Ravine Trail 1.9 Bike
46 Cascades Park Cascades Park Loop 2.5 Hike
47 Apalachicola National Forest Oak Park Bridge Trail 4.8 Hike
48 St. George Island State Park Gap Point Trai 5.5 Hike
49 St. George Island State Park Sugar Hill Beach Old Road 7.3 Both
50

St. Marks Trail

Wakulla Springs to St. Marks 16 Bike  
51 Myakka River State Park Fox’s Low to Mossy Hammock 3 Hike
52 Myakka River State Park Mossy Hammock to Fox’s Low 9.5 Hike
53 Plant City Dean’s Ride 8.5 Bike  
54 River Rise Preserve State Park River Rise Preserve Trail 2.9 Hike
55 St. Marks Trail North Trail 9.8 Bike  
56 Tom Brown Park Subaru to Tom Brown 3.5 Hike  
57 Apalachicola National Forest Wright Lake White Trail 5.5 Hike
58 St. Mark WMA Plum Orchard to St Marks Via Port Leon 8.2 Hike  
59 Capital Circle SE  Captial Circle SE Shared Use Path 13 Bike
60 Econfina River State Park Blue and Orange Trail 12.5 Bike
61 Anita Davis Preserve at Lake Henrietta Park Lake Henrietta Trail 1.8 Hike

62 Goose Pond Trail Goose Pond Trail 2.7 Hike  
63 Wakulla State Forest Double Springs Trail with Petrik Spur 4.7 Bike
64 Fred George Greenway and Park Fred George Loop 1.5 Hike  
65 Lake Talquin State Forest Long Leaf Loop 3.9 Hike

favicons (16×16)

66

Guyte P. McCord Park

Sculpture Trail 1.2 Hike
67 Ochlockonee Bay Trail Ochlockonee Bay Trail 31.2 Bike  
68 Optimist Park Indian Head Trail 2 Hike
69 Chase Street Park Monticello “Ike Anderson” Bike Trail 4.5 Bike  
70 Apalachicola National Forest – Bradley Bay Wilderness Monkey Creek Trail Head on Florida Scenic Trail 4.9 Hike
71 FSU Bike Path Ocala Rd to Stadium Drive 2.3 Bike  
72 Seminole State Park Gopher Tortoise Nature Trail 1.8 Hike
73 J.R. Alford Greenway Wiregrass and Beggarwood Loop 3.3 Bike  
74 Lake Talquin State Park Lake Talquin State Park Trail 1.6 Hike
75 Capital to Coast Trail St. Marks to 319 24.3 Bike  
76 Constitution Park Dolls Head Trail 2.4 Hike  
77 East Roswell Park Park Loop Trail 2.1 Bike  
78 Chattahoochee-Oconee National Forest Andrews Cove Trail 3.8 Hike
79 Unicoi State Park Bike Trail 3.5 Bike
80 Unicoi State Park Anna Ruby Falls 1 Hike
81 Brinkley Glen Park Brinkley Glen Trail 1 Hike
82 Letchworth-Love Mounds Archaelogical State Park Letchworth Mounds Loop 2.7 Hike
83 St. Marks WMA Florida Scenic Trail 5.5 Hike  
84 Econfina State Park River Loop (Red Trail) 3.3 Hike
85 Torreya State Park Torreya Trail 6.6 Hike
86 Lafayette Heritage Trail Park East Cadillac Loop 2.6 Bike
87 Governors Park Fern Trail, Kohl’s Trail and Blairstone Multi-Use Trail 4.4 Hike  
88 Apalachee Regional Park Cross Country Loop 3.3 Hike
89 St. Andrews State Park Road and Pine Flatwoods Trail 5.5 Both
90 Torreya State Park Torreya Challenge 9.3 Hike
      478    

 

Google Calendar Privacy Vulnerability

Its interesting how events can lead one find privacy and security vulnerabilities. I’m reminded of the old Connections show, where James Burke would connect seemingly unrelated events in human history and show how one led to another. During my Winter 2021 Strategic Privacy by Design course, the United States did a time shift known as Daylight Saving Time, an anachronism from the days of agriculture where the government thought changing the time twice a year to adjust to changing sunlight would help farmers use time more effectively. As a result of this shift, some students in Europe showed up at the end of a lecture because I had adjust my clock, but they, obviously being in Europe, had not.

As a result of this timing error, I thought it might be good to create calendar items in Moodle (the LMS I use) for the Spring 2021 Strategic Privacy by Design course. The plan was to export the iCal file and send it to students so they would each be able to insert the important course events in their own calendar. I did just that into my calendar as well, which, unfortunately is in Google.

My eagle eyed assistant instructor, Maria, noticed when she was checking my schedule to send me an invite to a meeting, that should could see these items, even though I had set up to only share Free/Busy calendar (see below).

After digging around, I finally figure out what was going on. Visibility on each calendar item has options of: private, public or default visibility (meaning to default to the overall calendar’s visibility).

However, these calendar items had a class in the iCal file of public, which overrode my calendar’s default of Free/Busy only.

Those events were imported. I wanted to check, so I had my security intern invite me to three event, one she set to private, one she set to public and one she set to default visibility. As expected, despite my calendar set to Free/Busy only, the “public” event showed as public.

Your reaction may be, well this event is public, but two problems persist. 1) It still shows MY interest or possible attendance in this public event, not just whether I’m busy or free; and 2) when the sender has their calendar default to public and doesn’t realize that but sends you an invite to talk. I would suggest that my calendar settings should override the imported event’s settings, just to be on the safe side.

By the way, if anyone has a suggestion for a privacy friendly online calendar (so I can share my free/busy schedule), I’d appreciate hearing from you. I haven’t found a good alternative yet.

One Way Tables

One of the potential downsides of relational data is the ease at which data can be related in both directions. A simple search of the table below will reveal not only if a known customer has COVID but a list of all COVID positive customers.

Supposed instead, you want to be able to determine if a customer has COVID but not easily obtain a list of all customers who have COVID. How might you do that? This can be accomplished using one-way functions (hashes).. By hashing the customer’s name (shown as H(name) function) and replacing that in the customer field, we no longer can get a list of customer’s with COVID. A simple SQL query will get the customer’s COVID status from their name: SELECT Covid from TABLE where Customer = hash(customer_name). The inverse, however, won’t get me the list of customers with COVID but a list of hashes: select customer from TABLE where COVID =”Positive”.

Now hashes are one way functions, meaning it’s very difficult to determine x, given H(x). However, because of the very few values of customer’s name, it’s easy to compute what’s called a rainbow table, that is a table of all possible hash values. We can then lookup the hash in the rainbow table and obtain the customer’s name for those customers that have COVID. The table below assumes we know all the names of customers, an attacker may not, but names are fairly common and even if we created our rainbow table with six thousand different common first names, that’s fairly trivial to compute.

We address this with a concept called salting, which is adding a random value to the Name before hashing it, i.e. Customer = H(Name+Salt). One problem though we can’t use the same salt for every customer or an attacker could precompute a rainbow table just adding the salt to each name before calculating the hash. We also need the salt BEFORE we do the hash. Now customer’s can’t be expected to remember the salt before looking it up, but we can give them an account number, which we can relate to the salt. A customer, wanting to look up their COVID status, enters their account number and name. The system retrieves the salt using the account number and hashes the name to determine the customer and look up their COVID status. [Note, salts should be incremented with each access to prevent replay attacks but that’s beyond the scope of this blog].

Light arrows show the link between rows, only discoverable with knowledge of the customer’s name.

Now an attacker wanting to identify all known COVID cases would need to hash all the possible customer names against all the possible salts. Padding the salt table with hundreds of thousands of random account numbers could increase the number of computations necessary for the attacker. In fact, you could prefill the account table with a million rows and a million salts and randomly assign account numbers to customers. Even if two customers got the same account number, they aren’t going to get the same Customer value unless they have the same name, H(name+salt). This also benefits us because if we have two people with the same name, they could be given different account numbers and distinct COVID statuses.

In order to create a rainbow table now an attacker must hash six thousand names with a million account numbers, or 109 rows. Not impossible but starting to slow them down.

I’m going to make it even a little bit harder. Right now, we can still use the COVID table to identify how many people are COVID positive or negative. This is especially problematic if each and every customer has the same status. We can do a little encryption and steganography to hide the information. First, for each record we can generate a random 64 byte value (say 787683b51cf3b49886fce82dc34a51a2 in Hexadecimal). If we need to encode a negative result, we ensure that value has the least significant bit of 0. If we need to encode a positive result, we ensure that value has the least significant bit of 1 (i.e. 787683b51cf3b49886fce82dc34a51a3). Note the change in the last digit from 2 to 3, changing the least significant bit and the COVID status. That value is then encrypted using a symmetric cypher (such as AES) without padding or authentication to verify the accuracy of the data. We can use part of the SALT as the encryption key (bolded in the table below)

Now, an attacker can’t search the COVID table to identify how many customers have COVID. Additionally, because we used an encryption algorithm without any validation, any of the values in the COVID column will decrypt with any of the encryption keys in the account table, just with erroneous least significant bits, thus not disclosing anyone’s COVID status at all. If we used a decryption algorithm with padding or authentication, it would provide a quicker backdoor for connecting records in the account table with records in the COVID table.

One more step to slow a potential attacker down is to iterate the hashing, in other words compute H( H( H( H( …. H(name+salt))))), say 100,000 times. While slowing down lookups slightly, it significantly increases the computation of a rainbow table for an attacker, who now must compute 100,000 x 109 hashes. Increasing the number of values in the account table, increasing the potential values for name (such as using given name and surname), and increasing the number of iterations will all serve to slow an attacker down.

The concept described above is sometimes called a translucent database, such that someone who has full access to the data (a database engineer) still can’t interpret it, but with a little extra knowledge (account number and name), a customer can still retrieve their COVID status.

Cookie pollution

Just moments ago I visited lightpollutionmap.info to look for someone to go camping this weekend where there was as little light pollution as possible given a reasonable drive from where I’m at. I was immediately presented with a pretty comprehensive cookie banner. The FIRST thing I noticed was that I couldn’t unselect unnecessary cookies, which included over 75 “statistics” cookies.

Most of those “statistics” cookies, of course, are really third party marketing trackers. The only option presented were “Allow selection” and “Allow all cookies” which were functionally identical because you couldn’t unselect any of the types of cookies. I noticed the button for only selecting necessary cookies was still there but invisible. I clicked it, only be to presented a blank screen. See the video below showing the cookie selection options disappearing. I also noticed the ad settings options to accept or reject vendors was disabled was disabled.

VIDEO of options being hidden

Breaking down “Personal Data”

I’ve rallied for years against the use of PII (or Personally Identifiable Data) as unhelpful in the privacy sphere. This term is used is some US legislation and has unfortunately made its way into the vernacular of the cyber-security industry and privacy professionals. Use of the term PII is necessarily limiting and does allow organizations to see the breadth of privacy issues that may accompany non-identifying personal data. This post is meant to shed light on the nuances in different types of data. While I’ll reference definitions found in the GDPR, this post is not meant to be legislation specific.

Personal Data versus Non-personal Data

The GDPR defines Personal Data as “means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” The key here term in the definition is the phrase “relating to.” This broad refers to any data or information that has anything to do with a particular person, regardless of whether that data helps identify the person or that person is known. This contrast with non-personal data which has no relationship to an individual.

Personal Data: “John Smith’s eyes are blue.”

In this phrase, there are three pieces of personal data. The first is the name John which is a first name related to an individual, John Smith. The second is his last name. Finally, the third is blue eyes, which also relates to John Smith.

Anonymous Data: “People’s eyes are blue.”

No personal data is indicated in the above sentence as the data doesn’t relate to an individual, identified or identifiable. It relates to people in general.

Identified Data versus Pseudonymous Data

Much consternation has been exhibited over the concept of pseudonymized data. The GDPR provides a definition of pseudonymized: “means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.” The key phrase in this definition is that data can no longer be attributed to an individual without additional data. Let me break this down.

Identified Data: “John Smith’s eyes are blue.”
The same phrase we used in our example for Personal Data is identified because the individual, John Smith, is clearly identified in the statement.

Pseudonymous (Identifiable) Data: “User X’s eyes are blue.”
Here we have processed the individual’s name and replaced it with User X. In other words, its been pseudonymized. However, it is still identifiable. From the definition above, Personal Data is data relating to an identified or identifiable individual. Blue eyes are still related to an identifiable individual, User X (aka John Smith). We just don’t know who he is at the moment. Potentially we can combine information that links User X to John Smith. Where some people struggle is understanding there must be some form of separation between the use of the User X pseudonym and User X’s underlying identity. Store both in one table without any access controls and you’ve essentially pierced the veil of pseudonymity. WARNING: Here is where it can get tricky. Blue eyes are potentially identifying. If John Smith is the only user with blue eyes, it makes it much easier to identify User X as John Smith. This is huge pitfall as most attributable data is potentially re-identifying when combined with some other data.

Identifying Data versus Attributable Data

In looking at the phrase “John Smith’s eyes are blue” we can distinguish between identifying data and attributable data.

Identifying Data: “John Smith”
Without going into the debate of number of John Smiths in the world, we can consider a person’s name as fairly identifying. While John Smith isn’t necessarily uniquely identifying, a type of data, a name, can be uniquely identifying.

Attributable Data: “blue eyes”
Blue eyes is an attribution. It can be attributable to a person, in the case of our phrase “John Smith’s eyes are blue.” It can be attributable to a pseudonym: “User X’s eyes are blue.” As we’ll see below, it can also be attributed anonymously.

Anonymous Data versus Anonymized Data

GDPR doesn’t define anonymous data but in Recital 26 it says “anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” In the first example, I distinguished Personal Data with Anonymous Data, which didn’t relate to specific individual. Now we need to consider the scenario where we have clearly Personal Data which we anonymize (or render anonymous in such a manner that the data subject is not or no longer identifiable).

Anonymous Data: “People’s eyes are blue.”
For this statement, we were never talking about a specific individual, we’re making a generalized statement about people and an attributed shared by people.

Anonymized Data: “User’s eyes are blue.”
For this statement, we took Identified Data (“John Smith’s eyes are blue”) and processed in a way that is potentially anonymous. We’ve now returned to the conundrum presented with Pseudonymous data. Specifically, if John Smith is the only user with blue eyes, then this is NOT anonymous. Even if John Smith is a one of a handful of users with blue eyes, the degree of anonymity is fairly low. This is the concept of k-anonymity, whereby a particular individual is indistinguishable from k-1 other individuals in the data set. However, even this may not given sufficient anonymization guarantees. Consider a medical dataset of names, ethnicities and heart condition. A hospital releases an anonymized list of heart conditions (3 people with heart failure, 2 without). Someone with outside knowledge (that those of Japanese descent rarely have heart failure and the names of patients) could make a fairly accurate guess as to which patients had heart failure and which did not. This revelation brought about the concept of l-diversity in anonymized data. The point here is that unlike Anonymous Data which never related to a specific individual, Anonymized Data (and Pseudonymized Data) should be carefully examined for potential re-identification. Anonymizing data is a potential minefield.

If you need help navigating this minefield, please feel free to reach out to me at Enterprivacy Consulting Group

Bots, privacy and sucide

I had the pleasure of serving last week on a panel at the Privacy and Security Forum with privacy consultant extraordinaire Elena Elkina and renowned privacy lawyer Mike Hintze. The topic of the panel was Good Bots and Bad Bots: Privacy and Security in the Age of AI and Machine Learning. Serendipitously, on the plane to D.C. earlier that morning someone had left a copy of the October issue of Wired Magazine, the cover of which displayed a dark and grim image of Ryan Gosling, Harrison Ford, Denis Villeneuve, and Ridley Scott from the new dystopian film, Blade Runner 2049. Not only was this a great intro to the idea of bot (in the movie’s case human like androids) but the magazine contained two pertinent articles to our panel discussion: “Q: In. Say. A customer service chat window, what’s the polite way to ask whether I’m talking to a human or a robot?” and “Stop the chitchat: Bots don’t need to sound like us.” Our panel dove into the ethics and legality of deception, in say a customer service bot pretending to be human.

White the idea was fresh in my mind, I wanted to take a moment to replay some of the concepts we touch upon for a wider audience and talk about the case study we used in more detail than the forum allowed. First off, what did we mean by bots? I don’t claim this is a definitive definition but we took the term, in this context, to mean two things:

  • Some form of human like interface. This doesn’t mean they have to the realism of Replicants in Blade Runner, but some mannerisms in which a person might mistake the bot for another person. This goes back to the days, as Elena pointed out, of Alan Turing and his Turing test, years before any computer could even think about passing. (“I see what you did there.”). The human like interface potentially has an interesting property, are people more likely to let their guard down and share sensitive information if they think they are talking to another person? I don’t know the answer to that and their may be some academic research on that point. If their isn’t I submit that it would make for some interesting research.
  • The second is the ability to learn and be situationally aware. Again, this doesn’t require the super sophistication of IBM’s Watson but any ability to adapt to changing inputs from the person with whom it is interacting. This is key, like the above, to giving the illusion a person is interacting with another person. By counter example, Tinder is littered with “bots” that recite scripts with limited, if any, ability to respond to interaction.

Taxonomy of Risk

Now that we have a definition, what are some of the heightened risks associated with these unique characteristics of a bot that, say, a website doesn’t have? I use Dan Solove’s Taxonomy of Privacy as my goto risk framework. Under the taxonomy I see 5 heightened risks:

  1. Interrogation (questioning or probing of personal information): In order to be situationally aware, to “learn” more, a bot may ask questions of someone. Those questions could go too far. While humans have developed social filters, which allows us to withhold inappropriate questions, a bot lacking a moral or social compass could ask questions which make the person uncomfortable or is invasive. My classic example of interrogation is an interview where the interviewer asks the candidate if they are pregnant or planning to become pregnant. Totally inappropriate in a job interview. One could imagine a front like recruitment bot smart enough to know that pregnancy may impact immediate job attendance of a new hire but not smart enough to know that it’s inappropriate to ask that question (and certainly illegal in the U.S. to use pregnancy as a discriminatory criteria in hiring).
  2. Aggregation (combining of various piece of personal information): Just as not all questions are interrogations, not all aggregation of data creates a privacy issue. It is when data is combined in new and unexpected ways, resulting in information disclosure than the individual didn’t want to disclose. Anyone could reasonably assume Target is aggregating sales data to stock merchandise and make broad decisions about marketing, but the ability to discern pregnancy of a teenager from non-baby related purchased was unexpected, and uninvited. For a pizza ordering bot, consider the difference between knowing my last order was a vegetable pizza and discerning that I’m a vegetarian (something I didn’t disclose) because when I order for one its always vegetable but if I order for more than one, it includes meat dishes.
  3. Identification (linking of information to a particular individual): There may be perfectly legitimate reasons a bot would need to identify a person (to access that person’s bank account for instance) but identification as an issue comes into play when its the perception of the individual that they would remain anonymous or at the very least pseudonymous. If I’m interacting with a bot as StarLord1999 and all the sudden it calls me by the name Jason, I’m going to be quite perturbed.
  4. Exclusion (failing to let an individual know about the information that others have about her and participate in its handling or use): As with aggregation, a situationally aware bot, pulling information from various sources may alter its interaction in a way that excludes the individual from some service without the individual understanding why and based on data the individual doesn’t know it has. For instance, imagine a mortgage loan bot, that pulls demographic information based on a user’s current address, and steers them towards less favorable loan products. That practice sounds a lot like red-lining and if it has discriminatory effects, could be illegal in the U.S.
  5. Decisional Interference (intruding into an individual’s decision making regarding her privacy affairs): The classic example I use for decisional interference is China’s historic one-child policy which interferes with a family’s decision making on their family make-up, namely how many children to have. So you ask, how can a bot have the same effect? Note the law is only influential, albeit in a very strong way. A family can still physically have multiple children, hide those children or take other steps to disobey the law, but the law is still going to have a manipulatory effect on the decision making. A bot, because if it’s human interface, and advanced learning and situational knowledge, can be used to psychologically manipulate people. If the bot knows someone is psychologically prone to a particular type of argument style (say appealing to emotion) it can use that and information at it’s disposal to subtly persuade you towards a certain decision. This is a form of decisional interference.

Architecture and Policy

I’m not going to go into a detailed analysis of how to mitigate these issues, but I’ll touch on two thoughts: first, architectural design and second, public policy analysis. Privacy friendly architecture can be analyzed along two axes, identifiability and centralization. The more identified and more centralized the design, the less privacy friendly it is. It should be obvious that reducing identifiability reduces the risk of identification and aggregation (because you can’t aggregate external personal data from unidentified individuals) so I’ll focus here on centralization. Most people would mistakenly think of bots as being run by a centralized server, but this is far from the case. The Replicants in Blade Runner or “autonomous” cars are both prominent examples of bots which are decentralized. In fact, it should be glaringly apparent that a self-driving car being operated by a server in some warehouse introduces unnecessary safety risks. The latency of the communication, potential for command injections at the server or network layer, and potential for service interruption are unacceptable. The car must be able to make decisions immediately, without delay or risk of failure. Now decentralization doesn’t help with many of the bot specific issues outlined above, but it does help with other more generic privacy issues, such as insecurity, secondary use and others.

Public policy analysis is something I wanted to introduce with my case study during the interactive portion of the session at the Privacy and Security Forum. The case study I present was as follows:

Kik is a popular platform for developing Bots. https://bots.kik.com/#/ Kik is a mobile chat application used by 300 million people worldwide and an estimated 40% of US teens at one time or another have used the application. The National Suicide Prevention Hotline, recognizing that most teens don’t use telephones wants to interact with them in services they use. The Hotline wants to create a bot to interact with those teens and suggest helpful resources. Where the bot recognizes a significant risk of suicide rather than just casual inquiries or people trolling the service, the interactions will first be monitored by a human who can then intervene in place of the bot, if necessary.

I’ll highlight one issue, decisional interference, to show why it’s not a black and white analysis. Here, one of the objectives of the service and the bot, is to prevent suicide. As a matter of public policy, we’ve decided that suicide is a bad outcome and we want to help people who are depressed and potentially suicidal get the help they need. We want to interfere with this decision. Our bot must be carefully designed to promote this outcome. We don’t want the bot to develop in a way that doesn’t reflect this. You could imagine a sophisticated enough bot going awry and actually encouraging callers to commit suicide. The point is, we’ve done that public policy analysis and determined what the socially acceptable outcome is. Many times organizations have not thought through what decisions might be manipulated by the software they create and what the public policy is that should guide they way the influence those decisions. Technology is not neutral. Whether it’s is decisional interference or exclusion or any of the other numerous privacy issues, thoughtful analysis must precede design decisions.