CHALLENGE:90 Trails in 100 Days

Challenge: 90 Trails in 100 Days

Sunrise over Myakka River State Park

In the fall of 2020, I went on a wonderful, adventurous, hike down from the rim of the Grand Canyon into Phantom Ranch on the river. The trip was arduous and the climb out was tough given my hailing from the vertically challenged state of Florida. Prior to this undertaking, I had begun training, in Florida, including perhaps Florida’s most vertical trail, the Torreya Challenge in Torreya State Park. After this adventure and the two months of training leading up to it, I became a bit of a couch potato in December and January. About the middle of January, I decided to do something about that. Seeing all the wonderful trails in my area, I set myself a challenge, to hike or bike 90 different trails in 100 days. I gave myself a hundred days to account for weather, work or other impediments to doing a trail each day.  

Now you may be asking, what this has to do with privacy. The short answer is not much, but there is always an angle. In using AllTrails, the trail mapping application, I discovered a nifty way to stalk people. See my previous blog post for more.

In February, thanks to Publix Supermarkets, I procured a large amount of trail mix. By the end, despite adding some more in March and April, I was down to one container.


My typical kit consistent of day pack with water reservoir, bear bell, bear mace, Chapstick, headphones to listen to Privacy podcasts, snacks (not pictured trail mix), and trail maps. Also not pictured are optional sunscreen and insect repellent.

Off I set on January 30th. The task seemed simple but, as with many things, implementation was more fraught than at first imagined. I tried to do longer trails or those  farther away when I had more time (like weekends) or during nice weather. One early trail, Pond Loop, at Okeeheepkee Prairie County Park, I completed on a rainy afternoon was only 0.5 miles. I decided after that to only include trails over 1 mile long. This lead me to a few times combining short trails into one “trail” or stretching a trail to it’s extreme (exploring every nook and cranny) to try to get that mile in. Trying to define a “trail” also led to some creative interpretations. Not all trails are simply laid out in the platonic idealized state. Some are based on forest service roads, some intersect and loop and figure 8. Some are out and back, retracing your steps. I had to break some long trails, like the 30+ miles of the St. Marks trail into more manageable pieces of about 16 mile chunks (8 out and 8 back). Some of my trips weren’t trails at all, but I counted them, like when I walked 6 miles home after dropping off a truck at the rental car company. I learned about new trails, which weren’t easily found, like the Capital to Coast trail still under construction, which when fully complete with be 120 miles of biking or shared use paths from Tallahassee to the Emerald coast. 

It was perhaps the best time to be out hiking and biking in Florida. Spring weather meant it wasn’t too hot, the parks were green and flowers were in full bloom. I saw so many animals, many that your rarely see as a weekend warrior.  In addition to the usual squirrels and lizards, I saw a bobcat, a family of boar, water moccasins and other snakes, a mole,  a red pileated woodpecker, a gopher tortoise, giant mosquitoes, ticks, spiders, and many more. I did not, however, see a bear. Not to say they weren’t there, but ever since I encountered a bear last year, I’ve been hiking with a bear bell and bear mace, so they’ve thankfully kept their distance. 

What lurks beneath? Creature from the black lagoon? Manatee? Large alligator? Something was moving fast and leaving a wake under the Crooked River in Tate’s Hell State Forest.

As a capstone to my challenge, I returned to Torreya State Park to take on the Torreya Challenge. It was a wonderous exhausting 4 hours which left me with a terrible head ache, but I made it. 90 different trails. 98 days in the making. 478 miles covered. Challenge complete. Level UP! 

More Galleries

Animals and Insects
Flora and Fungi
Doll's Head Trail
IMG_2133
Georgia
Econfina State Park
Emerald Coast
Torreya State Park - Torreya Trail and the Torreya Challenge Trail
Panoramas
# Location Trail Miles Type Links
1 Lake Talquin State Forest – Lines Track West Loop 3.8 Hike favicons (16×16)
2 Ellinor Klapp-Phipps Park West Loop 2.9 Hike
3 J.R. Alford Greenway Bluebird Loop 3.9 Hike
4 San Luis Mission Park San Luis Park Loop 1.98 Hike favicons (16×16)
5 Apalachicola National Forest GF&A Trail 5 Hike

6 Maclay Gardens State Park Shared Trail Loop 5.5 Bike https://www.floridastateparks.org/maclaygardens
7 Lake Talquin State Forest – Lines Track Talquin Loop (Blue) 6 Bike
8 Okeeheepkee Prairie County Park Pond Loop 0.5 Hike  
9 Apalachicola National Forest Munson Hills 8.4 Bike
10 Governors Park Fern Trail 3.4 Hike  
11 Three Rivers State Park Eagle Trail 3

Bike

12 St Marks Trail North Trail 16 Bike
13 Ochlockonee River WMA Old Cemetary Rd 3.9 Hike  
14 Lafayette Heritage Trail Park Lafayette Heritage Trail 6.9 Hike  
15 Wakulla Springs State Park Wakulla Springs Park Trail 10.1 Hike
16

Orchard Pond

Orchard Pond Trail 6.9 Bike  
17

Silver Lake Recreation Area

Silver Lake Habitat Trail 1.4 Hike
18 Timberlane Ravine Nature Preserve Timberlane Ravine Nature Trail 1.5 Hike
19 San Felasco Hammock Preserve State Park Moonshine Creek Trai 1.6 Hike
20 Lakeland Highlands Scrub  Lakeland Highlands Scrub Trail 3.1 Hike  
21 Catfish Creek Preserve State Park Campsite 2 White Trail 5.5 Hike
22 Black Creek Preserve Red Trail 4.9 Hike  
23 Tom Brown Park Magnolia MTB Trail 3 Bike  
24 Central Park Central Park Lake Loop  1.9 Hike  
25 Apalachicola National Forest Camel Lake Loop 8.2 Hike
26 J. R. Alford Greenway Yellow Loop 5.3 Bike  
27 Miccosukee Canopy Road Greenway Miccosukee Greenways Trail 15.5 Bike  
28 Bald Point State Park Loop 3.1 Hike
29 A.J. Henry Park A.J. Henry Park Trails 1.9 Hike
30 Alfred B. Maclay Gardens State Park Bike Loops 5.7 Bike
31 Wakulla State Forest Nemours Trail 1.6 Hike
32 Ochlocknee River WMA Cut Through Lewis Loop 2.5 Hike  
33 Kolomoki Mounds State Park Spruce Pine Trail 3.1 Hike
34 Wilson Hospice House Wilson Hospice House Trail 1.3 Hike  
35 Marjorie Turnbull Park Trail 1.6 Hike  
36 Tallahassee Morning Hike from Budget 5.4 Hike  
37 Gil Waters Preserve at Lake Munson Trail 1 Hike
38 Bald Point State Park Sandy Trails 4.2 Bike
39 Elinor Klapp-Phipps Park Redbug Trails 4.6 Bike
40 Tate’s Hell State Forest High Bluff Coastal Loop Trail 9 Bike
41 Ochlocknee State Park Pine Bluff Trail 1.2 Hike
42 St. Joseph Island Loggerhead Trail and Maritime Hammock Nature Trail 21.5 Both
43 Tom Brown Park West Cadillac Trail 3.1 Bike
44 J.R. Alford Greenway Long Leaf Trail 4 Hike  
45 Alfred B. Maclay Gardens State Park Ravine Trail 1.9 Bike
46 Cascades Park Cascades Park Loop 2.5 Hike
47 Apalachicola National Forest Oak Park Bridge Trail 4.8 Hike
48 St. George Island State Park Gap Point Trai 5.5 Hike
49 St. George Island State Park Sugar Hill Beach Old Road 7.3 Both
50

St. Marks Trail

Wakulla Springs to St. Marks 16 Bike  
51 Myakka River State Park Fox’s Low to Mossy Hammock 3 Hike
52 Myakka River State Park Mossy Hammock to Fox’s Low 9.5 Hike
53 Plant City Dean’s Ride 8.5 Bike  
54 River Rise Preserve State Park River Rise Preserve Trail 2.9 Hike
55 St. Marks Trail North Trail 9.8 Bike  
56 Tom Brown Park Subaru to Tom Brown 3.5 Hike  
57 Apalachicola National Forest Wright Lake White Trail 5.5 Hike
58 St. Mark WMA Plum Orchard to St Marks Via Port Leon 8.2 Hike  
59 Capital Circle SE  Captial Circle SE Shared Use Path 13 Bike
60 Econfina River State Park Blue and Orange Trail 12.5 Bike
61 Anita Davis Preserve at Lake Henrietta Park Lake Henrietta Trail 1.8 Hike

62 Goose Pond Trail Goose Pond Trail 2.7 Hike  
63 Wakulla State Forest Double Springs Trail with Petrik Spur 4.7 Bike
64 Fred George Greenway and Park Fred George Loop 1.5 Hike  
65 Lake Talquin State Forest Long Leaf Loop 3.9 Hike

favicons (16×16)

66

Guyte P. McCord Park

Sculpture Trail 1.2 Hike
67 Ochlockonee Bay Trail Ochlockonee Bay Trail 31.2 Bike  
68 Optimist Park Indian Head Trail 2 Hike
69 Chase Street Park Monticello “Ike Anderson” Bike Trail 4.5 Bike  
70 Apalachicola National Forest – Bradley Bay Wilderness Monkey Creek Trail Head on Florida Scenic Trail 4.9 Hike
71 FSU Bike Path Ocala Rd to Stadium Drive 2.3 Bike  
72 Seminole State Park Gopher Tortoise Nature Trail 1.8 Hike
73 J.R. Alford Greenway Wiregrass and Beggarwood Loop 3.3 Bike  
74 Lake Talquin State Park Lake Talquin State Park Trail 1.6 Hike
75 Capital to Coast Trail St. Marks to 319 24.3 Bike  
76 Constitution Park Dolls Head Trail 2.4 Hike  
77 East Roswell Park Park Loop Trail 2.1 Bike  
78 Chattahoochee-Oconee National Forest Andrews Cove Trail 3.8 Hike
79 Unicoi State Park Bike Trail 3.5 Bike
80 Unicoi State Park Anna Ruby Falls 1 Hike
81 Brinkley Glen Park Brinkley Glen Trail 1 Hike
82 Letchworth-Love Mounds Archaelogical State Park Letchworth Mounds Loop 2.7 Hike
83 St. Marks WMA Florida Scenic Trail 5.5 Hike  
84 Econfina State Park River Loop (Red Trail) 3.3 Hike
85 Torreya State Park Torreya Trail 6.6 Hike
86 Lafayette Heritage Trail Park East Cadillac Loop 2.6 Bike
87 Governors Park Fern Trail, Kohl’s Trail and Blairstone Multi-Use Trail 4.4 Hike  
88 Apalachee Regional Park Cross Country Loop 3.3 Hike
89 St. Andrews State Park Road and Pine Flatwoods Trail 5.5 Both
90 Torreya State Park Torreya Challenge 9.3 Hike
      478    

 

Google Calendar Privacy Vulnerability

Its interesting how events can lead one find privacy and security vulnerabilities. I’m reminded of the old Connections show, where James Burke would connect seemingly unrelated events in human history and show how one led to another. During my Winter 2021 Strategic Privacy by Design course, the United States did a time shift known as Daylight Saving Time, an anachronism from the days of agriculture where the government thought changing the time twice a year to adjust to changing sunlight would help farmers use time more effectively. As a result of this shift, some students in Europe showed up at the end of a lecture because I had adjust my clock, but they, obviously being in Europe, had not.

As a result of this timing error, I thought it might be good to create calendar items in Moodle (the LMS I use) for the Spring 2021 Strategic Privacy by Design course. The plan was to export the iCal file and send it to students so they would each be able to insert the important course events in their own calendar. I did just that into my calendar as well, which, unfortunately is in Google.

My eagle eyed assistant instructor, Maria, noticed when she was checking my schedule to send me an invite to a meeting, that should could see these items, even though I had set up to only share Free/Busy calendar (see below).

After digging around, I finally figure out what was going on. Visibility on each calendar item has options of: private, public or default visibility (meaning to default to the overall calendar’s visibility).

However, these calendar items had a class in the iCal file of public, which overrode my calendar’s default of Free/Busy only.

Those events were imported. I wanted to check, so I had my security intern invite me to three event, one she set to private, one she set to public and one she set to default visibility. As expected, despite my calendar set to Free/Busy only, the “public” event showed as public.

Your reaction may be, well this event is public, but two problems persist. 1) It still shows MY interest or possible attendance in this public event, not just whether I’m busy or free; and 2) when the sender has their calendar default to public and doesn’t realize that but sends you an invite to talk. I would suggest that my calendar settings should override the imported event’s settings, just to be on the safe side.

By the way, if anyone has a suggestion for a privacy friendly online calendar (so I can share my free/busy schedule), I’d appreciate hearing from you. I haven’t found a good alternative yet.

How to stalk someone via a trail mapping app.

For those that don’t know, I’m an avid hiker and biker. In fact, I’m currently undertaking a challenge that I created for myself to do 90 different trails in 100 days. Currently, I’m 2/3rds through that challenge with ~30 days to go. One of the keys tools I use for finding and following trails is a mobile App called AllTrails.

I’ve used it for years but now I’m using it daily. While I’ve known that trail apps have potential privacy problems  (I even included building a privacy friendly trail app as an example in my book, see illustration), my recent use has pinpointed how problematic.

 

Screenshot of AllTrails

In the App on a phone, when you pull up to explore the area looking for trails you’re presented the pinpoints of a bunch of curated trails, as shown at left. You can click on a trail and get a description, trail map, reviews, popular activities and features. There is a slight problem in that reviews, I think, are public by default, but it appears that when your profile is private or individual recordings are private, your reviews aren’t shown. 

In my search for trails, though, I’ve found lots of unlabeled trails. In other words, trails at parks, greenways and forests that haven’t been curated and cataloged. You can submit new trails for consideration, and I’ve done that with a few. I’ve also recorded via the app some hikes and walks that aren’t official trails, like when I dropped a rental truck off and walked home 5 miles because I needed to get a hike in that day or when my car was getting an oil change and I hiked to a park to kill time. Because of the challenge, I wanted to document these “hikes” to record my mileage. Now even though my recordings are private and my profile is private, uploading these recordings seemed even less problematic because they weren’t linked to an official trail and thus unfindable by the public. At least I assumed so. [Yes, privacy professionals, I know, AllTrails could be monetizing me by selling geolocation information to advertisers. I assume so, at least, with any app I use.

It turns out that my statement about recordings unlinked to trails is not quite accurate. In the App it appears to be true, but on the AllTrails website, you can look at curated trails OR community content. 

This community content contains all sorts of hikes people take, including official but uncurated trails, trips to visit grandmother in Ohio (I saw on where someone recorded their road trip) or walking around their neighborhood. I’ve yellowed out the map above to reduce the chance of someone finding this particular hiker’s location based on the road topography. Clicking on the recording in the list of community content leads to the details (shown below). As you can see this hiker left their house (black point) and walked around their neighborhood and turned off the recording as they approached their house at the end of the cul-de-sac (green point). Mousing over the endpoints yields the latitude and longitude to 5 decimal places, which is accurate to within a meter. I’ve attempted to obscure as much information as possible, like street names, exact lat/long and other houses, but I’m sure  someone with enough resources could identify this from the unique street outline. However, I’m not going to make it easy. 

You may be thinking, well this isn’t bad because I don’t know who lives at some random house (i.e. I don’t know their name, though it might be part of the public records on home ownership).  It other words its an attribute disclosure about this person (their walk details) but not an identity disclosure. I won’t debate the problems of attribute disclosures in this blog but that’s not what’s happening here. Clicking the profile icon will take you to their profile. Note, this person did at least not upload a picture of themselves so the profile icon (under the words Morning Hike on the left) is generic. Unfortunately, they DID include their full name (changed to a gender neutral generic name below). 

On my recent hiking challenge, I generally listen to podcasts, mostly privacy related. One I’ve become very fond of is Michael Bazzell’s “Privacy, Security and OSINT” podcast. It’s fairly frequent (I’m listening to podcasts daily now) and provides both tips on how to protect your privacy and OSINT (Open Source Intelligence) techniques, to which people need to be familiar with in order to protect their privacy. 

Of course, being a privacy by design specialist, my take is people shouldn’t have to go to extremes to protect their privacy. The onus is on organizations to build better products and services. AllTrails, I like your app, really, I do. But it needs so many improvements from a privacy perspective. So many, in fact, I’d be happy to offer you some free consulting. Just contact me rjc at enterprivacy.com.  I don’t mean to single AllTrails out. I’m sure this is a problem with many or most of the trail apps. AllTrails just happens to be the one I use. 

For others who don’t want their organizations to be on the cover of the NY Times , sign up for some privacy by design training or contact me about a consulting engagement. Become a privacy hero with your customers. 

One Way Tables

One of the potential downsides of relational data is the ease at which data can be related in both directions. A simple search of the table below will reveal not only if a known customer has COVID but a list of all COVID positive customers.

Supposed instead, you want to be able to determine if a customer has COVID but not easily obtain a list of all customers who have COVID. How might you do that? This can be accomplished using one-way functions (hashes).. By hashing the customer’s name (shown as H(name) function) and replacing that in the customer field, we no longer can get a list of customer’s with COVID. A simple SQL query will get the customer’s COVID status from their name: SELECT Covid from TABLE where Customer = hash(customer_name). The inverse, however, won’t get me the list of customers with COVID but a list of hashes: select customer from TABLE where COVID =”Positive”.

Now hashes are one way functions, meaning it’s very difficult to determine x, given H(x). However, because of the very few values of customer’s name, it’s easy to compute what’s called a rainbow table, that is a table of all possible hash values. We can then lookup the hash in the rainbow table and obtain the customer’s name for those customers that have COVID. The table below assumes we know all the names of customers, an attacker may not, but names are fairly common and even if we created our rainbow table with six thousand different common first names, that’s fairly trivial to compute.

We address this with a concept called salting, which is adding a random value to the Name before hashing it, i.e. Customer = H(Name+Salt). One problem though we can’t use the same salt for every customer or an attacker could precompute a rainbow table just adding the salt to each name before calculating the hash. We also need the salt BEFORE we do the hash. Now customer’s can’t be expected to remember the salt before looking it up, but we can give them an account number, which we can relate to the salt. A customer, wanting to look up their COVID status, enters their account number and name. The system retrieves the salt using the account number and hashes the name to determine the customer and look up their COVID status. [Note, salts should be incremented with each access to prevent replay attacks but that’s beyond the scope of this blog].

Light arrows show the link between rows, only discoverable with knowledge of the customer’s name.

Now an attacker wanting to identify all known COVID cases would need to hash all the possible customer names against all the possible salts. Padding the salt table with hundreds of thousands of random account numbers could increase the number of computations necessary for the attacker. In fact, you could prefill the account table with a million rows and a million salts and randomly assign account numbers to customers. Even if two customers got the same account number, they aren’t going to get the same Customer value unless they have the same name, H(name+salt). This also benefits us because if we have two people with the same name, they could be given different account numbers and distinct COVID statuses.

In order to create a rainbow table now an attacker must hash six thousand names with a million account numbers, or 109 rows. Not impossible but starting to slow them down.

I’m going to make it even a little bit harder. Right now, we can still use the COVID table to identify how many people are COVID positive or negative. This is especially problematic if each and every customer has the same status. We can do a little encryption and steganography to hide the information. First, for each record we can generate a random 64 byte value (say 787683b51cf3b49886fce82dc34a51a2 in Hexadecimal). If we need to encode a negative result, we ensure that value has the least significant bit of 0. If we need to encode a positive result, we ensure that value has the least significant bit of 1 (i.e. 787683b51cf3b49886fce82dc34a51a3). Note the change in the last digit from 2 to 3, changing the least significant bit and the COVID status. That value is then encrypted using a symmetric cypher (such as AES) without padding or authentication to verify the accuracy of the data. We can use part of the SALT as the encryption key (bolded in the table below)

Now, an attacker can’t search the COVID table to identify how many customers have COVID. Additionally, because we used an encryption algorithm without any validation, any of the values in the COVID column will decrypt with any of the encryption keys in the account table, just with erroneous least significant bits, thus not disclosing anyone’s COVID status at all. If we used a decryption algorithm with padding or authentication, it would provide a quicker backdoor for connecting records in the account table with records in the COVID table.

One more step to slow a potential attacker down is to iterate the hashing, in other words compute H( H( H( H( …. H(name+salt))))), say 100,000 times. While slowing down lookups slightly, it significantly increases the computation of a rainbow table for an attacker, who now must compute 100,000 x 109 hashes. Increasing the number of values in the account table, increasing the potential values for name (such as using given name and surname), and increasing the number of iterations will all serve to slow an attacker down.

The concept described above is sometimes called a translucent database, such that someone who has full access to the data (a database engineer) still can’t interpret it, but with a little extra knowledge (account number and name), a customer can still retrieve their COVID status.

Cookie pollution

Just moments ago I visited lightpollutionmap.info to look for someone to go camping this weekend where there was as little light pollution as possible given a reasonable drive from where I’m at. I was immediately presented with a pretty comprehensive cookie banner. The FIRST thing I noticed was that I couldn’t unselect unnecessary cookies, which included over 75 “statistics” cookies.

Most of those “statistics” cookies, of course, are really third party marketing trackers. The only option presented were “Allow selection” and “Allow all cookies” which were functionally identical because you couldn’t unselect any of the types of cookies. I noticed the button for only selecting necessary cookies was still there but invisible. I clicked it, only be to presented a blank screen. See the video below showing the cookie selection options disappearing. I also noticed the ad settings options to accept or reject vendors was disabled was disabled.

VIDEO of options being hidden

Artificial Motivations

Recently I had the pleasure of working with two start-ups in the AI space to help them consider privacy in their designs. Mr Young, a Canadian start-up, wants to use an intelligent agent to allow individuals to find resources available to them to improve their mental well-being. Eventus AI, a US based company, hopes to use AI to optimize the sales funnel from leads collected at events. Both recognized the potential privacy implications of their services and wanted to not only ensure compliance with legal obligations, but showcase privacy as an important aspect of their brand.

At the onset of my engagements, I had to think about how the intelligent agents driving them could threaten individuals’ privacy. In my previous work, threat actors were persons, organizations or governments, each with distinctive motives. People can be curious, seek revenge, trying to make money, or exert control.  Organizations are generally driven by making money or creating competitive advantage. Governments invade privacy for law enforcement or espionage purposes. Less angelic governments may invade privacy out of desires for control or repression of their citizens.

Figure 1 From the book Strategic Privacy by Design chapter on Actors

Typically, when I think of software, it isn’t a “threat actor” in my privacy model. They don’t have independent motives. They are tools made by developers but they don’t have motives on their own. The question arises though; does AI represent a different beast? Does AI have “motives” independent of its creator? Clearly, we haven’t reached a stage where HAL 9000 refuses Dave’s command or Skynet determines humanity as a threat to its existence, but could something slightly less sentient manifest motive?

Still, I would argue that AI is not similar to other software. It can, in the sense, present a privacy threat beyond the intent of its creator.  While not completely autonomous, AI does exhibit an ends justify the means approach to achieving its objective.  The difference between AI and, say, a human employee, is the human can put their business objective in context of other social norms, whereas AI lacks this contextual understanding. I liken it to the dystopian analogy of robots being programmed to prevent humans from harming one another and determining the best way is to exterminate all humans. Problem solved! No more humans harming other humans. Like the genie granting a wish, they do exactly what they are told, sometimes with unintended and far-reaching consequences.

The motivation that I would ascribe to AI then is “programmatic goal-seeking.” It is not that AI seeks to invade privacy for independent purposes; rather it seeks whatever it’s been programmed to seek (such as ‘increasing engagements’). Privacy is the beautiful pasture bulldozed on AI’s straight-line path to its destination.

The question now becomes, from the perspective of a developer trying to build an AI into a system, how do you prevent privacy being a casualty of that relentless pursuit? I make no claims that my suggestion below in any way supplants all of the efforts to consider ethics in AI development (failed or successful), but rather this is the approach I take complements others. I think it gets us far along in a pragmatic and systematic way.

Before looking at tactics in the AI context (or anywhere really) there is a fundamental construct the reader must understand: the difference between data and information. Consider a photo of a person. The data is the photo – the bits, bytes, interpretations of how color should be rendered, etc. But a photo is rich with much information. It probably displays the gender of the individual, their hair color, their age, their ethnic background, perhaps their economic or social status. Even without geotagging, if the photograph has a distinctive background it could reveal the person’s location. Their subject’s hairstyle and dress and the quality and makeup of the photo might suggest the decade it was recorded. Giving over that photo to someone not only gives them the bits and the bytes but also gives them all of that rich information.

In general, for privacy by design, I use Jaap-Henk Hoepman’s strategies and tactics to reduce privacy risks. Just as they can be applied to other threat actors, I think they are equally applicable here. Returning to how to use Hoepman’s strategies against AI, consider the following example:

Your company has been tasked with designing an AI based solution to sort through thousands of applicants to find the one best suited for a job. You’re concerned the solution might adversely discriminate against candidates from ethnic minority populations. If you’re questioning whether this is even a “privacy” issue, I’d point you to the concept of Exclusion under the Solove Taxonomy.  We’re (well the AI) is potentially using information, ethnicity, without knowledge and participation of the individuals, an Exclusion violation.

How then can we seek to prevent this potential privacy violation?

Two immediate tactics come to mind. These are by no means the only tactics that could or should be employed but illustrative. The first is stripping which falls under the Minimize strategy and my ARCHITECT supra-strategy. Stripping refers to removing unnecessary attributes. Here, the attribute we need to remove is ethnicity. This isn’t as simple as removing ethnicity as a data point given to the AI. Rather, returning to the distinction between data and information, we need to examine any instance where ethnicity could be inferred from data, such as a name or cultural distinctions in the way candidates my respond to certain questions. This also includes ensuring that training data doesn’t contain hidden biases in its collection.

The second tactic is auditing which falls under the Demonstrate strategy and my SUPERVISE supra-strategy. AI already employs validation data to ensure that the AI is properly goal seeking (i.e. achieving its primary purpose). Review of this validation process should be used to also continue ensuring that the AI isn’t inferring ethnicity somehow (that we failed to strip out) and using that information inappropriately as part of its goal seeking objective. If it turns out it is, then, similar to a human employee, the AI might need retraining with new, further sanitized, data.

While AI represents a new and potentially scary future, with proper design considerations and strategic systematic approaches, we reduce the potential privacy risks they would otherwise create.

Data Fetishism, FIPPs and the Intel Privacy Proposal

fetish – ‘An excessive and irrational devotion or commitment to a particular thing.

Basing their legislative proposal in the Fair Information Practice Principles (FIPPs), Intel looks to the past not the future of privacy.  The FIPPs were developed by the OECD in the 1970’s to help harmonize international regulation on the protection of personal data. Though they have evolved and morphed, those basic principles have served as the basis for privacy frameworks, regulations and legislation world-wide. Intel’s proposal borrows heavily from the FIPPs principles: collection limitation, purpose specification, data quality, security, transparency, participation and accountability. But the FIPPs age is showing. In crafting a new law for the United States, we need to address the privacy issues for the next 50 years, not the last.

When I started working several years ago for NCR Corporation I was a bit miffed at my title of “Data Privacy Manager.” Why must I be relegated to data privacy? There is much more to privacy than data and often controls around data are merely a proxy for combating underlying privacy issues. If the true goal is to protect “privacy” (not data) then shouldn’t I be addressing those privacy issues directly? The EU’s General Data Protection Regulation similarly evidences this tension between goals and mechanism. What the regulators and enactors sought to rein in with the GDPR was abusive practices by organizations that affected people’s fundamental human rights, but they constrained themselves to the language of “data protection” as the means to do this, leading to often contorted results. The recitals to the regulation mention “rights and freedoms” no less than 35 times. Article 1 Paragraph 2 even states “This Regulation protects fundamental rights and freedoms of natural persons and in particular their right to the protection of personal data.” Clearly the goal is not to protect data for its own benefit, the goal is to protect people.

Now many people whose career revolves around data focused privacy issues may question why data protection fails at the task. Privacy concerns existed way before “data” and our information economy. Amassing data just exacerbated power imbalances that are often the root cause of privacy invasions. For those still unpersuaded whether “data protection” is indeed insufficient, I provide four quick examples where a data driven regulatory regime fails to address known privacy issues. They come from either end of Prof. Dan Solove’s taxonomy of privacy.

Surveillance – Though Solove classifies surveillance and interrogation under the category of Information Collection, the concern around surveillance isn’t about the information collected. The issue with surveillance is it invites behavioral changes and causes anxiety in the subject being watched. It’s not the use of the information collected that’s concerning, thought that may give rise to separate privacy issues, but rather the act and method of collection. Just the awareness and non-consent of surveillance (the unwanted perception of observation in Ryan Calo’s model) triggers the violation. No information need be collected. Consider store surveillance by security personnel where no “data” is stored. Inappropriate surveillance, such as targeting ethnic minorities, causes consequences (unease, anxiety, fear, avoiding certain normal actions that might merely invite suspicion) in the surveilled population.

Interrogation – Far from the stereotypical suspect in a darkened room with one light glaring, interrogation is about any contextually inappropriate questioning or probing for personal information. Take my favorite example of a hiring manager interviewing a female candidate and asking if she was pregnant. Inappropriate given the context of a job interview; that’s interrogation. It’s not about the answer (the “data”) or the use of the answer. The candidate needn’t answer to feel “violated” in the mere asking of the question, raising consequences of anxiety, trepidation, embarrassment or more. Again, we find the act and method of questioning is the invasion, irrespective of any data.

Intrusion – When Pokémon Go came out, fears about what information were collected about players abounded, but one privacy issue hardly on anyone’s radar was the use of churches by the game as places for the individual to train their characters. It turned out some of the churches on the list had been converted to people’s residences thus inviting players to intrude upon those resident’s tranquility and peaceful enjoyment of their homes. I defy any privacy professional to say that asking any developer about the personal data there are processing, even under the most liberal definition of personal data, would have uncovered this privacy invasion.

Decisional Interference – Interfering with private decisions strikes at the heart of personal autonomy. The classic examples are laws that affect family decisions, such as China’s one child policy or contraception in the United States. But there are many ways to interfere with individual’s decisions. Take the recent example of Cambridge Analytica. Yes, the researcher who collected the initial information shared people’s information with Cambridge Analytica and that was bad. Yes, Cambridge Analytica developed psychographic profile and that was problematic. But what really got the press, the politicians and others so upset was Cambridge Analytica’s manipulation of individuals. It was there attempt, successful or otherwise, to alter peoples’ perception and manipulate their decision to vote and for whom.

None of the above examples of privacy issues are properly covered by a FIPPs based data protection regime, without enormous contortion. They deal with interactions between persons and organizations or among person, not personal data. Some may claim, that while true, any of these invasions, at scale, must involve data, not one-off security guards. I invite readers to do a little Gedanken experiment. Imagine a web interface with a series of questions, each reliant on the previous answers. Are you a vegetarian? No? What is your favorite meat, chicken, fish or beef? Etc. I may not store your answer (no “data” collection) but ultimately the questioning leads you to one specific page where I offer you a product or service based on your specific selection, perhaps discriminatory pricing based on the selection. Here user interface design essentially captures and profiles users but without that pesky data collecting that would invite scrutiny from the privacy office. I’m not saying some companies might be advanced enough in their thinking, but in my years of practice most privacy assessments begin with “what personal data are you collecting?”

Now, I’ll admit I haven’t spent a time to develop a regulatory proposal but I’d at least suggest looking at Woody Hartzog’s Privacy’s Blueprint for one possible path to follow. Hartzog’s notions of obscurity, trust and autonomy as guiding privacy goals encapsulate more than a data centric world. But Hartzog doesn’t just leave these goals sitting out there with no way to accomplish them. He presents two controls that would help: signaling and increasing transaction costs. Hartzog’s proposal for signaling is that in determining the relationship between individuals and organizations and the potential for unfairness and asymmetries (in information and power), judges should look not to the legalese of the privacy notice, terms and conditions or contracts but the entirety of the interaction and interfaces. This would do more to determine whether a reasonable user fully understood the context of their interactions.

Hartzog’s other control, transaction costs, goes into making it more expensive for organization to commit privacy violations. One prominent example of legislation that increases transaction costs is the US TPCA which bans robocalls. Robocalling technology significantly decreases the cost of calling thousands or millions of households. The TCPA doesn’t ban solicitation, but it significantly increases the costs to solicitors by requiring a paid human caller to make the call. In this way, it reduces the incidents of intrusion. Similarly, the GDPR’s ban on automated decision making increases the transaction costs by requiring human intervention. This significantly reduces the scale and speed at which a company can commit privacy violations and the size of the population affected. Many would counter, and in fact many commenters on any legislative proposal, are concerned about the effect on innovation and small companies. True, that increasing transaction costs, in the way that the TCPA does, will increase costs for small firms. That is, after all, the purpose of increasing transaction costs, but the counter-argument is do you want a two-person firm in a garage somewhere adversely affecting the privacy of millions of individuals? Would you want a small firm without any engineers thinking about safety building a bridge over which thousands of commuters traveled daily? One could argue the same for Facebook, they’ve made it so efficient to connect billions of individuals they simply don’t have the resources to deal with the scale of the problems they’ve created.

The one area where I agree with the Intel proposal is about FTC enforcement. As our de-facto privacy enforcer it already has institutional knowledge to build on, but their enforcement needs real teeth not ineffectual consent decrees. When companies analyze compliance risk if the impact of non-compliance is cost comparable to the cost of compliance, the they are incentivized to reduce the likelihood of getting caught, not actually get in compliance with the regulation. The fine (impact) multiplied by the likelihood of getting fined must exceed the cost of compliance to drive compliance. This is what, at least in theory, the GDPR 4% seeks to accomplish. Criminal sanctions on individual actors, if enforced, may have similar results.

There are other problems with the FIPPs. They mandate controls without grounding in the ultimate effectiveness of those controls. I can easily technically comply with the FIPPs without manifesting improving privacy. Mandating transparency (openness in the Intel proposal) without judicial ability to consider the entirety of the user experience and expectation only yields lengthy privacy notices. Even shortened notices provide less than information about what’s going on than user’s reliance on the interactions with the company.

In high school, I participated in a mock constitution exercise where we were supposed to develop a new constitution for a new society. Unfortunately, we failed and lost the competition. Our new constitution was merely the US Constitution with a few extra amendments. As others have said we don’t need GDPR-light, we need something unique to the US. I don’t claim Hartzog’s model is the total solution, but rather than looking at the FIPPs, Intel and others proposing legislation should be looking forward for solutions for the future, not the past.

 

 

Kafka’s “The Trial” finds new life on Amazon

In the early years of the aught decade, Dan Solove, then assistant professor of law at Seton Hall Law School presaged that the new paradigm for privacy in the Internet age wasn’t George Orwell’s 1984 but rather Franz Kafka’s The Trial. In the book, the protagonist is subject to a secret trial in which he knows not the charges against him or the evidence used. He is not allowed to participate in the proceedings or dispute the evidence. In the end, the character is executed having never learned what the charges were. A few years later when Professor Solove developed his taxonomy of privacy, which categorized cognizable privacy violations, this form of privacy violation became known as Exclusion: the use of information about an individual without the individual’s knowledge or ability to participate. Solove’s taxonomy encompasses 17 distinct privacy violations spanning four broad categories: information collection, information processing, information dissemination and the non-information privacy issues in invasion. Exclusion is a particularly pernicious violation because it combines two fundamental forces that underlie many issues around privacy, namely imbalance in information and imbalance in power between the organization and the individual. In The Trial, the government is the perpetrator of this type of violation. Historically, government actors is where one mostly find this because government’s power monopoly prevents individual choice and the impact, such as loss of liberty, can be devastating. The unfairness of exclusion was crucial in criticisms of the US government’s no-fly list, which contained names of individuals who didn’t know they were on the list, even once known were unable to know or dispute the information used against them. Concern over exclusion also underpins the original formation of the Fair Information Practices (which was primarily aimed at government databases in the 1970s) to prevent secret databases to which individual couldn’t dispute inaccuracies. In the commercial realm, these concepts first took hold in credit reporting industry and the Fair Credit Reporting Act’s requirement for transparency and a dispute mechanism.

I’m a heavy user of Professor Solove’s taxonomy in my privacy by design training because it help participants categorize different types of privacy violations. It highlights, for most participants, that insecurity of data (which seems to be almost a fetishistic focus for many) is really only the tip of the privacy iceberg. I like to joke in my training that I’ve had every single privacy violation foisted on me in some fashion or another. Using anecdotes helps my students relate to the violations in a way that defining and explaining doesn’t. Personal stories convey even more than sensational news stories can. In one instance, I talk about how a cashier at fast food restaurant identified my then girlfriend to send her a Facebook friend request. I show students an Asian dating website where my picture was appropriated in advertising. Exclusion really hits home when I ask students if they’ve ever been put on hold and told their call will be answered by the next available operator. I discuss how some companies use secret profiles of their “problem” customers to constantly kick them to the back of the queue without the customer being able to know or dispute this status.

My most recent example comes from Amazon and is detailed below.

Last month I was traveling in Europe to attend a few privacy related events and conduct some training for Deloitte Romania. In Bucharest, I taught a CIPT course as well as my Privacy by Design class, relabeled Data Protection by Design for the EU audience. While in Romania I attempted to place an order on Amazon for an electronic gift card. Apparently this raised red flags within Amazon’s systems (an electronic gift card ordered by an American consumer but from Romania, which they probably inferred from my IP address). The order was canceled. Amazon also blocked access to my account for 5 hours.

After 5 hours, I followed the instructions, which included resetting my password. I was back in and figuring it was safe since I re-authenticated my access to my email, re-placed my order . Of course, it wasn’t. Amazon repeated it’s effort but this time required that I call to reinstate my account.

I called Amazon, spoke with a customer service agent and after answering quite a few questions (like how long I had had my Amazon account, whether I had any linked devices, etc.), my account access was reinstated and I was able to reset my password and log in. I decided NOT to try and replace my order until I returned to the US the follow week, which I did. At this point my order went through and I thought everything was behind me. I even subsequently ordered Woody Hartzog’s new book Privacy’s Blueprint, without issue. [Side note, excellent book, highly recommended, just don’t order it from Amazon ;-)]

About two weeks later I’m in D.C. attending the Privacy Law Scholar’s Conference and ordered a present for a friend off their wish-list. BAM! My account was again shutdown. I didn’t realize it, since I was off doing actual work, but when I tried to check the status of the order I realized that they had shut my account down again. I called customer service but the best they could offer was to submit an email to the “Account Specialist” team. At 1:30 AM later than morning I received the following email:

Thank you for letting us know about the unauthorized activity on your Amazon.com account. For your security, the credit card information stored on your Amazon.com account cannot be accessed via our website. Your full credit card number is also not displayed in your account.

Due to this activity, your Amazon.com account has been closed and all open orders have been canceled. To continue shopping with Amazon.com, we ask that you open a new Amazon account. Your order history and other features such as Wishlists cannot be transferred to your new account.

Are you really sorry for the inconvenience this has caused, Amazon?

Note the email starts out “Thank you for letting us know about unauthorized activity.” I quickly responded that there was no authorized activity (and I certainly didn’t let them know). Oddly enough there was another email at 9:30AM that morning (AFTER the account closure) suggesting I needed to call to reset my password again. I did call but the customer service agent was only able to offer to send another email to the Account Specialist team.

Frustrated at this point, and not wanting my friend’s present further delayed, I dutifully followed the instructions about using another Amazon account, after all Amazon specifically told me that To continue shopping with Amazon.com, we ask that you open a new Amazon account.” I went into my business account which I had previously used to host AWS instances for various business projects. However, my attempt to complete my order was similarly thwarted. I initially received an email to confirm my billing address, which I quickly replied. I was then presented with this sternly worded email:

Any new accounts you open will be closed.”

So, despite having told me previously to open a new account to continue shopping, Amazon has decided they will close any account I open in the future, with no discussion of why this was occurring, no information as to what led to this decision and no opportunity to dispute or appeal it. Some faceless, nameless, bureaucratic Account Specialist had declared me persona non-grata. Assuming the Account Specialist is even a person and not a bot.

My virtual death as an Amazon customer.

Quite a few of the most recent complaints on BBB for Amazon are about inexplicable account closures. Luckily, my dependence on Amazon is negligible. Though I’ve been a customer 10 years, I wasn’t using AWS actively and canceled Prime just a few months ago because it wasn’t worth the costs. However, I could only imagine how this could have affected me. Having dinner the next day with a friend in the privacy/security industry he was shocked and concerned because he orders virtually everything off Amazon as I know many do. What if I used Kindle, would I lose all my purchases? I’d be more concerned about those who’s livelihood depended on the Amazon service. What if I was an associate and depended on Amazon referrals for this blog’s revenue? Or a reseller on Amazon? What I had had extensive use of AWS for my business? Perhaps if I was in one of these categories they would have been more circumspect in their closing my account. Perhaps I would have had another avenue to escalate, but as a regular consumer I certainly didn’t.

As for my particular circumstances, first and foremost, I’m a bit miffed that I can’t get my order history and wish-lists. I have frequently referred to my order history when doing my year end taxes as some of my purchases are business related. Also, I’ve spent years adding things to my wish-list, mostly books that I’ll probably never get around to ordering, though I have had relatives order them occasionally for birthdays in the past. There is no way for me to reconstruct this information. Account closure seems particular vindictive. Of course, I don’t know why they closed my account, so it’s hard to say. I can only guess that they suspected some sort of fraudulent activity despite all the charges to my account having always gone through and never disputed. If potential fraud is the reason, they could have suspended ordering privileges, but allowed the account to still access historical data. One problem with behemoths such as Amazon is that by providing multiple services to which people may become dependent, a problem in one area can have an out-sized impact on the individual. Segmentation of services is important in reducing risks. A problem with my personal account shouldn’t affect my business account. A problem with ordering physical products shouldn’t affect my Prime video membership or Kindle e-books. A problem with my use of the affiliate program shouldn’t affect my personal use as a shopper. Etc.

Maybe my tweet complaining that they were recording my call without informing me up front was the reason they closed my account.

Were I in the EU, I might have some additional recourse under the newly passed GDPR. Namely

  • Under Article 15 I could receive the personal information they have on me, including not only my wish-list and order history, but potentially whatever information was the basis for the account closure.
  • Under Article 16, I could potentially rectify any inaccurate data about me which may have lead to the decision to close my account.
  • Under Article 20, the right to data portability. While this probably doesn’t apply to my order history, since it’s arguable whether I “supplied” the information, it would apply to my wish-list, since that subjective preference data is something I supplied when I clicked “Add to my wish-list”
  • Under Article 21, the right not to be subject to automated decision making. More specifically:
    the data controller shall implement suitable measures to safeguard the data subject’s rights and freedoms and legitimate interests, at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision.


    Of course, I don’t know if the account deletion determination was based on an automated decision, but I suspect in part, they have some AI making a determination at some point to suspend or restrict use of my account.

Once it’s available, I’ve been considering applying for Estonia’s digital nomad visa. If I still care at that point, when I’m in the EU, I might subject Amazon to some of these rights to which I’m not afforded as someone in the US. It’s highly likely I won’t care at that point. As I stated, my use of Amazon, unlike many associates of mine, has been limited. Maybe my account closure is a good thing as it pushes me closer to removing my reliance on the big tech firms. I’ve closed my Facebook account and haven’t used Facebook in almost 6 months (though to be transparent, I still use Instagram). I moved several years ago to a Libre 15 by Purism running PureOS (an Ubuntu Linux variant) and away from Windows, though again in full transparency, I purchased a Microsoft Surface Pro for presentations (on newegg, btw, where I’ll obviously be shopping more from now on). At least Microsoft is more in the pro-privacy camp. I’m still trying to extricate myself from Google. I purchased an iPhone (again on newegg) for the first time having been an Android user for many years. I’m still, unfortunately, on GoogleFi because I like that it works when I travel internationally. I still use Gmail for my personal email, though not for business anymore, having my own domain. One reason I still use Gmail is when I have to give my email address to someone orally, I know they aren’t going to misspell gmail.com like they might privacymaverick.com.

Breaking down “Personal Data”

I’ve rallied for years against the use of PII (or Personally Identifiable Data) as unhelpful in the privacy sphere. This term is used is some US legislation and has unfortunately made its way into the vernacular of the cyber-security industry and privacy professionals. Use of the term PII is necessarily limiting and does allow organizations to see the breadth of privacy issues that may accompany non-identifying personal data. This post is meant to shed light on the nuances in different types of data. While I’ll reference definitions found in the GDPR, this post is not meant to be legislation specific.

Personal Data versus Non-personal Data

The GDPR defines Personal Data as “means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.” The key here term in the definition is the phrase “relating to.” This broad refers to any data or information that has anything to do with a particular person, regardless of whether that data helps identify the person or that person is known. This contrast with non-personal data which has no relationship to an individual.

Personal Data: “John Smith’s eyes are blue.”

In this phrase, there are three pieces of personal data. The first is the name John which is a first name related to an individual, John Smith. The second is his last name. Finally, the third is blue eyes, which also relates to John Smith.

Anonymous Data: “People’s eyes are blue.”

No personal data is indicated in the above sentence as the data doesn’t relate to an individual, identified or identifiable. It relates to people in general.

Identified Data versus Pseudonymous Data

Much consternation has been exhibited over the concept of pseudonymized data. The GDPR provides a definition of pseudonymized: “means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.” The key phrase in this definition is that data can no longer be attributed to an individual without additional data. Let me break this down.

Identified Data: “John Smith’s eyes are blue.”
The same phrase we used in our example for Personal Data is identified because the individual, John Smith, is clearly identified in the statement.

Pseudonymous (Identifiable) Data: “User X’s eyes are blue.”
Here we have processed the individual’s name and replaced it with User X. In other words, its been pseudonymized. However, it is still identifiable. From the definition above, Personal Data is data relating to an identified or identifiable individual. Blue eyes are still related to an identifiable individual, User X (aka John Smith). We just don’t know who he is at the moment. Potentially we can combine information that links User X to John Smith. Where some people struggle is understanding there must be some form of separation between the use of the User X pseudonym and User X’s underlying identity. Store both in one table without any access controls and you’ve essentially pierced the veil of pseudonymity. WARNING: Here is where it can get tricky. Blue eyes are potentially identifying. If John Smith is the only user with blue eyes, it makes it much easier to identify User X as John Smith. This is huge pitfall as most attributable data is potentially re-identifying when combined with some other data.

Identifying Data versus Attributable Data

In looking at the phrase “John Smith’s eyes are blue” we can distinguish between identifying data and attributable data.

Identifying Data: “John Smith”
Without going into the debate of number of John Smiths in the world, we can consider a person’s name as fairly identifying. While John Smith isn’t necessarily uniquely identifying, a type of data, a name, can be uniquely identifying.

Attributable Data: “blue eyes”
Blue eyes is an attribution. It can be attributable to a person, in the case of our phrase “John Smith’s eyes are blue.” It can be attributable to a pseudonym: “User X’s eyes are blue.” As we’ll see below, it can also be attributed anonymously.

Anonymous Data versus Anonymized Data

GDPR doesn’t define anonymous data but in Recital 26 it says “anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” In the first example, I distinguished Personal Data with Anonymous Data, which didn’t relate to specific individual. Now we need to consider the scenario where we have clearly Personal Data which we anonymize (or render anonymous in such a manner that the data subject is not or no longer identifiable).

Anonymous Data: “People’s eyes are blue.”
For this statement, we were never talking about a specific individual, we’re making a generalized statement about people and an attributed shared by people.

Anonymized Data: “User’s eyes are blue.”
For this statement, we took Identified Data (“John Smith’s eyes are blue”) and processed in a way that is potentially anonymous. We’ve now returned to the conundrum presented with Pseudonymous data. Specifically, if John Smith is the only user with blue eyes, then this is NOT anonymous. Even if John Smith is a one of a handful of users with blue eyes, the degree of anonymity is fairly low. This is the concept of k-anonymity, whereby a particular individual is indistinguishable from k-1 other individuals in the data set. However, even this may not given sufficient anonymization guarantees. Consider a medical dataset of names, ethnicities and heart condition. A hospital releases an anonymized list of heart conditions (3 people with heart failure, 2 without). Someone with outside knowledge (that those of Japanese descent rarely have heart failure and the names of patients) could make a fairly accurate guess as to which patients had heart failure and which did not. This revelation brought about the concept of l-diversity in anonymized data. The point here is that unlike Anonymous Data which never related to a specific individual, Anonymized Data (and Pseudonymized Data) should be carefully examined for potential re-identification. Anonymizing data is a potential minefield.

If you need help navigating this minefield, please feel free to reach out to me at Enterprivacy Consulting Group

Bots, privacy and sucide

I had the pleasure of serving last week on a panel at the Privacy and Security Forum with privacy consultant extraordinaire Elena Elkina and renowned privacy lawyer Mike Hintze. The topic of the panel was Good Bots and Bad Bots: Privacy and Security in the Age of AI and Machine Learning. Serendipitously, on the plane to D.C. earlier that morning someone had left a copy of the October issue of Wired Magazine, the cover of which displayed a dark and grim image of Ryan Gosling, Harrison Ford, Denis Villeneuve, and Ridley Scott from the new dystopian film, Blade Runner 2049. Not only was this a great intro to the idea of bot (in the movie’s case human like androids) but the magazine contained two pertinent articles to our panel discussion: “Q: In. Say. A customer service chat window, what’s the polite way to ask whether I’m talking to a human or a robot?” and “Stop the chitchat: Bots don’t need to sound like us.” Our panel dove into the ethics and legality of deception, in say a customer service bot pretending to be human.

White the idea was fresh in my mind, I wanted to take a moment to replay some of the concepts we touch upon for a wider audience and talk about the case study we used in more detail than the forum allowed. First off, what did we mean by bots? I don’t claim this is a definitive definition but we took the term, in this context, to mean two things:

  • Some form of human like interface. This doesn’t mean they have to the realism of Replicants in Blade Runner, but some mannerisms in which a person might mistake the bot for another person. This goes back to the days, as Elena pointed out, of Alan Turing and his Turing test, years before any computer could even think about passing. (“I see what you did there.”). The human like interface potentially has an interesting property, are people more likely to let their guard down and share sensitive information if they think they are talking to another person? I don’t know the answer to that and their may be some academic research on that point. If their isn’t I submit that it would make for some interesting research.
  • The second is the ability to learn and be situationally aware. Again, this doesn’t require the super sophistication of IBM’s Watson but any ability to adapt to changing inputs from the person with whom it is interacting. This is key, like the above, to giving the illusion a person is interacting with another person. By counter example, Tinder is littered with “bots” that recite scripts with limited, if any, ability to respond to interaction.

Taxonomy of Risk

Now that we have a definition, what are some of the heightened risks associated with these unique characteristics of a bot that, say, a website doesn’t have? I use Dan Solove’s Taxonomy of Privacy as my goto risk framework. Under the taxonomy I see 5 heightened risks:

  1. Interrogation (questioning or probing of personal information): In order to be situationally aware, to “learn” more, a bot may ask questions of someone. Those questions could go too far. While humans have developed social filters, which allows us to withhold inappropriate questions, a bot lacking a moral or social compass could ask questions which make the person uncomfortable or is invasive. My classic example of interrogation is an interview where the interviewer asks the candidate if they are pregnant or planning to become pregnant. Totally inappropriate in a job interview. One could imagine a front like recruitment bot smart enough to know that pregnancy may impact immediate job attendance of a new hire but not smart enough to know that it’s inappropriate to ask that question (and certainly illegal in the U.S. to use pregnancy as a discriminatory criteria in hiring).
  2. Aggregation (combining of various piece of personal information): Just as not all questions are interrogations, not all aggregation of data creates a privacy issue. It is when data is combined in new and unexpected ways, resulting in information disclosure than the individual didn’t want to disclose. Anyone could reasonably assume Target is aggregating sales data to stock merchandise and make broad decisions about marketing, but the ability to discern pregnancy of a teenager from non-baby related purchased was unexpected, and uninvited. For a pizza ordering bot, consider the difference between knowing my last order was a vegetable pizza and discerning that I’m a vegetarian (something I didn’t disclose) because when I order for one its always vegetable but if I order for more than one, it includes meat dishes.
  3. Identification (linking of information to a particular individual): There may be perfectly legitimate reasons a bot would need to identify a person (to access that person’s bank account for instance) but identification as an issue comes into play when its the perception of the individual that they would remain anonymous or at the very least pseudonymous. If I’m interacting with a bot as StarLord1999 and all the sudden it calls me by the name Jason, I’m going to be quite perturbed.
  4. Exclusion (failing to let an individual know about the information that others have about her and participate in its handling or use): As with aggregation, a situationally aware bot, pulling information from various sources may alter its interaction in a way that excludes the individual from some service without the individual understanding why and based on data the individual doesn’t know it has. For instance, imagine a mortgage loan bot, that pulls demographic information based on a user’s current address, and steers them towards less favorable loan products. That practice sounds a lot like red-lining and if it has discriminatory effects, could be illegal in the U.S.
  5. Decisional Interference (intruding into an individual’s decision making regarding her privacy affairs): The classic example I use for decisional interference is China’s historic one-child policy which interferes with a family’s decision making on their family make-up, namely how many children to have. So you ask, how can a bot have the same effect? Note the law is only influential, albeit in a very strong way. A family can still physically have multiple children, hide those children or take other steps to disobey the law, but the law is still going to have a manipulatory effect on the decision making. A bot, because if it’s human interface, and advanced learning and situational knowledge, can be used to psychologically manipulate people. If the bot knows someone is psychologically prone to a particular type of argument style (say appealing to emotion) it can use that and information at it’s disposal to subtly persuade you towards a certain decision. This is a form of decisional interference.

Architecture and Policy

I’m not going to go into a detailed analysis of how to mitigate these issues, but I’ll touch on two thoughts: first, architectural design and second, public policy analysis. Privacy friendly architecture can be analyzed along two axes, identifiability and centralization. The more identified and more centralized the design, the less privacy friendly it is. It should be obvious that reducing identifiability reduces the risk of identification and aggregation (because you can’t aggregate external personal data from unidentified individuals) so I’ll focus here on centralization. Most people would mistakenly think of bots as being run by a centralized server, but this is far from the case. The Replicants in Blade Runner or “autonomous” cars are both prominent examples of bots which are decentralized. In fact, it should be glaringly apparent that a self-driving car being operated by a server in some warehouse introduces unnecessary safety risks. The latency of the communication, potential for command injections at the server or network layer, and potential for service interruption are unacceptable. The car must be able to make decisions immediately, without delay or risk of failure. Now decentralization doesn’t help with many of the bot specific issues outlined above, but it does help with other more generic privacy issues, such as insecurity, secondary use and others.

Public policy analysis is something I wanted to introduce with my case study during the interactive portion of the session at the Privacy and Security Forum. The case study I present was as follows:

Kik is a popular platform for developing Bots. https://bots.kik.com/#/ Kik is a mobile chat application used by 300 million people worldwide and an estimated 40% of US teens at one time or another have used the application. The National Suicide Prevention Hotline, recognizing that most teens don’t use telephones wants to interact with them in services they use. The Hotline wants to create a bot to interact with those teens and suggest helpful resources. Where the bot recognizes a significant risk of suicide rather than just casual inquiries or people trolling the service, the interactions will first be monitored by a human who can then intervene in place of the bot, if necessary.

I’ll highlight one issue, decisional interference, to show why it’s not a black and white analysis. Here, one of the objectives of the service and the bot, is to prevent suicide. As a matter of public policy, we’ve decided that suicide is a bad outcome and we want to help people who are depressed and potentially suicidal get the help they need. We want to interfere with this decision. Our bot must be carefully designed to promote this outcome. We don’t want the bot to develop in a way that doesn’t reflect this. You could imagine a sophisticated enough bot going awry and actually encouraging callers to commit suicide. The point is, we’ve done that public policy analysis and determined what the socially acceptable outcome is. Many times organizations have not thought through what decisions might be manipulated by the software they create and what the public policy is that should guide they way the influence those decisions. Technology is not neutral. Whether it’s is decisional interference or exclusion or any of the other numerous privacy issues, thoughtful analysis must precede design decisions.