August 2011 – Privacy Maverick

Google still doesn’t get it.

Upon further reflection after my interview with Google for their Product Manager, privacy position, I’d have to say that I still they don’t “get privacy.” I was trying to explain to the interviewer that privacy by design was the way to go for Google, creating necessary functionality while preserving user privacy. While the interviewer, upon my questioning, affirmed Google’s commitment to privacy, most of his substantive comments about my suggestions and raising of issues were dismissive or patronizing of the privacy concerns. I must admit my own great like of the benefits of Google products….but maybe I’m giving up to much.

The Business Case for Privacy By Design

Privacy by design represents a very difficult to understand business case. Typically privacy falls under the rubric of compliance (i.e. we have to do it to comply with the law). Rarely, do companies willing engage in privacy practices. Why? The business case. What is the bottom line benefit? How do you quantify privacy in $ or euros? Business aren’t willing to spend the extra money up front in extra engineering if they can’t see a tangible return later down the line. Over the last 15 years that I’ve been following privacy issues, it’s clear to me and many others, that very few consumers will pay more for privacy. So why invest in it unless you have to?

As this InfoLawGroup blog post points out the benefit is in brand differention. Consumers may not pay for privacy but when given the option of choosing between equivalent products or services with one that preserves privacy and one that doesn’t they will choose the privacy protective option. In addition to instituting privacy by design, any company must make their privacy protective ways obvious to the consumers. They do this by following the fourth principle of PbD, visibility and tranparency. This entails not just putting their convoluted privacy statement (that no one reads) front and center but giving users information as close to the point of data collection as possible. The study upon which the InfoLawGroup post was based on the rarely used P3P policies and integrating that with search results to give put privacy information up front before consumers invested their time and energy in a site. The study suggests that companies wanting to use privacy as a brand differentiator should be blatant about their policies and not bury them.

Returning to the business case, it’s going to be hard to quantify though those companies with the resources could emulate the study or focus groups to identify how pushing privacy could increase customer confidence and make customers choose their brand over a competitors.

To quote from Ernst & Young’s top 11 privacy trends, “Organizations that ignore the importance of protecting personal information from outside — or inside — will suffer more than financial penalties. They may also see their reputation damaged and their brand negatively impacted.”

Privacy by design lost

This happened to a friend of mine recently. He went to a nightclub he used to frequent and before he could say anything they scanned his identification. This is what I would call the opposite of privacy by design. Bars and nightclubs are traditionally very privacy friendly. They follow a credentialing model: validate people before entry and then don’t track them. Once you’ve established that you meet the criteria for entry (i.e. being over 21), the bars don’t care who you are. This is harder to do online but can be done with privacy enhancing technology. But it’s something that comes naturally offline, like in bars. But now some bars have decided to integrate technology into their operations for tracking and marketing purposes.

FAA and flight routes

Recently, over the objections of many privacy advocates and airplane owners, the FAA moved to make more flight route information open and publicly available. Specifically, the FAA operates a program called BARR (Blocked Aircraft Registration Request), which allows certain aircraft to be exempt from the public records of flight routes. The FAA collects these flight routes for every flight in and out of the US in order to deal with traffic control issues, congestion, etc. However, this could be clearly sensitive information for some aircraft, those with at risk passengers or cargo, those doing surprise inspections on facilities, etc.

This is very typical of government modus operandi: collect potentially sensitive information and then attempt to secure it (or it’s analogy, making it illegal to look at information that clear for all to see). Although in this case the government has made it much more stringent criteria to be excluded, baring those who may have a legitimate interest in securing their flights that don’t meet the government threshold. This very much reminds me of the law surrounding employee addresses (or other PII) for government employees in Florida. Florida has very broad public records laws and generally, one can get the home addresses of government employees. There are numerous exceptions, for judges, law enforcement, child protective service employees, etc; those deemed by the state legislature as being high risk. This solution is no solution though, because one it deprives the citizenry of public records and adds to the long list of exemptions to the public records and second it deprives individuals of control of their personal information. A better solution would be not to collect home address information from employees, or at least give them the option of not supplying that information in the first place.

Returning to the FAA issue, there are other options. Assuming they do need it for legitimate purposes, the FAA could collect the information in such a way that it doesn’t have the information directly but only in the aggregate (to assess congestion, etc). If it needs the information for legitimate law enforcement purposes, it could use some key escrow or blinding method to store the information but only have it available with a valid court order. Without knowing all the functional requirements of the system, it’s hard to design a privacy protective method, but my purpose is to say that it could be done…… if someone cared enough to do it.

Financial Cryptography 2012

I’m trying to organize a legal panel for the conference on Financial Cryptography and Data Security 2012. The 16th annual conference will be held at the Divi Flamingo Beach Resort in Bonaire. The subject of the legal panel will be

Privacy: Technology versus the Law.

The widespread adoption of PETS (privacy enhancing technologies) faces many obstacles including misunderstanding by corporate leaders, a lack of technical skills within organizations, and business case justification. Are laws and regulations another impediment? Regulators seem to favor one of two approaches: outright bans on the collection of information or requiring information to be collected and then attempting to layer security on top. What laws and regulations do our panelist consider as problematic and what solutions do we have for getting government to support rather than hinder the adoption of PETS.

Aggregation without revelation

I was given the following problem:

You have a web based mail system that you want to provide auto-correction on spell checking. The spell checking system needs to not only correct for common words, but needs to account for proper names of places and obscure languages. Most importantly, privacy considerations need to be taken into account so as not to store or leak one individual’s spelling errors. How do you do it?

Anonymization of data may be insufficient. See AOL

Most intelligent spell checkers do so by recording a word and then changes to that word. If lots of people make the same changes, chances are it’s because of a typographical error or spelling error. You can then suggest that as a correction to others who type in the original potentially misspelled word. How do you do that without storing the misspelled word or the correction unless you have it in it’s aggregate form?

Here is my suggested solution:

Take the user’s email and modulo it with 256 (1 byte). That essentially puts everybody into one of 256 buckets. The reason will be clear later.

On the client side (via Javascript), also take each word as it’s typed and modulo that word into 3 bytes. So, you’ll have some collisions but not more than a few hundred worst case scenario. (Some statistical analysis should be done, but this is my back of the envelope calculation).

If the user does not change the word, do nothing.

If the user makes a correction, take the original and the correction and use M-of-N secret splitting to split that information into N parts. For this particular exercise, you’ll split the information into 256 parts. You’ll then take the z-th part (z being determined by the value of the user’s email moduloed 256 above) and send that back to the server. In and of itself, that information does nothing.

However, now once you have M parts, the server can reconstruct the word and the original and make it available for others to autocorrect. When will this happen? Well the first person provides the first part, the next person to send in a correct won’t necessarily provide the second part. They have a 1/256 chance of providing the same part at the first person. So how many people do you need to get M parts? It comes down to probability. You obviously need at least M people but M people doesn’t guarantee you have M different parts. By adjusting M, the administrator could set a threshold of how many people you would need to have a certain percentage chance of collecting the parts. For example, 30 people gave you a 90% chance and 50 people gave you a 95% chance and 100 people gave you a 99% chance. Once the information is revealed, you’re confident its based on aggregate errors, not one person’s.

Now, when a user types in an email, after each word, the system can calculate the 3 byte compression and send that out to the server. If the server finds any matches, it returns the list of corrections (which may be only a few times long or a hundred items). The client side then compares the word against the corrections and determines those that are likely and those that are not (i.e. “hapyy” to “corrugated” would not be a reasonable match but it would be to “happy” even though say hapyy and corugated both fall in the same 3 byte bucket).

So here you have a solution which only reveals information in the aggregate and gives some level of security so the server doesn’t know what the person is typing.

Google’s strategy

This post is related to privacy in so far as it’s hard these days to talk about Google and not mention privacy in the same breath (as least for anyone involved in the privacy profession).

I’m been thinking about Google a lot lately: strategically, competitively, and in terms of their product mix. I’ve come to a rather radical conclusion that I haven’t seen elsewhere so I thought I’d share. Ostentatiously, Google’s mission is to “organize the worlds information and make it universally accessible.” However, strategically they seem to be on a mission to become the universal middle man.

Consider the number of Google products and services that aim to position them between consumers and producers:

Chrome

Chrome Frame

Chrome OS

Android

Android Market

goo.gl (url shortner)

Offers, Adwords and other advertising products.

TV

Wallet

Page Speed Service

Public DNS

Google can’t physically come between consumers and producers (as say an ISP could though they may be going this route too with Google Fiber) but they can interject themselves virtually. Because of the need to be adopted on either the consumer or producer side of the equation, they add value by offering services that are free (or provide a benefit over costs like adwords). The goal of these services is to, again, get Google in the middle of the equation.

Where the lines of producer and consumer are blurred, such as interpersonal communications, Google also offers a suite of products to be the conduit between the parties: Google +, Voice, Gmail, Groups, Youtube.

What’s interesting is when you start viewing the competition in this light, there really is no competition. Apple is doing the same thing with it’s operating systems, itunes store, app market but there are two obvious distinctions: 1) Apple is only positioning itself between consumers and producer’s of information not products and they aren’t positioning themselves between either producers of consumer goods nor in the consumer as producer (i.e. social) market and 2) Apple is attempting to rent seek and take advantage of it’s position to maximize it’s income.

The next competitor is Facebook which can take advantage of it’s social networking site. To a degree it’s trying to leverage that to get between producers and consumers with it’s advertising platform and with it application space, but this strategy is necessarily limited.

Microsoft probably has the broadest competitive suite of products and services. However, it’s strategy doesn’t seem as cohesive.

There are a host of other companies that compete with Google in niche markets, but nobody (except maybe Microsoft) has the breadth of services strategy aimed at wedging themselves into every transaction.

How to store all the world’s phone calls.

I have a potential interview coming up with a certain technology company, who shall not be named though let’s just say their motto should be Go Big or Go Home. In preparing for the interview, the recruiting contact made a number of suggestions including be prepared to answer the following question:

Expect questions designed to test your analytical and technical ability (example: how would you store all the phone calls in the world?).

I’ve been tossing and turning all night thinking about this question, even though I’m sure it won’t be on the interview. How would I answer it? I think the question is more, how would I not answer it. That question begs a dozen more:

What is the purpose of storing all the phone calls?

Who are you storing it for? The originators? The world?

When you say store all the world’s phone calls, do you mean the meta data (i.e. who called who and when) or do you mean actual recordings of the conversations?

If we’re talking recordings there are many legal and societal implications before we even address the engineering.

How will this information be indexed? retrieved? accessed?

Do we need it searchable or just filed away?

How long do you need to store this information?

The security and privacy implications of such a system are huge. Consider a phone app that listens for sensitive data much like a keylogger. See http://blogs.computerworld.com/17785/sensory_malware_android_app_listens_then_steals_credit_card_data

In terms of pure data, you’re not really looking at much (compared to music audio or video). Voice quality audio is about 8kbits/s. Assuming the average phone user spends one hour per day on the phone that 1kbyte/s*60second*60minutes = 3600 kbytes = 3.6mbytes. Not much by today’s standards. Assuming 1 billion phones worldwide, that’s 3.6 billion megabytes per day (or 3.6 million gigabytes). Now this isn’t unsubstantial. We’re talking 3600 terabytes PER DAY. Google’s search index use around 1000 terabytes. Youtube about 45 terabytes.

I would have to say that one approach would be to use distributed storage. Let each user store their own phone calls locally on their phone. Now we’re back at 3.6 MB per phone user per day, which could easily be stored on a 1Gig sd card. In fact you could store nearly a year’s worth of phone calls on each phone. Of course, we’re back to the original questions I asked, why are we storing this? Does storing it in a distributed fashion like this meet our functional requirements? Even if we’re doing this for searching purposes, we could index the phone calls in a centralized location and allow them to be searched but to pull up the actual phone calls from the phones. Obviously each phone call would be stored on at least two phones, so you have some redundancy, but you could have even more by having each phone user agree to store (encrypted of course) other’s phone calls. Using a 10 to 1 factor would still allow the storage of up to a month’s worth of phone calls on each phone equipped with a 1gig card.

Now that I’ve thought it through, I hope they ask this question…