by analogy

Humans often learn by analogy. You take something you don’t understand and try to analogize to something you do understand. This is one of the reasons wave particle duality is so difficult to understand, because light in analogous to two incongruous models we already know: waves and particles.

Cryptography is also very difficult to understand because people have a hard time bridging the gap between things they know and are familiar with and things they don’t know. Some cryptographic techniques just don’t have very good real world analogies. It’s like quantum science, unfathomable to most people.

M of N secret splitting is just such a creature. Something that really has no real world analogy and is hard for people to grasp. Even harder it seems, is the multitude of applications such a neat techniques has for us.

Say you have a sentence “Now the right to life has come to mean the right to enjoy life, — the right to be let alone.” which you want to split up and store in N locations. The most obvious option, store the entire sentence in each location, isn’t very secure since any one of those locations could be compromised and give up your secret. You could split the sentence into N parts and store each part in each location and then it would take someone compromising each location to reconstruct your sentence. That’s better but now we suffer from another problem. What happens if we lose one of our locations (due to corruption or destruction)? Now we can’t reconstruct the sentence, because we’ve lost data. What’s a nice medium solution?

M of N secret splitting allows us to split the sentence into N parts but any M of them will reconstruct the entire sentence. For example we could split it into 10 parts and it takes 5 to reconstruct the sentence. This gives us the security of an attacker needing to compromise 5 locations AND us the security that we could lose 50% of the locations and still find the secret. Without cryptography it’s difficult to understand how this could be done. What’s even more interesting, is what else can be accomplished with this technique, beyond the simple idea of secret sharing. I gave one example last month, but I’m going to give another example this month that I thought about extensively during the Privacy Academy in Dallas.

Assume your customers have your mobile phone application and you want to be able to alert them to crowded conditions so they can avoid them (like traffic) or go to them (like hot nightclubs) but you don’t want to worry about tracking customers location information. You could have some sort of polling system that just ticked off where customers are without storing the information but the customer’s phones still have to let you know where they are, right? This leads to problems of hack, leaks or employee malfeasance. Wouldn’t it be better to have a system that gave you the information you needed without customers needing to tell you where they were? Enter M of N secret splitting.

Let each customer be identified by a unique customer name (email address or username). Take the user identifier and compress it down to one of 100 slots. In other words, take the numeric equivalent of the identifier and modulo it by 100. [100 is an arbitrary number here]. That way each customer falls into an essentially random bucket out of 100 buckets.

Now describe the location, say “El Gaucho Inca Restaurant” (where I recently had a delicious Peruvian meal). Perform M of N secret splitting on the location, where N is 100 and M is 5. This way, you have 100 parts and any 5 will give you the location. Now, pick the piece number equivalent with the slot you chose for this customer and upload that.

The company now has one piece of data it can’t do anything else. If it collect 4 others, it can reconstruct the location, but not until. By splitting it into 100, there is a 99/100 chance that the next piece it gets won’t be in the same bucket. We have met the criteria of absolutely not knowing where each person is, yet we can tell, in the aggregate, if at least 5 people are at this location.

While this may not seem like a major revelation, it seems that many system designers and business people have difficultly grasping the ability to build a system to meet the needs without storing or collecting data it would seem critical to the system. I’ll be talking more about this in future posts.

Cloud Computing contracts

As many others have pointed out, cloud computing is really nothing new. Before it was called cloud computing, application service providers (ASPs) provided software not as a downloadable product but as an online service. Really, what has changed is the acceleration of software (or infrastructure, data or platforms) as a much more modular and turnkey service. Service providers have minified the transaction costs of software (or hardware). Whereas before purchasing new or additional services took time and effort (i.e. transaction costs) on the part of both the seller and buyer, now it can be requisitioned and provisioned with a few clicks of a mouse, the so-called utility model; one just increases demand by adding more consuming devices and the utility provides.

However, shrinking transaction costs for efficiency means that there is no longer room for substantial negotiations between provider and consumer. This leads to a gap in the needs of the consumer for certain protections (e-discovery, retention, security, privacy etc) and the desires of the provider to limit liability and provide a one size fits all service. Bigger clients, which may command attention and have some bargaining power, make it more difficult for service providers to provide a simple cheap service because of the need for negotiation. I’m suggesting the end result is probably a stratification of service providers in differing industries (or geographically) in order to limit the need for negotiation with clients who have differing needs.

Privacy Academy 2011

I attended Privacy Academy 2011 in Dallas last week and it was quite interesting. Met a lot of people and have been contacting them furiously this week (while still trying to catch up on 2 weeks of missed work). While the seminars and lectures were thought inspiring (especially the one on the law of obscurity), it’s still problematic the gap between the legal privacy types and the mathematical/computer science community. I was inspired, though seeing Marc Rotenberg, o EPIC give the headlining speech at the Friday luncheon. He mentioned many people that I admire, such as Phil Zimmerman (PGP), David Chaum (DigiCash) and others. He spoke about the need for PETs and Privacy by Design, which as I’ve mentioned, is sorely needed in the Privacy Professional community.

I did submit a proposal to do a speech on privacy engineering for non-engineers at the Global Privacy Summit next year in Washington. Crossing my fingers that it occurs.

Google still doesn’t get it.

Upon further reflection after my interview with Google for their Product Manager, privacy position, I’d have to say that I still they don’t “get privacy.” I was trying to explain to the interviewer that privacy by design was the way to go for Google, creating necessary functionality while preserving user privacy. While the interviewer, upon my questioning, affirmed Google’s commitment to privacy, most of his substantive comments about my suggestions and raising of issues were dismissive or patronizing of the privacy concerns. I must admit my own great like of the benefits of Google products….but maybe I’m giving up to much.

The Business Case for Privacy By Design

Privacy by design represents a very difficult to understand business case. Typically privacy falls under the rubric of compliance (i.e. we have to do it to comply with the law). Rarely, do companies willing engage in privacy practices. Why? The business case. What is the bottom line benefit? How do you quantify privacy in $ or euros? Business aren’t willing to spend the extra money up front in extra engineering if they can’t see a tangible return later down the line. Over the last 15 years that I’ve been following privacy issues, it’s clear to me and many others, that very few consumers will pay more for privacy. So why invest in it unless you have to?

As this InfoLawGroup blog post points out the benefit is in brand differention. Consumers may not pay for privacy but when given the option of choosing between equivalent products or services with one that preserves privacy and one that doesn’t they will choose the privacy protective option. In addition to instituting privacy by design, any company must make their privacy protective ways obvious to the consumers. They do this by following the fourth principle of PbD, visibility and tranparency. This entails not just putting their convoluted privacy statement (that no one reads) front and center but giving users information as close to the point of data collection as possible. The study upon which the InfoLawGroup post was based on the rarely used P3P policies and integrating that with search results to give put privacy information up front before consumers invested their time and energy in a site. The study suggests that companies wanting to use privacy as a brand differentiator should be blatant about their policies and not bury them.

Returning to the business case, it’s going to be hard to quantify though those companies with the resources could emulate the study or focus groups to identify how pushing privacy could increase customer confidence and make customers choose their brand over a competitors.

To quote from Ernst & Young’s top 11 privacy trends, “Organizations that ignore the importance of protecting personal information from outside — or inside — will suffer more than financial penalties. They may also see their reputation damaged and their brand negatively impacted.”

Privacy by design lost

This happened to a friend of mine recently.  He went to a nightclub he used to frequent and before he could say anything they scanned his identification.  This is what I would call the opposite of privacy by design.  Bars and nightclubs are traditionally very privacy friendly.  They follow a credentialing model: validate people before entry and then don’t track them.  Once you’ve established that you meet the criteria for entry (i.e. being over 21), the bars don’t care who you are.  This is harder to do online but can be done with privacy enhancing technology.  But it’s something that comes naturally offline, like in bars.  But now some bars have decided to integrate technology into their operations for tracking and marketing purposes. 

FAA and flight routes

Recently, over the objections of many privacy advocates and airplane owners, the FAA moved to make more flight route information open and publicly available. Specifically, the FAA operates a program called BARR (Blocked Aircraft Registration Request), which allows certain aircraft to be exempt from the public records of flight routes. The FAA collects these flight routes for every flight in and out of the US in order to deal with traffic control issues, congestion, etc. However, this could be clearly sensitive information for some aircraft, those with at risk passengers or cargo, those doing surprise inspections on facilities, etc.

This is very typical of government modus operandi: collect potentially sensitive information and then attempt to secure it (or it’s analogy, making it illegal to look at information that clear for all to see). Although in this case the government has made it much more stringent criteria to be excluded, baring those who may have a legitimate interest in securing their flights that don’t meet the government threshold. This very much reminds me of the law surrounding employee addresses (or other PII) for government employees in Florida. Florida has very broad public records laws and generally, one can get the home addresses of government employees. There are numerous exceptions, for judges, law enforcement, child protective service employees, etc; those deemed by the state legislature as being high risk. This solution is no solution though, because one it deprives the citizenry of public records and adds to the long list of exemptions to the public records and second it deprives individuals of control of their personal information. A better solution would be not to collect home address information from employees, or at least give them the option of not supplying that information in the first place.

Returning to the FAA issue, there are other options. Assuming they do need it for legitimate purposes, the FAA could collect the information in such a way that it doesn’t have the information directly but only in the aggregate (to assess congestion, etc). If it needs the information for legitimate law enforcement purposes, it could use some key escrow or blinding method to store the information but only have it available with a valid court order. Without knowing all the functional requirements of the system, it’s hard to design a privacy protective method, but my purpose is to say that it could be done…… if someone cared enough to do it.

Financial Cryptography 2012

I’m trying to organize a legal panel for the conference on Financial Cryptography and Data Security 2012. The 16th annual conference will be held at the Divi Flamingo Beach Resort in Bonaire. The subject of the legal panel will be

Privacy: Technology versus the Law.
The widespread adoption of PETS (privacy enhancing technologies) faces many obstacles including misunderstanding by corporate leaders, a lack of technical skills within organizations, and business case justification. Are laws and regulations another impediment? Regulators seem to favor one of two approaches: outright bans on the collection of information or requiring information to be collected and then attempting to layer security on top. What laws and regulations do our panelist consider as problematic and what solutions do we have for getting government to support rather than hinder the adoption of PETS.

Aggregation without revelation

I was given the following problem:

You have a web based mail system that you want to provide auto-correction on spell checking. The spell checking system needs to not only correct for common words, but needs to account for proper names of places and obscure languages. Most importantly, privacy considerations need to be taken into account so as not to store or leak one individual’s spelling errors. How do you do it?

Anonymization of data may be insufficient. See AOL

Most intelligent spell checkers do so by recording a word and then changes to that word. If lots of people make the same changes, chances are it’s because of a typographical error or spelling error. You can then suggest that as a correction to others who type in the original potentially misspelled word. How do you do that without storing the misspelled word or the correction unless you have it in it’s aggregate form?

Here is my suggested solution:

Take the user’s email and modulo it with 256 (1 byte). That essentially puts everybody into one of 256 buckets. The reason will be clear later.

On the client side (via Javascript), also take each word as it’s typed and modulo that word into 3 bytes. So, you’ll have some collisions but not more than a few hundred worst case scenario. (Some statistical analysis should be done, but this is my back of the envelope calculation).

If the user does not change the word, do nothing.

If the user makes a correction, take the original and the correction and use M-of-N secret splitting to split that information into N parts. For this particular exercise, you’ll split the information into 256 parts. You’ll then take the z-th part (z being determined by the value of the user’s email moduloed 256 above) and send that back to the server. In and of itself, that information does nothing.

However, now once you have M parts, the server can reconstruct the word and the original and make it available for others to autocorrect. When will this happen? Well the first person provides the first part, the next person to send in a correct won’t necessarily provide the second part. They have a 1/256 chance of providing the same part at the first person. So how many people do you need to get M parts? It comes down to probability. You obviously need at least M people but M people doesn’t guarantee you have M different parts. By adjusting M, the administrator could set a threshold of how many people you would need to have a certain percentage chance of collecting the parts. For example, 30 people gave you a 90% chance and 50 people gave you a 95% chance and 100 people gave you a 99% chance. Once the information is revealed, you’re confident its based on aggregate errors, not one person’s.

Now, when a user types in an email, after each word, the system can calculate the 3 byte compression and send that out to the server. If the server finds any matches, it returns the list of corrections (which may be only a few times long or a hundred items). The client side then compares the word against the corrections and determines those that are likely and those that are not (i.e. “hapyy” to “corrugated” would not be a reasonable match but it would be to “happy” even though say hapyy and corugated both fall in the same 3 byte bucket).

So here you have a solution which only reveals information in the aggregate and gives some level of security so the server doesn’t know what the person is typing.