I was given the following problem:
You have a web based mail system that you want to provide auto-correction on spell checking. The spell checking system needs to not only correct for common words, but needs to account for proper names of places and obscure languages. Most importantly, privacy considerations need to be taken into account so as not to store or leak one individual’s spelling errors. How do you do it?
Anonymization of data may be insufficient. See AOL
Most intelligent spell checkers do so by recording a word and then changes to that word. If lots of people make the same changes, chances are it’s because of a typographical error or spelling error. You can then suggest that as a correction to others who type in the original potentially misspelled word. How do you do that without storing the misspelled word or the correction unless you have it in it’s aggregate form?
Here is my suggested solution:
Take the user’s email and modulo it with 256 (1 byte). That essentially puts everybody into one of 256 buckets. The reason will be clear later.
On the client side (via Javascript), also take each word as it’s typed and modulo that word into 3 bytes. So, you’ll have some collisions but not more than a few hundred worst case scenario. (Some statistical analysis should be done, but this is my back of the envelope calculation).
If the user does not change the word, do nothing.
If the user makes a correction, take the original and the correction and use M-of-N secret splitting to split that information into N parts. For this particular exercise, you’ll split the information into 256 parts. You’ll then take the z-th part (z being determined by the value of the user’s email moduloed 256 above) and send that back to the server. In and of itself, that information does nothing.
However, now once you have M parts, the server can reconstruct the word and the original and make it available for others to autocorrect. When will this happen? Well the first person provides the first part, the next person to send in a correct won’t necessarily provide the second part. They have a 1/256 chance of providing the same part at the first person. So how many people do you need to get M parts? It comes down to probability. You obviously need at least M people but M people doesn’t guarantee you have M different parts. By adjusting M, the administrator could set a threshold of how many people you would need to have a certain percentage chance of collecting the parts. For example, 30 people gave you a 90% chance and 50 people gave you a 95% chance and 100 people gave you a 99% chance. Once the information is revealed, you’re confident its based on aggregate errors, not one person’s.
Now, when a user types in an email, after each word, the system can calculate the 3 byte compression and send that out to the server. If the server finds any matches, it returns the list of corrections (which may be only a few times long or a hundred items). The client side then compares the word against the corrections and determines those that are likely and those that are not (i.e. “hapyy” to “corrugated” would not be a reasonable match but it would be to “happy” even though say hapyy and corugated both fall in the same 3 byte bucket).
So here you have a solution which only reveals information in the aggregate and gives some level of security so the server doesn’t know what the person is typing.