Skip to main content

Privacy Breaches in Privacy-Preserving Data Mining

Johannes Gehrke
Cornell University

The exponential growth in the amount of digital data has resulted in the creation of databases of unprecedented scale. At the same time concerns about privacy of personal information have emerged globally. Data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? This talk will survey recent results on privacy-preserving data mining, concentrating on the class of solutions where each party randomizes their data before sending it to a central server for building the model. We show that simple randomization methods can be exploited to find privacy breaches, and we analyze the nature of these privacy breaches. We then propose a class of randomization operators with strong privacy guarantees and introduce a general property for any randomization operator that limits privacy breaches. This is joint work with Rakesh Agrawal, Alexandre Evfimievski, and Ramakrishnan Srikant.