Introduction
AI/ML can be applied to many domains and power very different use-cases. Trask (2020) observes that important use-case also tend to have the most privacy concerns attached with them. If we were to plot this, we see on the one side of the spectrum, digit classification. Data collection for this use-case is hardly problematic, and large datasets exists that can be used to train a well performing model. A useful use-case indeed but not on the same scale as early cancer detection. Such models can make a big difference; however, data collection is much harder due to large privacy concerns.
By addressing privacy concerns, we can unlock many more impactful use cases.
Anonymised data is flawed
A reflex, often expressed in this context, is that personal data can be anonymised and thereby eliminates any privacy concerns. There are a couple of problems with this approach that are best illustrated with an example.
Schwarz et al. (2019) demonstrated they could reconstruct a 3d-model from an MRI scan. By using the 3d-model in a face detection software, they could in a next step find a matching identity. It is a proof-of-concept, but shows that a dataset containing what seems unidentifiable information can still be very problematic.
Another vector of attack is re-identification. Re-identification works by combing several datasets and puzzle yourself a new dataset with all personal data included. There are already quite a few (illegal) databases available containing personal information. An attacker can combine such databases with the anonymised dataset and retrace identities.
This is not a theoretical concern. To the interest reader I refer to the cases of Sweeney (2000), Narayanan & Shmatikov (2008) and Culnane et al., (2019) that impressively demonstrate how they could with rather simple means un-anonymous publicly made datasets. It is among many researchers accepted that anonymisation is a flawed and dangerous approach.
Defence: Differential Privacy
The main benefit of differential privacy is that you learn from data of many people, while not having retractable access to the individual data of a person.
This works by effectively adding noise to the data. It follows the conceptual idea that if you can show that the outcome of a statistical analysis does not change whenever you remove or change an individual data point, you secure the underlying information. It injects plausible deniability for an individual the dataset.
Apple uses local differential privacy to for instance create a model that learns what a fitting emoji to text is. To create such a model, Apple needs to collect data on what users type on their keyboards with its emoji usage. Local differential privacy is applied before Apple sends data to its servers. It also does some other things, but the key concept is that data can be shared whilst ensuring privacy.
Privacy concerns for models
So far, the argument has been made that distribution of anonymised data is not safe. But what if, you would train a model using private data and distribute the model? For instance, you would deploy the model to a cloud infrastructure, giving the host access to it as well. Or, what if you would deploy it and give your users access to an API for prediction making.
Researchers have identified several possibilities to extract personal data from just having access to the model (inference attack). This works, because models tend to “summarise” the original training data and create representations. The original data can be recreated or approximated just by having access to an API only providing probabilities. Models thus leak information they were trained with. In fact, an argument has been made that a model should as such be seen as personal data and thereby fall under GDPR rules. This for instance, prevent companies from moving it to outside EU servers (Veale et al., 2018).
Defence: minimise information leakage
The important thing to remember is: models can be used to recreate, approximate, or at least get insight on the original training data. As discussed in the previous section, even if anonymised, this can lead to severe loss of privacy. I have to note, though, these attacks can be hard and computational intensive to perform. Nevertheless, you are wise treating the model with similar security as you would do with personal data. To protect against API attacks, you can limit the number of requests and or add some noise to the logits (Fredrikson et al., 2015). Homomorphic encryption can also help (discussed next).
Privacy concerns at inference time
Imagine you are a doctor, and you have made an MRI-scan of a brain to check for a cancer tumour. A big tech company created a revolutionary model that can diagnose, assess the severity and identify the best treatment method. Should the doctor upload the image?
You first have to consider that uploading the file gives other people access to it. Even worse, the model outputs very sensitive knowledge, to which the tech company also has access. By uploading the MRI-scan, you thus release very sensitive personal data.
Of course, a system can operate on trust and agreements. In reality, however, this is hardly a good privacy protection method and has failed society on numerous occasions. A better scenario is: the doctor uploads an encrypted MRI-scan only the doctor can view. The tech company runs the encrypted file through the model and spits out an encrypted result only the person with access to the input file can see (the doctor).
Defence: Homomorphic encryption
Encrypted communication is among simple chat messengers a normal standard. It offers protections against outsiders viewing information. This removes the need for trust and makes the solution truly private.
In the context of inference making, this seems impossible. It would mean a model can make a prediction on data it cannot read the content from. And next, it would spit out data it cannot decipher. However, this is what homomorphic encryption does.
Homomorphic encryption is in light of AI/ML an exciting new technology that brings security and privacy in the life-cycle from model deployment to inference making.
The experiment
I discussed Differential Privacy and Homomorphic Encryption as promising technologies to mitigate privacy concerns.
Academic research explains how these methods work mathematically. My focus of interest, though, is developing an intuition of how this can be used in an application.
In my experiment, I trained a model on the MNIST dataset and applied Differential Privacy and Homomorphic Encryption. I was especially curious how it would affect model creation, accuracies, and inference making. In addition, I performed a simple benchmark to understand performance impact.
The code is available on a Colab . I used the Opacus and Crypten library from the PyTorch team for the differential privacy and homomorphic encryption. Other libraries that at least should be mentioned are, OpenMined and Microsoft SEAL . I have to say, I had to follow quite a few tutorials on getting the code running. Again, I just wanted to get some first-hand experience in applying the tech.
Applying Differential Privacy
Differential privacy can be implemented in different ways. In my experiment, I applied global differential privacy. This just means, the dataset was “un-private” and a step before training the model is applied to make it “private”. Applying differential privacy is surprisingly easy. I have to note, there are quite a few hyperparameters to choose from.
Applying homomorphic encryption
The encryption is added after normal training of the model. To encrypt the model, the library needs the weights and biases but also the architecture itself. I have to note that not all architectures are supported. For instance, the typical ReLU activation function is not supported.
The encryption step is pretty quick (a few seconds) and only applied once.
The next step is the secure inference making. To do so, the input needs to be encrypted, fed to the encrypted model and decrypted the output again. There is a bunch of juggling of keys (I do not fully understand). However, the effect is that only the person with the key to the input can decrypt the output.
The important thing I noticed, is that the inference step does take significant compute.
The benchmark
Two equal architectures, two methods of applying security. From the table below, a couple of interesting observations can be drawn.
- Training a model using differential privacy takes additional compute (+31%). However, inference time is almost identical.
- In my experiment, the accuracy of the model with differential privacy is similar to the normal model. Typically though, the use of differential privacy leads to a dip in accuracy.
- Homomorphic encryption does not require training a new model. However, the inference time increases significantly in relative terms. Speaking in absolute terms, it is still okay (4.14 seconds) but did prevent me from computing the accuracy score. It is reported though that the use of the encryption technique does not impact accuracy that much.
There is a vivid discussion on further pros and cons of each approach. I refer the interested reader to Mireshghallah et al. (2020) and to the courses from OpenMined .
Final thoughts
Privacy and security of ML/AI are critical requirements for trustworthy AI. Anonymising data is not a valid approach. Practically, in all steps of AI/ML do privacy/security matters occur:
- While collecting private data
- While deploying a model
- While making inferences using the model
Multiple technologies have to be combined to create a safe and privacy respecting system throughout the full lifecycle. I only discussed two promising (and already usable!) methods as counter-measures. There are more defences possible. A big one I did not discuss in this article is federated learning. This technology is particularly interesting if multiple parties want to collaborate in creating an ML-model to avoid creating a central repository of private data.
Even if there are simple to use libraries, implementing privacy and security requires teams to invest resources. From my experience, this is not always easy to do and requires specialised knowledge.
However, a good starting point is to realise that there are privacy and security concerns throughout the whole AI/ML lifecycle. Have a look at your options and see what you can do to create safe, secure and privacy respecting AI.
References
Culnane, D. C., Rubinstein, A. B. I. P., & Teague, A. V. (2019). Stop the Open Data Bus, We Want to Get Off. ArXiv.
Mireshghallah, F., Taram, M., Vepakomma, P., Singh, A., Raskar, R., & Esmaeilzadeh, H. (2020). Privacy in deep learning: A survey. In arXiv.
Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. Proceedings — IEEE Symposium on Security and Privacy, 111–125. https://doi.org/10.1109/SP.2008.33
Schwarz, C. G., Kremers, W. K., Therneau, T. M., Sharp, R. R., Gunter, J. L., Vemuri, P., Arani, A., Spychalla, A. J., Kantarci, K., Knopman, D. S., Petersen, R. C., & Jack, C. R. (2019). Identification of Anonymous MRI Research Participants with Face-Recognition Software. New England Journal of Medicine, 381(17), 1684–1686. https://doi.org/10.1056/nejmc1908881
Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Sweeney.
Trask, A. (2020, February 24). Building AI with Security and Privacy in mind (video 4). https://nips.cc/Conferences/2020/Schedule?showEvent=20240
Veale, M., Binns, R., & Edwards, L. (2018). Algorithms that Remember: Model Inversion Attacks and Data Protection Law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 376(2133). https://doi.org/10.1098/rsta.2018.0083