With current technological advancements in AI, new risks naturally emerge. However, we have long been familiar with terms such as “algorithm,” “automated decision-making” and “machine learning” (ML) through real-life applications like email spam filters and automated bank loan decisions. Thus, some of the challenges with AI are not entirely new but are amplified as AI spreads to new areas of life. The good news is that since AI did not develop overnight, some of its associated risks have been acknowledged for quite some time, and ways to mitigate privacy-related risks already exist.
In this text, I aim to share some thoughts and available online sources on protecting personal data in the context of AI. The text is divided into three main questions: whether personal data is lawfully collected for the development and use of AI, how personal data is protected, and how our rights are considered when AI makes its predictions or decisions.
Can we trust that our personal data is lawfully collected by companies using AI solutions?
Both the large language models (LLMs) and image creation technologies have been trained with massive amounts of content available online stirring debates over copyright law. When the data processed is personal, our privacy can also be violated. One of the most well-known cases involves the French authority Commission nationale de l’infomatique et des libertés (CNIL), which issued fines to a US-based facial recognition technology company for scraping photos of individuals online, including from social media platforms, and then selling rights to access this database to recognize a person when provided with a photo. Although the content scraped was made public by the data subjects themselves, people did not expect their photos to be used in this way. The French authority determined that there was no legal basis for Clearview AI’s actions and required the deletion of photos of any residents of France. When no evidence of compliance was provided by Clearview AI, a fine of over €5 million was issued. As data scraping becomes more of an issue, the Dutch Data Protection Authority, Autoriteit Persoonsgegevens, has recently issued guidelines for scraping personal data online (available only in Dutch).
It is possible to request data subject consent to obtain personal data for the specific purpose of developing an AI model. When data subject consent has not been requested, the GDPR allows secondary processing (i.e., processing personal data for purposes other than those for which it was originally collected) for research, development and innovation, when permitted by member state law. When neither consent nor national law is in place, secondary processing is allowed if the secondary purpose is compatible with the original purpose of processing and certain considerations are taken. The British Information Commissioner’s Office (ICO) has issued guidance on the most common privacy issues around AI and data protection, including defining the lawful basis according to the UK GDPR.
But wouldn’t it be easier to use synthetic data to train the model? While fake data does not have the same statistical qualities as real data, synthetic data is generated from a real database and can therefore be used to train and test an AI model without compromising real personal data. Gartner estimated that by 2024, 60% of all data used to train and test AI models would be synthetic. However, a recent studies indicate that when an AI model encounters a large amount of synthetic data, it may lead to model collapse. Given these findings, it will be interesting to see whether synthetic data will be as widely used as previously estimated.
In addition to the legal basis for collecting and processing personal data, other general personal data processing principles — such as data accuracy, data minimization, purpose limitation, data integrity and confidentiality apply to AI technology as well. The ICO has also issued a separate AI and Data Protection Toolkit with privacy controls for the entire AI lifecycle, addressing these topics in detail.