The Data Protection and Digital Information (No. 2) Bill (DPDIB) has been introduced to Parliament on the 8th of March 2023. It is not a complete revamp of the first Data Protection and Digital Information Bill, but this time the intention is firm. The Department of Science, Innovation and Technology (DSIT) is determined to have the DPDIB adopted this year.
As the Bill is currently in its second reading in the House of Commons, it’s useful to shed light on the first section of the Bill, as it will have implications for how artificial intelligence (AI) – related practices are governed.
To recap, data protection law could kick in at least at three different stages: when a machine learning model is trained, when data is inputted into the model and when an output is produced.
Up until now, most of the attention has been set on Article 22.
Assuming the DPDIB were to be adopted as it was introduced, Article 22 would lose some of its teeth [which would reduce the ability to assess the lawfulness and fairness of processing activities associated with the production of and reliance upon model outputs], as the restriction set upon automated decision-making would only concern significant decisions based entirely or partly on special categories of personal data referred to in Article 9(1), i.e., “personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation.”
The emerging case law on automated decision making shows however the variety of practices that would necessitate scrutiny: the processing of special category data is only a subset of the whole.
There is a more subtle change the DPDIB would bring to the data protection framework.
Assuming the DPDIB were to be adopted, the revised version of GDPR Article 4(1)(1) would read as follows:
"‘personal data’ means any information relating to an identified or identifiable living individual (‘data subject’); an identifiable living individual is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of the individual (and see paragraph 2);
(1A) an individual is identifiable from information "directly" if the individual can be identified without the use of additional information;
(1B) an individual is identifiable from information "indirectly" if the individual can be identified only with the use of additional information;"
Article 4(1)(1) would then need to be read together with Article 4(1)(5), which contains an amendment to the definition of pseudonymisation:
‘pseudonymisation’ means the processing of personal data in such a manner that it becomes information relating to a living individual who is only indirectly identifiable; but personal data is only pseudonymised if the additional information, needed to identify the individual is kept separately and is subject to technical and organisational measures to ensure that the personal data is not information relating to an identified or directly identifiable living individual;
Article 4(1) is problematic, as it could have the effect of reducing the material scope of the GDPR, i.e., the definition of personal data.
Let’s unpack Article 4(1). For this purpose, we need to distinguish between three types of information:
- The data source that is being processed by the controller or the processor, for instance a table with a set number of individual records and associated attributes organised in columns, such as demographic or behavioural attributes. Let’s further assume that direct identifiers have been masked, as a security measure, to make re-identification more difficult. A variety of masking techniques, such as encryption, hashing, hashing with salt could be used for this purpose. [The choice of the masking technique does not make any difference here].
- The additional information that could be used to recover the direct identifiers masked within the data source and thereby identify individuals. The additional information could be either the encryption key or the salt used to mask personal identifiers, but also a clear-text version of the data source, if it persists somewhere.
- Publicly available information, which could comprise both direct and indirect identifiers, and if combined with the data source, would then enable the linking of individual records within the data source to direct identifiers, i.e., the re-identification of individuals. [This is not a mere textbook scenario]
Yet, it is possible that even if the additional information is destroyed or does not exist the individual records comprise a long list of indirect and/or sensitive attributes, some of which being publicly available or attainable by an attacker, or simply too sensitive to form the basis for individual decision-making.
Article 4(1) seems to suggest that when the additional information is destroyed or never collected in the first place an individual ceases to be, or more simply, is not identifiable… [unless the presence of publicly available information makes the individual directly identifiable; however this would mean that pseudonymisation is never possible in the presence of publicly available information, which is quite a demanding reading of Article 4(1)(4) and does not seem to be the position of the ICO].
Making a Vidal-Hall argument becomes suddenly harder. In Vidal-Hall the case that the browser-generated information constituted personal data had been found to be clearly arguable:
"If section 1 of the DPA is appropriately defined in line with the provisions and aims of the Directive, identification for the purposes of data protection is about data that ‘individuates’ the individual, in the sense that they are singled out and distinguished from all others."
The richer the list of attributes associated with singled-out individuals, i.e., distinguished individuals within the crowd, the more likely identification and individual impact.
What is more, if such a narrow reading of Article 4(1) were to prevail, we would lose a ground to assess the lawfulness and fairness of a great number of processing activities [and in particular AI training] and their individual and collective impact upon fundamental rights.
Is this saying that the ICO’s functional approach to anonymisation is also problematic? Not necessarily, but processing purposes should play a more prominent role and anonymisation is certainly not a processing purpose, it is just a data transformation process.