Anonymisation and Personal Data

Anonymisation and Personal Data

What is personal data?

According to the definition given in the General Data Protection Regulation (GDPR), 'personal data' means any information relating to an identified or identifiable natural person. A natural person is considered identifiable if they can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. (EU General Data Protection Regulation Article 4, Paragraph 1). By this definition, when it comes to research data, personal data are not limited to information relating to research participants. Research data may also contain identifiers relating to research subjects' family and friends or other third parties. Identifying information relating to these persons also constitutes personal data.

There are no limitations regarding the nature and character of personal data. Any information related to a natural person may be personal data. This includes statements, opinions, attitudes and value judgments. Personal data may be objective or subjective. Whether the information is true or verifiable or not is of no consequence here. The information may refer to an individual's private or family life, health, physical characteristics, professional activities, and economic or social behaviour.

What kind of information constitutes identifiable data?

Personal data are any kind of data that may be used to identify a natural person or a cluster of persons, such as individuals in the same household. Identification can occur on the basis of one or more factors specific to the physical, psychological, mental, economic, cultural or social identity of an individual or individuals. Data that are not directly about people can also be personal if they contain identifiers. An example of secondary personal data could be fire department information on the occurrences of fires, which may include addresses. (Elliot et al. 2016.)

Information that is sufficient on its own to identify an individual includes a person's full name, social security number, email address containing the personal name, and biometric identifiers (fingerprints, facial image, voice patterns, iris scan, hand geometry or manual signature). These type of data are called direct identifiers.

Other information that may be used to identify an individual fairly easily include a postal address, phone number, vehicle registration number, bibliographic citation of a publication by the individual, email address not in the form of the personal name, web address to a web page containing personal data, unusual job title, very rare disease, or position held by only one person at a time (e.g. chairperson in an organisation). A rare event can also reveal the identity of an individual. The Finnish Social Science Data Archive (FSD) calls these types of information strong indirect identifiers.

At FSD, strong indirect identifiers also include the types of codes that can be used to unequivocally identify an individual from among a group of individuals. These include, for instance, a student ID number, insurance or bank account number, IP address of a computer etc.

Indirect identifiers (or quasi-identifiers) are the kind of information that on their own are not enough to identify someone but, when linked with other available information, could be used to deduce the identity of a person. Background variables and indirect identifiers include, for instance, age, gender, education, status in employment, economic activity and occupational status, socio-economic status, household composition, income, marital status, mother tongue, ethnic background, place of work or study and regional variables. Indirect identifiers relating to region of residence include, for example, post code, neighbourhood, municipality, and major region.

Date can also be an indirect identifier. Date of birth is the most common example, but dates of death and dates of newsworthy events may also be indirect identifiers in research data when combined with other information. In health and medical research, treatment and sampling dates may also occasionally be indirect identifiers when linked to other information.

Pseudonymous data are also taken to be personal data. These include data from longitudinal studies where participants have a case ID instead of a personal identification number, but the research team has a key that can be used to connect the data to research participants.

Processing research data containing identifiers

Identifiable data may be used for scientific research when the use is appropriate, planned and justified, and when there is a legal basis for processing the data (e.g. consent of participants or research carried out in the public interest).

From the point of view of research participants, processing personal data constitutes the risk of confidential information relating to them being revealed to outsiders (for instance, people close to them, employers or authorities). Therefore, personal data processing must be planned thoroughly and executed carefully. Data protection must not be jeopardised, for example, by careless preservation or insecure digital transfers. You can adapt the various guarantees presented in these Data Management Guidelines, including data minimisation, pseudonymisation and anonymisation, for your purposes when processing personal data. Anonymisation is one way of making the data available for sharing and reuse. If necessary, the data can be further protected by administrative and technical data security solutions.

» More information on data security

Terms to understand

Anonymous data: An individual data unit (person) cannot be re-identified with reasonable effort based on the data provided or by combining the data with additional data points. Completely anonymous data do not exist, but with well-executed procedures one can achieve a result where individual persons cannot be identified with reasonable effort. Anonymisation refers to the various techniques and tools used to achieve anonymity.

Pseudonymous data: An individual data unit cannot be re-identified based on the pseudonymised data without additional, separate information. Pseudonymisation refers to the removal or replacement of identifiers with pseudonyms or codes, which are kept separately and protected by technical and organisational measures. The data remain pseudonymous as long as the additional identifying information exists.

De-identification: Removal or editing of identifying information in a dataset to prevent identification of specific cases. De-identification often refers to the process of removing or obscuring direct identifiers (Elliot et al. 2016).

De-anonymisation: Re-identification of data that are classified as anonymous by combining the data with information from other sources. If anonymous data are de-anonymised, the data were not truly anonymous to begin with, technology has advanced, or more information on the individuals has become available elsewhere. This is why it is good practice to re-assess the robustness of the anonymisation periodically (the so-called residual risk assessment).

Minimisation: Only the minimum amount of personal data necessary to accomplish a task (e.g. research) should be collected. Personal data must not be collected just in case they might be useful in the future. There has to be a clear, specified need for collecting the personal data.

Storage limitation: Personal data that are no longer needed to conduct the research should be erased as soon as possible. For example, names, addresses and other similar identifiers should be removed immediately after they are no longer necessary to carry out the research. If personal identification numbers were used to link data, they should also be deleted when they are no longer needed. Storage limitation reduces risks related to personal data processing.

When are data anonymous and when pseudonymous?

Data are anonymous if characteristic attributes (e.g. combinations of certain indirect identifiers) pertain to more than one person and a data subject cannot be identified with reasonable effort.

The EU data protection regulation (GDPR) defines anonymous data in a functional manner, as part of an activity:

To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

Source: EU GDPR Recital 26

When data are anonymous, individual data subjects cannot be identified from indirect identifiers or by combining the data with information available elsewhere. New data on the same research subjects cannot be added to an anonymous dataset. For the data to count as anonymous, anonymisation must be irreversible.

Pseudonymous data do not allow the identification of a data subject without the use of separately stored additional information. When data are pseudonymised, unique records are replaced by consistent values so that specific data subjects are no longer identifiable. In addition, information on the original values and techniques used to create the pseudonyms should be kept organisationally and technically separate from the pseudonymised data. Organisational measures refer to the protection of physical environment and documented access control. Technical measures include, for example, secure data storage and encryption. (Tarhonen 2016.)

Data are not pseudonymous if a specific data subject is identifiable from the data solely without additional information (ibid.). This could happen when indirect identifiers and exceptional records enable identification, even if personal identification numbers and other direct identifiers are stored separately and securely.

Pseudonymous data become anonymous when separately kept identifying information (decryption key, personal data and information on the techniques used to pseudonymise the data) are destroyed. If you cannot dispose of the separately kept personal data, you can make pseudonymous data anonymous by destroying the decryption key and information on the pseudonymisation processes, and by re-arranging the data, for example, according to new, randomised case IDs. The data are anonymous if they cannot be linked to the original personal data with reasonable effort.

For instance, research data of a longitudinal study remains identifiable for as long as the research group has the decryption key to the personal data of the research subjects. The data will not become anonymous even if the decryption key is coded twice (double coding). However, coding and double coding as well as pseudonymisation in general are useful guarantees to prevent the use of identifiers in analyses. Coding and double coding are often used in medical sciences.

Bases of anonymisation

There is no single anonymisation technique suitable for all types of data. Anonymisation should always be planned case by case, taking into consideration the data features, environment and utility.

Data features refer to the age and sensitivity of the data, number of data subjects as well as how specific the data contents are (Elliot et al. 2016). Data environment refers to the context in which the data are used: who uses the data, when and where? What external data sources are available? Data environment also includes the physical storage of data, on which you can find more information here. In assessing data utility one must consider how to balance data utility and anonymisation so that the data remains as usable as possible after anonymisation. You should plan anonymisation carefully and document all relevant anonymisation techniques and processes along with the rationale for them.

When planning anonymisation, you should first consider whether the population and the sampling method may reveal exceptional or unique information on the research subjects. One thing to consider is how random the selection of people in the sample and the study are. For example, when a sample includes a complete population (population data), e.g. all Finnish parents of prematurely born children under the age of one, it is clear from the start whose answers may be found in the data. In a random sample, the chance of being included in the data is smaller because only some of the entities being studied, e.g. every 50th individual, are included in the sample. Regardless of the population or the sampling method, it is important to examine what kinds of indirect and direct identifiers the data contain and to see if there are any exceptional or unique observations.

You should also pay attention to the response rate because a high response rate increases the likelihood of an entity being included in the data. This is particularly important in assessing the anonymity of population data.

Information on how the data were collected, i.e. sampling and selection, should not reveal the identity of the research subjects. Disclosure risk is particularly notable if a researcher selects the study participants from his or her social circle using snowball sampling or from an area with few inhabitants.

In addition, the age and time span of the data affect the need for anonymisation. The older the data are, the more difficult it is to identify individuals in them because the information changes over time. The time span is relevant, for example, in longitudinal studies collected from the same individuals at intervals. Detailed information on the life events of data subjects often make the observations unique and identifiable. (Elliot et al. 2016.)

The first step of anonymisation is usually to remove direct and strong indirect identifiers from the data (see the Identifier type table). Note that direct and strong indirect identifiers may appear in any part of the data. This is why it is important to examine the complete data instead of focusing only on the most obvious identifiable attributes or on the background information of participants at the beginning of qualitative interviews. In addition to variables charting identifiable information, quantitative data may also contain direct and strong indirect identifiers in responses to open-ended questions. In qualitative data, these identifiers may occur just about anywhere.

However, removing direct and strong indirect identifiers is rarely sufficient to make the data anonymous. After removing the direct identifiers, you should examine the indirect identifiers in the data and assess whether individuals can be identified based on them. The number of indirect identifiers and their level of detail affect the anonymisation choices. The greater the number and the more exact they are, the more careful consideration of anonymisation is called for.

Background variables should always be considered together. If you wish to leave municipality of residence information in the data, you should take sufficient measures to coarsen other background information relating to the persons in question (for instance, occupation, workplace, education, age) to prevent identification. On the other hand, if it is important for the research to have information on the participants' occupation and age, geographical information relating to the participants should be categorised (major region or municipality type instead of municipality of residence). The need to categorise any other background information must be carefully reviewed as well.

For successful anonymisation, information included in the data should be considered together with information available from other data sources. Data must be processed in a way that no individual can be identified, even when using information from other sources. When assessing disclosure risk, you should also take into account what kind of indirect identifiers can be found in information available online (public registers, websites of organisations etc.) As open access to all kinds of information increases rapidly, it is important to check regularly whether previously anonymised data actually remain anonymous (residual risk assessment).

Linking only a few pieces of background information may be sufficient to identify an individual. Latanya Sweeney (2000) found out in her study, based on voter registration lists, that 87% of Americans are likely to be uniquely identified based on their date of birth, gender and a 5-digit ZIP code. The voter lists contain personal and regional information on people who have voted. Similarly, over half of the population in the United States (53%) are likely to be uniquely identified by gender, birth date and place, i.e. city, town or municipality of residence (ibid.). The results show that information from external data sources is of considerable significance in anonymisation.

The following questions can be used to make sense of the anonymisation process of both quantitative and qualitative data:

  1. What kinds of direct or indirect identifiers do the data contain?
  2. Are there any unique or rare observations in the data?
  3. Which information in the data can be linked to identify an individual?
  4. Can information from other sources be linked to the data, making identification possible?
  5. Which features of the data do you want to keep (if possible), and which can be “sacrificed” in the anonymisation process? Think about how other researchers would most likely use the data.

Anonymisation of quantitative data

In the international literature of the field, anonymisation is a broad concept that involves various methods and techniques, such as the functional and statistical approaches (Elliot et al. 2016). However, in these guidelines we focus on concrete anonymisation techniques for research data.

In anonymising quantitative data, we want to eliminate exceptional observations that may increase disclosure risk. This is why it is recommended to examine the relationship between rare or unique records and indirect identifiers. Usually a researcher should inspect all variables containing indirect identifiers or, ideally, all variables in the data. (Cabrera 2017.) You can search for rare or unique records, for example, by examining the categories and frequency distributions of variables with indirect identifiers. Cross tabulating variables may also be useful in finding exceptional cases and records. If there are continuous variables in the data, it is a good idea to recode them into categorical variables for disclosure risk assessment (ibid.). Continuous variables include, for instance, age or income when they can take on any real value on a continuum.

When cross tabulating variables, it is worth keeping in mind that categories with few observations do not necessarily always constitute identifying information. For example, if a survey is conducted in five schools with roughly the same number of pupils and only four pupils from one of the schools respond, these four observations are not automatically identifying information simply because of the small frequency count. This is because the potential number of respondents was as large as in the other schools. The situation would be different if this school had significantly fewer pupils than the others.

Anonymisation techniques for quantitative data can be divided into two categories: generalisation and randomisation. When data are generalised, information is irreversibly removed or attributes of data subjects are diluted by (re-)categorising or coarsening values, i.e. modifying their scale or order of magnitude. Randomisation techniques are used to add “noise” to the data to increase uncertainty of observations. (Cabrera 2017; EU's article 29 working group: Opinion 05/2014.) Successful anonymisation usually requires the use of several anonymisation techniques as well as assessment of the balance between data anonymity and data utility.

All anonymisation techniques have their advantages and limitations, which is why you should familiarise yourself with their effects on data quality and utility. Categorising variables enables retaining information in the data and utilising it with certain research methods. Categorisation lessens data utility, but only a little (Purdam & Elliot 2007). In terms of anonymity, however, it is problematic that an entity can still be linked to a specific category after recoding (EU's article 29 working group: Opinion 05/2014). Moreover, categorising all values of a variable may make it difficult to determine relationships between variables and prevent the use of certain data analysis techniques designed for continuous variables (Anguli, Blitzstein & Waldo 2015).

Randomisation may be useful when there are relatively few rare observations in the data (under one percent). However, when using randomisation techniques, you should carefully assess the impact of the technique on the quality of the data. Randomisation techniques may have a significant effect on, for instance, the frequency distributions of variables and analyses of correlation and causation. These, in turn, affect research results. Although some researchers consider different randomisation techniques distortion of data, they are often useful in anonymisation.

In the following sections, we present the most common generalisation and randomisation techniques. Generalisation techniques include excluding, categorising and coarsening information, using samples instead of the whole data, and k-anonymisation and l-diversity. Randomisation techniques obscure the exact values of variables through multiplication and permutation. Finally, we list ways to assess which anonymisation technique is suitable for you.

Removing variables, values and units of observation

In the case of direct or indirect identifiers, removing a variable is the easiest and most obvious way to decrease the risk of identification. Naturally, variables containing indirect identifiers can also be removed when necessary. For instance, if in a survey on self-reported crime the young participants are asked which school they attend, the variable may present a disclosure risk when linked with other background variables. In this case, the school variable should be removed.

Sometimes it is also necessary to remove open-ended variables to prevent disclosure. This is often done when information in an open-ended variable is available in the data in another, categorised variable. For instance, if there is a categorised variable on the type of educational institution, the open-ended variable charting the names of the participants' educational institutions could be removed.

If exact information in an open-ended variable is crucial for research, one possible option is to detach the variable from the data into a separate file. You can then coarsen the background variables you need for analysis and include them in the file. If linking the contents of the open-ended variable with the original data constitutes a disclosure risk, you must edit and organise the separate file in a manner that does not allow linking.

Removing individual values from records containing indirect identifiers may be justified if a value constitutes a disclosure risk, i.e. is exceptional or rare. Such a value could be, for example, exceptionally high income or a rare occupation like minister (member of the government). When removing individual values, it should be noted that anonymisation will not be successful if the removed values can be inferred with reasonable effort.

A whole data unit (individual, respondent) may be removed if it is not possible to otherwise remove identifying information on the individual. In some situations, this is a better option than using restricting techniques on the whole data only to de-identify one data unit.

When removing information, you should consider whether the removed information can be inferred by potential attackers. For example, the data of a population study collected from workplace X contains the job titles of all employees and one title is only held by two people. Recoding this value as missing data would not be a good anonymisation solution, as it is relatively easy to find out the original job title. Instead of removing the value, a better solution would be to coarsen the job titles or combine some of the categories.

Editing responses in open-ended variables

Open-ended questions, which the respondents can answer in their own words, occasionally contain identifiers. These identifiers may relate to the respondents themselves or to third persons. The information in open-ended responses does not suffer decisively if identifiers (names, phone numbers, email addresses etc.) are removed. When it comes to other potentially identifying information in open-ended variables, disclosure risk should be assessed on a case-by-case basis taking into consideration the topic of the study and available background variables.

You can mark the anonymised names, words and excerpts in the data with square brackets. Original terms may be substituted by coarser, more general terms within square brackets or they may be simply marked as [identifier removed]. For example, in a survey collected from all teachers in Anytown Elementary, one teacher says she works in the only special unit of the school with few employees. Because there are only three teachers in the unit, the information is identifying and should be deleted, e.g. like this: [identifier removed]. Simply anonymising the special unit as follows is not sufficient: [special unit Y of school X removed]. This is because the unique special unit is easily inferred. All additional information in open-ended responses revealing that the teacher works in the special unit should also be removed. See instructions on how to anonymise qualitative data for further advice on anonymising open-ended variables.

Using a sample instead of the full data

One technique to prevent re-identification often used by Statistics Finland is to provide a sample of the data instead of the full population data. Different sampling methods are used to create a random sample of the data.

Data archived at FSD mostly contain samples instead of complete populations.

Recoding variable values

Recoding the values of a variable is a better solution than simply removing the variable. For instance, instead of including the names of schools, you can recode the school variable into broader categories such as 'lower secondary school', 'upper secondary school', 'vocational school', etc. You can also categorise identifiers like the exact age, municipality of residence and occupation. For instance, record the year of birth rather than the day, month and year, or recode it into categories that contain 3–5-year age groups.

Variables containing detailed geographical information, such as postal codes, can be aggregated from five-digit variables into two- or three-digit ones. The variable identifying the respondent's municipality of residence can be aggregated into two different variables: region/province and municipality type (urban, semi-urban, rural, etc.) This is a way to minimise identification risk without losing background information relevant for research.

» Regional classifications of Statistics Finland

An occupation variable can be classified into occupational groups (e.g. Managers; Professionals; Technicians and associate professionals; Clerical support workers; Service and sales workers; Skilled agricultural, forestry and fishery workers; Craft and related trades workers; Plant and machine operators and assemblers; Elementary occupations; Armed forces). Another option is to use employment status grouping (e.g. Employees; Employers; Own-account workers; Contributing family workers etc).

» Social classifications of Statistics Finland

One way to reduce disclosure risk is to restrict the upper and lower ranges of a continuous variable to exclude outliers. This anonymisation technique is typically used for income variables. Highest incomes may be top-coded, that is, coded into a new category (e.g. "income over xxxxx euros) while other income responses are preserved as actual quantities (= the actual income in euros). In the same way, the smallest observed values can be bottom-coded.

Another way to remove identifiers is to categorise open-ended responses. This technique functions well for open-ended questions collecting background information such as place of residence, education, educational institutions, place of work etc. For instance, a survey of physicians might contain an open-ended question on specialisation. Linked to other background variables, this variable might lead to an identification of physicians who are specialised in more than one area. One solution is to categorise the open-ended variable into broader categories, such as 'one area of specialisation', 'two or more areas of specialisation', etc.

It is also possible to change open-ended responses into a dichotomous variable (responded - did not respond) if the textual responses could lead to disclosure risk when linked to other background variables. This may be convenient for mainly quantitative variables where most response options have been classified and a separate open-ended 'Other, please specify' option has been created for responses that do not belong to any classes mentioned. For example, such a question may be used to ask what the participant's mother tongue is, with response options '1) Finnish, 2) Swedish, 3) Other, please specify' or to ask about religious denomination (Evangelical-Lutheran; Orthodox; Other, please specify). The open-ended responses given to the last alternative may constitute an identification risk when linked to other background variables. A good solution is to remove the open-ended responses from the data and only leave information on whether the respondent chose this option or not.

Discretionary categorisation of variable values

Categorising or coarsening variables may significantly diminish the possibility to draw statistical conclusions. A good option for balancing between data utility and disclosure risk is to recode some values of a variable into broader categories. If the frequency distribution is between 1–20 and most cases fall into values 1–12, it may be a good idea to leave the values under 10 as they are and combine the higher values into broader categories like 13–15 and 16–20. However, you should pay attention to the impact of this technique on the mean of the variable as well as on correlations between different variables.

K-anonymisation and l-diversity

There are statistical anonymisation methods for assessing disclosure risk that help a researcher gain perspective on the anonymity of their data and justify the decisions made. One of the best-known of these methods is k-anonymisation, which is an attempt to combine the best features of statistical approaches (Elliot et al. 2016). K-anonymisation and l-diversity can be used, for example, when data are collected from a complete population and there are attributes that enable indirect identification of individuals or clusters of individuals. Such data include patient data, among others. K-anonymisation and l-diversity can also be used to ensure successful anonymisation after other anonymisation techniques have been used. There are free anonymisation tools available online, such as ARX and µ-ARGUS (ibid.).

K-anonymisation aims to prevent the identification of a data unit by forming a group of at least k records with the same attributes (El Emam & Dankar 2008). In other words, there should be at least k records in each value of a variable. For example, in a situation where a dataset contains only one male aged over a hundred years from Tampere, this individual should be grouped among others so that he is not the only person with these attributes. If the data contain other males over the age of 90 from Tampere, the hundred-year-old could be grouped among them. There is not an exact value for k and it should be decided on a case-by-case basis. Sometimes, a k of two data units may be sufficient (Cabrera 2017), but at least three is preferable. Some scholars have claimed that k should contain 5–10 data units (Anguli et al. 2015; Machanavajjhala et al. 2007).

The problem with k-anonymisation is that it does not prevent an attacker from inferring what kind of sensitive attribute is in question if all individuals of a k-anonymised group share the same value of the attribute. That is, k-anonymisation prevents identity disclosure but it does not prevent attribute disclosure. This is where l-diversity becomes useful. L-diversity ensures that in a group of data units with identical attributes there are at least l values for a sensitive attribute. In other words, there should be enough variability between the values so that an attacker cannot infer what kind of sensitive information the value contains. (EU's article 29 working group: Opinion 05/2014.) It should be noted that l-diversity is not a de-identification technique per se but it prevents uncovering what kind of sensitive information pertains to an individual if the individual is re-identified (Cabrera 2017).

An example of l-diversity: data collected from all inpatients of an eating disorder clinic contain sensitive information on whether the respondent has tried to commit suicide in the past two years (yes/no). The respondents are k-anonymised into groups of at least three individuals in terms of certain indirectly identifying attributes (age group, gender, town of residence). This technique is sometimes called 3-anonymity (Cabrera 2017). When examining the sensitive information on suicide attempts, it becomes apparent that all male respondents aged 25–34 from Tampere have tried to commit suicide in the past two years. Therefore, if an attacker knows the identity of any male aged 25–34 from Tampere who had been an inpatient at the clinic during the survey, it is immediately obvious that this individual has tried to commit suicide. In order to achieve l-diversity (e.g. l=2), there should be both those who had tried to commit suicide and those who had not in the group. The term 2-diversity is sometimes used in a situation described above where the sensitive attribute has two distinct values (ibid.). Because l-diversity is not achieved in the example, one option would be to coarsen background variables (e.g. municipality of residence into region of residence).

T-closeness can be used if it is important to keep the data as close as possible to the original. T-closeness is achieved when there are at least l different values within each equivalence class and each value is represented as many times as necessary to mirror the initial distribution of each attribute. For more details on t-closeness, see, for instance, EU's article 29 working group: Opinion 05/2014.

Noise addition

Adding noise refers to modifying attributes in the data so that they are less accurate in order to increase uncertainty over the exact values of observation. Noise can be added in various ways. For example, values of the age attribute could be expressed with an accuracy of +-2 years. An observer of the data will assume the values are accurate, although they are only so to a certain degree. (EU's article 29 working group: Opinion 05/2014.)

Noise can also be added by multiplying the original values by a random number or by transforming categorised values into other values based on predetermined probabilities. An example of the latter would be transforming 15% of North Karelians into inhabitants of the Kainuu region. In addition, identifiable values of continuous variables may be aggregated into group means. (Cabrera 2017.) For instance, the exact drug costs of patients with sensitive illnesses could be replaced by the average drug costs of patients with these illnesses.


Permutation refers to altering the values of attributes by swapping them from one record to another. By swapping the values between data units, the variance and distribution of a variable will not change, but correlations between values and individuals is lost. Permutation will not provide strong guarantees if two or more attributes have a logical relationship and they are permutated independently because an attacker might identify the permutated attributes and reverse the permutation. This is why it is advisable to only use permutation for attributes that are not strongly correlated. For example, in a situation where two attributes, such as income and occupational status, have a strong logical relationship and one of them needs to be anonymised, consider using another anonymisation technique instead of or in addition to permutation. (EU's article 29 working group: Opinion 05/2014.)

Assessing the robustness of anonymisation

As Elliot et al. (2016) point out, "[a]nonymisation is not an exact science," so determining the sufficient level of anonymisation may sometimes prove problematic. However, you can use the following questions to assess your choice of anonymisation technique and the robustness of the outcome (adapted from EU's article 29 working group: Opinion 05/2014). If you answer the first two questions in the negative and there is a very small chance of inference, the anonymity of the data is in good order.

  1. Singling out an individual: Can you still single out any individual in the data after anonymisation?
  2. Linkability: Can you link records relating to an individual to another dataset or information from external sources and thus identify the individual?
  3. Inference: Can you infer that certain information concerns a specific individual? Can you infer the original values of altered or removed values?

Anonymising qualitative data

Anonymisation measures presented below can be used both for anonymising qualitative data and for anonymising extracts published from research data. The guidelines presented here apply to textual data only. We do not provide instructions for anonymising sound or video recordings.

The starting point in making a textual dataset anonymous is to erase background material containing identifiers, such as the contact details of participants and background information forms.

When you remove or edit identifiers, mark all changes to the data clearly. You can mark the changes with single or double square brackets: [changed text] or [[changed text]].

Replacing personal names with pseudonyms

Changing proper nouns into pseudonyms is the most popular anonymisation technique used for qualitative data. However, pseudonymisation does not render data anonymous until the original identifiers are completely disposed of. Research teams must be consistent in the selection and use of pseudonyms throughout a research project. A spreadsheet file available to all team members can be used to maintain a list of names and their pseudonyms. The same pseudonyms should be used in both the data and the published excerpts.

When anonymising proper names, it is always a better option to use pseudonyms rather than simply delete the names altogether or replace them by a letter or a character string, such as [x] or [---].

Replacing proper names with pseudonyms enables the researcher to retain the internal coherence of the data. In cases where several individuals are frequently referred to, data may become unintelligible if the proper names are simply removed.

Using a pseudonym for both the first name and the surname may be justified to make the transcription resemble natural speech or to keep a large number of participants separate from one another. The usual procedure, however, is to replace the first names with pseudonyms and remove the surnames. If a person is referred to by his/her surname only, the pseudonym should also be a surname.

A dataset may contain references to persons who are publicly known on account of their activities in politics, business life or other work-related spheres. Their names are not changed to pseudonyms. However, a pseudonym or categorisation (e.g. [local politician]) should be used if the reference is related to the person's private affairs.

Categorising proper nouns

There is no need to create pseudonyms for persons who are only mentioned once or twice in the data and who have no essential importance for understanding the content. Instead, their names can be replaced by a category or role (e.g. [woman], [man], [sister], [father], [colleague, female], [neighbour, male]). Using pseudonyms is not always necessary for other proper nouns either. If a unit of data (personal interview, group interview, biography, letter, etc.) contains only one school or place of residence, its name can be replaced by a category, for instance, [lower secondary school], [home town] or [residential area].

If workplaces are mentioned or there is other information about businesses and workplaces that can constitute an indirect identifier in the data, researchers can use Statistics Finland's Industrial Classification in replacing the names with categories. Another possibility is to simply generalise Peters & Peters into [law firm], Tottenham Hotspur into [football club], Pizza Hut into [restaurant], etc.

» Industrial Classification of Statistics Finland

If need be, place names mentioned in the data can be replaced by more general expressions like [population centre], [district], [village], etc. If you are not certain whether a place name denotes a municipality or a suburb, various place name lists and municipality catalogues may be of help.

If it has been decided that the participants' municipality of residence will not be revealed, researchers should remember to remove identifying geographical information relating to participants' place of residence both from background information and the textual data content. For instance, if the participant mentions that he or she often goes to a particular restaurant that is located a short walking distance away from his or her home, it is best to replace the name of the restaurant with a generic expression [restaurant].

Changing or removing sensitive information

Identifying sensitive information should be removed, categorised or classified. For example, 'AIDS' could be changed to [severe long-term illness] and thereafter referred to as [illness], provided that the reader is able to deduce from the context that [illness] refers to the 'severe long-term illness' mentioned in the beginning.

Removing or generalising sensitive data is justified if a) the respondent mentioned it only incidentally b) the information is not relevant to the subject matter and c) the sensitive information constitutes a disclosure risk. For example, if a study focuses on the lives of persons with a severe illness, disclosure risk can be best reduced by using other anonymisation methods than altering crucial information.

Categorising background information

Background characteristics of participants, such as gender, age, occupation, workplace, school, or place of residence, are often essential for comprehending the data. Such characteristics may constitute important contextual information for secondary analysis. Detailed background information can be edited into categories similarly to indirect identifiers in quantitative data. Various existing classifications, such as those used by national statistical institutes, are helpful in the process. If researchers create their own classifications, the classifications should be documented in detail in the data description.

Categorisation is often a better solution than deleting background information. An example: an interview of a man whose actual background information is 44-year-old system specialist working in the Computer Center at the University of Tampere, married with two children aged 9 and 11, lives in Tampere. To reduce the risk of identification, his background information could then be categorised in the following manner:

  • Gender: Male
  • Age: 41–45
  • Workplace: University
  • Occupation: Information and communications technology (ICT) professional
  • Household composition: Wife and two school-age children
  • Place of residence: Town in Western Finland

In the example above, the workplace (university) does not need not be generalised into [public sector employer], since the remaining background data do not allow even a partial identification. There are three universities and some separate units of other universities in the region of Western Finland.

When considering the need to categorise background information, researchers should take into account the other anonymisation techniques explained above as well as the subject matter and content of the data.

» Social Classifications of Statistics Finland
» Regional Classifications of Statistics Finland
» Industrial Classification of Statistics Finland

Changing values of identifiers

Sometimes it is possible to anonymise qualitative data by distorting information, just like values of identifying attributes can be swapped between records in quantitative data. For instance, an exact date of birth – which as an identifier should normally be removed – may sometimes be crucial for understanding the content. A hypothetical example:

The interviewee was born on 31st December 1958. On New Year's Eve in 2005 she sat by the hospital bed of her dying child. In the interview, she describes in detail her conflicting emotions evoked by the fact that New Year celebrations, the death of her child, and her own birthday are all mingled together in her mind.

In a case like this, deleting New Year's Eve from the data would prevent us from understanding the content. The date (New Year's Eve) can be retained in the data if the interviewee's year of birth is changed to one or two years earlier or later.

Removing hidden metadata from files

During anonymisation, it is important to check whether archival files contain any hidden technical metadata that could enable the identification of research participants. This hidden metadata consist of, for example, location information and information about the owner of a device or a user profile. Technical metadata may be saved when files are created but also when they are edited.

Research data in the form of text or images may consist of files created by the research participants themselves. Risk of identification from the metadata is particularly high in these cases. As textual data often comprise text files created and directly submitted by the research participants, the hidden metadata of these files refer to the participants explicitly. EXIF data of digital images may also contain very precise information, such as the exact coordinates of where the picture was taken and even the photographer's name.

Technical metadata can be removed by using common text or picture editors (e.g. MS Office, Windows File Explorer, Photoshop, GIMP, Irfanview). There are also programs specifically designed for removing EXIF data (e.g. Easy Exif Delete), which make removing the hidden metadata easy. Specific instructions on how to remove technical metadata depend on the software and its version. See the instructions on the website of the program you are using.

Practical tips for anonymisation

  • Make a written anonymisation plan for your research data.
  • Take into account any background information not included in the actual data files, as it may contain identifiers that need to be anonymised or removed (contact details of participants, paper questionnaires etc.)
  • Remember to check that no identifying information related to third persons are retained in the data.
  • Carry out anonymisation of quantitative data by using the syntax of a statistical package (more information on using syntax).
  • When anonymising textual data, use the Find and Replace commands to make one change at a time.
  • Check whether different names (first name, nickname, surname) have been used to refer to one and same person (e.g. whether Elizabeth is also referred to as Eliza, Beth, Bess, Libby etc.).
  • Make it easier to plan anonymisation of textual files by consistently putting a symbol or special character (e.g. #, ¤) in front of each proper noun at the transcription stage. It will save a lot of time later.
  • Planning and being thorough pays off in the end.

Identifier type table

Different types of identifiers are listed in the table below. Information that is deemed sensitive is marked with an asterisk (*). Each identifier is characterised as either direct identifier, strong indirect identifier or indirect identifier.

The last column notes the easiest methods for dealing with that type of identifier. The methods include removing the identifier, changing it into a pseudonym and categorising or classifying it.

Some attributes may be both indirect identifiers and strong indirect identifiers. For example, an unusual occupation or occupational status is a strong indirect identifier, while a common occupation is an indirect identifier.

The table is not exhaustive but may provide good tips for recognising identifiers and anonymising research data.

Table 1.

Identifier type Direct identifier Strong indirect identifier Indirect identifier Anonymisation method
Personal identification number x     Remove
Full name x     Remove/Change
Email address x x   Remove
Phone number   x   Remove
Postal code     x Remove/Categorise
District/part of town     x Categorise
Municipality of residence     x Categorise
Region     x (Categorise)
Major region     x  
Municipality type     x  
Audio file x     Remove
Video file displaying person(s) x     Remove
Photograph of person(s) x     Remove
Year of birth   x   Categorise
Age     x Categorise
Gender     x  
Marital status     x  
Household composition     x (Categorise)
Occupation   (x) x Categorise
Industry of employment     x  
Employment status     x  
Education     x Categorise
Field of education     x  
Mother tongue     x Categorise
Nationality     x (Categorise)
Workplace/Employer   (x) x Categorise
Vehicle registration number   x   Remove
Title of publication   x   Categorise
Web page address   (x) x Remove
Student ID number   x   Remove
Insurance number   x   Remove
Bank account number   x   Remove
IP address   x   Remove
Health-related information *   (x) x Categorise/Remove
Ethnic group *   (x) x Categorise/Remove
Crime or punishment *     x Categorise/Remove
Membership in a trade union *     x Categorise
Political or religious allegiance *     x Categorise
Other position of trust or membership   (x) x Categorise/Remove
Need for social welfare *     x Categorise/Remove
Social welfare services and benefits received *     x Categorise/Remove
Sexual orientation *     x Remove

References and more information:

updated 2018-07-10