Anonymisation and Personal Data

Anonymisation and Personal Data

The guidelines provided here refer to the situation in Finland, and may not always be applicable in other countries due to differences in legislation.

What is meant by personal data?

According to the section 3 of the Finnish Personal Data Act, the term personal data refers to any information relating to a private individual or to his/her personal characteristics or personal circumstances, where these are identifiable as concerning him/her or the members of his/her family or household.

It is important to note that by this definition any information related to a natural person may be defined as personal data, regardless of whether related to research participants or to other persons. Research may contain personal data of individuals close to research participants or of third parties mentioned in the research. Information related to these persons also constitute personal data.

The Act does not contain any limitations regarding the nature and character of personal data. Any information related to a natural person may be defined as personal data. This includes statements, opinions, attitudes or values. Personal data may be objective or subjective. Whether the information is true or verifiable or not, is of no consequence here. The information may refer to an individual's private life, family life, health, physical characteristics, professional activities, or economic or social behaviour.

What kind of data are personal?

Personal data are any kind of data that may be used to identify a natural person. Identification can be made on the basis of factors specific to the physical, psychological, mental, economic, cultural or social identity of an individual or individuals.

Information that is sufficient on its own to identify an individual includes a person's full name, social security number, email address containing the personal name, and biometric identifiers such as fingerprints, facial image, voice patterns, hand geometry, iris scan, or manual signature. This type of data are often called direct identifiers.

Other information that may be used to fairly easily identify an individual include a person's postal address, phone number, vehicle registration number, bibliographic citation to publications, email address not in the form of the personal name, web address to a web page containing personal data, unusual job title, rare disease, position held by only one person at a time (e.g. chairperson in an voluntary organisation). The Finnish Social Science Data Archive calls this type of data strong indirect identifiers.

The data archive also counts as strong indirect identifiers the types of codes that can be used to unequivocally identify an individual from among a group of individuals. These include, for instance, student ID number, insurance or bank account number, or IP address of a computer etc.

Other indirect identifiers are the kind of information which on their own are not enough to identify someone but when linked with other available information, could be used for deducing the identity of the person. For example, age, gender, municipality of residence, or job title may in some cases, when combined with other information, enable identification. Background variables form the most common case of indirect identifiers.

According to the Personal Data Act, also pseudonymous or coded data are taken to be personal data. These include data from longitudinal studies where participants have a case ID instead of a personal identification number, but the research team has the code key which team members could use to connect the information to a particular research participant.

When are data anonymised?

Data are anonymised if characteristic factors (for instance, indirect identifiers when linked together) are the same for several individuals and if any particular individual cannot be identified with reasonable effort. The assessment of how identifiable the data of a dataset are and how they can be anonymised is always done on a case-by-case basis.

For example, the Personal Data Act implies that data from longitudinal studies are considered identifiable as long as the research team has the code key. In the eyes of the law, even if the original code key were coded twice (double coding), the procedure does not render the data anonymous. Coding and double coding are, however, useful data protection and data security methods when researchers wish to prevent the use of identifiers in their analyses. Coding and double coding are often used in medicine.

Data are fully anonymous only when no natural person can be identified with reasonable effort. It should be impossible to identify any individual from fully anonymised data, for instance, through indirect identifiers or by combining information collected for the study with information from other sources. For the data to remain anonymous, no new information relating to the same research participants should be added to it. The Finnish legislation requires that anonymisation must be irreversible for the data to be considered anonymous.

Using personal data in research

According to the Personal Data Act, data containing identifiers may be used for scientific research if the use is appropriate, planned and justified.

From the point of view of research participants, processing personal data constitutes the risk that confidential information relating to them is revealed to outsiders (for instance, to people close to them, to employers or authorities). Therefore personal data must be processed carefully and in a well-planned manner. Data protection must not be jeopardised, for example, by careless preservation or insecure digital transfers.

» More information on data security

Personal identification numbers, personal names, addresses and other unnecessary identifiers must be removed from data whenever possible. Identifying information stored separately must be destroyed permanently when it is no longer needed for validating analyses and there is no longer any legal grounds for its preservation. If there are grounds for preserving such information for research purposes (e.g. for longitudinal studies), research participants must always be asked to give consent to the preservation of their personal data. The identifiers needed for the research and their preservation time must always be documented in the scientific research data file description.

» Description of scientific research data file (Office of the Data Protection Ombudsman, PDF form)

To ensure data protection in large research consortiums, dissemination of sensitive information needs to be carefully planned. One option is to give each research team an anonymised version that is slightly different from the version received by other teams. However, even in this case the original data with all its identifiers are retained somewhere, which means that in the view of Finnish legislation, even the disseminated anonymised version is still data containing personal information. Several complex mathematical models have been created to produce slightly different anonymised versions (see for example Dwork & Roth 2014). At its simplest, one randomly selected case is removed from each disseminated data version, or a single value of a variable is changed. When making such changes, attention must be paid on their impact on variable means and correlations between variables.

Ethics

Identifying an individual participant from data collected and processed in accordance with the law is harmful only when the information is misused. Research data containing personal information must not be used or disseminated for any other purpose than the one specified in the consent participants have given. Good research ethics dictate that researchers must take great care to prevent any situation where information contained in their data could influence the status or assessment of a participant or decisions relating to him or her.

The principal goal of research is to produce scientific information to increase the understanding of health, disease and social problems, as well the understanding of society and culture in general. This goal does not in any way allow revealing personal information relating to research participants to authorities or other third parties. Privacy protection is a basic right that protects citizens also from actions carried out by authorities.

If a researcher unexpectedly comes across an unit of data (a questionnaire, an interview, a diary, written material etc.) relating to a person he/she knows personally, the researcher should carefully consider whether it would be more ethical to leave that unit out of the analysis altogether. Whether researchers leave such units out or not, they are still bound by the duty of confidentiality.

Contrary to research data, research publications are in the public domain. The statistics and tables of quantitative research must be presented in a manner where there is no risk of identification even when the publication is based on data containing identifiers.

In the case of qualitative data, the risk of identification must always be evaluated before any samples/quotations from the data are published: which indirect identifiers will be left in the sample as such, which will be categorised and which will be removed altogether so that individual participants can no longer be identified.

In historical and cultural research, for example, it may be justified both in terms of science and research ethics to publish the names of research participants. Studies based on expert interviews may also be analysed and their results published without masking or deleting the names of the experts. However, the participants' consent for this must be obtained in advance.

The starting point of anonymisation

There are no anonymisation methods that would be suitable for all types of data. Anonymisation is always planned on a case-by-case basis. Simply removing direct and strong indirect identifiers is rarely sufficient to make data anonymous. In addition, researchers must always consider the need to remove or mask other indirect identifiers. They should also make sure that no individuals can be identified by using additional information from other sources.

Usually the first procedure is to remove direct and strong indirect identifiers from the data (see the Identifier type table). Direct and strong indirect identifiers may also appear in other parts of the data than in individual variables of a quantitative dataset or in the personal information of each participant in the beginning of each qualitative interview. In quantitative data, such identifiers may be included in responses to open-ended questions. In qualitative data, they may occur just about anywhere.

Background variables and indirect identifiers include, for example, gender, age, education, occupational status, economic activity, socioeconomic status, household composition, income, marital status, mother tongue, nationality, ethnicity, workplace, school and geographic variables. Geographical variables include postal code, district/part of the town, municipality, region and major region.

Anonymisation level chosen is affected by how many and how exact indirect identifiers there are in the data. The greater the number and the more exact they are, the more careful consideration of anonymisation is called for.

Background variables should always be considered together. If the researcher wishes to leave municipality of residence information in the data, he or she must take care that other background information relating to the persons in question must be sufficiently coarsened (for instance, occupation, workplace, education, age) to prevent identification. On the other hand, if it is important for the research to have information on the participants' occupation and age, geographical information relating to the participants must be categorised (major region or municipality type instead of municipality of residence). The need to categorise other background information must be carefully reviewed as well.

In successful anonymisation, information included in the data must be considered together with potential information available from other sources. Data must be processed in a manner that even when using information from other sources, no individuals can be identified. When assessing disclosure risk, researchers should also take into account what kind of indirect identifiers can be found in information openly available online (public registers, websites of organisations etc.) As open access to all kinds of information increases rapidly, it is important to check regularly whether data anonymised previously still remain anonymous (assessing residual risk).

Anonymisation of quantitative data

Removing variables

In the case of direct or indirect variables, removing the variable is the easiest and strongest way to remove the risk of identification. Researchers may also remove variables containing indirect identifiers only. If, for instance, in a survey on self-reported crime young participants have been asked which school they attend, the variable may present a risk of identification disclosure when combined with other background variables. The school variable should be removed.

Sometimes it is also necessary to remove open-ended variables to prevent disclosure. This is often done when the information contained in an open-ended variable is available in the data in another, categorised variable. For instance, if there is a categorised educational institution type variable, the open-ended variable charting participants' educational institution is removed.

If exact information contained in the open-ended variable is crucial for the research, one possible option is to detach the variable from the data into a separate file and leave only the coarsened variables as background information in that file. The separate file must be organised in a manner that does not allow linking with the original data, if linking would constitute a disclosure risk.

Recoding variable values

Recoding the values of a variable is a better solution than simply removing the variable. For instance, instead of including the names of schools, the school variable may be recoded into broader categories such as 'lower secondary school', 'upper secondary school', 'vocational school', etc. Exact age, municipality of residence and occupation can also be categorised. For instance, record the year of birth rather than the day, month and year, or recode it into categories which contain 3–5 year age groups.

Variables containing detailed geographical information, such as postal codes, can be aggregated from five-digit variables to two- or three-digit ones. The variable identifying the respondent's municipality of residence can be aggregated into two different variables: region/province and municipality type (urban, semi-urban, rural, etc.) This is a way to minimise identification risk without losing background information relevant for research.

» Regional classifications of Statistics Finland

An occupation variable can be classified into occupational groups (e.g. Managers; Professionals; Technicians and associate professionals; Clerical support workers; Service and sales workers; Skilled agricultural, forestry and fishery workers; Craft and related trades workers; Plant and machine operators and assemblers; Elementary occupations; Armed forces). Another option is to use employment status grouping (e.g. Employees; Employers; Own-account workers; Contributing family workers etc).

» Statistics Finland classifications

One way to reduce disclosure risk is to restrict the upper and lower ranges of a continuous variable to hide outliers. This anonymisation method is typically used for income variables. Highest incomes may be top-coded, that is, coded into a new category (e.g. "income higher than xxxxx euros) while other income responses are preserved as actual quantities (= the actual income in euros). In the same way, the smallest observed values can be bottom-coded.

Another way to remove identifiers is to categorise open-ended responses. This procedure functions well for open-ended questions collecting background information such as place of residence, education, educational institutions, place of work etc. For instance, a survey of physicians might contain an open-ended question on medical expertise. Linked to other background variables, this variable might lead to an identification of physicians who have more than one medical speciality. One solution is to categorise the open-ended variable to broader categories, such as 'one area of medical speciality, 'two or more areas of medical speciality', etc.

It may also be possible to change open-ended responses into a dichotomous variable (responded - did not respond) if the textual responses might lead to disclosure risk when linked to other background variables. This may be convenient for mainly quantitative variables where most response options have been classified and a separate open-ended 'Other, please specify' option has been created for those responses which do not belong to any classes mentioned. For example, such a question may ask what the participant's mother tongue is, with response options 1) Finnish, 2) Swedish, 3) Other, please specify or ask about religious denomination (Evangelical-Lutheran; Orthodox; Other, please specify). The open-ended responses in the last option may constitute an identification risk when linked to other background variables. A good option is to remove the open-ended responses from the data and leave only information on whether the respondent chose this option or not.

The aim is to achieve a situation where each value contains more than one record (k-anonymised data) (El Emam & Dankar 2008). There should be at least three records in each value but 5–10 are preferable. Another method, L-diversity criterion can be used to further diminish the risk of identification (Machanavajjhala Ashwin, Kifer et al. 2007), if k-anonymisation is not sufficient. If, for example, all persons attending clinic F have diabetes, one can deduce that a person who visits clinic F has diabetes. However, if the clients of clinic F include both people who have diabetes and those who do not have it, one can no longer deduce that a person visiting clinic F has diabetes.

Recoding only some variable values

Categorising or coarsening variables may significantly diminish the possibility to draw statistical conclusions. One good option for balancing between statistical usability and disclosure risk is to recode some values of a variable into broader categories. If the frequency distribution is between 1–20 and most cases fall into values 1–12, it may be a good idea to leave the values under 10 as they are and combine higher values into broader categories, for instance, 13–15, 16–20. However, one must pay attention to the impact on the mean of the variable as well as on correlation between different variables.

Removing identifiers from responses to open-ended questions

Responses to open-ended questions sometimes contain identifiers relating to respondents themselves or other persons. The information content of a response will not diminish significantly even if direct identifiers (names, phone numbers, e-mail addresses, etc.) are removed. Disclosure risk must be assessed on a case-by-case basis, taking into account the subject of the study and the number and nature of background variables.

Using a sample rather than all of the original data

One method Statistics Finland often uses to prevent disclosure is to release a sample instead of all of the original data. Only part of the population will be analysed and the randomness of the sample will be guaranteed by using various sampling procedures.

Data archived at the Finnish Social Science Data Archive are mostly based on samples.

Swapping and adding random variation

Less well-known anonymisation techniques include swapping and adding random variation to indirect identifiers. Swapping means matching unique cases on the indirect identifier and then exchanging the values of the variable. Some researchers regard these two techniques as distorting data. Adding random variation has a negative impact on statistical analysis since it weakens the correlation between variables and makes it more difficult to analyse cause and effect. Exchanging the values between cases may even lead to dangerously erroneous correlations in health-related data. However, both methods do prevent linking variable information to register information.

Anonymising qualitative data

Anonymisation measures presented below can be used both for anonymising data and for anonymising extracts published from research data. The guidelines presented here apply to textual data only. The FSD does not provide instructions for anonymising sound or video recordings.

The starting point in making a textual dataset anonymous is to erase background material containing identifiers, such as the contact details of participants and background information forms.

When removing or editing identifiers mark all changes to the data clearly. The changes can be marked by using single or double square brackets: [changed text] or [[changed text]].

Replacing personal names with pseudonyms

Changing proper nouns to pseudonyms is the most popular anonymisation technique used for qualitative data. Research teams must be consistent in the selection and use of pseudonyms throughout the project. A spreadsheet file available to all team members can be used to maintain a list of names and their pseudonyms. The same pseudonyms should be used in both the data and in published samples.

When anonymising proper names, it is always better to use pseudonyms than simply delete the names altogether or than replace them by a mere letter or a character string, such as [x] or [---]. Replacing proper names with pseudonyms enables the researcher to retain the internal coherence of the data. In cases where several individuals are frequently referred to, much of the information is lost if the proper names are just removed.

Using a pseudonym for both the first name and the surname may be justified to make the transcription resemble natural speech or to keep a large number of participants separate from one another. The usual procedure, however, is to replace the first names with pseudonyms and remove the surnames. If a person is referred to by his/her surname only, the pseudonym is also a surname.

A dataset may contain references to persons who are publicly known on account of their activities in politics, business life or other work-related spheres. Their names are not changed to pseudonyms. However, a pseudonym or categorisation (e.g. [local politician]) should be used if the reference is related to the person's private affairs.

Categorising proper nouns

There is no need to create pseudonyms for persons who are mentioned only once or twice in the data, and who have no essential importance for the understanding of the content. Instead, their names can be replaced by a category (e.g. [woman], [man], [sister], [father], [colleague, female], [neighbour, male]). It is not always necessary to use pseudonyms for other proper nouns either. If a unit of data (personal interview, group interview, biography, letter, etc.) contains only one school or place of residence, its name can be replaced by a category, for instance, [lower secondary school], [home town] or [residential area].

If workplaces are mentioned or there is other information about businesses and workplaces that can constitute an indirect identifier in the data, researchers can use Statistics Finland's Industrial Classification for categorising. Another possibility is to simply generalise Peters & Peters into [law firm], Tottenham Hotspur into [football club], Pizzahut into [restaurant], etc.

Industrial Classification of Statistics Finland

If need be, place names mentioned in the data can be replaced by more general expressions like [population centre], [district], [village], etc. If not certain whether a place name denotes a municipality or a suburb, various place name lists and municipality catalogues may be of help.

If it has been decided that the participants' municipality of residence will not be revealed, researchers should remember to remove identifying geographical information relating to participants' place of residence both from background information and the textual data content. For instance, if the participant mentions that he or she often goes to a particular restaurant that is located a short walking distance away from his or her home, it is best to replace the name of the restaurant with a generic expression [restaurant].

Changing or removing sensitive information

Identifying sensitive information should be removed, categorised or classified. For example, 'AIDS' could be changed to [severe long-term illness] and thereafter referred to as [illness], provided that the reader is able to deduce from the context that [illness] refers to the 'severe long-term illness' mentioned in the beginning.

Removing or generalising sensitive data is justified if a) the respondent mentioned it only incidentally b) the information is not relevant to the subject matter and c) the data contain a number of indirect identifiers. But if the study focuses on the lives of persons with a severe illness, disclosure risk can be best reduced by using other anonymisation methods than editing crucial information.

Categorising background information

Background characteristics of participants, such as gender, age, occupation, workplace, school, or place of residence, are often essential for understanding the data. Such characteristics constitute important contextual information for secondary analysis. Detailed background information can be edited into categories in the same way as with quantitative data. Various existing classifications, such as those used by national statistical institutes, are helpful in the process. If researchers create their own classifications, the classifications should be explained in detail in the data description.

Categorisation is always a better solution than deleting background information. An example: an interview of a man whose actual background information is 44-year-old system specialist working in the Computer Center at the University of Tampere, married with two children aged 9 and 11, lives in Tampere. To reduce the risk of identification, his background information could then be categorised in the following manner:

  • Gender: Male
  • Age: 41-45
  • Workplace: University
  • Occupation: Information and communications technology (ICT) professional
  • Household composition: Wife and two school-age children
  • Place of residence: Town in the province of Western Finland

In the example above, the workplace (i.e. a university) does not need not be generalised into [public sector employer], since the other remaining background data do not allow even a partial identification. The province of Western Finland has three universities.

When considering the need to categorise background information, researchers should take into account the other anonymising options explained above as well as the subject matter and content of the data.

» Statistics Finland classifications
» Regional Classifications of Statistics Finland
» Social Classifications of Statistics Finland
» Industrial Classification of Statistics Finland

Changing values of identifiers

Sometimes it is possible to anonymise qualitative data by distorting information, just like identifying values can be exchanged between cases in quantitative data. For instance, an exact date of birth - which as an identifier should normally be removed - may sometimes be crucial for understanding the content. A hypothetical example:

The interviewee was born on 31st December 1958. On New Year's Eve in 2005 she sat by the hospital bed of her dying child. In the interview, she describes in detail her conflicting emotions evoked by the fact that New Year celebrations, the death of her child, and her own birthday are all mingled together in her mind.

In a case like this, deleting New Year's Eve from the data would prevent us from understanding the content. The date (New Year's Eve) can be retained in the data if the interviewee's year of birth is changed to one or two years earlier or later.

Removing hidden metadata from files

During anonymisation, it is important to remember to check whether archival files contain any hidden technical metadata that could enable the identification of research participants. This hidden metadata consist of, for example, location information and information about the owner of a device or a user profile. Technical metadata may be saved when files are created but also when they are edited.

Research data in the form of text or images may consist of files created by the research participants themselves. Risk of identification from the metadata is particularly high in these cases. As textual data often comprise text files created and directly submitted by the research participants, the hidden metadata of these files refer to the participants explicitly. EXIF data of digital images may also contain very precise information, such as the exact coordinates of where the picture was taken and even the photographer’s name.

Technical metadata can be removed by using common text or picture editors (e.g. MS Office, Windows File Explorer, GIMP, Irfanview). There are also programs specifically designed for removing EXIF data (e.g. Easy Exif Delete), which make removing the hidden metadata easy. Specific instructions on how to remove technical metadata depend on the program used and its version. See the instructions on the website of the program you are using.

Practical tips

  • Always make a written anonymisation plan for your research data.
  • Also take into account any background information not included in the actual data files as it may contain identifiers that need to be anonymised or destroyed (contact details of participants, paper questionnaires etc.)
  • Remember to check that no identifying information related to third persons are retained in the data.
  • Carry out anonymisation of quantitative data by using the syntax of a statistical program (More information on using syntax).
  • When anonymising textual data, use Find - - Replace commands to make one change at a time.
  • Check whether several different names have been used to refer to one and same person (e.g. full name, first name, nickname).
  • Make it easier to plan anonymisation of textual files by consistently putting a symbol or special character (e.g. #, ¤) in front of each proper noun at transcription stage. it will save a lot of time later

Identifier type table

Different types of identifiers have been listed in the table below. Information that is deemed to be sensitive according the Finnish Personal Data Act have been marked with an asterisk (*). Each identifier has been characterised (direct identifier, strong indirect identifier, indirect identifier).

The last column notes the easiest methods for dealing with that type of identifier. Remove means removing, Change means changing to pseudonyms and Categorise means categorisation/classification. In the case of qualitative data, Categorising means coarsening identifying information, that is, categorising it.

Some identifiers may be both indirect identifiers and strong indirect identifiers. An unusual occupation or occupational status is a strong indirect identifier while a common occupation is just indirect identifier.

The table is not exhaustive but may provide good tips for determining and anonymising research data.

Table 1.

Identifier type Direct identifier Strong indirect identifier Indirect identifier Anonymisation method
Personal identification number x     Remove
Full name x     Remove/Change
Email address x x   Remove
Phone number   x   Remove
Postal code     x Remove/Categorise
District/part of town     x Categorise
Municipality of residence     x Categorise
Region     x (Categorise)
Major region     x  
Municipality type     x  
Audio file x     Remove
Video file displaying person(s) x     Remove
Photograph of person(s) x     Remove
Year of birth   x   Categorise
Age     x Categorise
Gender     x  
Marital status     x  
Household composition     x (Categorise)
Occupation   (x) x Categorise
Industry of employment     x  
Employment status     x  
Education     x Categorise
Field of education     x  
Mother tongue     x Categorise
Nationality     x (Categorise)
Workplace/Employer   (x) x Categorise
Vehicle registration number   x   Remove
Title of publication   x   Categorise
Web page address   (x) x Remove
Student ID number   x   Remove
Insurance number   x   Remove
Bank account number   x   Remove
IP address   x   Remove
Health-related information *   (x) x Categorise/Remove
Ethnic group *   (x) x Categorise/Remove
Crime or punishment *     x Categorise/Remove
Membership in a trade union *     x Categorise
Political or religious allegiance *     x Categorise
Other position of trust or membership   (x) x Categorise/Remove
Need for social welfare *     x Categorise/Remove
Social welfare services and benefits received *     x Categorise/Remove
Sexual orientation *     x Remove

Other information sources:

Print
updated 2017-11-23