Anonymisation and Identifiers
Data containing identifiers can be used for scientific research if the use is appropriate, planned and justified. When researchers plan how to anonymise data, that is, how to delete or mask identifiers, the level of anonymity need not be as high as the level required for research publications.
In historical and cultural research, for example, it may be justified both in terms of science and research ethics to publish the names of research participants. Studies based on expert interviews may also be analysed and published without masking or deleting the names of the experts or other identifiers. However, the participants' consent for this must be obtained in advance. In such cases, data are not anonymised.
If there is no intention to publish the names of research participants, researchers should plan at the start of the research how to protect the privacy of participants and how to delete, replace, or categorise identifiers in the data.
Whenever data are collected directly from research subjects, the anonymisation level required will depend to a large extent on the information given to subjects on the use, processing and storage of the data. If there is need, it should also be possible to analyse data which still contains the identifiers (particularly the indirect identifiers). When research participants have been told upfront that the data would be archived for scientific reuse, the level of anonymisation is sufficient if re-users of data cannot immediately identify individual participants.
Good research ethics dictate that re-users of archived data do not try to identify individual participants even if this could be done with reasonable effort. The Finnish law states that there exists 'a duty of confidentiality' that re-users of data must not breach.
The guidelines given in this document presuppose that the data will not be destroyed after the original research has been completed but will be archived for scientific reuse.
The starting point for anonymising data which contain identifiers is to consider the dataset as a whole. The four most essential elements are:
- dimensions of identification
- information given to research participants
- background variables, direct and indirect identifiers
- subject matter of the data.
These elements should be considered together, and decisions on the anonymisation strategy should be based on the whole. This document presents a number of anonymisation measures both for qualitative and quantitative data. Researchers can choose those that are the most appropriate. For instance, the chapter 'Anonymisation of qualitative data' presents measures both for anonymising data and for anonymising published samples of data.
Dimension of identification in scientific research
When planning anonymisation measures, it is worthwhile to consider in what concrete ways identification of participants may occur and what potential consequences identification would have. The aim is to maximize the security while minimizing the information loss. It is important not to overanonymise data. Identifying a single person from a dataset that has been collected, processed and stored appropriately is harmful only if information relating to the person is used for wrong purposes. Research data may only be used for research.
Processing and using data which contain identifiers involves a risk to research participants mainly if confidential information concerning the participants is disclosed to third parties (for instance, to their families, friends, employers or authorities). Such disclosure should never happen. Therefore, data protection and security measures must be planned and executed with care. The privacy of participants must be guaranteed, for instance, by storing the data in a secure manner and by avoiding electronic transfer of unencrypted files.
Research data may be used or disseminated for research purposes only. Handing over data to third parties or talking about individual participants to outsiders in a way that would affect the evaluation, treatment, status or behaviour of the participant is particularly unethical. Protection of privacy is a basic right which protects citizens against actions taken by public authorities. The duty of a researcher is to produce scientific information which will help to understand social problems, society and culture, not to reveal information about individuals to authorities.
If a researcher unexpectedly comes across an unit of data (an interview, diary, written material etc.) relating to a person he/she knows personally, the researcher should carefully think over whether it would be more ethical to leave that unit out of the analysis altogether. Whether the researcher leaves the unit out or not, he/she is still bound by the duty of confidentiality.
Users of data and re-users of archived data are bound by the access conditions set for the dataset, by the duty of confidentiality and the stated purpose of use. People wanting to use data archived at the Finnish Social Science Data Archive, for example, must both explain in their access application the purpose for which they need the data and sign an agreement which binds them to specified conditions of use.
Research publications, on the other hand, are in the public domain. The results of quantitative research are reported statistically which means that there is no risk of identification even when the publication is based on data containing identifiers. In the case of qualitative data, the risk of identification must always be evaluated before any samples/quotations from the data are published: which indirect identifiers will be left in the sample as such, which will be categorised and which will be removed altogether.
Informing research participants
Data can be collected and archived for scientific reuse with identifiers if the research participants were informed of this. Their explicit consent gives the researcher the widest possible rights to use the participants' personal information, and is required for archiving audiovisual datasets.
To comply with the Finnish Data Protection Act, in most cases the participants' exact names and addresses must be destroyed after the original research has been completed. At the latest, their contact information must be destroyed when the data are archived. In this way, researchers will be prevented from becoming so enthusiastic about the thoughts of an individual participant years later that they actually contact the participant again hoping to receive further information.
Sometimes it is justified to preserve the exact names and addresses of research participants. For example, research projects targeting small and hard-to-reach groups and aiming at longitudinal research can retain the participants' contact information in the data as this is probably the only way to ensure that the same persons can be contacted again later.
In cases where it is justified on scientific grounds to preserve the contact information, participants must be informed of this and their consent obtained. Data protection measures and privacy protection must be planned with care from the outset. A detailed data protection plan is mainly needed for data management, but if research participants ask for it, the plan has to be forwarded to them as well.
Even if participants had been informed of the intention to preserve the data with identifiers for scientific reuse, direct identifiers such as personal identity numbers, names, addresses, telephone numbers, and exact birth dates should still be removed from the data after the original research has been completed. Preserving them is justified only when direct identifiers are essential for the analysis of the data, and the participants have given specific consent to the arrangement beforehand. For indirect identifiers, the situation is somewhat different. When participants have been informed that the data would be archived for scientific reuse, it may not be necessary to delete or edit indirect identifiers at all.
If research participants were not informed upfront that the data would be archived for reuse, anonymisation measures need to be estimated on a case-to-case basis, be the data quantitative or qualitative.
Background variables or indirect identifiers
The following are examples of background variables or indirect identifiers: gender, age, education, occupation, economic activity, socio-economic status, household composition, income, marital status, mother tongue, nationality, ethnicity, workplace/organisation, educational institution, and geographical identifiers. Geographical identifiers include, for instance, postcode, suburb, municipality, province, region, and place where the respondent grew up.
The level of anonymisation needed depends on the number of background variables in the data and how detailed information they yield. Could a combination of indirect identifiers lead to the identification of a respondent? If there are many background variables and they provide precise information of individuals, anonymisation procedures must be planned with care, particularly if the subject matter is sensitive and research participants were not informed that the data would be preserved for scientific reuse.
Subject of the data
When planning to what degree a dataset must be anonymised, the subject matter and the sensitivity of the data must be taken into account. Content must be screened and anonymisation planned carefully if the data are sensitive in the sense specified in the Finnish Data Protection Act, and contain many variables giving detailed personal information. Information on state of health, social benefits received and committed crimes is much more sensitive than information on people's views on traffic. Data on citizens' interpretations, attitudes and opinions on society and culture are generally less problematic than data on very personal and sensitive matters.
Removing a variable
Removing a variable from a dataset is by far the most radical way of anonymising data though it may be justified in some cases. For example, if the researcher has local knowledge, and a survey on self-reported youth crime contains a variable identifying the respondent's school, this variable may, if used in combination with other background variables, present a risk of identity disclosure. Removing the variable would considerably limit disclosure risk without necessarily reducing the scientific value of the dataset.
Sometimes an open-ended variable may be removed to reduce the risk of disclosure, provided that the removal does not significantly reduce the usability of the data for analysis. Removal can be justified in cases where the same information can be found in another, categorised variable. For example: if there is a categorised education variable, an open-ended variable giving the name of the educational institution may be removed.
Recoding the values of a variable
Recoding the values of a variable is always a better solution than simply removing the variable. For instance, instead of using the names of schools, the school variable may be recoded into broader categories such as 'lower secondary school', 'upper secondary school', 'vocational school', etc. The respondent's exact age, municipality of residence and exact occupation can also be aggregated or categorised to prevent disclosure. An example: record the year of birth rather than the day, month and year, or recode it into categories which contain 3-5 year age groups.
Variables containing detailed geographical information, such as postcodes, can be aggregated from five-digit variables to three-digit ones. The variable identifying the respondent's municipality of residence can be aggregated into two different variables: region/province and type of location (urban, semi-urban, rural, etc.)
An occupation variable can be classified into occupational groups such as:
1 Legislators, senior officials and managers
3 Technicians and associate professionals
5 Service and care workers, and shop and market sales workers
6 Skilled agricultural and fishery workers
7 Craft and related trades workers
8 Plant and machine operators and assemblers
9 Elementary occupations
0 Armed forces
(ISC0-08 standard, major occupation groups)
Another possibility is to use status of the employment categorisation: 'employees', 'employers', 'own-account workers', 'contributing family workers', etc.
One way to reduce disclosure risk is to restrict the upper and lower ranges of a continuous variable to hide outliers. This anonymisation method is typically used for income variables. Highest incomes may be top-coded, that is, coded into a new category (e.g. "income higher than xxxxx euros) while other income responses are preserved as actual quantities (= the actual income in euros). This will prevent identification of highly paid individuals. In the same way, the smallest observed values can be bottom-coded.
Removing identifiers from responses to open-ended questions
Responses to open-ended questions sometimes contain identifiers which are connected to respondents themselves or other persons. The information content of a response will not diminish significantly even if direct identifiers (names, phone numbers, e-mail addresses, etc.) are removed. Disclosure risk must be assessed on a case-to-case basis, taking into account the subject of the study and the number and nature of background variables.
Another way to remove identifiers is to categorise responses. The procedure functions well for open-ended questions collecting background information such as place of residence, education, place of work etc. For instance, a survey of physicians might contain an open-ended question on medical expertise. Linked to other background variables, this variable might lead to an identification of physicians who have more than one medical speciality. One solution is to categorise the open-ended variable to broader categories, such as 'one area of medical speciality, 'two or more areas of medical speciality', etc. (Economic and Social Science Data Service 2005).
Using a sample rather than all of the original data
One method Statistics Finland often uses to prevent disclosure is to release a sample instead of all of the original data. Only part of the population will be analysed and the randomness of the sample will be guaranteed by using various sampling procedures.
Swapping and adding random variation
Less well-known anonymisation techniques include swapping and adding random variation to indirect identifiers. Swapping means matching unique cases on the indirect identifier and then exchanging the values of the variable. Some researchers regard these two techniques as distorting data but both prevent people from using variables as a means for linking records. (Guide to Social Science Data Preparation and Archiving 2005).
Replacing personal names with pseudonyms
Changing proper names to pseudonyms is the most popular anonymisation technique used for qualitative data. A good way to keep the anonymisation process under control is to replace personal names with pseudonyms directly after the transcription. For example, typing a special character in front of all proper names already at the initial transcription stage will facilitate the planning and carrying out of anonymisation because all proper names can be easily found within the data.
Research teams must be consistent in the selection and use of pseudonyms throughout the project. A spreadsheet file available to all members can be used to maintain a list of names and their pseudonyms. The same pseudonyms should be used in both the data and in published samples.
When anonymising proper names, it is always better to use pseudonyms than simply delete the names altogether or replace them by a mere letter or a character string, such as [x] or [---]. Replacing proper names with pseudonyms enables the researcher to retain the internal coherence of the data. In cases where several individuals are frequently referred to, much of the information is lost if the proper names are just removed.
Using a pseudonym for both the first name and the surname may be justified to make the transcription resemble natural speech or to keep a large number of participants separate from one another. The usual procedure, however, is to replace the first names with pseudonyms and remove the surnames. If a person is referred to by his/her surname only, the pseudonym is also a surname.
A dataset may contain references to persons who are publicly known on account of their activities in politics, business life or other work-related spheres. Their names are not changed for pseudonyms. However, a pseudonym or categorisation (e.g. [local politician]) should be used if the person's private affairs are talked about.
Computers provide ways to carry out fast anonymisation operations. Yet one should be very careful when using find and replace techniques, and preferably replace one item at a time. Names may form part of other words: for example, when 'Tom' is replaced by 'Jack' using the 'Replace all' command, 'atomic' is also changed to 'aJackic'. Therefore, it is best to apply changes one by one. Prior to the anonymisation, one must also check whether the same person is referred to by using different names (e.g. Tom as Thomas or Tommie).
Personal names do not necessarily need to be replaced with pseudonyms, if the research participants had been informed that the data would be archived with their names. In such cases, the essential thing is to ensure that the terms and conditions set for reuse of the data are compatible with the information given to participants. Re-users of data have the same obligation not to disclose personal information as primary researchers.
Categorising proper names
No pseudonyms need to be created for persons who are mentioned only once or twice in the data, and who have no essential importance for the understanding of the content. Instead, their names can be replaced by a category (e.g. [woman], [man], [sister], [father], [colleague, female], [neighbour, male]). It is not always necessary to use pseudonyms for other proper names either. If the unit of data (personal interview, group interview, biography, letter, etc.) contains only one school or place of residence, they can be replaced by a category, for instance, [lower secondary school], [home town] or [residential area].
If the data are sensitive, the Statistics Finland's Industrial Classification is a useful resource for categorising workplaces. Another possibility is to simply generalise Peters & Peters into [law firm], Tottenham Hotspur into [football club], Starbucks into [café], etc.
If need be, place names mentioned in the data can be replaced by more general expressions like [population centre], [district], [village], etc. If not certain whether a place name denotes a municipality or a suburb, various place name lists and municipality catalogues may be of help.
Changing or removing sensitive information
Whenever there is risk of even partial identification, and the personal/sensitive data are not necessary for understanding the content, it may be a good option to edit or delete the sensitive parts. Even in this case, it is always better to edit than delete the data.
A diagnosed severe illness can be changed into another, similar type of illness, if doing this does not reduce the usefulness of the data too much. Another option would be to categorise the information in the same way as with quantitative data. For example, 'AIDS' could be changed to [severe long-term illness] and thereafter referred to as [illness], provided that the reader is able to deduce from the context that [illness] refers to the 'severe long-term illness' mentioned at the beginning.
Removing or generalising sensitive data is justified if a) the respondent mentioned it only incidentally b) the information is not relevant to the subject matter and c) the data contain a number of indirect identifiers. But if the study focuses on the lives of persons with a severe illness, disclosure risk can be best reduced by using other anonymisation methods than editing crucial information.
If removing or changing sensitive parts would affect the usability of the material, there are other options. Sensitive sections which cannot be used in direct quotations or detailed descriptions in the publications can be marked. The beginning and end of these sections should have a mark-up like this:
(/seg type="sensitive")I don't want this to be recorded, but, to tell the truth, before starting to work in the paper mill, my father had participated in smuggling activities and at that time we were always worrying that he might get caught. I'm sure you understand that this explains my own choices s a lot as well.(/seg).
Categorising background information
Background characteristics of participants, such as gender, age, occupation, workplace, school, or place of residence, are often essential for understanding the data. Such characteristics constitute important contextual information for secondary analysis. If it is justified for reducing the threat of disclosure, detailed background information can be edited into categories in the same way as with quantitative data. Various existing classifications, such as those used by national statistical institutes, are helpful in the process. If the researcher creates his/her own classifications, they should be explained in the data description.
Categorisation is always a better solution than deleting background information. An example: an interview of a woman whose actual background information is 44-year-old system specialist working in the Computer Center at the University of Tampere, married with two children aged 9 and 14, lives in Tampere. To reduce the risk of identification, her background information could then be categorised in the following manner:
Occupation: Information and communications technology (ICT) professional
Household composition: Husband and two school-age children
Place of residence: Town in the province of Western Finland
In the example above, the workplace (i.e. a university) does not need not be generalised into [public sector employer], since the other remaining background data do not allow even a partial identification. The province of Western Finland has three universities. Usually only part of the background information needs to be categorised - sometimes the only information that needs categorisation is the place of residence.
The degree of categorisation should be decided in relation to other anonymisation techniques, and the content and subject matter of the data. There is no need to anonymise less sensitive data thoroughly, as simple measures such as deleting the names and categorising the addresses of participants may be enough. On the other hand, in case of sensitive data, the risk of identification can be significantly reduced by categorising background data and using pseudonyms or other editing methods for proper names.
To categorise the participants' place of residence, occupations or education, for example, Statistics Finland classifications may prove to be helpful.
Changing values of identifiers
Deliberately editing the values of some identifiers is perhaps most justified when the material refers to the private lives of public figures. For instance, a respondent may describe the traumatic relationship with his/her sister and emphasise that the well-known sister's identity must be concealed in research publications. In this case, it may be a good option to anonymise the data by changing the value of some identifier related to the sister (occupation, age, etc.) Sometimes distorting an identifier can be justified for non-public figures as well. For example, an exact birth date may sometimes be essential for understanding the content of the data. A fictional example:
The interviewee was born on 31st December 1958. On New Year's Eve in 2005 she sat by the hospital bed of her dying child. In the interview, she describes in detail her conflicting emotions evoked by the fact that New Year celebrations, the death of her child, and her own birthday are all mingled together in her mind.
In a case like this, deleting New Year's Eve from the data would prevent us from understanding the content. If the respondent's exact birth date, in connection with other indirect identifiers, formed a threat of disclosure, one alternative would be to change the year of birth to one or two years earlier or later.
It is important to aim at a reasonable level of anonymisation for both quantitative and qualitative data, and to avoid over-anonymisation. The goal should be to produce reusable data and keep the changes as small as possible. Detailed logs should be kept of all anonymisation measures carried out.
Besides data anonymisation, good research ethics can be enhanced by storing the data in a secure manner and setting conditions for its reuse.
- Economic and Social Science Data Service. 2005. "Create and deposit." Cited 10/24, 2005 (http://www.esds.ac.uk/aandp/create/).
- "Guide to Social Science Data Preparation and Archiving. Inter-university Consortium for Political and Social Research (ICPSR) "2005. Cited 10/31, 2005 (http://www.icpsr.umich.edu/access/dpm.html).
- Kuula, Arja 2006: Tutkimusetiikka. Aineistojen hankinta, käyttö ja säilytys [Research Ethics. Acquisition, Use and Preservation of Data], (pages 207-222).
- Statistics Finland 2008: Classification Services. Cited 26 Jan 2008 (http://www.tilastokeskus.fi/meta/luokitukset/index_en.html)