Processing Qualitative Data Files

Processing Qualitative Data Files

Qualitative research data may consist of many different types of research material. These may include transcribed interviews, audio recordings, still images, ethnographic diaries and various types of written texts.

Qualitative data archiving at FSD

The qualitative data archived at the Finnish Social Science Data Archive (FSD) are mainly textual. Digital photographs and audio recordings are archived only if the research participants appearing in them have given consent to the archiving. If the audio material and photographs are scientifically valuable, but there is no consent from research participants and it is impossible to go back to them to gain consent for archiving, the researcher may apply for permission to archive the recordings from the National Archives of Finland.

» National Archives of Finland: How to apply for permission (in Finnish only)

The FSD can also archive newspaper and magazine material as well as photographs, cartoons and illustrations in books that have been collected by researchers for their studies but created by someone else. According to an agreement between the FSD and the Finnish copyright society Kopiosto, the data archive can archive and disseminate such material for research purposes. See Data collected from periodicals below.

» Kopiosto Copyright Society

The FSD does not archive audiovisual material. Audiovisual material are archived and disseminated for further research by the Language Bank of Finland (Fin-CLARIN). If you are planning to collect audiovisual material in your research or already have such material that you wish to archive for data sharing, contact the Language Bank.

» Language Bank of Finland, coordinated by FIN-CLARIN

In order to help researchers in data management during their research process and to facilitate data reuse, the FSD provides examples below on various qualitative data processing methods. These include how to store contextual information and background information of participants.


The most common formats of qualitative data are written texts, interview data and focus group discussion data. In most cases, interview and discussion data are first digitally recorded and then transcribed. Representing audiovisual data into written form is the most typical way of processing interview and discussion data into an analysable format. Occasionally, the recordings themselves are analysed, for instance, in studies focusing on language or interaction.

The level of transcription is always decided by the original researcher or research team and is dependent on the objectives set for the data. Transcription level decisions are often influenced by the resources available. In an ideal case, researchers understand how valuable the data may become for other researchers outside the original research team and thus allocate resources to the transcription. It is recommended that transcription of the recorded material is done as extensively as possible. As it is hard to say in advance for what kind of study the data will be used in future, it is often best to transcribe also those parts that do not seem relevant at the time. This also enables the researchers themselves to reuse the data later for other new research questions. Researchers can do the transcription themselves or buy it from a service provider.

There are no established names or definitions for different levels of transcription, although there is some agreement on the general guidelines. In practice, transcription does not follow any particular level but combines features from different levels, tailoring the transcription to the requirements of the material at hand. Whatever the level chosen, it is essential to be uniform and consistent throughout in the level of detail and logics of transcription.

Different levels of transcription can be classified in a following manner, for instance:

  • Gisted/summary transcription: Interview recordings are represented into written form only roughly, by listing or summarising main points/topics. Direct quotations or parts of speech are only rarely written down. Interpretation plays a big role in this kind of transcription because it is the transcriber who decides which parts are worth transcription.
    Can be used, for instance, for producing articles based on interview data. Does not enable in-depth analysis nor does it support rich and varied use and reuse of the data.

  • Basic level transcription: Will produce a verbatim (exact) transcription of utterances but leaves out repeats, cut-offs of words and sentences, fillers ('you know'), and non-lexical sounds ('uh', 'ah'). Utterances clearly not in context can also be left out. In addition to speech, significant expressions of emotion (laughter, getting upset etc.) are incorporated.
    Can be used when the main focus is analysing the content of speech. This is the minimum transcription level for data sharing and archiving.

  • Exact transcription: All speech is transcribed, nothing is left out. Transcription is a verbatim, word-for-word replication of the verbal data, using the most common standardized notation symbols. Fillers ('you know'), repeats, cut-offs of words and non-lexical sounds are incorporated in the transcription, as well as expressions of emotion (laughter, sighs, getting upset etc.) and emphasis or stress. Timed pauses (in seconds) and possible background noises and other disturbances are noted .
    Often used when there is intention to analyse expressions and interaction, at least to some extent. This level of transcription allows for varied and rich reuse of the data.

  • Conversation analysis transcription: Full verbal transcription using standardized notation symbols, with careful reproduction of colloquial speech patterns. Transcription includes all words, timed pauses (in seconds), cut-offs of a word, intonation, volume, word stress, as well as non-lexical action (sneezes, breaths, sighs, facial expressions) etc.
    The most detailed level of transcription. The goal is to represent the conversation event in as much detail as possible in textual format. Often used together with the audio and video recordings themselves.

Both for the sake of one's own research and for the sake of data reuse, it is always better for transcription to be too detailed than vice versa. If interview records have been represented into text only in a summary format, this may become a problem even for the original researchers at the analysis stage. The minimum transcription level for data reuse and sharing is basic level transcription. Data reuse is further enhanced if exact transcription has been used. Whether a yet more detailed transcription level is chosen is dependent on the research objectives and resources available. See also Anonymisation and Personal Data.

If transcription notation symbols are used, it is good to remember that the symbol signs of word processor programs may change when converted to other software programs. Formatting, footnotes and links to other documents may also disappear in conversion. It is therefore advisable never to enter content or structural information using formatting (i.e. using bold, italics, underline, colours, indent etc.). It is safest to use only the symbols available in keyboards.

The notation symbols used in transcription should be described in interview guidelines and consequent data documentation. This way the same notations will be used systematically and consistently throughout. Having information on the notations used in transcription is essential for data reuse, or when data are collected and transcribed in different locations, and even in cases when it is the original researchers who are reusing the data, because memory is short. Without notation information, it soon becomes impossible to understand what each symbol means. When a standardised notation is used, it may be enough to enter a detailed reference to the original source of the notation.

Speaker demarcation should be consistent throughout the transcription to facilitate readability and to allow for automatic processing at some stage. Each time the person speaking changes, his or her speech should be transcribed as a discrete unit, always starting from a new row. For instance, at the beginning of the row, a speaker ID is entered, followed by colon (:). The speaker ID may be the name of the speaker, initials of the name or a pseudonym, as long as they are used consistently.

For instance,

Interviewer 1: Did they agree?
Interviewee 6: I'm guessing they did, for most part.
Interviewee 7: Oh, yes, I thought so as well.

Organizing data files

When the data have been collected, saved and possibly transcribed, it is time to decide how to organise the storage. The data collected are entered into data files which are then stored in a data folder. If the research involves several independent data collections, it is advisable to create a separate data folder for each collection.

The decision on how to organise the data files should be made on a case-by-case basis. All material relevant to the data should be entered into the data folders. This should include detailed information on the data collection and data processing procedures. Examples of relevant material:

  • interview guidelines
  • possible stimulus material
  • writing invitation
  • observation instructions
  • transcription guidelines
  • writing instructions

Depending on the amount of data, one data file can include one or more data units. Sometimes it is more convenient to store several short pieces of texts (=data units) into one data file instead of having a separate file for each unit, for instance, in case of several short writing competition texts (see Example 1). If the textual data as a whole are not very large, it might be more useful to store all units in one data file.

Example 1: One data file

Writing competition on anxiety 2013
Writing competition 13 texts.rtf

If text units are generally longer than one page, it is usually advisable to store each unit in a separate data file (see Example 2).

Example 2: Separate data files

Writing competition 2013

If the data consist of several types of data (e.g., both focus group and individual interviews) or of different types of data files collected in the same connection (e.g. transcriptions, audio tapes and photographs), the best option might be to store the files of each type in their own subfolders.

Naming data files

Systematic and consistent naming of data files facilitates data management during research as well as data archiving and reuse. Even during an ongoing research project, it is easier to manage and locate data files if they have descriptive names which means including some of the background information of the data unit in the file name (e.g. date, gender and age). However, it is good to be aware of certain problems which may arise for data reuse and archiving when this type of naming is used.

If the background information appearing in the name is coded too concisely, it may be difficult or downright impossible for outsiders or even for original researchers to interpret them. Therefore, researchers should always produce a document describing the file naming convention used for the research (see Example 4). Background information of research subjects should primarily be stored elsewhere and should never appear in the file names only. For example, when data are archived at the FSD, the archive names the data files according to its own file naming conventions. Therefore, all information stored in the file name by the researchers will disappear at this stage.

Another benefit of consistent naming of data files is that it is easier to identify all files connected to one data collection event (e.g. one interview). The files related to one collection event (e.g. audio tape, its transcription and photographs taken by the interviewee) can be connected by the file name.

The most convenient way is to give all files connected to the same event an 'event identifier' in the beginning of the name, that is, in the first part of the name. The latter part of the name can be used to convey the specifics, for instance, whether it is an audio tape, transcription or a still image. Thus, data files 20130311_interview2_audio.wav and 20130311_interview2_trans.rtf and 20130311_interview2_image.jpg are files connected to the same interview event conducted on 11 March 2013. The latter part of the name reveals the specifics of the file. In this case "audio" means audio tape and "trans" a transcription of the audio tape (also see Example 3).

Example 3: Structure of the data folder

In our example case, the data are varied and contain audio tapes of the interviews, interview transcripts, stimulation material shown to the research subjects, and photographs taken by the subjects. Please remember that background information must never be stored in the file name only (see the section Documenting background information).

Perceptions on immigration 2014
Audio tapes
Stimulation material

Example 4: An example on how to document data file conventions used

Data file names are formed in the following manner:
<date>is the date on which the data were collected,
<type> specifies the type of event/data material,
<ID1> is the ID of the collection event,
<gender> is the gender of the interviewee,
<age> is the age of the interviewee,
<municipality> is the municipality of residence of the interviewee,
<datatype> specifies the type of data the file contains, for instance,
"trans" means transcription, "audio" means audio recording, and "image" means photograph.
<ID2> is the ID number used to separate the images connected to the
collection event.

Documenting background information

As mentioned above, from the point of view of data reuse and sharing it is not a good practise to document background information in the file names alone. Interpreting the background information appearing in a file name may be difficult or downright impossible for outsiders or even for researchers themselves if some time has passed. It may also cause difficulties for archiving processes. For instance, the Finnish Social Science Data Archive produces a html index for archived datasets (see example of html index), for which it should be possible to parse background information from the data files automatically. Automatic parsing is not possible when background information is coded in file names as the naming conventions vary greatly between research projects. Data files are also renamed at the FSD to follow the conventions of the archive for file naming. This may result in total disappearance of important background information if they are recorded in the file name only.

What background information is entered for each unit varies from data to data and is a decision of the original researchers. Background information may include information on the research subject and the data collection event, and notes of the researcher. Information relating to the collection event are typically time and location, and the name of the interviewer. Information on research subjects may include gender, age, municipality of residence, education, occupation, job, educational institute, family composition, marital status, mother tongue, and nationality or ethnic background. Other relevant information include the file name(s) and the research subject ID. See also Anonymisation and Personal Data.

Recording background information which does not seem very relevant for the ongoing research may be of great importance in future when the data are reused for other research purposes. It is therefore better to record too much background information than too little. Removing superfluous information is always easier than complementing insufficient information. Even though background information should be as informative as possible, it is good to keep in mind what kind and level of identifying information is allowed in the consent obtained from research participants (see Informing Research Participants).

Below are two examples of recording background information in a manner that facilitates data archiving and ensures that the information is retained in the processing of data. Which way is chosen depends on the format of the data.

Entering background information into data files

For textual data, background data are systematically entered in the beginning of each data unit (e.g. interview transcript) in a standardised manner. Such practices greatly facilitate analysis during ongoing research.

From the FSD's point of view, when a textual dataset is archived at the data archive, the archive produces a separate html index (see example of html index) for the dataset, with the help of which it is easy to handle individual interviews, written texts etc. The index enables users to easily identify and locate data units according to particular background information, for example, gender, age, profession. To make the html index creation possible, it is important that background data fields can be parsed automatically for each data unit. For this it is particularly important that information is entered in a uniform manner throughout the data collection.

Example 5 presents a typical transcript of an interview with only one interviewee. The transcript of each interview in the data have been saved in a separate file, often in RTF or Word format (see Organising data files, Example 2). Background data fields are entered in the following manner in the beginning of each transcription file.

Example 5:

Interview date: 08.02.2013 [=8 February 2013]
Interviewer: Matt Miller
Pseudonym of interviewee: Ian (not the real first name of the interviewee)
Occupation of interviewee: Journalist
Age of interviewee: 32
Gender of interviewee: Male

I: First I would like to ask you about your choice of profession. How did it come about that you decided to become a teacher?
Ian: Well, you know, when I was a kid we had this really great guy teaching history....

Example 6 is otherwise similar to example 5 expect that it is a focus group interview, with several interviewees. Therefore each interviewee has an ID (e.g. R1, R2) which helps to identify their speech. The background information of each interviewee can be entered to the background data fields in the following manner. Other types of ID (for instance, the person's whole name or pseudonym) can also be used. Whatever ID system is chosen, it should be used consistently throughout the data.

Example 6:

Interview date: 08.02.2013 [=8 February 2013]
Interviewer: Matt Miller
Pseudonyms of interviewees: Ian (R1), Mary (R2), Ken (R3)
Occupation of interviewees: Teacher (R1), Headmaster (R2), Janitor (R3)
Age of interviewees: 31 (R1), 47 (R2), 22 (R3)
Gender of interviewees: Male (R1), Female (R2), Male (R3)

I: First I would like to ask you about your choice of profession. Tell me a bit about how you came to have the profession you have now?
R3: It's not really... er, for me, it's not a profession, I'm just doing this for now and might go back to school later.
R1: You know, when I was a kid we had this really great guy teaching history....

In Example 7, research subjects were asked to write down one proverb that had been significant in their lives. Subjects could answer anonymously but were asked to give some background information. Altogether, the data contained over 40 pages of proverbs provided by over 100 individuals. As the proverbs were short but the data in itself quite large, it was easiest to store all proverbs in one file (see Organising Data Files, Example 1). In a case like this, background data fields are entered in the beginning of each proverb so that they allow for potential automatic processing of data.

Example 7:

Occupation: Teacher
Age: 32
Gender: Male
Municipality of residence: Helsinki

"When the cat is away the mice will play"

Occupation: Carpenter
Age: 56
Gender: Male
Municipality of residence: Rovaniemi

"Early to bed and early to rise makes a man healthy, wealthy and wise"

Occupation: Journalist
Age: 49
Gender: Female
Municipality of residence: Tampere

"A rose is a rose is a rose"

Occupation: Sofware programmer
Age: 34
Gender: Female
Municipality of residence: Helsinki

"Two heads are better than one"

Automatic processing of data is possible if the background data fields are created in the same, identical manner throughout the data and that the fields are always in the same order. A very good way is to end the title of each background information field with a colon, followed by an empty space. Each background data field ends in a line break ('enter'), so it can be separated from other text. To avoid spelling mistakes and ensure that the order of the data fields remains consistent, it is easiest to copy the background data field titles as empty in the beginning of each text unit, that is, each proverb in our example case. Then all that remains is to enter the actual background information to the fields themselves for each subject.

For instance, for interview data:

Date of interview:
Interviewer 1:
Interviewee pseudonym:
Interviewee's occupation:
Interviewee's age:
Interviewee's gender:

Data lists and storing background information in a list

For some types of data, the file format does not allow recording background information in the beginning of the data file. This is the case for audio and video recordings and protected pdf files, for example. In these cases, the best practice is to store background information in a manually created data list or a separate text file containing key background characteristics of each participant on successive rows.

In a manually created data list, the background data fields are entered in table form using Excel or Open Office Calc program, for instance (see Example 8). If a separate text file is used, the optimal solution is to enter the background data fields in a consistent and uniform manner (see Example 9). This would enable the FSD, for instance, to create the user-friendly html index automatically at the processing stage.

In all cases, background information contains the file names, background characteristics of research subjects and information on the data collection event. A data list for audiovisual data may also contain technical information connected to the collection event, such as the type and model of the device used for recording and the length of the video/audio etc. Most technical background information can be read automatically from the audiovisual files themselves, so it may not be necessary to enter them manually into the background information.

Systematic entering of background information in a data list or in a separate text file will facilitate data management in different stages of the research if there are more than a few participants, as well as preserve collection event information that is important for later archiving and reuse of the data.

Example 8: Background data list (Excel)

Example 8 portrays a background data list in Excel format. The data collected were videorecorded interviews. The data list contains background information related to the interviewee and the interview event as well as information on the model and brand of the camera used and the length of the video (in minutes). See also another data list example from the UK Data Archive.

Data list in Excel

Example 9: Background data list (text)

This is an example of the same dataset as in the Example 8 but this time the background information has been entered into a separate text file, unit by unit. The background data fields are the same as in examples 5, 6 and 7 where they were in the beginning of each unit text. When they are entered in a separate text file as in the example here, there is an ID on the top of each background information entity, beginning with @, linking them to the right unit. The ID is in the format @filename.fileformat (e.g. @Peter_1.avi). Each entity is separated by at least one line break [enter].

Interview date: 12.04.2012
Interviewer: Matt Miller
Interviewee name: Peter Herald
Age of the interviewee: 37
Gender of the interviewee: Male
Occupation of the interviewee: Barkeeper
Camera used for the video: Panasonic HC-V10
Duration of the video: 2:45


Interview date: 12.04.2012
Interviewer: Matt Miller
Interviewee name: Peter Herald
Age of the interviewee: 37
Gender of the interviewee: Male
Occupation of the interviewee: Barkeeper
Camera used for the video: Panasonic HC-V10
Duration of the video: 5:05


Interview date: 17.04.2012
Interviewer: Matt Miller
Interviewee name: Lisa Smith
Age of the interviewee: 43
Gender of the interviewee: Female
Occupation of the interviewee: Author
Camera used: Canon XF305
Duration of the video: 10:12


Interview date: 22.04.2012
Interviewer: Matt Miller
Interviewee name: Mary Davies
Age of the interviewee: 42
Gender of the interviewee: Female
Occupation of the interviewee: Teacher
Camera used: Panasonic HC-V10
Duration of the video: 6:56


Interview date: 24.04.2012
Interviewer: Matt Miller
Interviewee name: Pablo Neftali
Age of the interviewee: 76
Gender of the interviewee: Male
Occupation of the interviewee: Poet
Camera used: Canon XF305
Duration of the video: 4:32


Data collected from periodicals

The FSD has made an agreement with the Finnish Copyright Society Kopiosto which allows the FSD to archive and disseminate copyrighted data collected by researchers, such as newspaper and magazine articles, photographs, cartoons and illustrations in books. The FSD only accepts digital data that have been collected for research purposes.

Material collected from online periodicals

When researchers collect articles from online periodicals for research purposes, they should bear in mind that references to web resources, like URLs, may change over time. Because of this, articles deposited at the Data Archive for archiving should be copied into a word processing program. If the copied articles do not contain bibliographic information, it should be added in the beginning of the article. After this, it is advisable to convert the articles into PDF file format.

Documenting bibliographic information

When articles, photographs and other similar material are collected from periodicals for research purposes, bibliographic information should be carefully detailed. For example, bibliographic information of newspaper articles should include

  • Author(s)
  • Title of the article
  • Title of the newspaper
  • Date of publication
  • Web address of the article (if online newspaper)
  • For an online newspaper, retrieval date of the article
  • If the text in question is an editorial, opinion piece or letter to the editor, this should be mentioned in the citation.

Examples of articles with a known author:

Examples of anonymous articles:

If the data under study includes articles from scientific journals, bibliographic information should also include, following the established citation conventions of academic writing,

  • Page numbers of the article
  • Title of the journal
  • Journal volume number
  • Journal issue number

If the articles have been collected from edited books, the bibliographic information should include

  • Name(s) of the editor(s)
  • Complete title of the book
  • Page numbers of the article
  • Title of the publication series, number of the edited book in the series
  • Name of the publisher
  • Place of publication

Make a list of analysed articles

A researcher planning to deposit articles collected from periodicals for archiving should deliver the Data Archive a separate listing of all the articles. The list should consist of bibliographic information sorted alphabetically or chronologically. Alternatively, the articles may be listed in the order the articles were analysed during research. The important thing is that the bibliographic information is documented consistently. The Archive delivers the list of articles to Kopiosto during the archiving process.

updated 2016-06-08