Physical Data Storage

Physical Data Storage

During a research project, the project members are usually responsible for the usability and storage of the data. Archiving typically occurs after the project has ended, and the responsibility for storing the data is assumed by a research organisation, a university, or a unit specialising in the archiving of data, such as the Finnish Social Science Data Archive (FSD). It is also possible to deposit the data in the FSD already during the project and make an agreement that the data will be published only after the project is finished.

Whatever the case, researchers and research groups should have basic knowledge of what data preservation and physical storage entail. Physical preservation requires careful monitoring of data quality and system integrity, upgrade and validation measures, disaster preparedness, and constant development of the system.

Naming and managing files

It is advisable to create a separate folder for each dataset in which data files, description information, and all other files related to the data will be saved. Access rights should be defined for all folders and files, particularly when they are stored on a server instead of a single computer. For example, it is not necessary that all members of a research project have the right to modify backup files.

Folders and files should be named in an uncomplicated and logical manner. It is advisable to save basic information about the files in the same location as the metadata. Modern software allows fairly long file names and the name should include at least abbreviation of the project name, year, file contents, and file version. For instance, the original SPSS file of the World Values Survey 2002 survey data could be named wvs2000_data_original.por and the questionnaire used in data collection wvs2000_questionnaire_finland.rtf. If the dataset has been given a unique identifier, it is advisable to include it in the names of all files related to the data.

Different file formats are discussed in File Formats and Software page.

Example: Files of the dataset ISSP 2006: Role of Government IV: Finnish Data archived at the FSD:

Directory of X:\Data\FSD2248
|   cbF2248.pdf
|   meF2248.xml
|   mef2248e.xml
|   quF2248_fin.pdf
|   quF2248_sve.pdf
|   vaf2248.xml
|   
+---Data
|       daF2248.por
|       syF2248.SPS
|       
\---Original
        ISSP06_frequencies.xls
        ISSP06_FSDdata.sas7bdat
        ISSP06_FSDdata.sav
        ISSP06_labfor.sas
        ISSP06_questionnaire_fin.pdf
        ISSP06_questionnaire_swe.pdf
        ISSP06_study_description.doc
        ISSP06_variable_list.lst
        ISSP_vastaus%_2002-06.xls

In the example, a folder named FSD2248 has been created for the dataset based on the identifier given by the archive. First two characters in a file name indicate what the file contains:

  • cb = codebook
  • da = data file
  • sy = syntax file
  • me = data description/metadata
  • qu = questionnaire
  • va = variable description

Fnnnn is the identifier of the dataset. Information on the file language can be found at the end of the file name. In the example above, meF2248e.xml signifies the data description in English.

See also: instructions for naming qualitative data files.

Backup and recovery

Backing up files decreases the risk of partial or complete deletion of data. Several different solutions for saving and backing up data are on offer. Saving and copying on different media is generally easy if the size of the data does not cause restrictions. Survey data are rarely so big as to cause problems, but register data as well as audio and video material may require exceptional measures. Separate working copies and backup copies should always be created for any files related to the data.

Backing up files protects the data from unfortunate incidents such as:

  • accidental changes to the data
  • accidental deletion of the data or part of it
  • changes in or deletion of the data caused by media or software faults
  • damage caused by computer viruses
  • harm caused by hackers
  • natural disasters, wars, fire, flood etc.

When planning backup procedures, particular attention should be paid to following issues:

  • frequency; backup the data and different versions regularly
  • dispersal; store at least one backup copy in another physical location
  • data integrity; ensure that the backup copies have not been corrupted, for example, by using checksums
  • reliability and durability of storage media
  • creating rolling backups; older backups should not always be overwritten with newer ones: for example, weekly or monthly backups can be created while retaining older versions
  • refreshing backup media; regularly replace old media with new
  • storage requirements; follow the storage media manufacturer’s instructions and recommendations
  • format; file formats of the material to be backed up should be suitable for long-term preservation

In addition to issues mentioned above, instructions given by home organisation should be followed.

Owing to modern technical solutions and economic reasons, people often resort to storing data on hard disk drives (HDD). Because hard disk drives are fairly susceptible to failure, it is advisable to copy the same data to several HDDs or use additional media for backup.

Migration and refreshing: keeping data accessible

Media and software development is rapid, which constitutes a problem for long-term preservation. Archived data have to be accessible even after the original research project that created them has ended and the software, file format and media are outdated. The most common strategies of long-term preservation of digital information are migration and emulation. Of these, migration is the most useful when archiving research data.

Migration (conversion) refers to the conversion of data files from older system environments to newer ones, for example, when transferring data to another version of a program. Migration has to be done again from time to time when the system environment changes. Storing the data and related metadata in as standardised and simple format as possible facilitates migration and reduces the need for it.

Refreshing is the transfer of data from one storage medium to another. The software needed to read and modify the data will remain unchanged. Refreshing is necessary, for instance, if the original storage medium has deteriorated and if new media are clearly more affordable than old ones. Refreshing often requires updating the hardware environment, for example, purchasing a device that can read the new media.

Storage media

Optical storage media

The range of optical storage media is wide and changing constantly. Common to all optical storage media is that a beam of light, often laser, is used to read them. As of yet, there is no extensive knowledge of long-term data storage capabilities of optical media, but they are well-suited for storing data and transferring working copies during a research project. The most common optical media include:

  • CD (compact disc). CDs come in different sizes, but standard ones have a diameter of 120 millimetres, providing space for approximately 74 minutes of audio or 650 megabytes of data. CDs that can be written on (CD-R and CD-RW) typically have a capacity of 80 minutes of audio or 700 MB of data.
  • DVD (digital video/versatile disc). The most common uses for DVDs are storing video and data. A DVD outwardly resembles a CD. DVDs support storing data in several formats. Writable DVDs typically have a storage capacity of 4.5 gigabytes, while HD DVDs have larger capacities.
  • Blu-ray Disc (BD). A "blue laser", which has a shorter wavelength than the conventional red laser, is used to read and write Blu-ray discs. This allows for a larger storage capacity than that of CDs and DVDs. A single-sided, single-layer disc has a storage capacity of 25 GB and a dual-layer disc 50 GB.

Optical media are not designed with long-term preservation in mind. The discs must be stored with care, as they are susceptible to problems caused by scratches, fingerprints and UV radiation. For instance, direct sunlight can easily damage an optical disc. The readability of recordable and rewritable discs, in particular, can be temporary. More information on the suitability of optical media for long-term preservation: Risks Associated with the Use of Recordable CDs and DVDs as Reliable Storage Media in Archival Collections.

Non-volatile memory

Various memory cards and memory sticks (flash memory/USB flash drives) are non-volatile memory, meaning that the information stored in them is retained even if the device is powered off. Non-volatile memory is not designed for long-term preservation, although it is not as vulnerable to problems caused by external factors as optical media. Due to their small size, non-volatile memory devices are fit for transferring working copies from one computer to another. There is a drawback to the small size, though: memory sticks, in particular, tend to get lost surprisingly often.

  • Memory stick (USB flash drive) is a device with small external dimensions that can be connected to a USB port on a computer. A flash drive is displayed as a removable disk in 'My Computer' and data can be stored on it similarly to a hard disk drive. Commonly, USB flash drives have a capacity of 4-64 GB.
  • Memory cards are used, for example, in digital cameras and mobile phones. Newer (laptop) computers often have built-in memory card readers. Memory card formats include, among others, Compact Flash (CF), Secure Digital (SD), Multi Media Card (MMC) and Memory Stick.
  • SSD mass storage devices (solid state drives) were designed to replace hard disk drives that are based on rotating magnetic disks. SSDs do not have mechanical components, and are more silent and often faster than HDDs (at least in reading data), but repeated writing to them may, in some cases, decrease their lifespan.

Magnetic storage

Magnetic storage media have been used for a long time. Digital information can be written magnetically on tapes or hard disks, both of whose storage capacities are larger than, say, those of optical media.

  • Floppy disks (or diskettes) are storage devices that are rapidly becoming obsolete, and modern computers do not have drives to read or write them. Data stored on a floppy disk should be migrated to new media as soon as possible.
  • Hard disk drives (hard drives, HDDs) are used as mass storage devices for computers. Data is written to the magnetic surface(s) of the rotating metal or glass disk(s) inside the case. In addition to conventional use as mass storage devices, there are nowadays "external" hard drives (with USB connection) that offer significant portable storage capacity (from 450 GB to 6 TB). External hard drives are considerably slower at writing than "internal" ones. A hard disk drive is an excellent medium for processing a working copy of data, but unreliable for long-term preservation.
  • Digital magnetic tape systems are designed mainly for backing up extensive data systems long term. They are not very well suited for storing working copies, because retrieving data is slow and the tapes wear out if they are read often. Using and maintaining tape systems require specific skills and a tape system is expensive to use for single research projects. Different tape systems include, among others, DAT, LTO and DLT (see also the Wikipedia article on magnetic tape data storage).

Data security

Data security refers to protecting data, systems and communications. Copying and distributing digital research data is easy, but so is accidental deletion and modification. Backing up is part of data security, but in addition to backups, prevention of unauthorised use is needed. The following, among others, should be considered:

  • Security of information networks. Research staff should have personal usage rights to read and write data (for instance, usernames and passwords). This is particularly important if the research data can be accessed through a network. Information transferred through networks can be encrypted if necessary. Confidential information must not be stored on servers that provide network services (e.g. web and mail servers). Confidential data are to be stored only on computers that are not connected to networks. Additionally, it should be ensured that the information system does not save any temporary or other files resulting from the processing of data in folders accessible by anyone.
  • Physical security. Storing and backing up research data should be planned to protect the data from fire, water damage, break-ins and sabotage. It is recommended that the building has access control, and the doors should be locked when staff is not present. Additionally, access to rooms that contain research data can be restricted to a few individuals. Preparations should be made in case of failure of computers or other devices. It is recommended to place backup files in a safe. A backup copy of the data should be made and stored in a different physical location. Data security of this backup file should also be ensured.
  • Software updates. Critical updates for operating systems and programs should be installed as quickly as possible. It is recommendable to use an updater service that installs important updates automatically and to keep in mind that sometimes software updates may cause compatibility issues.
  • Virus protection. All computers used in the research project must have regularly and automatically updating antivirus software installed.

In matters related to data security, technical support of the organisation should be consulted and, if necessary, technical services purchased.

Data disposal

The potential disposal of research data must be planned carefully. Accidental deletion of data should be prevented.

Dispensable data files and temporary files created when programs are used should be erased when they are no longer needed. Simply deleting a file and emptying the recycle bin of deleted files does not mean that the file will be erased for good. Recovering deleted files is possible even after a hard drive has been formatted. Special file deletion software can be used to overwrite data, effectively rendering data recovery impossible. Magnetic storage media can be degaussed (i.e. demagnetised) to wipe the data. Physical destruction of storage media (disintegrating, shredding, incinerating, pulverising), when done properly, is the most reliable way to securely dispose of data. Universities and organisations often have their own guidelines on data disposal and destruction.

Preservation of data in paper form

If digital research data have been documented and processed in a way that allows long-term preservation, paper documents related to the data are usually no longer needed. For instance, questionnaires filled in by the respondents need not be archived after the research project has ended.

Completed questionnaires should only be preserved if they contain essential information that is not available in digital form. This kind of information could be, for example, responses to open-ended questions that have not been recorded anywhere else. Paper questionnaires should also be preserved if the data need to be checked. Whatever the case, the costs of storing paper material should be compared to the supposed advantages of saving it. Furthermore, confidentiality issues need to be addressed, as the completed questionnaires may form a personal data register. However, for documentation purposes, it is always beneficial to preserve the electronic questionnaire, keep a few blank paper questionnaires, or scan the paper questionnaire to preserve it electronically.

More information:

Print
updated 2014-09-16