University Library: Research Data Management: File management

File management considerations

Having a well-considered plan for the structure and organisation of your research data will improve it's management, access, and re-use. The organisation of data is especially important in team projects where more than one person will be accessing and analysing the data.

The key methods and approaches to consider are:

file formats
file naming
versioning
folders and file directories

File management

By working with file formats that are widely-used, interchangeable and with good long-term preservation qualities, you will improve the impact and reach of your research outputs. Choosing good formats will improve the accessibility of your research and make it easier for yourself and other future researchers to use or reuse with a wide range of computer systems regardless of available software packages.

When performing research it’s often necessary to use specialised and proprietary file formats. This may be for many reasons: your method of data analysis; the hardware used; the software available to you or to meet discipline-specific standards. Regardless of these issues, it’s still important to make a conscious and informed decision on choosing file formats. At a minimum you should consider:

proprietary (eg .xlsx, .nvp, .pdf) or open formats (eg .csv, .rtf, .otp) and whether specific software is required to access your data
maintaining data integrity by using lossless formats to ensure no useful data is lost to future researchers
the risk of file format obsolescence which may be the result of imminent or future software or hardware upgrades or practice shifts, particularly for proprietary software

At later stages of your research, such as when publishing traditional research outputs or making your data publicly available, you should consider transferring your data to a file format that can be utilised by people who may not have access to the exact suite of software you have. The UK Data Service Recommended Formats table can help you use a file format best suited to long term accessibility.

File formats
This basic guide by ARDC (formerly ANDS) gives a good overview of how to choose which formats to use to gather, transfer and store research data
Born-digital file format standards: National Archives of Australia
Use digital file formats which have a low risk of becoming obsolete, to ensure the preservation, accessibility and interoperability of digital information for researchers over the long-term.
Recommended file formats: UK Data Service
A table outlining recommended and acceptable file formats for various data types.

Digital file names are important for identifying and finding a digital file. To maximize access to your records, establishing a naming convention for your files is recommended. A file naming convention is a framework for naming your files in a way that describes what they contain and how they relate to other files.

It is essential to establish a convention before you begin collecting files or data in order to prevent a backlog of unorganised content that will lead to misplaced or lost data. The most important things to remember about file naming are:

descriptiveness - Good naming conventions should provide useful clues to the content and status of a file, including its version. A file name is its principal identifier and also helps in classifying and sorting files.
consistent application - By selecting an appropriate and systematic naming convention for files as early as possible, then following it throughout the research and among cooperating researchers, the benefit from file naming systems will be maximised.

The following examples highlight basic principles of file naming.

Good file names:

20201024_Registry of participants_Survey.doc

ROBERTS_Hannah_2021_Interview.mp4

20210130_ProjectB_Ex1Test1_SmithE_v01.xlsx

File names are concise (40-50 character limit) and meaningful
Description of content and version
Sentence case including a capital letter for names and proper nouns
Date of creation format: YYYYMMDD or YYYY-MM-DD. Starting a file-name with something numeric helps with default ordering
Elements are separated with hyphens (-) or underscores ( _ ). Avoid spaces and punctuation as these are not machine-readable
When using a sequential numbering system, use leading zeros to ensure that files sort sequentially. For example, use: 001, 002, ...010, 011 ... 100, 101 ... instead of 1, 2, ...10, 11 ... 100, 101, etc.

Bad file names:

Document13.docx

crt doc scan.pdf

Lit review, bib., chpt2-4, rev, cvr page, appendices.docx

output NVB>3.0.xml

Avoid very long file names
Do not describe the subject or topic of the document’s contents
Avoid abbreviations - temp could mean template, temporary etc. Use the full word instead
Do not use spaces, dots or special characters within the document name, e.g.: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' "

Over the duration of a research project, a dataset will undergo many changes. They may be as simple as adding more sets of findings, or as major as the addition of a new dimension or type of measurement. Considering how you will deal with these scenarios is important for 2 reasons:

Changes are certain to alter the conclusions derived from the dataset. In order to maintain integrity about where your conclusions came from, it's important to know which version of your data you're addressing.
Changes you make might ultimately not be useful. If this happens, you may need to go back to an earlier version of the dataset.

One of the tools used to address these scenarios is versioning. This refers to a system of keeping the old versions of a file and tracking the changes made in each subsequent version.

The most basic forms of versioning are manual systems. These usually contain two important elements:

The user adds a sequential number to the file name to indicate which version of the file it is
A change table in each document where versions, dates, authors and details of changes to the file are recorded

These are outlined in the ARDC Versioning guide and the UK Data Service Version Control and Authenticity page linked below.

While a manual system can work for many research projects, they can become difficult to use once your needs become complex or multiple people begin working on the same dataset, as explained in the video below. In these cases you should consider using version control software - Git is the best known and most widely used of this type of software.

ARDC: Tap into the world's data versioning experts
Access and use solutions developed by the Research Data Alliance Data Versioning Working Group
Data versioning
ARDC provides information about data version control and why it is important for researchers
Git
Git is a free and open source distributed version control system designed to handle everything from small to large projects.
UK Data Service: Version Control and Authenticity
A good outline of the best practice of version control and examples of file versions from the UK Data Service

JhaiChrispy. (2010, December 9). Version control overview [Video file]. https://youtu.be/6_JtvswKzII

Like file naming, systems to organise folder and file directories require coherence and consistency.

Coherence - Anyone using the folders should be aware that there is a system and what it means.

Consistency - Anyone using the folders should be consistent in creating folder names in line with the system, but also in keeping the relevant files in the appropriate folders.

This will ensure that it is easy to locate, organise, navigate and understand the context of all files and versions.

Other concepts to consider include:

file hierarchy refers to the number of levels or sub-folders in the directory. It is usually useful to have a maximum depth of 4 folders
folder direction determines how folders are nested and which way is most useful (e.g. Results/2019 or 2019/Results)
ambiguity or overlapping categories, especially at the top-level, can cause confusion

File wrangling
Careful thought about files at the beginning of a research project can save a lot of time, money and heartache later: plan your file formats, file naming conventions and file version control.
Organising data
A brief outline about structuring and organising data from the UK Data Service.

Research Data Management

RDM solution

File management considerations

File management

Good file names:

Bad file names:

Find & Search

Visit the Library

Contact the Library

University Quicklinks