Review documentation
In addition to data and metadata, relevant documentation must also be available to ensure that secondary users can understand and reuse the dataset. As a reviewer, it is your responsibility to assess whether the submitted data are accompanied by the necessary documentation. It is difficult to define exactly what constitutes sufficient documentation, as this depends on factors such as the research discipline, data type, and the specifics of the research project. Each case requires individual assessment.
In many contexts, documentation and metadata are used as interchangeable terms. The boundary between them can be fluid, but there are some differences. Documentation is primarily intended for human readers and often consists of text, while metadata are structured to be readable by both humans and machines. Since documentation does not need to be machine-readable, it is not subject to the same structural requirements as metadata.
Examples of information that may need to be documented include:
- the research questions and objectives of the study
- a description of the chosen data collection method
- the contents of data files, such as a variable list or codebook
- survey reports and technical reports
- questionnaires
- an overview of the analytical process and a description of the data used for analysis
- descriptions of included photographs, such as date, location, and camera settings (see also Images, and Digital video on the pages about file format recommendations on Researchdata.se)
- related articles and publications
- associated code/scripts.
Accompanying documentation is essential for enabling reuse of data in new research, for validating research findings, and for understanding the research itself. Relevant information may, for example, be summarized in a README file, such as this template developed by Cornell University.
Examples of documentation
Below are some examples of the types of documentation that may be needed for different kinds of data.
Tabular data
Here, it is essential that each variable/column is clearly described. Abbreviations are often used as column headers in a tabular database, and it is important that these are explained. A clear alternative to the abbreviation is often sufficient – for example, “CORINE 2012 Land cover type code” instead of “COR_TYPE”.
You should also check that each variable/column is filled in consistently – for example, that all dates in a column titled “date of sampling” use the same format. It is also good practice to ensure there are no blank fields; missing values should be clearly marked (e.g., using “missing”).
Make sure all values used in the dataset are defined or described. If, for instance, a survey has been entered into a table using numeric values, the documentation must explain what the values represent – for instance, that in the column Gender, the value 1 means “male” and 2 means “female”.
It is also important to include references or links to standards and/or formal definitions, if these exist, in documentation and variable descriptions. This could, for example, include survey designs, assessment tools, ISO standards, algorithms, coding frameworks. Where possible, use a persistent identifier (PID) for such references.
Images
A large collection of images may require extensive documentation, and much of it may also be considered metadata. Image files should have clear and informative filenames – for example, a descriptive title or a coded name that can be interpreted using a code list. If the images are organized into folders, the folder structure must be clear, and each folder should be named in a way that makes it easy to navigate and understand the contents of the folders.
Geospatial data
Geospatial data must be accompanied by documentation explaining the contents of the dataset. This documentation can be embedded in the data files themselves or provided as separate text files. Because it can be difficult to assess whether all necessary information for interpreting geospatial data is included, it is important to communicate with the researcher. Questions you can ask include:
- Is there information on projection and coordinate system?
- Is there guidance on how to interpret the column or attribute data?
Data example: Greenhouse gas measurements in the atmosphere
Data type: Tabular data (a table file containing several columns/fields)
A secondary user needs to understand what each field/column represents and what units are used. A good way to document this type of dataset is to upload a variable list describing the columns, include a publication that explains the table structure, or document the details directly in the data file.
Data example: Data from an archaeological excavation
Data type: Excavation database (a database with multiple tables containing information about different types of finds, archaeological layers, image descriptions, etc.) and images (scanned maps, excavation photographs, photos of artefacts, etc.).
To understand the contents of the database, documentation must include information about its structure and content – for example, a variable list, codebook, or equivalent – as well as how the images relate to the database.
Excavations often result in several reports, some unpublished. It is important that these reports, or a relevant selection of them, are included as documentation.
Other types of documentation may include field diaries describing the practical work during the excavation.
If the data are linked to a publication
If a dataset is shared in connection with a specific article or publication, the documentation requirements may be slightly lower – provided that the publication:
- is open and freely accessible
- is linked to the data description using a persistent identifier (PID)
- includes all the information needed to understand and reuse the dataset.
If an article is intended to replace some of the dataset’s documentation, it must include information about, for example, the methodology and data collection process. Keep in mind that variable and code lists explaining the data files are usually not included in the article or its supplementary materials. These lists must therefore be shared alongside the data files.
If the article is published in a subscription-based journal, the relevant sections that describe the data must be saved and included with the dataset.
Researchers should also be encouraged to link to the dataset from the article, for example in a Data Availability Statement.
Sometimes, data need to be made available before the article is published. In exceptional cases, you may therefore need to publish the data description without a link to the article. The recommendation is to enter the article title in DORIS followed by “under review” and to update the data description with the link to the article as soon as it becomes available. See more under Closed review on the Manage data descriptions in DORIS page.