Describing your data well ensures that your data can be understood, discovered, and used by any user. It is fundamental to capture contextual details about how and why the data were created.
To describe your data, you should address these questions below:
To ensure discoverable, understandable and reuseable by other researchers, you should document your data:
Example of a well-described dataset: Data and scripts for evaluation of researcher training in spreadsheet curation
The repository that you choose to publish in will have a form for you to fill out so that you can describe your data. You should fill in as many fields as possible to help users understand and reuse your data. If the repository is general, like figshare, use this README template to help give a more in-depth description of your data.
Examples of metadata that you can keep track of include:
Many computer systems also create additional technical metadata about your files, such as file size and the date the file was last modified.
Some fields of research have developed specific metadata standards that set out the types of information about your data that should be documented. These metadata standards ensure you have a complete, standard set of information about each part of your data and enable your dataset to be organised with other datasets. If you’re working with large datasets, databases, or data management systems, then you should contact email@example.com for advice on metadata standards that might be appropriate for your area of research, or you can view different standards by discipline on the Digital Curation Centre’s website.
Examples of metadata standards;
Vocabularies are controlled lists of terms that can be used for describing data so that it can easily be found. Vocabularies can be on any subject. They can range from short simple lists to very long complex hierarchies of terms organised into tree structures.
Sometimes these more complex vocabularies are referred to as ontologies, taxonomies, or thesauri. Strictly speaking, vocabularies are simple lists of terms, whereas ontologies include the contextual relationships between the terms.
Taxonomies are ontologies that classify terms into hierarchical arrangements, and thesauri are ontologies that provide pointers to synonyms or alternative terms. However, such words for describing vocabularies are often used interchangeably.
Here, the term vocabulary will be used as an umbrella term, to cover all types of vocabularies, ontologies, taxonomies, and thesauri.
Using terms from a controlled vocabulary to describe your data means that the metadata (the data about your data) you create will be more consistent and easier for other researchers to understand and find. If you use a controlled vocabulary, then you can be certain that the terms you use are not only consistent across your own dataset, but are also consistent with all the other datasets in your field that use the same vocabulary. Vocabulary terms are well documented and clearly defined, so if you use a vocabulary to describe your data, then other researchers will be able to look up what those terms mean and thereby understand them.
Search the following registries or portals for suitable vocabularies to help you describe your data:
There are several ways of using vocabularies. For example:
If you're using a vocabulary as a reference tool, you may wish to bookmark it, or save a copy of it, so that you can browse it when you’re creating your data documentation. You can use it to look up appropriate vocabulary terms for describing your data when you need to.
RightField is a tool that lets you use an existing vocabulary to create a dropdown list in Microsoft Excel spreadsheets. The tool was developed by researchers for researchers and is free and open source.
RightField integrates with BioPortal, so vocabularies published on there can be used directly. However, it’s also possible to download vocabularies discovered through other vocabulary registries and portals for use with RightField.