Skip to main content

Data Publication: Removing identifiers from data

This guide will give you practical hints and tips to publish your data and ensure that it is findable, accessible, usable and citable. Let's publish data well!

Removing sensitive information

You might be able to strip your data of anything that makes the data sensitive, such as identifying information or the location of an endangered species. Making data non-identifiable can be a time consuming process and you usually have to make sure that data can’t be re-identified using other publicly available information.

When removing identifiers from human data (often referred to as de-identification) you’re usually removing or aggregating any identifying information. For human data, this may include names, date of births and postcodes, or physical descriptions/images that include facial features or distinctive injuries. For data that is otherwise sensitive, you may need to remove information such as the location of an endangered species or commercially sensitive information relating to an organisation.

Removing identifiers from your data is a form of risk management; while the possibility of identification may not be zero you’ve substantially minimised the risk of people being identified.

Remember that you can choose different access conditions when publishing your data; not all published data is made openly available.

Levels of identifiability

Levels of identifiability. Derived from Understanding Patient Data (2017). Identifiability spectrum. Retrieved from https://understandingpatientdata.org.uk/what-does-anonymised-mean

Derived from Understanding Patient Data (2017). Identifiability spectrum. Retrieved from https://understandingpatientdata.org.uk/what-does-anonymised-mean

Identifiable

The identity of an individual information, or other sensitive information, can be reasonably discerned.

Re-identifiable

All identifiers are removed from the dataset (eg name, postcode, date of birth), replaced with a code, or are aggregated. Re-identification may be possible if a master copy of data that contains identifiers or master copy of study participants is kept, or through the combination of this dataset with other available datasets.

Non-identifiable

All identifiers are removed from the data or are aggregated, or data was never collected with identifiers. Master copies of data that contain identifiers have been destroyed. An assessment of the risk of data being indirectly re-identified has been undertaken and managed.

Working with and publishing sensitive data

When should you remove identifiers from data? 

Whether identifying information needs to be removed from your data depends on its sensitivity and the risk posed if someone or something is identified using the data. If you have to remove a substantial amount of information or have to extensively aggregate the data to anonymise it, the data may no longer be as useful to others. When identifiers can’t be removed from your data without loss of value, consider publishing the identifiable data via mediated access if you have consent to do so.

 

Avoid collecting identifiable data in the first place

It’s best practice to avoid obtaining any unnecessary details during the data collection process. If the information won’t add anything to your study, then don’t collect it; for example, if you need to know a participant’s age in years, then collect age, not date of birth. If you have to collect identifiable data, then try to employ data collection practices that will make removing identifying information easier.

 
When you analyse your data

If your data contains identifiers that aren’t needed for the analysis process, then you must remove them before you analyse your data to minimise the risk associated with the dataset. When you remove identifiers, a master file of the data should be encrypted and stored on secure storage (eg the Research Data Store). This is particularly important if you’re using an unsecure environment for your data analysis, which should be avoided if possible.

 

When you publish or share the data

Assess your data before you publish or share to identify any risks associated with releasing the dataset beyond the original research team. If your data contain identifiers that could put someone or something at risk, then identifiers must be removed prior to publishing or sharing.

Removing identifiers

How should identifiers be removed?

Make sure you document all processes that you undertake to remove identifying information. This will ensure that you know what’s been removed and enables other people to be aware of the processes and modifications that have taken place.

How you go about removing identifiers from your data depends on the type of data you have. There are some tools available to assist you with removing identifiers, however, depending on your dataset format, you may have to do it manually. The UK Data Service provides a useful step by step overview for removing identifiers manually.

 

Useful tools for removing identifiers

Tool Can be used for More information
The Cornell Anonymization Toolkit (CAT) Tabular data Interactive Anonymization of Sensitive Data
DICOM cleaner Medical Images in DICOM (Digital Imaging and Communications in Medicine) format Online help available
The University of Texas at Dallas Anonymisation tool Unstructured text files Contact details for Data and Privacy Lab, The University of Texas at Dallas
De-identification software (deid) Free text in medical records Automated De-Identification of Free-Text Medical Records
The University of Essex Text Anonymisation Helper Tool Text in Microsoft Word documents Instructions available when tool is downloaded
sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation The generation of anonymized (micro) data Issues register on GitHub
IQDA Qualitative Data Anonymizer tool Text files Online instruction manual

Resources

De-identification by the Australian National Data Service

De-identification and the Privacy Act, Office of the Australian Information Commissioner, Australian government

Identifiability Demystified by Understanding Patient Data