When removing identifiers from human data (often referred to as de-identification) you’re usually removing or aggregating any identifying information. For human data, this may include names, date of births and postcodes, or physical descriptions/images that include facial features or distinctive injuries. For data that is otherwise sensitive, you may need to remove information such as the location of an endangered species or commercially sensitive information relating to an organisation.
Removing identifiers from your data is a form of risk management; while the possibility of identification may not be zero you’ve substantially minimised the risk of people being identified.
Remember that you can choose different access conditions when publishing your data; not all published data is made openly available.
Derived from Understanding Patient Data (2017). Identifiability spectrum. Retrieved from https://understandingpatientdata.org.uk/what-does-anonymised-mean
The identity of an individual information, or other sensitive information, can be reasonably discerned.
All identifiers are removed from the dataset (eg name, postcode, date of birth), replaced with a code, or are aggregated. Re-identification may be possible if a master copy of data that contains identifiers or master copy of study participants is kept, or through the combination of this dataset with other available datasets.
All identifiers are removed from the data or are aggregated, or data was never collected with identifiers. Master copies of data that contain identifiers have been destroyed. An assessment of the risk of data being indirectly re-identified has been undertaken and managed.
Whether identifying information needs to be removed from your data depends on its sensitivity and the risk posed if someone or something is identified using the data. If you have to remove a substantial amount of information or have to extensively aggregate the data to anonymise it, the data may no longer be as useful to others. When identifiers can’t be removed from your data without loss of value, consider publishing the identifiable data via mediated access if you have consent to do so.
It’s best practice to avoid obtaining any unnecessary details during the data collection process. If the information won’t add anything to your study, then don’t collect it; for example, if you need to know a participant’s age in years, then collect age, not date of birth. If you have to collect identifiable data, then try to employ data collection practices that will make removing identifying information easier.
If your data contains identifiers that aren’t needed for the analysis process, then you must remove them before you analyse your data to minimise the risk associated with the dataset. When you remove identifiers, a master file of the data should be encrypted and stored on secure storage (eg the Research Data Store). This is particularly important if you’re using an unsecure environment for your data analysis, which should be avoided if possible.
Assess your data before you publish or share to identify any risks associated with releasing the dataset beyond the original research team. If your data contain identifiers that could put someone or something at risk, then identifiers must be removed prior to publishing or sharing.
Make sure you document all processes that you undertake to remove identifying information. This will ensure that you know what’s been removed and enables other people to be aware of the processes and modifications that have taken place.
How you go about removing identifiers from your data depends on the type of data you have. There are some tools available to assist you with removing identifiers, however, depending on your dataset format, you may have to do it manually. The UK Data Service provides a useful step by step overview for removing identifiers manually.
|Tool||Can be used for||More information|
|The Cornell Anonymization Toolkit (CAT)||Tabular data||Interactive Anonymization of Sensitive Data|
|DICOM cleaner||Medical Images in DICOM (Digital Imaging and Communications in Medicine) format||Online help available|
|The University of Texas at Dallas Anonymisation tool||Unstructured text files||Contact details for Data and Privacy Lab, The University of Texas at Dallas|
|De-identification software (deid)||Free text in medical records||Automated De-Identification of Free-Text Medical Records|
|The University of Essex Text Anonymisation Helper Tool||Text in Microsoft Word documents||Instructions available when tool is downloaded|
|sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation||The generation of anonymized (micro) data||Issues register on GitHub|
|IQDA Qualitative Data Anonymizer tool||Text files||Online instruction manual|
Process for Publishing Sensitive Unit Record Level Public Data as Open Data, Australian Government
De-identification by the Australian National Data Service
De-identification and the Privacy Act, Office of the Australian Information Commissioner, Australian government
Identifiability Demystified by Understanding Patient Data
The De-Identification Decision-making Framework by Data61 and CSIRO