De-Identification Concepts

Protecting Patient Data

One of the most important issues in healthcare IT is the protection of patient data. Regulation addresses patient privacy and the use of health information in many countries. In the US, HIPAA regulates the use of PHI (protected health information).

While protecting patient data, HL7 analysts need to share or redistribute HL7 production data for such purposes as porting realistic data to a test system or staging area, providing realistic sample HL7 messages for interface scoping, and providing data for clinical and financial analytics.

The Department of Health and Human Services (HHS) provides a HIPAA Privacy Rule booklet (PDF) that highlights the 18 criteria that can be used to identify patients. All 18 identifiers are categories of data that must be protected. Besides easily recognized personal information, care must be given to protect device identifiers and even IP addresses. De-identification techniques must cover all 18 identifiers.


De-identification or Anonymization

This term refers to removing or masking protected information. The de-identification removes identifiers from a data set so that information can no longer be linked to a specific individual. In terms of health care information, all identifiers are removed from the information set including both personally identifiable information (PII) and protected health information (PHI).


As a subset of de-identification, pseudonymization replaces data elements with new identifiers. After that substitution, the initial subject cannot be associated with the data set. In terms of health care information, patient information can be pseudonymized by replacing patient-identifying data with completely unrelated data resulting in a new patient profile. The data appears complete and the data context is preserved while patient information is completely protected


A pseudonymized data set can be restored to its original state through re-identification. In re-identifying data, a reverse mapping structure (constructed as the data was pseudonymized) is applied. As an example, a pseudonmymized data set could be sent for processing to an external system. Once that processed information is returned, the data could be re-identified and pushed to the correct patient file.


Identifiers are data elements that can directly identify individuals.This includes name, email address, telephone address, home address, social security number, medical card number, among others. Two identifiers may be needed to identify a unique individual.


Data elements of this type do not directly identify an individual but may provide enough information to narrow the potential of identifying a specific individual. Genders, date of birth and zip/postal code have been studied extensively in this context. There is a dependent relationship between quasi-identifiers and the type of data set of which they are a part. As an example, if all members of a data set are male, gender cannot be a meaningful quasi-identifier. In addition, quasi-identifiers are categorical in nature with a finite set of discrete values. It’s relatively easy to search for individuals using quasi-identifiers.


Non-identifiers may contain an individual’s personal information but aren’t helpful in reconstructing the initial information. For example, an indicator of an allergy to pollen would be a non-identifying data element. The incidence of such an allergy is extremely high in the general population. Therefore this factor is not a good discriminator among individuals. Again, non-identifiers are dependent on data sets. In the right context, they may be used to identify an individual.