De-identifying Patient Data, Part 2

Home » De-identifying Patient Data, Part 2

Definitions

We’re continuing with our series of posts on patient data de-identification. This week, we’re reviewing a set of definitions of common terms. This list will become the glossary for upcoming posts on HL7 de-identification and protecting sensitive healthcare data. We’re looking for feedback on this list. Feel free to add your nuances and/or related terms in the comments…

De-Identification or Anonymization

An umbrella term for removing or masking protected information. In a more specific sense, the de-identification process removes identifiers from a data set so that it’s no longer possible to relate information back to individuals. In the context of healthcare information, de-identification occurs when all identifiers (IDs, names, addresses, phone numbers, etc – see our previous HL7 de-identification post for a complete list) are removed from the information set. This way, patient identity is protected while most of the data remains available for sharing with other people/organizations, statistical analysis, or related uses.

Pseudonymization

A subset of anonymization. This process replaces data-element identifiers with new identifiers so that the relationship to the initial object is replaced by a completely new subject. After the substitution, it is no longer possible to associate the initial subject with the data set. In the context of healthcare information, we can “pseudonymize” patient information by replacing patient-identifying data with completely unrelated data. The result is a new patient profile. The data continues to look complete and the data semantics (the meaning of the data) is preserved while patient information remains protected.

Re-Identification

This process restores the initial information to a pseudonymized data set. To re-identify data, you would need to use a series of reverse mapping structures constructed as the data is pseudonymized. There are a few use cases for re-identification. One example would be to send the pseudonymized data to an external system for processing. Once the processed information is returned, it would be re-identified and pushed to the right patient file.

Identifiers

Identifiers are data elements that can directly identify individuals. Examples of identifiers include but are not limited to name, email address, telephone number, home address, social security number, medical card number (see previous post for a complete list of HIPAA identifiers). In some cases, more than one identifying variable is needed to identify an individual uniquely. For example, the name “John Smith” appears multiple times in the White Pages. However, you need to combine the name with a telephone number to identify the right John Smith.

Quasi-identifiers

These are data elements that do not directly identify an individual, but that provide enough information to significantly narrow the search for a specific individual. Some quasi-identifiers have been studied extensively. These include gender, date of birth, and zip/postal code. Quasi-identifiers are highly dependent on the type of data set. For example, gender will not be a meaningful quasi-identifier if all of the individuals are female. Another interesting thing about quasi-identifiers: they are categorical in nature, with a finite set of discrete values. In other words, gender, birth dates over a period of less 150 years, and address are finite. This makes searches simple. Individuals are relatively easy to pinpoint using quasi-identifiers.

Non-identifiers

These data elements may contain personal information on individuals, but they aren’t helpful for reconstructing the initial information. For example, an indicator on whether an individual has pollen allergies would most likely be a non-identifying data element. The incidence of pollen allergy is so high in the population that it would not be a good discriminator among individuals. Again, non-identifier data elements are dependent on data sets. In a different context, this data element might enable you to identify individuals.

What other data de-identification terms should we define?