What is De-identification?

Young Man with a blank paper hiding his face

De-identification is a technique to mask or remove protected health information (PHI) from sensitive patient data. If you need to use production HL7 data, you’ll need to find a way to protect PHI. De-identification is a good choice when you need to:

  • Troubleshoot an interface
  • Populate a test system
  • Gather data for analytics

Why de-identify?

There are a number of reasons you should consider de-identification.

1. Leverage the richness of production data

When you test a system or an interface, you’ll cover the most realistic test scenarios with realistic data. What is “realistic data”? Customized fields, Z-segments, the code sets that your health system (or your customer) really uses. De-identifying production data will get you realistic data. Learn more about HL7 testing in Chapter 9 of the HL7 Survival Guide.

2. Support HIPAA, protect patient identifiers

Based on HIPAA, you need to protect 18 types of information that could potentially identify a patient. This covers a range of items: names, locations, phone numbers, social security numbers, even medical device numbers. Before you use the data, make sure you these identifiers are protected. De-identification will remove the data for you, and in some cases, replace it with realistic dummy data.

3. Safeguard data

Even if you sign and enforce Business Associate agreements, data in transit – via email or a laptop — is at risk. Reduce the risk by de-identifying data within the Covered Entity system before moving it to the BA.

How to de-identify HL7 data

You have some options when it comes to de-identifying data.

Manual removal
Some analysts de-identify data manually by loading messages in a text editor then scanning, removing, and replacing data in HL7 fields, components, and sub-components. This works if you’ve got a small batch to process – say 10 messages or fewer – and you can get someone to check over your work to make sure you haven’t missed a data element.

Some developers write custom scripts to deal with a batch of messages. If you have the time and the skills, this can be a good option; however, when you take into consideration the time to both write and test the script for a complex de-identification, this option becomes expensive.

Finally there are teams that use de-identification software to get the job done.  If you’re shopping for software, look for these capabilities:

  • Easy to use – if it takes a week to set up and train on the de-identification, that’s going to slow you down. Keep looking.
  • Ability to remove data and also replace it with random yet realistic data.
  • The data replacement should be flexible. For instance, if you’re removing date of birth in a data for patients over 65, make sure the replacement data generated by the software doesn’t turn them in NICU patients.
  • Tracking: keep track of what you’ve de-identified. You’ll want to keep a record of your de-identifications: which fields and segments were affected, when the de-identification took place, who did it, etc.

De-identification Software Free Trial

De-identify HL7 data in minutes. Download a Cloak software trial now.

Point-and-Click Interface Engine Migration

Announcing Caristix Converters

We gave you a sneek peak at our new Caristix technology, converters, in a previous blog post, Cutting Interface Costs, Continued. In this blog post, we discussed a big area where interface costs are extremely high: migrating from a legacy interface engine to a newer engine technology.

We had several customers tell us that out connectors were saving them a lot of programming time. But was there a way to use the connectors to migrate their interfaces? Now, there is. The Caristix converter reads the interface configuration in one engine, and outputs it in the new engine format, transforming all interface configuration in one engine, and outputs it in the new engine’s format, transforming all interface attributes.

We put out a press release today announcing our partnership with iNTERFACEWARE, a leading integration engine provider, to support the development and production of the first-ever interface converter. This new Caristix technology, which is now available as a beta release, allows users to automatically migrate interfaces from the Mirth Connect interface engine to iNTERFACEWARE’s Iguana integration engine. You can read the entire press release here.

To see the Caristix converter in action, visit: https://www.caristix.com/blog/2014/01/cutting-interface-costs-cont/

Contact us at info@caristix.com to learn more about the prototype and ask about joining a beta.

HIMSS 2014

Both iNTERFACEWARE and Caristix will have representatives at this year’s annual HIMSS conference in Orlando (Booth: #2229) where they will be discussing their respective integration solutions.

If you’d like to meet us at HIMSS, please contact us to set up an appointment:
Toll-free: 877.872.0027

Additional Release Notes: Caristix v2.8

Information for Cloak and Workgroup Users

With our recent v2.8 update, we made the default de-identification rules more robust. You’ll need to add fewer rules manually. But this means your default de-id rules are going to look different than previous versions. Here are some answers to some of the questions you might have.

What’s the change?

Unlike older versions, there is no single default de-id rule file. Default de-id rules are now HL7-version-based and will adjust to whatever HL7 v2.x version you pick as a reference profile.

How does this impact my de-identification work?

Your choice of reference profile has a bigger impact on your de-identified messages. Ensure you match your reference profile closely to the messages you want to de-identify. An example: if you have a batch of v2.4 messages, pick v2.4 as your reference profile.

Why this change?

When we first wrote the software, we created a default de-id rule file based on HL7 v2.6 to cover all HL7 versions. But as we heard back from our users, they were looking to avoid manually inserting de-id rules. So we extended the rules algorithm to cover additional ID and name locations in messages.

What happened under the hood?

We focused the default rules on data types, not fields. Data type is always impacted by the reference profile.

An example: let’s say your reference profile contains the XPN data type. In this profile, PID-2 and PID-4 contain XPNs.  We change the reference profile, and now PID-3 and PID-5 contain XPNs. With previous Caristix software versions, you would have to change your de-id rules to accommodate this difference. With Caristix version 2.8, you’d be good to go. You’ll have less manual work with de-identification rules with our latest update.

How to keep your old rules

If you have a set of rules, they’ll still be valid. Keep using them. Just keep in mind they’ll apply to different fields if you change reference profiles. So ensure that your reference profiles closely match your messages

How to create new de-id rules files

You’ll create them like you did before.


Contact us at support@caristix.com.

How to Create HL7 Test Messages and Logs

Over the years, most hospital IT teams have developed their own HL7 test messages and logs, which they use over and over again for system testing and interface validation. These logs may not be 100% accurate for the task at hand but hey, they’re good enough, right?

Not really.

“Good-enough” logs don’t contain the latest lab codes. “Good-enough” logs with just 10 or 20 or even 50 patients don’t contain the volume you need for load and performance testing. “Good-enough” logs miss out on message workflow problems that can bring down interfaces.

What hospital IT teams and their vendor partners need are better test logs. Test messages and logs need to reflect a hospital’s IT environment: their own ADT message flow, their specific lab codes, and their case mix.

There’s a way to generate test logs quickly and effectively: use production data and remove the HIPAA identifiers.

By de-identifying production data, you get test messages that are 100% representative of the hospital environment (because you’ve just done a “hot” extraction).

When you de-identify messages, here is a list of capabilities you want to ensure you have:

  • First and foremost, be absolutely sure to remove the 18 identifiers designated by HIPAA as protected health information (PHI).
  • Keep the message flow. If “John Doe” in your production data becomes “Michael Smith” in your test log, ensure that Michael Smith in your A01 admission message is the same Michael Smith upon discharge.
  • Cover data in z-segments. PHI can hide in z-segments.
  • Log volume. Have at least a week’s worth of messages. A few months would be even better. One HIT vendor we worked with de-identified 12GB of data, which represented 3 months of hospital data, for their development environment.
  • Traceability. Keep records of which data was de-identified and which fields and data types were transformed.

How have you dealt with HL7 test messages? Let us know in the comments.

How to Change HL7 Segment and Field Definitions in Caristix Cloak

One of our Cloak customers is de-identifying close to 14 GB of clinical data coming from several healthcare information systems (including 2 ADTs and a lab system) at an IDN. This customer is asking some great questions that would help other Cloak users get more out of the software. Here’s an excerpt from our conversations.

The NK1 segment is giving me trouble. Specifically, field 5, the address. I created a sample message with this as the NK1.5 content:

123 EASY ST^Arlington^VA^22207

The NK1 segment is listed as NK1.5.2 as being “other designation”, not the city, thus throwing off my address conversion. I have no means to identify subcomponent 2 as the city, I’m “stuck” with it being “other designation.”

It looks like the NK1 segment in the logs doesn’t follow the standard… (surprise, surprise;). In fact, based on the HL7 standard, the address would be stored in NK1.4 and city in NK1.4.3. It appears to be a naming issue within the data. You can modify the HL7 profile/specification that Cloak uses so the HL7 reference profile represents the data you’re working with (as opposed to trying to conform to the official HL7 specification). In other words, you can change the specification to remove the “other designation” field in the HL7 profile.

To do this, you would need either Caristix Conformance or Reader software. Reader is a free download available here.

Here’s the procedure:
1. Open Conformance or Reader.
2. Make a copy of “HL7 v2.6” profile in “New Folder”.
3. Rename the profile to something that make sense to you.
4. Browse to the NK1 segment and expand it.
5. Browse to NK1.5 and expand it.
6. Delete the “other designation” field.
7. Save the profile.

8. Go back to Cloak.
9. From the menu bar, go to Tools, Options, Reference Profile.
10. In the list of profiles, select the profile you just modified.
11. Click OK. The NK1.4.2 field name is now city.

Vary De-identified Names Across Clinical Data Using Caristix Cloak

One of our Cloak customers is de-identifying several GB of clinical data coming from several healthcare information systems (including 2 ADTs and a lab system) at an IDN. This customer is asking some great questions that would help other Cloak users get more out of the software. Here’s an excerpt from our conversations. We’ll be posting new Q&As in the coming weeks.

Is there a means by which the names in a message can be de-identified, i.e. patient, physician, etc., without it being the SAME across the message? For example, using the names.xls spreadsheet (which is such a time-saver… oh man!), I’ve replaced the patient and the caregiver names. However, in my PV1 segment, I’m finding that all the caregivers are the same name as the patient.

You can use Excel files in Cloak to generate replacement data. For instance, I might have an Excel file listing cities and zip codes. Cloak will manage de-identification so that when a replacement zip code is chosen at random, you get a city associated with the zip code. That way, the data still make sense. The same technique lets you build an Excel file with names and genders, so that Cloak provides female first names to female patients.

If you use the same Excel column to cover several fields, the same row (so, in this case, the same value) will be used.

To get a different name, you can do one of two things:
1. Open the Excel file and add a column with names (such as physician names, for instance). This way the patient will have the physician listed on that current row.
2. Copy the Excel file; change the copied file so you have two different files for patient names and physician names. This way the association between patient and physician is going to be random once you set the de-identification generator type parameters.

Read more about using Excel files to generate replacement data in Cloak.

Protecting Patient Data in HL7 Logs

Information Week ran an article this week on protecting patient data. The article wasn’t on one of the usual suspects — a HIPAA violation or a breach in a production system. Instead, this was notable because we’re finally seeing one of the hidden dangers in healthcare IT coming to light: unsecured patient data sitting in development and test systems. Our industry needs to start addressing this issue.

Information Week cited a survey where 51% of organizations don’t protect patient data used in software development and testing. Yet the per victim cost of data breaches in healthcare is $294, 44% higher than other industries. Read more on the Information Week site.

It’s a given: vendors and providers have to use real-world data to test applications, systems, HL7 interfaces, and connectivity. There’s no getting around that. Without test data — accurate, realistic test data — you don’t want to go ahead with the product launch, system go-live, or integration engine migration. There’d be too much at stake. Without a reasonable volume of real-world test data, you end up testing for too-good-to-be-true workflows and patients. The result is way too many bugs, enhancement requests, delayed projects, and (rightfully) irate clinicians down the road.

So oftentimes, the solution to robust testing is to copy over production data into the test system. There are times when providers end up sharing production data — for instance, HL7 message logs — with vendors. Under certain circumstances, these approaches can be fraught with security issues. So vendors and providers need to ensure they’re working within regulatory frameworks when they use production data for development and testing.

But instead of clamping down and setting up a governance structure that says, “Never, ever extract production data,” how about looking for ways to do it safely?

One way is to de-identify production data before porting it over to the test system. So you remove information that can identify patients while leaving real-world workflows intact. We’ve written about de-identifying HL7 data here. And provided a few de-identification definitions here.

Because of this issue, we’re working on an HL7 data de-identification tool — a data protection tool, if you will. If you’d like to learn when the software beta opens, please sign up here (don’t forget to check the beta notification box).


Any comments or insights on protecting patient data in test or development systems? We’d love to hear from you in the comments below, on Twitter, or by email (it’s at the top of this post).

De-identifying Patient Data, Part 2


We’re continuing with our series of posts on patient data de-identification. This week, we’re reviewing a set of definitions of common terms. This list will become the glossary for upcoming posts on HL7 de-identification and protecting sensitive healthcare data. We’re looking for feedback on this list. Feel free to add your nuances and/or related terms in the comments…

De-Identification or Anonymization

An umbrella term for removing or masking protected information. In a more specific sense, the de-identification process removes identifiers from a data set so that it’s no longer possible to relate information back to individuals. In the context of healthcare information, de-identification occurs when all identifiers (IDs, names, addresses, phone numbers, etc – see our previous HL7 de-identification post for a complete list) are removed from the information set. This way, patient identity is protected while most of the data remains available for sharing with other people/organizations, statistical analysis, or related uses.

HL7 data anonymization


A subset of anonymization. This process replaces data-element identifiers with new identifiers so that the relationship to the initial object is replaced by a completely new subject. After the substitution, it is no longer possible to associate the initial subject with the data set. In the context of healthcare information, we can “pseudonymize” patient information by replacing patient-identifying data with completely unrelated data. The result is a new patient profile. The data continues to look complete and the data semantics (the meaning of the data) is preserved while patient information remains protected.

HL7 data pseudonymization


This process restores the initial information to a pseudonymized data set. To re-identify data, you would need to use a series of reverse mapping structures constructed as the data is pseudonymized. There are a few use cases for re-identification. One example would be to send the pseudonymized data to an external system for processing. Once the processed information is returned, it would be re-identified and pushed to the right patient file.

HL7 data security and privacy


Identifiers are data elements that can directly identify individuals. Examples of identifiers include but are not limited to name, email address, telephone number, home address, social security number, medical card number (see previous post for a complete list of HIPAA identifiers). In some cases, more than one identifying variable is needed to identify an individual uniquely. For example, the name “John Smith” appears multiple times in the White Pages. However, you need to combine the name with a telephone number to identify the right John Smith.


These are data elements that do not directly identify an individual, but that provide enough information to significantly narrow the search for a specific individual. Some quasi-identifiers have been studied extensively. These include gender, date of birth, and zip/postal code. Quasi-identifiers are highly dependent on the type of data set. For example, gender will not be a meaningful quasi-identifier if all of the individuals are female. Another interesting thing about quasi-identifiers: they are categorical in nature, with a finite set of discrete values. In other words, gender, birth dates over a period of less 150 years, and address are finite. This makes searches simple. Individuals are relatively easy to pinpoint using quasi-identifiers.


These data elements may contain personal information on individuals, but they aren’t helpful for reconstructing the initial information. For example, an indicator on whether an individual has pollen allergies would most likely be a non-identifying data element. The incidence of pollen allergy is so high in the population that it would not be a good discriminator among individuals. Again, non-identifier data elements are dependent on data sets. In a different context, this data element might enable you to identify individuals.

What other data de-identification terms should we define?

De-identifying Patient Data, Part 1

In healthcare IT, no matter where you work, you’re faced with protecting patient data. Many countries have regulatory frameworks to address patient privacy and the use of health information. In the US, HIPAA regulates the use of PHI (protected health information). In Canada, the law is called PIPEDA (Personal Information Protection and Electronic Documents Act). PIPEDA regulates the use of consumer data in a number of industries, not just healthcare. Plus a few Canadian provinces have their own privacy legislation in place.

Regardless, data breaches cost healthcare organizations a staggering $6 billion annually, in the US alone.

So how do you protect patient data? Let’s hone in on one data protection technique: de-identification. Data de-identification is essentially a way to mask or replace personally identifiable information (PII) and protected health information (PHI). On occasion, HL7 analysts need to share or redistribute HL7 production data. One use case is the need to port realistic data to a test system or staging area.

So what do you need to know in order to de-identify HL7 log data?

  1. To begin with, you’ll need to list the sensitive data identifiers you’re dealing with. The Department of Health and Human Services (HHS) provides a HIPAA Privacy Rule booklet (PDF) that highlights the 18 HIPAA identifiers. Each identifier is a category of data you need to protect. The list goes way beyond names, addresses, social security numbers, and health plan numbers. You’ll need to pay attention to device identifiers and even IP addresses. Ensure that your de-identification technique covers all 18 identifiers.
  2. To be safe, use techniques that don’t permit re-identification.
  3. Make sure you map identifiers to HL7 fields and segments. This will vary from one system to the next. You’ll want to have the ability to trace which message components will be impacted by changes before you hit that OK button (or the equivalent) on your de-identification tool.
  4. Ensure the data remains useful. One of the issues with traditional randomization techniques is that scrambled data may not be plausible. Overall meaning in the message flow should be preserved. You don’t want to be able to identify patient John Smith, but you want to make sure he isn’t discharged before he’s admitted — so the patient’s overall record should remain as-is.

Further Reading on Protecting Patient Data

Your Comments

We’ve just touched the tip of the de-identification iceberg here. Are there other issues we should be keeping an eye out for? Let everyone know in the comments.