Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 13. De-Identification and Data Quality

As is evident from the case studies we’ve presented, anonymization results in some distortion of the original data. What we want to do now is discuss the amount of distortion that can be introduced and how it can be effectively managed. We’ll focus on de-identification, not masking, because it’s de-identification that distorts the variables we might want to use for analysis. The amount of distortion is referred to as “information loss,” or conversely “data utility.”

Data utility is important for those using anonymized data, because the results of their analyses are critical for informing major care, policy, and investment decisions. Also, the cost of getting access to data is not trivial, making it important to ensure the quality of the data received. What we really want to know is whether the inferences drawn from de-identified data are reliable—that is, are they the same inferences we would draw from the original data?

Useful Data from Useful De-Identification

Although obvious, it’s worth repeating that poor de-identification techniques will result in less data utility. In fact, that is one key way to evaluate the quality of a de-identification method. Many of the de-identification techniques that we’ve described are essentially optimization problems. Some optimization methods maximize data utility, while others minimize the risk of re-identification at the expense of data utility. Not all optimization methods are created equal. The goal, of course, is to maximize data utility while meeting the predefined risk thresholds.

The other thing to keep in mind is that there are sometimes errors in the application of de-identification methods. A public example of that was an error in the public use files created by the US Census Bureau that affected inferences about people 65 years and older.[85] In this case, there was a nontrivial reduction in data utility, but only for a segment of the population. When looking at de-identification methods in this chapter, we’ll assume that there are no errors and that any loss of utility is a function of the methods used.

Evidence of the impact of de-identification on data utility is mixed. Some studies show little impact,[86] whereas others show a nontrivial impact.[87] There is also evidence that data utility depends on both the de-identification method and the analysis method used.[88], [89] Given the mixed reviews, it’s more useful to discuss the factors of evaluating and interpreting data utility rather than trying to make overly general statements.

Degrees of Loss

Data utility is not a binary metric: good or bad. Conceptually, there’s a spectrum of data utility (remember the Goldilocks principle shown in Figure 2-1 in Basic Principles?). A certain amount of data utility may be acceptable in one instance and not acceptable in another. Let’s consider an example of a public data release.

A large health data custodian manages a national registry. This registry has been around for many years, and the internal statisticians have been performing detailed longitudinal analyses on the data for many years. Access to the original data has also been granted to external analysts, after going through a somewhat lengthy screening and approval process. Then, one day, there was a request by the sponsors of the data custodian to create a public version of the registry to allow the large data holdings to be used and made accessible more broadly. The public data would have lower utility than the original registry, given the level of de-identification required. But is that lower utility acceptable?

The users of the public data may be data analysts who would want to do analyses similar to the internal statisticians. The greatest concern these analysts would have is whether they would be able to able to reproduce the longitudinal analyses done by the internal statisticians, given the lesser utility of a public data release. We can certainly evaluate data utility from that perspective, but there are many other potential data consumers for the public data who might have different requirements.

A public data set can be used in many ways:

§  By app developers as the basis for new tools, and to test them on realistic data

§  By students to learn how to build models

§  By computer scientists and statisticians to do simulations and as examples to illustrate their new algorithms and modeling techniques

§  By analysts to develop protocols and code to include in their proposals for funding to work on the full data set (i.e., to justify going through the lengthy process to get access to the internal data set)

§  By the media or other organizations (such as government departments) to produce high-level summaries from the data

§  By health care providers who may want to perform basic benchmarks of their performance

§  By data brokers who want to create aggregate information products for government and industry

For all of these data users, lower data utility may be just fine, because their purposes are different from those of the internal statisticians.

It’s important to keep in mind the multitude of users that may want access to a public data set when deciding whether its level of data utility is acceptable. This is of course most pronounced in the case of a public data set, but the issue arises for nonpublic data sets as well.

For public data, to ensure that the needs of all of these stakeholders are addressed, it would be prudent to consult with them during the de-identification process. This can be achieved by creating an advisory body that provides feedback to the de-identification team.

Workload-Aware De-Identification

Ideally, the de-identification methods that are applied are “aware” of the types of analysis that will be performed on the data. If it’s known that a regression model using age in years is going to be applied, the de-identification should not generalize the age into, say, five-year age intervals. But if the analyst is going to group the age into five-year age bands anyway, because these correspond with known mortality risk bands or something like that, this generalization of age will not have an impact on data utility. Examples of questions you can ask to allow the planned analysis to inform the de-identification are provided in Questions to Improve Data Utility.

Customizing the de-identification parameters to the needs of the analysis is easier to do for nonpublic data where the data recipient is known. All you need is a conversation or negotiation with the data recipient to match the de-identification methods to the analysis plan, such that variables that are critical for analysis are less likely to be modified (generalized or suppressed) during the de-identification. If the analyst is expected to perform a geospatial analysis, for example, location data may be less affected, but subsampling (Chapter 7) could be used to maintain anonymity. In contrast, if the analysis is to detect rare events, subsampling may not be an option during the de-identification, but fuzzing location may be all right. Certain groupings of nominal variables, such as race, ethnicity, language spoken at home, and country of birth, may also be limited if these are critical for the analysis.

If the data recipient is a clinician who doesn’t have a data analysis team, limits might be imposed on suppression. Performing imputation (discussed briefly in Information Loss Metrics) to recover information that has been suppressed requires some statistical sophistication, and this might not be something the clinician wants to get into. One of the main advantages of the methodology we’ve presented in this book is that it allows us to calibrate the de-identification so the data sets are better suited to the needs of the data recipients.

If it’s possible to have direct negotiations with the data users, it’s definitely worth the time. It will result in higher-quality data for the people that want it. The challenge is often that people are not used to participating in such negotiations and might not be willing to make the effort. Also, many may not understand the issues at hand and the trade-offs involved in different kinds of de-identification. But this relatively minor investment in time can make a big difference in the data utility. It’s also a way to manage expectations.

It’s not always possible to negotiate directly with a data recipient. A pharmacy that has provided its de-identified data to multiple business partners might not have the capacity to calibrate the data sets to each data user—such workload awareness might not be feasible. But there are a few options to consider:

§  You could guess at the types of analyses that are likely to be performed. The planned analyses are often quite simple, consisting of univariate and bivariate statistics and simple cross-tabulations. So, the pharmacy could test the results from running these types of analyses before and after de-identification to evaluate data utility. In most cases, the data custodians will have good knowledge of their data and which fields are important for data analysis, which makes them well positioned to make such judgments.

§  You could create multiple data sets suited to different analyses. One data set may have more extensive geographic information but less clinical detail, and another data set may have much more clinical information but fewer geographic details.[90] Depending on the needs of the data user, the appropriate data set would be provided.

§  You could ensure high data utility for some common parameters, such as means, standard deviations, correlation and covariance matrices, and basic regression models.[91], [92] These parameters are relevant for most analyses that are likely to be performed on the data. The impacts of de-identification on each parameter can be measured using the mean squared error (before and after de-identification).

§  You could use general information loss metrics, discussed in Information Loss Metrics. These metrics include the extent of suppression and entropy. In practice they are relatively good proxies for general distortion to the data, and people can easily understand them.

Of course, you can also use a mix of these options: one type of de-identification for known data users who are willing to negotiate with the data custodian, and a generically de-identified data set for other users.

Questions to Improve Data Utility

You can ask data users a lot of things to better understand the type of analytics that they plan to run on the data. The answers to these questions can help de-identify the data in a manner that will increase its data utility for them. Here are some of these questions, which will help you get the conversation started:

What data do you really need?

This is probably the most fundamental question to ask. We often see data users asking for all of the variables in a data set, or at least for more variables than they really need or plan to use. Maybe they haven’t thought through the analysis methods very carefully yet—they wanted to first see what data they could get. Talking about it with the data users will often result in a nontrivial pruning of the data requested. This is important because fewer variables will mean fewer quasi-identifiers, which leads to higher data utility for the remaining quasi-identifiers after de-identification.

Do you need to perform any geospatial analysis?

It’s important to find out the level of granularity required for any planned geospatial analysis. If a comparative analysis by state is going to be done, state information can be included without any more detailed geospatial data. However, if a hotspot analysis to detect disease outbreaks is to be done, more granular data will be needed. But even if no geospatial analysis is planned, location may be needed for other things—e.g., to link with data about socioeconomic status (SES) from the census (in which case linking the data for the data users, and then dropping the geospatial information is the easiest option).

Are exact dates needed?

In many cases exact dates aren’t needed. In oncology studies it’s often acceptable to convert all dates into intervals from the date of diagnosis (the anchor), and remove the original date of diagnosis itself. Or the analysis can be performed at an annual level to look at year-by-year trends, like changes in the cost of drugs—generalizing dates to year won’t affect data utility here. Dates can also be converted to date ranges (e.g., week/year, month/year), depending on what’s needed.

Will you be looking for rare events?

In general, subsampling is a very powerful way to reduce the risk of re-identification for nonpublic data releases. It’s a method that you should have in your toolbox. But if the purpose of the analysis is to detect rare events, or to examine “the long tail,” then subsampling would mean losing some of those rare events. By definition, rare events are rare, and losing this data is likely to mean losing any chance at statistical significance. In this case, avoid methods that will remove the rare events that are needed.

Can we categorize, or group categories in, variables?

To some degree this goes back to the data that is really needed, but people often forget about groupings they were planning, or would be willing, to create. They might ask for an education variable (six categories), but actually plan to change the groupings into fewer categories than are present in the original (two categories). The finer distinctions might not be needed to answer questions, or some of the original categories might not have a lot of data in them. These changes can have a big impact on reducing risk, especially for demographics, so it’s far better to know this in advance and include it in the de-identification.

Are provider or pharmacy identities necessary?

Due to the possibility of geoproxy attacks, provider and pharmacy identities increase re-identification risk. It’s relatively easy to determine a provider or pharmacy’s location from its identity and then predict where the patient lives. For some analyses the provider and pharmacy information is critical, and therefore it’s not possible to remove it. But sometimes it is acceptable to remove that information and consequently eliminate the risk of a geoproxy attack.

Do the data recipients have the ability to perform imputation?

Performing imputation to recover missing data requires some specialized skills. If the data user or team doesn’t have that expertise, then it would be prudent to minimize suppression during de-identification. Otherwise you might be leaving people with data they don’t know how to use. If a clinician is planning to do a simple descriptive analysis on the data set and there’s no statistician available to help, then the impact of missingness might be lost on him. If he’s unlikely to perform imputation, or understand the impact of missingness, minimize suppression.

Would you be willing to impose additional controls?

One way to increase data utility is to increase the risk threshold so that less de-identification is required. We can do this by improving the mitigating controls, if there’s a willingness to put in place stronger security and privacy practices. If there’s time, it’s helpful to show data users what data they would get with the current versus the improved mitigating controls. That way they can decide if it’s worth the effort for them to strengthen their security and privacy practices.

Is it possible to strengthen the data sharing agreement?

This is another way to increase the risk threshold so that less de-identification is required. The first measure we would suggest, and the first that we look for, is that the data sharing agreement has a provision in place that prohibits re-identification, to reduce motives and capacity. This would increase the risk threshold so that less de-identification is required.

Final Thoughts

How much the utility of data changes before and after de-identification is important, and is very context-driven. All stakeholders need to provide input on what is most important to them, be it data utility or privacy. It’s not easy to balance the needs of everyone involved, but good communication and a commitment to producing useful data that keeps the risk of re-identification low is all you really need to get started. It’s not an easy negotiation—and it may be iterative—but it is an important negotiation to have.

[85J. Alexander, M. Davern, and B. Stevenson, “Inaccurate Age and Sex Data in the Census PUMS Files: Evidence and Implications,” NBER Working Paper No. 15703 (Cambridge, MA: National Bureau of Economic Research, 2010).

[86A. Kennickell and J. Lane, “Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances,” in Privacy in Statistical Databases, ed. J. Domingo-Ferrer and L. Franconi (Berlin: Springer, 2006), 291–303.

[87K. Purdam and M. Elliot, “A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records,” Environment and Planning A 39:5 (2007): 1101–1118.

[88S. Lechner and W. Pohlmeier, “To Blank or Not to Blank? A Comparison of the Effects of Disclosure Limitation Methods on Nonlinear Regression Estimates,” in Privacy in Statistical Databases, ed. J. Domingo-Ferrer and V. Torra (Berlin: Springer, 2004), 187–200.

[89L. H. Cox and J. J. Kim, “Effects of Rounding on the Quality and Confidentiality of Statistical Data,” in Privacy in Statistical Databases, ed. J. Domingo-Ferrer and L. Franconi (Berlin: Springer, 2006), 48–56.

[90K. El Emam, D. Paton, F. Dankar, and G. Koru, “De-identifying a Public Use Microdata File from the Canadian National Discharge Abstract Database,” BMC Medical Informatics and Decision Making 11:53 (2011).

[91W. E. Winkler, “Methods and Analyses for Determining Quality,” in Proceedings of the 2nd International Workshop on Information Quality in Information Systems (Baltimore, MD: ACM, 2005), 3

[92Josep Domingo-Ferrer and Vicenç Torra, “Disclosure Control Methods and Information Loss for Microdata,” in Confidentiality Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, ed. P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz (Amsterdam: Elsevier Science, 2001), 91–110.


The animals on the cover of Anonymizing Health Data are Atlantic Herring (Clupea harengus), one of the most abundant fish species in the entire world. They can be found on both sides of the Atlantic Ocean and congregate in schools that can include hundreds of thousands of individuals.

These silver fish grow quickly and can reach 14 inches in length. They can live up to 15 years and females lay as many as 200,000 eggs over their lives. Herring play a key role in the food web of the northwest Atlantic Ocean: bottom-dwelling fish like flounder, cod, and haddock feed on herring eggs, and juvenile herring are preyed upon by dolphins, sharks, skates, sea lions, squid, orca whales, and sea birds.

Despite being so important to the ecology of the ocean, the herring population has suffered from overfishing in the past. The lowest point for the Atlantic herring came during the 1960s when foreign fleets began harvesting herring and decimated the population within ten years. In 1976, Congress passed the Magnuson-Stevens Act to regulate domestic fisheries, and the Atlantic herring population has made a great resurgence since then.

Herring fisheries are especially important in the American northeast, where the fish are sold frozen, salted, canned as sardines, or in bulk as bait for lobster and tuna fishermen. In 2011, the total herring harvest was worth over $24 million. Fisheries in New England and Canada do especially well because herring tend to congregate near the coast in the cold waters of the Gulf of Maine and Gulf of St. Lawrence. As long as the current regulations on fisheries stand, the Atlantic herring will continue to be a very important member of both the Atlantic Ocean’s ecosystem and our worldwide economy.

The cover image is from Brockhaus’s Lexikon. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.