Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 7. Data Reduction: Research Registry Revisited

We often see massive data sets that need to be de-identified. With such large data sets, we are looking for ways to speed up the computations. We noted earlier that the de-identification process is often iterative, and it’s important to expedite these iterations so we can reach a final data set rapidly. In enterprise settings there may be a time window for de-identification to complete so that subsequent analytics can be performed. This also demands more rapid execution of de-identification.

For large and complex data sets traditional de-identification may also result in considerable information loss. We therefore need to find ways to reduce the size and complexity of the data that minimize information loss but still provide the same privacy assurances.

The Subsampling Limbo

A straightforward data reduction approach is subsampling, in which we randomly sample patients from the data set to create a subset of records—the subset is the data set that is then released. The sampling fraction is the proportion of patients that are included in the subset. In the case of a cross-sectional data set, we sample based on records; in the case of a longitudinal data set, we sample based on patients instead. Otherwise, we could end up with a lot of records from those patients with an extreme number of records, even though they’re in the minority. That wouldn’t result in the correct sampling fraction, since we wouldn’t have a representative subsample of patients. What we want is a data set that is representative of the whole.

This seems simple enough, and subsampling can actually be a very powerful way to reduce the re-identification risk. That’s why national statistical agencies will generally only release a small subsample of census data—they do it as a way to manage re-identification risk. You may know that your neighbor A. Herring is in the census, because pretty much everyone is, but if the data being made publicly available is a random sample, then you don’t know if A. Herring is included in the public data set or not.

How Low Can We Go?

Let’s be clear, though: not all data users will be happy with a sample. Generally, data analysts want as much data as they can get. But apart from the general desire for more data, subsampling can also reduce the statistical power of many tests (just to be clear, this has nothing to do with adversary power, discussed in Chapter 6). So, subsampling won’t always be the best option for reducing re-identification risk.

Our cutoff is a sample of 3,000—if a subsample is smaller than that, then the statistical power of classical tests will take a nontrivial hit.[56] But for samples of more than 3,000, the statistical power will be above 90% for most tests. When a data set might contain outliers or effects that are evident only in small groups, then 3,000 may be a low number even if the statistical power is high.

On the flip side, it’s also possible to have too much data, resulting in models that are overfit and unreliable (i.e., they won’t validate on external data). Studies have suggested a max of 20 effective observations per candidate predictor, but more for narrowly distributed predictors.[57] Things get more complicated for some models,[58] so really a power analysis involving effect size should be used to determine an appropriate sample size. And if this doesn’t make any sense to you, then leave it to a statistician to sort out (they’re good that way).

Not for All Types of Risk

For average risk, subsampling has a significant impact and can be another technique to use in addition to generalization and suppression. If a data user wants more precision on a variable (i.e., less generalization and suppression), then we can offer to use a smaller sampling fraction in order to bring the risk down.

But for maximum risk, the reality is that subsampling has very little impact. We would have to create a subsample that is extremely small for it to really have a meaningful impact on maximum risk. This means that for the public release of data sets, subsampling will not be a very useful technique to use (although it may still be worth doing, depending on the context and the changes in risk that you’re looking for). However, for average risk it can have a nontrivial impact and is a useful technique to employ.

Remember that the subsampling we’re talking about would be done completely at random, so it wouldn’t bias the results of the analysis, and wouldn’t require any special analytical methods. Of course, the subsampling can be done in a more complex way, where specific groups of observations are sampled less or more than others, but this also introduces complexities in risk measurement. And, to be honest, it would stretch the assumptions of current risk metrics and their estimators.

BORN to Limbo!

Let’s use the BORN data, which we first saw in Chapter 3, to see how subsampling affects risk. We’re disclosing the data to our good friend Researcher Ronnie, who now wants more precision in the variables and is willing to take a subsample to get it. The risk needs to be smaller than 0.115, and it’s average risk since it’s for a researcher. We use the same setup and quasi-identifiers as before, which we repeat in Table 7-1.

Table 7-1. Quasi-identifiers requested by Researcher Ronnie

Field

Description

BSex

Baby’s sex

MDOB

Maternal date of birth

BDOB

Baby’s date of birth

MPC

Maternal postal code

To give you a sense of things, we created random subsamples going down to only 5% of the data (OK, admittedly that’s not much), and for each subsample we measured the average re-identification risk. This process was repeated 1,000 times and the mean risk across all iterations was computed. We compared the original data set to a version with generalization (but less than before), as described in Table 7-2, to see how we can change the precision of variables when we’re subsampling.

Table 7-2. BORN again! Two versions of the data to see what subsampling will do to risk.

Data set

BSex

MDOB

BDOB

MPC

Original

No change

dd/mm/year

dd/mm/year

6 characters

With generalization

No change

year

week/year

3 characters

Now for the subsampling. Figure 7-1 shows that the average risk decreases rapidly as smaller samples are drawn. The question is, how low can we go? This is a dance we can only do with Researcher Ronnie, as he’s the one that needs to work with the final subsampled data set.

The lower Researcher Ronnie can go, the lower the risk

Figure 7-1. The lower Researcher Ronnie can go, the lower the risk

With all of the detailed variables and no generalization or suppression, subsampling achieves the 0.115 threshold with a sampling fraction of 0.3. The data set has 919,710 records, so a 30% subsample gives us 275,913 records. With generalization of the quasi-identifiers, we meet the risk threshold with a 35% subsample of 321,898 records. But the cost is a loss in precision on some of the quasi-identifiers.

If Researcher Ronnie wanted to perform detailed geospatial analysis, then he’d most likely want six-digit postal codes. With generalization on some of the other fields and suppression, we could even give him a larger subsample. It really comes down to what Researcher Ronnie needs, and how low he can go (in terms of sampling fraction, of course).

Many Quasi-Identifiers

Many health data sets have a large number of quasi-identifiers. We see that with registries and with survey data. A large number of variables might be collected directly from patients and their providers, or data might be linked with other data sets to create a much larger and more comprehensive data set. More quasi-identifiers means more opportunity to re-identify patients.

When there are a large number of quasi-identifiers, it is called “the curse of dimensionality.” This means that there are too many variables, and the risk of re-identification is going to be quite large. To manage the risk in such cases using traditional assumptions and methods would result in excessive generalization and suppression. The data would have quite a lot of information removed from it.

But we have to be realistic with what we assume an adversary can know about patients. When there are a large number of quasi-identifiers, we don’t have to consider all of them at the same time. If a data set has 20 quasi-identifiers, it’s not reasonable to assume that an adversary would have 20 different pieces of background knowledge on any individual in the database. This is a lot of information about individuals, especially when we’re referring to their physical or mental health. An adversary might only have five or seven pieces of background information that can be used for a re-identification attack.

This is the same concept of “adversary power” that we discussed in the context of the WTC registry, except now we are looking at adversary power from the perspective of a cross-sectional data set rather than a series of longitudinal events. The way we approach adversary power is a little different here because all of the patients will have the same number of quasi-identifiers (as opposed to each patient potentially having a different number of events), and we do not model diversity as that concept is not really applicable here.

If the 20 quasi-identifiers are strongly correlated, then knowing one quasi-identifier might be enough to estimate a few more. Bootstrapping this idea further, the adversary might only need to know three or five things to estimate the full set of 20 quasi-identifiers. But that doesn’t change our approach, because evaluating the risk on a few quasi-identifiers would therefore be the same as looking at all the quasi-identifiers. So let’s assume they’re independent of one another.

Subsets of Quasi-Identifiers

Let’s assume that the adversary could only reasonably know 5 of the 20 quasi-identifiers. The question then is which 5 out of 20 do we consider when evaluating the re-identification probability? We need to choose a subset of 5 fields to measure re-identification risk and to de-identify exposed records. Since any combination of 5 fields is plausible, this becomes a combinatorial problem—there are 15,504 combinations of 5 items from a set of 20 (better known as 20 choose 5, or, of 20 items, how many combinations of 5 can we choose?).

We could use a brute force approach and measure the re-identification risk for all 15,504 possible combinations of five quasi-identifiers. If the measured risk were high for any of these 15,504 combinations, then we would apply our de-identification algorithms. The problem is that evaluating 15,504 combinations can take a long time. And this solution doesn’t scale well. For 21 fields, there are 20,249 combinations of 5; for 22 fields, there are 26,334 combinations; and so on. This would be a very inefficient way of solving the problem.

Let’s consider some shortcuts. We’ll refer to our quasi-identifiers as q1 to q20. We can consider each quasi-identifier individually; if the measured risk is high on any one individual quasi-identifier (say, q1) then it will by definition be high for all combinations of five quasi-identifiers that include q1. Date of birth is a good example, as it can be quite identifying. Grouped with state, gender, race, and body mass index, the main driver of risk would still be date of birth. The risk might be higher as a combination of five, but if it was already too high with date of birth then there’s no need to evaluate the combination of quasi-identifiers.

On the other hand, even if the measured risk on q1 is low, there may well be a combination of five quasi-identifiers that includes q1 that has a high measured risk. Take age generalized into 10-year bands. That seems much less identifying, but pair it with state, gender, race, and body mass index, and suddenly we’re not so sure. In this case we need to evaluate all combinations of five quasi-identifiers to ensure they have a low risk of re-identification. This isn’t much of a shortcut, then, because we still need to check the combinations of five for many quasi-identifiers.

Another shortcut, which is more efficient but incurs a slight cost in terms of information loss, is to use covering designs. We can fine tune how much information loss is acceptable, balancing between precision and computation time.

Covering Designs

The idea of covering designs is to “cover” the combinations we’re interested in with a larger group of combinations. It’s easier to illustrate this with a small example. Let’s say that we have five quasi-identifiers and we want to evaluate the risk of re-identification for all combinations of three quasi-identifiers. But we also want to minimize the number of combinations that we evaluate. Table 7-3 shows all combinations of 3 from 5. This gives us 10 combinations.

Table 7-3. Every combination of five quasi-identifiers taken three at a time

Set #

Combination

1

{q1q2q3}

2

{q1q2q4}

3

{q1q2q5}

4

{q1q3q4}

5

{q1q3q5}

6

{q1q4q5}

7

{q2q3q4}

8

{q2q3q5}

9

{q2q4q5}

10

{q3q4q5}

Now let’s consider the fewest possible blocks of size 4, or combinations of four quasi-identifiers, that cover all combinations of three quasi-identifiers. This is shown in Table 7-4. Here we managed to reduce the number of combinations from 10 to 4, but we still evaluate all possible combinations of three quasi-identifiers from a set of five. Table 7-4 is a (5,4,3) covering design (like 5 choose 3, except we add a middle piece, which is the block size, so that it’s more like 5 choose blocks of 4 to cover combinations of 3).

Table 7-4. Covering every combination of five quasi-identifiers taken three at a time in blocks of four

Block

Set #’s covered

{q1q2q3q4}

1, 2, 4, 7

{q1q2q3q5}

1, 3, 5, 8

{q1q2q4q5}

2, 3, 6, 9

{q1q3q4q5}

4, 5, 6, 10

When we measure the re-identification probability for a covering design, we take the maximum probability across the combinations that are evaluated and use that as the measure for the whole data set. The rotten apple spoils the barrel. We’ve developed a strategy to minimize the amount of information loss. Compared to other strategies we tried, this one has resulted in the least distortion of the data:

1.    Sort the blocks in the covering design by the percent of records that are high risk, with the largest percentages on top.

2.    Starting from the top, apply generalization followed by suppression to the block until the risk is acceptably low.

3.    Continue down the list of blocks, using generalization and suppression, until the risk is acceptably low for all of them.

Starting from the top of the list of blocks, applying generalization and suppression lowers the risk for the other blocks lower down the list as well. In a lot of cases that means we don’t need to do anything to the blocks lower down the list, since the worst was dealt with first.

Covering BORN

We linked the BORN data set to socioeconomic data from the census, available at the postal code level. The census provides aggregate numbers, not exact matches to the individual records in BORN. There might be 30 people in postal code P0M1B0 that are visible minorities, but we don’t know which, if any, of these are in the BORN data. The only option to get the exact numbers would be to use secure linking, which we’ll discuss in Chapter 12. The other way is to simulate it. If we know there are 200 people in P0M1B0, then we randomly assign the status of visible minority to individual records with that same postal code, with a probability of 15% (30/200). For our purposes, we used the simulation approach to assign a value to each record in the BORN registry.

Linking BORN, with five quasi-identifiers, to the census data added seven quasi-identifiers to the maternal-health registry (see Table 7-5). Using covering designs makes the de-identification of this data set a lot faster. Compared to a brute force approach of evaluating all combinations, many fewer combinations of quasi-identifiers need to be evaluated.

Table 7-5. BORN and census quasi-identifiers

Origin

Variable

Description

BORN

BSex

Baby’s sex

BORN

MDOB

Maternal date of birth

BORN

BDOB

Baby’s date of birth

BORN

MPC

Maternal postal code

BORN

PBabies

Number of previous live births

BORN

LHIN

The health region where the birth occurred

BORN

LANGUAGE

Language spoken at home (this field also appears in the census)

BORN

ABORIGN

Aboriginal status

BORN

MINORITY

Visible minority

Census

MSTATUS

Marital status

Census

IMMIGRATION

Immigration status

Census

OCCUPATION

Occupation category

Census

PGROUP

Population group (e.g., West Indian, East Asian, Arab, African, and so on)

Census

INCOME

Income in $10k increments

Census

CERT

Highest educational certificate obtained

In general, the number of blocks goes down as the block size increases (as you would expect). This can really speed up computations. However, the larger the blocks are, the more we end up overestimating the actual probability of re-identification, which results in more distortion to the data than is strictly necessary. We show this in Table 7-6, which is independent of the data. Note that the first entry for a covering, (15,k,k), is equivalent to all combinations (15 choose k), the worst case.

Table 7-6. A cover up, to reduce the number of quasi-identifiers we need to consider. The numbers in this table are the numbers of blocks we need to evaluate.

Block size (k)

(15,k,3)

(15,k,5)

(15,k,7)

3

455

   

4

124

   

5

56

3003

 

6

31

578

 

7

15

189

6435

8

13

89

871

9

10

42

270

10

7

27

108

Final Thoughts

If you’re using average risk, subsampling is a simple and very effective way to lower the potential for re-identification. Not so much for maximum risk. Just make sure you have some idea of what kind of analytics will be performed, and you can probably limbo to a pretty low subsample and not have to worry. A power analysis would be a great way to be sure. In another dimension, you could have too many quasi-identifiers, and covering designs can help manage the risk in a more reasonable way than assuming the adversary knows all of them. All in all, it’s just more tools to help manage risk and our assumptions.


[56J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. (Hillsdale, NJ: LEA Publishers, 1988).

[57F.E. Harrell Jr., Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (New York: Springer, 2001).

[58T. Snijders and R. Bosker, Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling (London: SAGE Publications, 1999).