Why Is Data Security Important?
Posted by Guest on July 11, 2019 in Blog
In today’s world of hackers and data breaches, how can you ensure your personal information is kept secure? It’s a difficult question – and one that the Census Bureau is legally bound to answer. Under the 72-Year Rule, the Census Bureau is required to keep all of your personal information confidential for seventy-two years after it is collected, and under Title 13 of the United States Code, any current or former employee who discloses confidential information can be sentenced to five years in prison and fined up to $250,000.
The Census Bureau is keenly aware of the importance of public trust and is attempting to go above and beyond to protect your identity. As part of this mission, the Census has to remain vigilant against both current and future privacy threats. Obviously, protecting against future threats poses a challenge – the Census Bureau has no way to predict how computers will become more powerful, what data breaches will happen, or what new hacking techniques will be developed.
One threat that the Census Bureau faces is that if it publicly released data that was not sufficiently anonymized, attackers may identify individuals based on that data. These attacks may be largely broken into two categories: reconstruction and reidentification.
Reconstruction relies on computer algorithms that can take in tabular or aggregate data and perform calculations to break them down to the individual level. In a simple example, if we know there are 3 men in a Census block, there are over 300,000 possible combinations of ages (based on the assumption that humans cannot be older than 125 years or younger than 1). But just by knowing that those 3 men have an average age of 44 and a median age of 30, it is possible to narrow down the number of possible age combinations down to less than 50.
Reidentification, on the other hand, takes publicly available data and combines it with other data. By comparing matching fields between two or more datasets, an attacker can identify individuals. A famous example of this type of attack occurred in the 1990s, when researchers at MIT compared anonymized medical records with the Cambridge, Massachusetts voting records. Using only these two datasets, researchers were able to match records based on ZIP code, data of birth, and gender, and thereby identify the Governor of Massachusetts.
Traditionally, the Census Bureau has used a combination of methods to avoid disclosing information that could be used to identify individuals:
1. Suppression: removing information that is considered too identifying. For example, in the following image you can see that the names and street addresses that are visible in the raw dataset are removed in the suppressed dataset. Data values may be suppressed even if they are not directly identifying like names and addresses; they simply need to be identifying enough that they could be used to identify specific individuals. For example, race and ethnic information is not directly identifying, but if a town only had ten people of one race, it could be possible to identify those people by their race in conjunction with other data like age.
2. Coarsening: reduces the amount of detail or precision in data. This is frequently done by grouping records together. For example, instead of reporting the number of people aged23, you might report the total number of people aged 20-25. The Census Bureau often coarsens data by refraining from reporting statistics for small geographical areas like census blocks.
3. Swapping: exchanging sensitive values between records. As an example, you might exchange birthdates between two people so that it’s harder to identify them in their place of residence. You may also exchange places of residence and keep the birthdates the same, depending on how you intend to use your dataset. Examples of raw data and data after swapping can be seen below.
4. Noise: in data science, noise is a term for meaningless, unexplained variations within a dataset. Noise can be a frustrating obstacle for data scientists, but by carefully introducing noise into a dataset, the Census Bureau can protect the privacy of individuals by making obscuring the true details while maintain the overall statistical patterns.
5. Synthetic data: by analyzing patterns in real census data, the Census Bureau can generate fake or synthetic data. Like noise, synthetic data maintains overall statistical patterns while obscuring individual records.
The Census Bureau currently uses all of these methods in conjunction to protect the confidentiality of your data. However, they’re not without their drawbacks; namely, the Bureau has to keep most of the information about how they implement these methods secret. If the Bureau released data on how many records they swapped, or how much noise they added in, etc, malicious actors would potentially be able to undo the Census Bureau’s privacy measures and recreate the original dataset. We know these methods of data protection do help, but it’s hard to put a number on how much they help.
But most importantly, over the past several decades, computers have gotten more powerful and data has become publicly accessible. Because data reconstruction attacks and reidentification attacks rely on computer assistance, they are becoming more common - and therefore, becoming greater threats to confidentiality.
In fact, a team of researchers at the Census Bureau found that basic personal information could be reconstructed from publicly available data released after the 2010 Census when it was combined with other publicly available data from sources like Facebook. The reconstructed data was rife with errors for most Americans, but Census officials worried that in conjunction with other datasets, it could threaten some individuals’ privacy. In order to fulfill its obligations of confidentiality, the Census Bureau has decided to incorporate a new method of data protection for both the 2020 Census and the American Community Survey: differential privacy.
In the early 2000s, Cynthia Dwork and a team of researchers proposed differential privacy as a way to protect against data reconstruction attacks and ensure confidentiality of data. Differential privacy is future-proof, meaning that no matter how much data is released in the future or how much computing power increases, your data will always be exactly as secure as it was when differential privacy was implemented.
So far, we’ve explained why the Census Bureau protects your privacy, how traditional data protection methods work, and why the Census Bureau is interested in differential privacy. But how does differential privacy work, and what effect will it have on the reports the Census Bureau releases? AAI will explore the principles behind differential privacy in the next blog post in our series, Securing the Census, Part 2: Differential Privacy and You.
This post is part of a multi-series exploration of data, dis/misinformation and the decennial census, guest authored by Summer 2019 Ph.D. Fellow Emma Drobina.