People who turn to the Census Bureau’s latest data release in an effort to answer Sesame Street’s musical query may, in some cases, be puzzled by what they find. The detailed race, ethnicity and population counts make it easy to look up data for any block in America. But those numbers may not be completely accurate—and deliberately so.
A census block is the smallest unit of geography for which data are published, and blocks are the basis for assembling larger geographic entities such as legislative districts. Nationally, there are more than 11 million of them, housing on average 100 people. According to a Census Bureau description, blocks normally are bounded by streets, other prominent physical features or the boundaries of geographic areas. They may be as small as a city block or as sprawling as a 100-square-mile rural area.
The 2010 Census data being released on a state-by-state basis this month and next month, which will be used as the basis for redistricting, include counts down to the block level. Data for each block include counts not just of people in the six basic race groups, but also of people who checked any one of the dozens of multi-race combinations. The data also include counts for Hispanics and non-Hispanics in these dozens of race groups.
But what if there are only one or two people on a block who are in a different race or ethnic category from that of the other residents? In such a case, publication of this level of detail about every block in America runs the risk that a person or household could be identified individually, conflicting with the Census Bureau’s legal obligation to protect the privacy of respondents.
So the Census Bureau deliberately blurs some of its data. The total population count is correct for each block, but when it judges that there is a risk that a household could be individually identified, the bureau undertakes what is called data swapping—essentially exchanging one household in a block for another similar household nearby.
The bureau does not reveal all the rules it uses, only saying it swaps households “with identical characteristics on a certain set of variables” or “records unique on their block based on a set of key demographic variables.” But as its website states: “Because of data swapping, users should not assume that tables with cells having a value of one or two reveal information about specific individuals.”
According to Stephen E. Fienberg, a professor of statistics and social science at Carnegie Mellon University, “What they will guarantee is that the number of people in a block never changes.” That guarantee is especially important to state officials, who use blocks as the basic units in order to build congressional and legislative districts of equal size.
Fienberg, whose areas of expertise include disclosure limitation on federal surveys, notes that once a household is swapped, it is permanently relocated for the purposes of any data release by the Census Bureau. Only when the original census forms are made public in 72 years, he said, will any swapped households be viewed in their proper place.
The bureau began to use data swapping in the 1970s. But the need to use statistical methods to blur data only grew with the expanded capability of personal computers and increased demand for micro-data—individual records that researchers can analyze under sworn confidentiality agreements. The arrival of the American Community Survey, which produces block-group-level data for five-year time periods on characteristics such as nativity, education, income and commuting as a substitute for the decennial census long form, introduced a new set of challenges.
The bureau employs a number of disclosure-avoidance procedures on a wide variety of its data products, not just the decennial census. Generally, the changes that are made should not affect the conclusions that researchers could draw from the data. However, last year the bureau acknowledged, in response to a working paper by several researchers, that errors had been introduced into some micro-data released about the elderly as the result of faulty application of privacy-protection procedures.
Some have suggested that the Census Bureau not release block-level data because of the privacy issue. Among them are Fienberg and former Census Bureau director Kenneth Prewitt, who oversaw the 2000 Census. Prewitt’s concern was that block-level data are unreliable for a number of reasons, which include respondent error and data swapping. Fienberg argues that political lines can be redrawn without block-level data, noting that block-level errors usually wash out when several blocks are combined.
The idea has never gained traction, though. There would be strong resistance among users, especially those who use block-level data to redraw lines of political districts “Everyone became so accustomed to getting it at that level,” Fienberg said, “that it’s hard to pull back from it.”