- Blocks of interest for NHGIS time series
- Advanced model for blocks of interest
- Simpler model for other blocks
In time series tables that are standardized to 2010 census geography, NHGIS produces the 2000 statistics by reaggregating census block data from 2000 Census Summary File 1 (NHGIS dataset 2000_SF1b).
NHGIS first allocates counts from 2000 blocks to 2010 blocks and then sums the reallocated block counts for each target 2010 unit. Where a 2000 block intersects multiple 2010 blocks, NHGIS interpolates from the 2000 block data to estimate how the 2000 block characteristics are distributed among the intersecting 2010 blocks.
In addition to the 2000 block data, we use information from three ancillary sources to refine the interpolation model:
- 2010 census block population and housing unit counts from 2010 Census Summary File 1 (NHGIS dataset 2010_SF1a).
- Locations of residential roads and water bodies from the U.S. Census Bureau's 2010 TIGER/Line Shapefiles
- Extents of developed land, which we define as 30-meter square cells in the 2001 National Land Cover Database (NLCD 2001) (2011 Edition) with at least 5% of their area covered by impervious surface.
Blocks of interest for NHGIS time series
At the outset of our standardization plans, we identified a limited set of 2000 "blocks of interest" for which to develop an advanced, data-intensive interpolation model. The blocks of interest are those that require interpolation for standardized time series because they each:
- Have a nonzero count of population or housing units and
- Share land area with two or more 2010 census units within a single target geographic level1
If a 2000 block has no population or housing, then we need not allocate any data from it to 2010 units.
If a 2000 block's land area lies within exactly one 2010 census unit for each target summary level (e.g., within one 2010 block group, one 2010 county subdivision, one 2010 urban area, etc.), then we can allocate the block's counts wholly to each corresponding target unit without any interpolation.
The next section specifies the advanced model NHGIS uses to interpolate counts from the blocks of interest to 2010 units for standardized time series tables.
NHGIS also supplies a block-to-block crosswalk that includes an interpolation weight for every intersection between 2000 and 2010 blocks. The crosswalk's weights are based on the advanced model only for the blocks of interest, which comprise 4.3% of all 2000 census blocks. For all other blocks, the crosswalk uses a simpler model based only on 2000 and 2010 census block data.
Advanced model for blocks of interest
This section outlines the key features of the advanced interpolation model NHGIS uses to allocate counts from the 2000 blocks of interest to 2010 blocks for the production of standardized time series. An article in Computers, Environment and Urban Systems (Schroeder 2017) provides a complete explanation of the model, including an assessment of the NHGIS model relative to others. A pre-print version of the article is also available here.
We initially derive two sets of interpolation weights, using two distinct interpolation techniques:
Binary dasymetric (BD) interpolation. In a BD model, the study area is subdivided into two zones of either inhabited or uninhabited areas. The basic assumption of the model is that, within each reporting area, the characteristic of interest (population, housing units, etc.) is distributed uniformly throughout the inhabited zone and absent from the uninhabited zone. Accordingly, for a given source unit (in our case, a 2000 census block), the model assigns a weight to each intersecting target unit (2010 census block) that is equal to the proportion of the area of the source unit's inhabited zone that lies in the target unit.
In NHGIS's BD model for 2000 block data, the inhabited zone consists of all areas that are at least 5% developed impervious surface (within each 30-meter square cell of NLCD 2001 data) and lie within 300 feet of a residential road center line2 but not in a water body, using road definitions and water polygons from the 2010 TIGER/Line Shapefiles.
NLCD 2001 also includes classified land cover data that distinguishes four classes of developed land (open space, and low, medium, and high intensity), but through comparison with satellite imagery, we concluded that the open space class included too much uninhabited area (e.g., parks, golf courses, roadways, etc.) to include it in the inhabited zone, while limiting the inhabited zone to the other three developed classes would omit too much residential land. Using our own classification based on a 5% impervious surface criterion appears to achieve an effective compromise.
It is somewhat problematic to use 2010 TIGER/Line roads in the BD model because in many areas, the 2010 roads do not represent well the 2000-era road network. Although 2000-era road definitions do exist in 2000 TIGER data, the Census Bureau made major accuracy improvements in TIGER data between 2000 and 2010, so the 2000 TIGER road representations are spatially imprecise and do not align well with the 2010 TIGER boundaries for 2000 and 2010 blocks that we use to identify intersections between blocks.
The problems of using 2010-era road definitions are largely mitigated by using them in conjunction with the NLCD 2001 data. If a new 2010 road is in an area that was undeveloped in 2000, the area will not meet the impervious surface criterion, and it will properly be omitted from the inhabited zone. Conversely, some liabilities of the NLCD data are mitigated by the 300-foot road buffer restriction. Areas of impervious surface that are far removed from residential roads tend to be industrial or commercial complexes, quarries, airports, interstate highways, golf courses, etc., and are typically not residential.
Target-density weighting (TDW) interpolation. The basic assumption of a TDW model is that, within each source unit, the spatial distribution of the characteristic of interest is proportional to the densities of another, related characteristic among intersecting target units (Schroeder 2007). To interpolate 2000 block data to 2010 blocks, the TDW assumption is that, within each 2000 block, the distribution of 2000 characteristics is proportional to the densities of some 2010 block characteristic measured in 2010.
Additionally, the TDW model we use is dasymetrically refined (Ruther et al. 2015), such that it limits the modeled distributions to areas that are classified as inhabited zones.
Specifically, in NHGIS's TDW model for 2000 blocks of interest, all 2000 characteristics within each 2000 block are assumed to be located only in an inhabited zone that is within 300 feet of a residential road center line and not in a water body (using the same 2010 TIGER/Line basis and road classification as in the BD model, butwithout the NLCD-based impervious surface restriction), and within each 2000 block's inhabited zone, the densities of 2000 characteristics are assumed to be proportional to the summed densities of 2010 population and housing in the inhabited zone of each intersecting 2010 block, where the "summed density" equals the sum of the total population and total housing units in the 2010 block, divided by the area of the inhabited zone in the 2010 block, using the same definition of inhabited zone for both the 2000 and 2010 densities.
In effect, the weight assigned to an intersection between a 2000 block and 2010 block is equal to the proportion of the 2000 block's characteristics that would be located in the 2010 block if the characteristic's density in each "inhabited" area of intersection between the 2000 and 2010 blocks were exactly equal to the summed density of 2010 population and housing units in the inhabited zone of each whole 2010 block.
We choose to use the sum of 2010 population and housing units to guide the interpolation because we have chosen to use a single TDW model to interpolate all 2000 census block characteristics, including both population and housing characteristics. There are many blocks where the count of residents is much greater than the count of housing units (e.g., where most of the population lives in group quarters), and there are many blocks where the opposite occurs (e.g., where most of the housing is vacant). It would therefore be problematic for a general model to use only 2010 population or housing densities to interpolate all 2000 characteristics. We believe that using the sum of the two is a suitable compromise solution.
It would also be possible to use different 2010 characteristics to guide the interpolation of different 2000 characteristics, but using a single set of weights (as noted above) greatly simplifies the model and ensures that all interpolated subtotals will correctly sum to totals.
We choose to use a different inhabited zone definition for the TDW model than for the BD model because we found that the optimal inhabited zone differs for each interpolation approach, based on a broad assessment in which we used several interpolation models to disaggregate counts from pairs of neighboring 2000 blocks back to individual blocks (Schroeder 2017).
The final model is a hybrid of the two above models...
Hybrid TDW-BD model. The BD and TDW models have some complementary advantages and disadvantages. The TDW model is generally the more effective model because the 2010 distributions are, in most cases, a strong indicator of 2000 distributions. However, in blocks where population and housing distributions changed significantly between censuses, we might expect the BD model to be the more effective model. For this reason, we choose to use a hybrid of the TDW and BD models.3
In the hybrid model, each interpolation weight is a weighted average of the TDW and BD weights, computed as:
pH = wT pT + (1 - wT)pB
where the p values are interpolation weights from the TDW model (pT), the BD model (pB), and the hybrid model (pH), and wT is the weight given to the TDW model in the hybrid model's weighted average. The weight given to the BD model is (1 - wT), ensuring that the two model weights sum to 1, so there is no loss or gain in total counts.
Because we expect TDW to perform well where distributions have been stable and less well where distributions have changed, we compute the TDW model weightwT to vary according to the estimated change in population and housing counts in each 2000 block:
wT = 0.9192 - 0.8057 * |z - y| / (z + y)
where y is the sum of 2000 population and housing units in the 2000 block and z is an estimated sum of 2010 population and housing units in the 2000 block, assuming that each 2010 block's characteristics are uniformly distributed within the 2010 block's inhabited zone as defined for the TDW model. The measure of change used in this model is the absolute normalized difference between y and z, computed as the absolute value of the difference divided by the sum: |z - y| / (z +y). This relative change measure is helpfully constrained to a range of 0 (no change) to 1 (change from zero to nonzero, or vice versa).
We selected the model coefficients (0.9192 and 0.8057) by fitting the model using linear regression on data for over 350,000 cases of interpolation from pairs of 2000 blocks (each block of interest paired with its nearest neighboring block) to individual 2000 blocks. For more information on the model-fitting approach, see Schroeder (2017).
In blocks where there was no estimated change in population and housing between 2000 and 2010, the hybrid model strongly favors TDW over BD, assigning a weight of 0.9192 to the TDW model and a weight of 1 - 0.9192 = 0.0808 to the BD model. In blocks where the estimated change was extreme, the hybrid model favors BD over TDW, assigning a minimum possible weight of 0.9192 - 0.8056 = 0.1136 to the TDW model and a maximum possible weight of 1 - 0.1136 = 0.8864 to the BD model. The mean weight given to TDW (wT) is 0.7217 among all blocks of interest (excluding 303 Alaska blocks that have no NLCD land cover data and are therefore handled separately).
There are two exceptional scenarios that require special handling:
2000 blocks that contain no inhabited zone or no 2010 population and housing. Both the BD and TDW models assume that each 2000 block contains some area classified as inhabited, and the TDW model further assumes that there is some 2010 population and housing in the intersecting 2010 blocks. In cases where these assumptions do not hold, each model "cascades" to simpler, more general models of the 2000 block characteristics until it reaches a model with a nonzero inhabited zone, in these orders of priority:
BD inhabited zones:
- Land within 300 feet of 2010 residential roads and developed in 2001 (with at least 5% area in impervious surface)
- Land within 300 feet of 2010 residential roads
- All land
TDW target densities:
- Summed density of 2010 block population and housing units on land within 300 feet of residential roads
- Summed density of 2010 block population and housing units on all land
- BD model, with priorities as given above
Alaska blocks that are not covered by NLCD 2001. NLCD 2001 covers all of the contiguous U.S. and Hawaii, but it covers only a small portion of Alaska, a section centered on Anchorage. For other parts of Alaska, the BD model's inhabited zone (following the priorities given above) consists of all land within 300 feet of 2010 residential roads. After testing the various TDW models for this area, we found that TDW without dasymetric refinement is most suitable. We also refitted the hybrid model for this area, with the outcome that the TDW model weight is a constant 0.8773, with no variation according to the estimated change in population and housing.
Simpler model for other blocks
For the NHGIS block-to-block crosswalks, which supply interpolation weights for all intersections between 2000 and 2010 blocks, we opted not to apply our data-intensive advanced model for all 2000 blocks. Instead, for the 95.7% of 2000 census blocks that are not identified as blocks of interest for NHGIS time series, we use a simpler model based entirely on 2000 and 2010 census block data.
Specifically, we apply simple land-based target-density weighting (TDW) as originally defined by Schroeder (2007), following these steps:
- Allocate each 2010 block's population and housing unit counts among the 2000 blocks that intersect it in proportion to the land area of each intersection
- Sum the estimated 2010 population and housing units counts for all intersections within each 2000 block
- Set the interpolation weight for each intersection to equal the estimated 2010 count of population and housing units for the intersection (from step 1) divided by the total estimated 2010 count for the 2000 block (from step 2)
- In cases where the total estimated 2010 count for the 2000 block is zero, then the interpolation weight is based on land areas (the land area of the intersection divided by the total land area of the 2000 block), or, if the 2000 block also has zero land area, then the weight is based on water areas
We use only two data sources:
- Land and water areas for all 2000-2010 block intersections from the U.S. Census Bureau's 2000-2010 Block Relationship Files
- 2010 census block population and housing unit counts from 2010 Census Summary File 1 (NHGIS dataset 2010_SF1a)
In our assessment of interpolation models (Schroeder 2017), land-based TDW does not perform as well as the hybrid model that NHGIS uses for the blocks of interest, but land-based TDW does outperform all tested binary dasymetric models. It also performs almost as well as the best dasymetrically refined TDW models and, in most cases, almost as well as the hybrid model, too. Meanwhile, the choice of model matters little for most block intersections because 33.8% of 2000 blocks contain no population or housing units, and an additional 48.3% have population or housing but share land with only one 2010 block.
^We initially identified the "target geographic levels" to be: block groups, places, county subdivisions, school districts, ZIP Code Tabulation Areas, urban areas, congressional districts (111th and 113th), and any levels that can be constructed from these (e.g., census tracts, counties, etc.). NHGIS has not yet produced standardized time series for all of these levels, but our model would enable us to do so effectively.
^ TIGER/Line files distinguish many classes of roads. To identify "residential roads" for the purposes of the BD model, we inspected various examples of each TIGER road class and distinguished them according to whether they do or do not, at least occasionally, provide access to housing, with the following results:
Residential road classes:
- S1200: secondary road
- S1400: local neighborhood road, rural road, city street
- S1640: service drive usually along a limited access highway
- S1730: alley
- S1740: private road for service vehicles (logging, oil fields, ranches, etc.)
- S1750: internal U.S. Census Bureau use
- S1780: parking lot road
Nonresidential road classes:
- S1100: primary road
- S1500: vehicular trail: 4WD
- S1630: ramp
- S1710: walkway/pedestrian trail
- S1720: stairway
- S1820: bike path or trail
- S1830: bridle path
- S2000: road median
- any road segment flagged as a bridge, ford, or tunnel
^ The hybrid strategy used here follows the approach used by Schroeder and Van Riper (2013) to construct a hybrid of TDW and a dasymetric model in another setting.
- ^ Ruther, M., Leyk, S., & Buttenfield, B. P. (2015). "Comparing the effects of an NLCD-derived dasymetric refinement on estimation accuracies for multiple areal interpolation methods." GIScience & Remote Sensing 52(2), 158-178. http://dx.doi.org/10.1080/15481603.2015.1018856
- a b Schroeder, J. P. (2007). "Target-density weighting interpolation and uncertainty evaluation for temporal analysis of census data." Geographical Analysis 39(3), 311–335. http://dx.doi.org/10.1111/j.1538-4632.2007.00706.x
- ^ a b c d Schroeder, J. P. (2017). "Hybrid areal interpolation of census counts from 2000 blocks to 2010 geographies." Computers, Environment and Urban Systems 62, 53-63. http://dx.doi.org/10.1016/j.compenvurbsys.2016.10.001
- ^ Schroeder, J. P., & Van Riper., D. C. (2013). "Because Muncie's densities are not Manhattan's: Using geographical weighting in the EM algorithm for areal interpolation." Geographical Analysis 45(3), 216–237. http://dx.doi.org/10.1111/gean.12014