Generating h3 Hexgrids from GeoDataFrames

import matplotlib.pyplot as plt

from tobler.area_weighted import area_interpolate
from tobler.util import h3fy

%load_ext watermark
%watermark -v -a "author: eli knaap" -d -u -p tobler,cenpy,geopandas
Author: author: eli knaap

Last updated: 2026-03-30

Python implementation: CPython
Python version       : 3.14.3
IPython version      : 9.12.0

tobler   : 0.13.1.dev120+g87df23aef
cenpy    : 1.0.1
geopandas: 1.1.3

Note: This notebook relies on functionality from the contextily package that provides convenient basemaps for geospatial plots, and the cenpy package that provide a convenient interface to the U.S. Census API. These can be installed with

pip install contextily cenpy
or
conda install contextily cenpy -c conda-forge

import contextily as ctx
from cenpy import products
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Getting data from CenPy

To begin with, we will fetch some data from the 2017 ACS

acs = products.ACS(2017)

We’re looking for median home value, so first we will filter the ACS tables by those containing “value” in the description so we can find the correct variable code

acs.filter_tables("VALUE", by="description")
description columns
table_name
B25075 VALUE [B25075_001E, B25075_002E, B25075_003E, B25075...
B25076 LOWER VALUE QUARTILE (DOLLARS) [B25076_001E]
B25077 MEDIAN VALUE (DOLLARS) [B25077_001E]
B25078 UPPER VALUE QUARTILE (DOLLARS) [B25078_001E]
B25079 AGGREGATE VALUE (DOLLARS) BY AGE OF HOUSEHOLDER [B25079_001E, B25079_002E, B25079_003E, B25079...
B25080 AGGREGATE VALUE (DOLLARS) BY UNITS IN STRUCTURE [B25080_001E, B25080_002E, B25080_003E, B25080...
B25082 AGGREGATE VALUE (DOLLARS) BY MORTGAGE STATUS [B25082_001E, B25082_002E, B25082_003E]
B25083 MEDIAN VALUE (DOLLARS) FOR MOBILE HOMES [B25083_001E]
B25096 MORTGAGE STATUS BY VALUE [B25096_001E, B25096_002E, B25096_003E, B25096...
B25097 MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS) [B25097_001E, B25097_002E, B25097_003E]
B25100 MORTGAGE STATUS BY RATIO OF VALUE TO HOUSEHOLD... [B25100_001E, B25100_002E, B25100_003E, B25100...
B25107 MEDIAN VALUE BY YEAR STRUCTURE BUILT [B25107_001E, B25107_002E, B25107_003E, B25107...
B25108 AGGREGATE VALUE (DOLLARS) BY YEAR STRUCTURE BUILT [B25108_001E, B25108_002E, B25108_003E, B25108...
B25109 MEDIAN VALUE BY YEAR HOUSEHOLDER MOVED INTO UNIT [B25109_001E, B25109_002E, B25109_003E, B25109...
B25110 AGGREGATE VALUE (DOLLARS) BY YEAR HOUSEHOLDER ... [B25110_001E, B25110_002E, B25110_003E, B25110...
B25121 HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 201... [B25121_001E, B25121_002E, B25121_003E, B25121...
B992519 ALLOCATION OF VALUE [B992519_001E, B992519_002E, B992519_003E]

The variable we’re looking for is B25077_001E, the median home value of each. Lets collect that data for the Washington DC metropolitan region. The next cell can take a minute or two to run, depending on the speed of your connection.

dc = acs.from_msa("Washington-Arlington", variables=["B25077_001E"])
dc.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1359 entries, 0 to 1358
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   GEOID        1359 non-null   str     
 1   geometry     1359 non-null   geometry
 2   B25077_001E  1331 non-null   float64 
 3   NAME         1359 non-null   str     
 4   state        1359 non-null   str     
 5   county       1359 non-null   str     
 6   tract        1359 non-null   str     
dtypes: float64(1), geometry(1), str(5)
memory usage: 170.8 KB
dc.crs
<Projected CRS: EPSG:3857>
Name: WGS 84 / Pseudo-Mercator
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: World between 85.06°S and 85.06°N.
- bounds: (-180.0, -85.06, 180.0, 85.06)
Coordinate Operation:
- name: Popular Visualisation Pseudo-Mercator
- method: Popular Visualisation Pseudo Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Creating Hexgrids with the h3fy function

Using the h3fy function from the tobler.util module, we can easily generate a hexgrid covering the face of the DC Metropolitan region

dc_hex = h3fy(dc)
fig, axs = plt.subplots(1, 2, figsize=(18, 10))
axs = axs.flatten()

dc.plot(ax=axs[0], alpha=0.4, linewidth=1.6, edgecolor="white")
dc_hex.plot(ax=axs[1], alpha=0.4, linewidth=1.6, edgecolor="white")

axs[0].set_title("Original Tract Data")
axs[1].set_title("Hex Grid")

for i, _ in enumerate(axs):
    ctx.add_basemap(axs[i], source=ctx.providers.CartoDB.Positron)
    axs[i].axis("off")
../_images/6b370f306b39a0f82979963f414aa003def2342ab80866f0e00865b274ac029f.png

By altering the resolution parameter, we can generate grids using hexes of various sizes

dc_hex_large = h3fy(dc, resolution=5)
dc_hex_small = h3fy(dc, resolution=7)
fig, axs = plt.subplots(1, 2, figsize=(18, 10))

dc_hex_large.plot(ax=axs[0], alpha=0.4, linewidth=1.6, edgecolor="white")
dc_hex_small.plot(ax=axs[1], alpha=0.4, linewidth=1.6, edgecolor="white")

for ax in axs:
    ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron)
    ax.axis("off")
../_images/85df4e26da6919080a725f4d01c03e4fc0c07f2d48e4db2ca901429a988f16a6.png

and by using the clip parameter, we can ensure that the hexgrid is does not extend beyond the borders of the input geodataframe

dc_hex_clipped = h3fy(dc, resolution=5, clip=True)
fig, ax = plt.subplots(figsize=(10, 10))

dc_hex_clipped.plot(ax=ax, alpha=0.4, linewidth=1.6, edgecolor="white")
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron)
ax.axis("off")
ax.set_title("h3fy with clip")
Text(0.5, 1.0, 'h3fy with clip')
../_images/baa20c8bbfa90990dfa3dd1c083a630eb28e86fce0ebcfc879329e1ec5b40ccc.png

Notice, though, that we still have some sharp hexagonal edges, meaning the hexagon doesn’t fully cover the original source polygons. To force full coverage, we can use the buffer keyword (or, optionally, both buffer and clip).

dc_hex_clipped = h3fy(dc, resolution=5, clip=True, buffer=True)

fig, ax = plt.subplots(figsize=(10, 10))

dc_hex_clipped.plot(ax=ax, alpha=0.4, linewidth=1.6, edgecolor="white")
ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron)
ax.axis("off")
ax.set_title("h3fy with buffer & clip")
Text(0.5, 1.0, 'h3fy with buffer & clip')
../_images/91b7260d264a0c5d48746b2d86929da526137c801f852f1a850437f9c95f093b.png

Interpolating to a hexgrid

Thus, in just a few lines of code, we can estimate the value of census variables represented by a regular hexgrid

here, we will estimate the median home value of each hex in the DC region using simple areal interpolation

dc_hex_interpolated = area_interpolate(
    source_df=dc, target_df=dc_hex, intensive_variables=["B25077_001E"], fill_nan="mean"
)
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/tobler/area_weighted/area_interpolate.py:350: UserWarning: nan values in variable: B25077_001E, replacing with 437297.2201352367
  vals = _nan_check(source_df, variable, fill_value=fill_nan)
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
dc.plot("B25077_001E", scheme="quantiles", alpha=0.5, ax=axs[0])

dc_hex_interpolated.plot("B25077_001E", scheme="quantiles", alpha=0.5, ax=axs[1])

axs[0].set_title("Original Data")
axs[1].set_title("Interpolated Data")

for ax in axs:
    ctx.add_basemap(ax=ax, source=ctx.providers.CartoDB.Positron)
    ax.axis("off")
plt.suptitle(
    r"Spatial Interpolation with the PySAL $\mathtt{tobler}$ package", fontsize=16
)
Text(0.5, 0.98, 'Spatial Interpolation with the PySAL $\\mathtt{tobler}$ package')
../_images/93389448a88d5f9ae24789effd7b982819ff83ab0b8923736b499dc0a3438089.png

Dealing with NaN Values

It’s important to understand how and NaN (not-a-number) values (i.e. missing data) in the source dataframe are handled.

fill_nans = [None, 0.0, "mean", "median"]

f, ax = plt.subplots(2, 2, figsize=(12, 12))
ax = ax.flatten()
for i, nan in enumerate(fill_nans):
    dc_hex_interpolated = area_interpolate(
        source_df=dc,
        target_df=dc_hex,
        intensive_variables=["B25077_001E"],
        fill_nan=nan,
    ).plot("B25077_001E", scheme="quantiles", alpha=0.5, ax=ax[i])
    ax[i].set_title(f"fill_nan={nan}")
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/tobler/area_weighted/area_interpolate.py:350: UserWarning: nan values in variable: B25077_001E, replacing with None
  vals = _nan_check(source_df, variable, fill_value=fill_nan)
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/tobler/area_weighted/area_interpolate.py:350: UserWarning: nan values in variable: B25077_001E, replacing with 0.0
  vals = _nan_check(source_df, variable, fill_value=fill_nan)
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/tobler/area_weighted/area_interpolate.py:350: UserWarning: nan values in variable: B25077_001E, replacing with 437297.2201352367
  vals = _nan_check(source_df, variable, fill_value=fill_nan)
/home/runner/micromamba/envs/test/lib/python3.14/site-packages/tobler/area_weighted/area_interpolate.py:350: UserWarning: nan values in variable: B25077_001E, replacing with 385900.0
  vals = _nan_check(source_df, variable, fill_value=fill_nan)
../_images/407abffeeb18189bce3ce4335a16246e764fa7cd6c0ad567c901b2ac011edea4.png

Further, all of these options are still different from dropping any geometries with missing values for the variable being interpolated. In this case, the dropped observations in source are not part of the overlay system so only the overlapping pieces of source and target count toward the total weight applied.

ax = dc_hex_interpolated = area_interpolate(
    source_df=dc.dropna(subset=["B25077_001E"]),
    target_df=dc_hex,
    intensive_variables=["B25077_001E"],
    fill_nan=nan,
).plot("B25077_001E", scheme="quantiles", alpha=0.5)
ax.set_title("drop NA prior to interpolation")
ax.axis("off")
(np.float64(-8739249.02217054),
 np.float64(-8481429.572694423),
 np.float64(4567405.812595519),
 np.float64(4841621.00817157))
../_images/c9b8389c8f8f1cd1da3a2453a61f4d693a0a6be7b86899b3e1c7559b06365670.png

For extensive variables (where target values are a weighted sum), dropping missing observations is functionally equivalent to setting fill_nan=0 and allocate_total=True.

For intensive variables, however, the weighted average provided by each target zone is not “diluted” by missing data. The total weights at the target polygon is the shared area of source and target, not the total area of target.