This page was generated from notebooks/randomregion.ipynb. Interactive online version:
Random Regions¶
Authors: Serge Rey, James Gaboardi
Introduction¶
In this notebook we demonstrate how to use spopt
with empirical and synthetic examples by:
Evaluating the properties of a predefined regionalization scheme.
Exploring the
max_swaps
parameter on a simple lattice.
spopt
offers functionality to generate random regions based on user-defined constraints. There are three optional parameters to constrain the regionalization: number of regions, cardinality, and contiguity. The default case simply takes a list of area IDs and randomly selects the number of regions and then allocates areas to each region. The user can also pass a vector of integers to the cardinality parameter to designate the number of areas to randomly assign to each region. The contiguity
parameter takes a spatial weights object and uses that to ensure that each region is made up of spatially contiguous areas. When the contiguity constraint is enforced, it is possible to arrive at infeasible solutions; the maxiter
parameter can be set to make multiple attempts to find a feasible solution.
We begin with importing the relevant packages:
[1]:
%config InlineBackend.figure_format = "retina"
%load_ext watermark
%watermark
Last updated: 2023-12-10T14:07:40.178142-05:00
Python implementation: CPython
Python version : 3.12.0
IPython version : 8.18.0
Compiler : Clang 15.0.7
OS : Darwin
Release : 23.1.0
Machine : x86_64
Processor : i386
CPU cores : 8
Architecture: 64bit
[2]:
import geopandas
import inequality
import libpysal
import matplotlib
import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn
import spopt
from spopt.region import RandomRegion, RandomRegions
import warnings
RANDOM_SEED = 12345
%watermark -w
%watermark -iv
Watermark: 2.4.3
numpy : 1.26.2
pandas : 2.1.3
spopt : 0.5.1.dev53+g5cadae7
geopandas : 0.14.1
matplotlib: 3.8.2
inequality: 1.0.1
libpysal : 4.9.2
seaborn : 0.13.0
1. Evaluating Regional Partitions: The case of Mexican state incomes¶
We will be using the built-in example data set mexico
from the libpysal package. First, we get a high-level overview of the example:
[3]:
libpysal.examples.explain("mexico")
mexico
======
Decennial per capita incomes of Mexican states 1940-2000
--------------------------------------------------------
* mexico.csv: attribute data. (n=32, k=13)
* mexico.gal: spatial weights in GAL format.
* mexicojoin.shp: Polygon shapefile. (n=32)
Data used in Rey, S.J. and M.L. Sastre Gutierrez. (2010) "Interregional inequality dynamics in Mexico." Spatial Economic Analysis, 5: 277-298.
[4]:
mdf = geopandas.read_file(libpysal.examples.get_path("mexicojoin.shp"))
[5]:
mdf.head()
[5]:
POLY_ID | AREA | CODE | NAME | PERIMETER | ACRES | HECTARES | PCGDP1940 | PCGDP1950 | PCGDP1960 | ... | GR9000 | LPCGDP40 | LPCGDP50 | LPCGDP60 | LPCGDP70 | LPCGDP80 | LPCGDP90 | LPCGDP00 | TEST | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7.252751e+10 | MX02 | Baja California Norte | 2040312.385 | 1.792187e+07 | 7252751.376 | 22361.0 | 20977.0 | 17865.0 | ... | 0.05 | 4.35 | 4.32 | 4.25 | 4.40 | 4.47 | 4.43 | 4.48 | 1.0 | MULTIPOLYGON (((-113.13972 29.01778, -113.2405... |
1 | 2 | 7.225988e+10 | MX03 | Baja California Sur | 2912880.772 | 1.785573e+07 | 7225987.769 | 9573.0 | 16013.0 | 16707.0 | ... | 0.00 | 3.98 | 4.20 | 4.22 | 4.39 | 4.46 | 4.41 | 4.42 | 2.0 | MULTIPOLYGON (((-111.20612 25.80278, -111.2302... |
2 | 3 | 2.731957e+10 | MX18 | Nayarit | 1034770.341 | 6.750785e+06 | 2731956.859 | 4836.0 | 7515.0 | 7621.0 | ... | -0.05 | 3.68 | 3.88 | 3.88 | 4.04 | 4.13 | 4.11 | 4.06 | 3.0 | MULTIPOLYGON (((-106.62108 21.56531, -106.6475... |
3 | 4 | 7.961008e+10 | MX14 | Jalisco | 2324727.436 | 1.967200e+07 | 7961008.285 | 5309.0 | 8232.0 | 9953.0 | ... | 0.03 | 3.73 | 3.92 | 4.00 | 4.21 | 4.32 | 4.30 | 4.33 | 4.0 | POLYGON ((-101.52490 21.85664, -101.58830 21.7... |
4 | 5 | 5.467030e+09 | MX01 | Aguascalientes | 313895.530 | 1.350927e+06 | 546702.985 | 10384.0 | 6234.0 | 8714.0 | ... | 0.13 | 4.02 | 3.79 | 3.94 | 4.21 | 4.32 | 4.32 | 4.44 | 5.0 | POLYGON ((-101.84620 22.01176, -101.96530 21.8... |
5 rows × 35 columns
This data set records per capital Gross Domestic Product (PCGDP) for the decades 1940-2000 for 32 Mexican states.
[6]:
mdf.columns
[6]:
Index(['POLY_ID', 'AREA', 'CODE', 'NAME', 'PERIMETER', 'ACRES', 'HECTARES',
'PCGDP1940', 'PCGDP1950', 'PCGDP1960', 'PCGDP1970', 'PCGDP1980',
'PCGDP1990', 'PCGDP2000', 'HANSON03', 'HANSON98', 'ESQUIVEL99', 'INEGI',
'INEGI2', 'MAXP', 'GR4000', 'GR5000', 'GR6000', 'GR7000', 'GR8000',
'GR9000', 'LPCGDP40', 'LPCGDP50', 'LPCGDP60', 'LPCGDP70', 'LPCGDP80',
'LPCGDP90', 'LPCGDP00', 'TEST', 'geometry'],
dtype='object')
We can plot the spatial distribution of regional incomes in the last decade of the sample:
[7]:
mdf.plot(
figsize=(8, 6),
column="PCGDP2000",
scheme="quantiles",
cmap="Blues",
edgecolor="grey",
legend=True,
legend_kwds={"fmt": "{:.0f}"},
).axis("off")
[7]:
(-118.64174423217773, -85.21563186645508, 13.6420334815979, 33.6293231010437)
Here we see the north-south divide in the Mexican space economy. This pattern has been the subject of much recent research, and a common methodological strategy in these studies is to partition the states into mutually exclusive and exhaustive regions. The motivation is similar to that in traditional clustering where one attempts to group like objects together. Here the regional sets should be composed of states that are more similar to other members of the same set, and distinct from those states in other sets.
Our dataset has a number of regionalization schemes from the published literature. Here we will focus on one HANSON03
. We can plot this partition using a categorical scheme:
[8]:
mdf.plot(figsize=(8, 6), column="HANSON03", categorical=True).axis("off");
Here the different colors symbolize membership in a different region. We can examine the cardinality structure of this partition:
[9]:
cards = mdf.groupby(by="HANSON03").count().NAME.values.tolist()
cards
[9]:
[6, 7, 10, 2, 3, 4]
Thus, we see there six regions (sets) of different sizes, the smallest region has only two states, while the largest (in a set-theory sense) has 10 states. We can explore the regional definitions in tabular form:
[10]:
mdf[["NAME", "HANSON03", "PCGDP2000"]].sort_values(by="HANSON03")
[10]:
NAME | HANSON03 | PCGDP2000 | |
---|---|---|---|
0 | Baja California Norte | 1.0 | 29855.0 |
29 | Nuevo Leon | 1.0 | 38672.0 |
23 | Chihuahua | 1.0 | 30735.0 |
30 | Tamaulipas | 1.0 | 23546.0 |
24 | Coahuila De Zaragoza | 1.0 | 28460.0 |
22 | Sonora | 1.0 | 24068.0 |
2 | Nayarit | 2.0 | 11478.0 |
1 | Baja California Sur | 2.0 | 26103.0 |
27 | Zacatecas | 2.0 | 11130.0 |
26 | Durango | 2.0 | 17379.0 |
25 | Sinaloa | 2.0 | 15242.0 |
4 | Aguascalientes | 2.0 | 27782.0 |
28 | San Luis Potosi | 2.0 | 15866.0 |
3 | Jalisco | 3.0 | 21610.0 |
6 | Queretaro de Arteaga | 3.0 | 26149.0 |
5 | Guanajuato | 3.0 | 15585.0 |
7 | Hidalgo | 3.0 | 12348.0 |
17 | Tlaxcala | 3.0 | 11701.0 |
31 | Veracruz-Llave | 3.0 | 12191.0 |
8 | Michoacan de Ocampo | 3.0 | 11838.0 |
12 | Morelos | 3.0 | 18170.0 |
15 | Puebla | 3.0 | 15685.0 |
11 | Colima | 3.0 | 21358.0 |
10 | Distrito Federal | 4.0 | 54349.0 |
9 | Mexico | 4.0 | 16322.0 |
21 | Chiapas | 5.0 | 8684.0 |
19 | Oaxaca | 5.0 | 9010.0 |
18 | Guerrero | 5.0 | 11820.0 |
14 | Campeche | 6.0 | 36163.0 |
13 | Yucatan | 6.0 | 17509.0 |
20 | Tabasco | 6.0 | 13360.0 |
16 | Quintana Roo | 6.0 | 33442.0 |
How good is HANSON03
as a partition?¶
One question we might ask about the partition, is how good of a job does it do at capturing the spatial distribution of incomes in Mexico? To answer this question, we are going to leverage the random regions functionality in spopt
.
A random partition respecting cardinality constraints¶
We will first create a new random partition that has the same cardinality structure/distribution as the HANSON03
partition:
[11]:
ids = mdf.index.values.tolist()
[12]:
numpy.random.seed(RANDOM_SEED)
rrmx = RandomRegion(ids, num_regions=6, cardinality=cards)
This creates a new RandomRegion
instance. One of its attributes is the definition of the regions in the partition:
[13]:
rrmx.regions
[13]:
[[27, 12, 18, 3, 15, 8],
[0, 25, 21, 20, 7, 6, 24],
[23, 10, 13, 11, 19, 16, 26, 14, 17, 22],
[28, 31],
[30, 9, 4],
[1, 29, 5, 2]]
This will have the same cardinality as the HANSON03
partition:
[14]:
set([len(region) for region in rrmx.regions]) == set(cards)
[14]:
True
A simple helper function let’s us attach the region labels for this new partition into the dataframe:
[15]:
def region_labels(df, solution, name="labels_"):
n, k = df.shape
labels_ = numpy.zeros((n,), int)
for i, region in enumerate(solution.regions):
labels_[region] = i
df[name] = labels_
And, then we can visualize the new partition:
[16]:
region_labels(mdf, rrmx, name="rrmx")
mdf.plot(figsize=(8, 6), column="rrmx", categorical=True).axis("off");
While the rrmx
partition and the HANSON03
partitions have identical cardinality structures, their spatial distributions are radically different. The rrmx
map looks nothing like the HANSON03
map.
One of the key differences is that the sets in the HANSON03
partition are each spatially connected components while this is not the case for the “regions” in the rrmx
partition.
A random partition respecting cardinality and contiguity constraints¶
A second way to form a random partition is to add an additional constraint in the form of the spatial connectivity between the states. To do so, we will construct a spatial weights object using the Queen
contiguity criterion:
[17]:
with warnings.catch_warnings(record=True) as w:
w = libpysal.weights.Queen.from_dataframe(mdf)
and then add this as an additional parameter to instantiate a new RandomRegion
:
[18]:
numpy.random.seed(RANDOM_SEED)
rrmxc = RandomRegion(ids, num_regions=6, cardinality=cards, contiguity=w)
[19]:
rrmxc.regions
[19]:
[[27, 5, 28, 8, 11, 24, 3, 4, 2, 29],
[18, 19, 9, 21, 10, 12, 20],
[0, 22, 1, 25, 23, 26],
[13, 16, 14],
[15, 17, 7, 6],
[31, 30]]
This will also have the same cardinality as the HANSON03
partition:
[20]:
set([len(region) for region in rrmxc.regions]) == set(cards)
[20]:
True
But, more importantly, we see the partition yields spatially connected components for the regions:
[21]:
region_labels(mdf, rrmxc, name="rrmxc")
mdf.plot(figsize=(8, 6), column="rrmxc", categorical=True).axis("off");
Comparing Partitions¶
We now can get back to the question of how good the HANSON03
partition does in capturing the spatial structure of state incomes in Mexico. Two alternative random partitions have been constructed using RandomRegion
and we can compare these with the HANSON03
partition:
[22]:
f, axs = plt.subplots(1, 3, figsize=(12, 8))
ax1, ax2, ax3 = axs
mdf.plot(column="HANSON03", categorical=True, ax=ax1)
ax1.set_axis_off()
ax1.set_aspect("equal")
ax1.set_title("HANSON03")
mdf.plot(column="rrmx", categorical=True, ax=ax2)
ax2.set_axis_off()
ax2.set_aspect("equal")
ax2.set_title("Cardinality Constrained")
mdf.plot(column="rrmxc", categorical=True, ax=ax3)
ax3.set_axis_off()
ax3.set_aspect("equal")
ax3.set_title("Cardinality and Contiguity Constrained")
plt.subplots_adjust(wspace=-0.2);
In order to judge the HANSON03
regions against those from the two random solutions, we need some objective function to serve as our benchmark. Given the interest in the spatial distribution of incomes, we might ask if a partition has minimized the internal heterogeneity of the incomes for states belonging to each set?
We can use the PySAL inequality package to implement this measure:
Let’s focus on the last year in the sample and pull out the state incomes:
[23]:
y = mdf.PCGDP2000
The TheilD statistic decomposes the inequality of the 32 state incomes into two parts:
inequality between states belonging to different regions (inequality between regions)
inequality between states belonging to the same region (inequality within regions)
We can calculate this for the original partition:
[24]:
t_hanson = inequality.theil.TheilD(y, mdf.HANSON03)
The overall level of inequality is:
[25]:
t_hanson.T
[25]:
0.10660832349588024
which is decomposed into the “between-regions” component:
[26]:
t_hanson.bg
[26]:
array([0.05373534])
and the “within-regions” component:
[27]:
t_hanson.wg
[27]:
array([0.05287298])
For this partition, the two components are roughly equal. How does this compare to a random partition? Well, for a random partition with the same cardinality structure as HANSON03
we can carry out the same decomposition:
[28]:
t_rrmx = inequality.theil.TheilD(y, mdf.rrmx)
The level of overall inequality will be the same:
[29]:
t_rrmx.T
[29]:
0.10660832349588024
However, the decomposition is very different now, with the within component being much larger than the between component.
[30]:
t_rrmx.bg
[30]:
array([0.02096667])
[31]:
t_rrmx.wg
[31]:
array([0.08564165])
How about when both cardinality and contiguity are taken into account when developing the random partition? We can pass in the partition definition for this solution into the decomposition:
[32]:
t_rrmxc = inequality.theil.TheilD(y, mdf.rrmxc)
Again, the total inequality remains the same:
[33]:
t_rrmxc.T
[33]:
0.10660832349588024
and the within-region component is even larger than either the HANSON03
and rrmx
partitions:
[34]:
t_rrmxc.bg
[34]:
array([0.01313884])
[35]:
t_rrmxc.wg
[35]:
array([0.09346949])
Generating a Reference Distribution¶
Both of the comparison partitions are examples of random partitions, subject to matching the cardinality (rrmx
) or cardinality and contiguity constraints (rmxc
). As such, the generated partitions should be viewed as samples from the population of all possible random partitions that also respect those constraints. Rather than relying on a single draw from each of these distributions, as we have done up to now, if we sample from these distributions, we can develop an empirical reference
distribution to compare the original HANSON03
partition against. In other words, we can ask if the latter partition is performing any better than a random partition that respects the cardinality and contiguity constraints.
We can do this by adding in a permutations
argument to the class RandomRegions
. Note the plural here in that this class is intended to generate permutations
solutions for the random regions that respect the relevant cardinality and contiguity constraints. We will generate permutations=99
solutions and calculate the inequality decomposition for each of the 99 partitions, extracting the wg
component to develop the reference distribution against which we can evaluate the
observed wg
value from the HANSON03
partition:
[36]:
numpy.random.seed(RANDOM_SEED)
rrmxcs = RandomRegions(
ids, num_regions=6, cardinality=cards, contiguity=w, permutations=99
)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
wg = []
for i, solution in enumerate(rrmxcs.solutions_feas):
name = f"rrmxc_{i}"
region_labels(mdf, solution, name=name)
wg.append(inequality.theil.TheilD(y, mdf[name]).wg)
wg = numpy.array(wg)
The mean of the within-region inequality component is:
[37]:
wg.mean()
[37]:
0.08499903576350978
which is larger than the observed value for HANSON03
:
[38]:
t_hanson.wg
[38]:
array([0.05287298])
Examining the empirical distribution of the within-region inequality component for the random partitions:
[39]:
rdf = pandas.DataFrame(data=wg, columns=["Within-region inequality"])
[40]:
_ = seaborn.displot(
rdf, kind="kde", legend=False, x="Within-region inequality", rug=True
);
provides a more comprehensive benchmark for evaluating the HANSON03
partition. We see that the observed within-region component is extreme relative to what we would expect if the partitions were formed randomly subject to the cardinality and contiguity constraints:
[41]:
t_hanson.wg
[41]:
array([0.05287298])
[42]:
wg.min()
[42]:
0.052392123130697854
We could even formalize this statement by developing a pseudo p-value for the observed component:
[43]:
pvalue = (1 + (wg <= t_hanson.wg[0]).sum()) / 100
pvalue
[43]:
0.02
which would lead us to conclude that the observed value is significantly different from what would be expected under a random region null.
So, yes,the HANSON03
partition does appear to be capturing the spatial pattern of incomes in Mexico in the sense that its larger within-region inequality component is not due to random chance.
2. Demonstrating the max_swaps
parameter¶
Generate a 20 x 20 lattice in spenc
with synthetic labels & plot.
[44]:
hori, vert = 20, 20
n_polys = hori * vert
gdf = spopt.region.spenclib.utils.lattice(hori, vert)
numpy.random.seed(RANDOM_SEED)
gdf["synth_data"] = numpy.random.laplace(100, 10, n_polys)
gdf.head()
[44]:
geometry | synth_data | |
---|---|---|
0 | POLYGON ((0.00000 0.00000, 1.00000 0.00000, 1.... | 119.606435 |
1 | POLYGON ((0.00000 1.00000, 1.00000 1.00000, 1.... | 95.423219 |
2 | POLYGON ((0.00000 2.00000, 1.00000 2.00000, 1.... | 89.998863 |
3 | POLYGON ((0.00000 3.00000, 1.00000 3.00000, 1.... | 91.062546 |
4 | POLYGON ((0.00000 4.00000, 1.00000 4.00000, 1.... | 101.455462 |
[45]:
plt.rcParams["figure.figsize"] = [10, 10]
gdf.plot(
column="synth_data",
scheme="quantiles",
cmap="Blues",
edgecolor="grey",
legend=True,
legend_kwds={"fmt": "{:.2f}"},
);
Set the number of desired regions to 4.
[46]:
n_regions = 4
Define IDs & set cardinality to 100 areas per region.
[47]:
ids = gdf.index.values.tolist()
cards = [int(n_polys / n_regions)] * n_regions
cards
[47]:
[100, 100, 100, 100]
Generate a RandomRegion
instance (constraining only region count and cardinality) and plot.
[48]:
numpy.random.seed(RANDOM_SEED)
rrlat = RandomRegion(ids, num_regions=n_regions, cardinality=cards)
[49]:
region_labels(gdf, rrlat, name="rrlat")
gdf.plot(column="rrlat", cmap="tab20");
Create a Queen
weights object to define spatial contiguity.
[50]:
with warnings.catch_warnings(record=True) as w:
w = libpysal.weights.Queen.from_dataframe(gdf)
Iterate over max_swaps=[100, 350, 650, 999]
for 4 RandomRegion
instances.
[51]:
for swaps in [100, 350, 650, 999]:
swap_str = f"rrlatms{swaps}"
numpy.random.seed(RANDOM_SEED)
result = RandomRegion(
ids,
num_regions=n_regions,
cardinality=cards,
contiguity=w,
compact=True,
max_swaps=swaps,
)
region_labels(gdf, result, name=swap_str)
Assign labels and titles, then plot.
[52]:
labels = [
["synth_data", "rrlatms100", "rrlatms650"],
["rrlat", "rrlatms350", "rrlatms999"],
]
titles = [
["Data", "Max 100 Swaps", "Max 650 Swaps"],
["Unconstrained", "Max 350 Swaps", "Max 999 Swaps"],
]
[53]:
rows, cols = 2, 3
f, axs = plt.subplots(rows, cols, figsize=(9, 7.5))
for i in range(rows):
for j in range(cols):
label, title = labels[i][j], titles[i][j]
_ax = axs[i, j]
if i == j and i == 0:
kws = {"scheme": "quantiles", "cmap": "Blues", "ec": "grey"}
gdf.plot(column=label, ax=_ax, **kws)
else:
gdf.plot(column=label, ax=_ax, cmap="tab20")
_ax.set_title(title)
_ax.set_axis_off()
_ax.set_aspect("equal")
plt.subplots_adjust(wspace=1, hspace=0.5)
plt.tight_layout()
As shown here, the results obtained from setting the maximum number of swaps clearly vary.