This page was generated from notebooks/randomregion.ipynb. Interactive online version: Binder badge

Random Regions

Authors: Serge Rey, James Gaboardi

Introduction

In this notebook we demonstrate how to use spopt with empirical and synthetic examples by:

  1. Evaluating the properties of a predefined regionalization scheme.

  2. Exploring the max_swaps parameter on a simple lattice.

spopt offers functionality to generate random regions based on user-defined constraints. There are three optional parameters to constrain the regionalization: number of regions, cardinality, and contiguity. The default case simply takes a list of area IDs and randomly selects the number of regions and then allocates areas to each region. The user can also pass a vector of integers to the cardinality parameter to designate the number of areas to randomly assign to each region. The contiguity parameter takes a spatial weights object and uses that to ensure that each region is made up of spatially contiguous areas. When the contiguity constraint is enforced, it is possible to arrive at infeasible solutions; the maxiter parameter can be set to make multiple attempts to find a feasible solution.

We begin with importing the relevant packages:

[1]:
%config InlineBackend.figure_format = "retina"
%load_ext watermark
%watermark
Last updated: 2023-12-10T14:07:40.178142-05:00

Python implementation: CPython
Python version       : 3.12.0
IPython version      : 8.18.0

Compiler    : Clang 15.0.7
OS          : Darwin
Release     : 23.1.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit

[2]:
import geopandas
import inequality
import libpysal
import matplotlib
import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn
import spopt
from spopt.region import RandomRegion, RandomRegions
import warnings

RANDOM_SEED = 12345

%watermark -w
%watermark -iv
Watermark: 2.4.3

numpy     : 1.26.2
pandas    : 2.1.3
spopt     : 0.5.1.dev53+g5cadae7
geopandas : 0.14.1
matplotlib: 3.8.2
inequality: 1.0.1
libpysal  : 4.9.2
seaborn   : 0.13.0

1. Evaluating Regional Partitions: The case of Mexican state incomes

We will be using the built-in example data set mexico from the libpysal package. First, we get a high-level overview of the example:

[3]:
libpysal.examples.explain("mexico")
mexico
======

Decennial per capita incomes of Mexican states 1940-2000
--------------------------------------------------------

* mexico.csv: attribute data. (n=32, k=13)
* mexico.gal: spatial weights in GAL format.
* mexicojoin.shp: Polygon shapefile. (n=32)

Data used in Rey, S.J. and M.L. Sastre Gutierrez. (2010) "Interregional inequality dynamics in Mexico." Spatial Economic Analysis, 5: 277-298.

[4]:
mdf = geopandas.read_file(libpysal.examples.get_path("mexicojoin.shp"))
[5]:
mdf.head()
[5]:
POLY_ID AREA CODE NAME PERIMETER ACRES HECTARES PCGDP1940 PCGDP1950 PCGDP1960 ... GR9000 LPCGDP40 LPCGDP50 LPCGDP60 LPCGDP70 LPCGDP80 LPCGDP90 LPCGDP00 TEST geometry
0 1 7.252751e+10 MX02 Baja California Norte 2040312.385 1.792187e+07 7252751.376 22361.0 20977.0 17865.0 ... 0.05 4.35 4.32 4.25 4.40 4.47 4.43 4.48 1.0 MULTIPOLYGON (((-113.13972 29.01778, -113.2405...
1 2 7.225988e+10 MX03 Baja California Sur 2912880.772 1.785573e+07 7225987.769 9573.0 16013.0 16707.0 ... 0.00 3.98 4.20 4.22 4.39 4.46 4.41 4.42 2.0 MULTIPOLYGON (((-111.20612 25.80278, -111.2302...
2 3 2.731957e+10 MX18 Nayarit 1034770.341 6.750785e+06 2731956.859 4836.0 7515.0 7621.0 ... -0.05 3.68 3.88 3.88 4.04 4.13 4.11 4.06 3.0 MULTIPOLYGON (((-106.62108 21.56531, -106.6475...
3 4 7.961008e+10 MX14 Jalisco 2324727.436 1.967200e+07 7961008.285 5309.0 8232.0 9953.0 ... 0.03 3.73 3.92 4.00 4.21 4.32 4.30 4.33 4.0 POLYGON ((-101.52490 21.85664, -101.58830 21.7...
4 5 5.467030e+09 MX01 Aguascalientes 313895.530 1.350927e+06 546702.985 10384.0 6234.0 8714.0 ... 0.13 4.02 3.79 3.94 4.21 4.32 4.32 4.44 5.0 POLYGON ((-101.84620 22.01176, -101.96530 21.8...

5 rows × 35 columns

This data set records per capital Gross Domestic Product (PCGDP) for the decades 1940-2000 for 32 Mexican states.

[6]:
mdf.columns
[6]:
Index(['POLY_ID', 'AREA', 'CODE', 'NAME', 'PERIMETER', 'ACRES', 'HECTARES',
       'PCGDP1940', 'PCGDP1950', 'PCGDP1960', 'PCGDP1970', 'PCGDP1980',
       'PCGDP1990', 'PCGDP2000', 'HANSON03', 'HANSON98', 'ESQUIVEL99', 'INEGI',
       'INEGI2', 'MAXP', 'GR4000', 'GR5000', 'GR6000', 'GR7000', 'GR8000',
       'GR9000', 'LPCGDP40', 'LPCGDP50', 'LPCGDP60', 'LPCGDP70', 'LPCGDP80',
       'LPCGDP90', 'LPCGDP00', 'TEST', 'geometry'],
      dtype='object')

We can plot the spatial distribution of regional incomes in the last decade of the sample:

[7]:
mdf.plot(
    figsize=(8, 6),
    column="PCGDP2000",
    scheme="quantiles",
    cmap="Blues",
    edgecolor="grey",
    legend=True,
    legend_kwds={"fmt": "{:.0f}"},
).axis("off")
[7]:
(-118.64174423217773, -85.21563186645508, 13.6420334815979, 33.6293231010437)
../_images/notebooks_randomregion_11_1.png

Here we see the north-south divide in the Mexican space economy. This pattern has been the subject of much recent research, and a common methodological strategy in these studies is to partition the states into mutually exclusive and exhaustive regions. The motivation is similar to that in traditional clustering where one attempts to group like objects together. Here the regional sets should be composed of states that are more similar to other members of the same set, and distinct from those states in other sets.

Our dataset has a number of regionalization schemes from the published literature. Here we will focus on one HANSON03. We can plot this partition using a categorical scheme:

[8]:
mdf.plot(figsize=(8, 6), column="HANSON03", categorical=True).axis("off");
../_images/notebooks_randomregion_13_0.png

Here the different colors symbolize membership in a different region. We can examine the cardinality structure of this partition:

[9]:
cards = mdf.groupby(by="HANSON03").count().NAME.values.tolist()
cards
[9]:
[6, 7, 10, 2, 3, 4]

Thus, we see there six regions (sets) of different sizes, the smallest region has only two states, while the largest (in a set-theory sense) has 10 states. We can explore the regional definitions in tabular form:

[10]:
mdf[["NAME", "HANSON03", "PCGDP2000"]].sort_values(by="HANSON03")
[10]:
NAME HANSON03 PCGDP2000
0 Baja California Norte 1.0 29855.0
29 Nuevo Leon 1.0 38672.0
23 Chihuahua 1.0 30735.0
30 Tamaulipas 1.0 23546.0
24 Coahuila De Zaragoza 1.0 28460.0
22 Sonora 1.0 24068.0
2 Nayarit 2.0 11478.0
1 Baja California Sur 2.0 26103.0
27 Zacatecas 2.0 11130.0
26 Durango 2.0 17379.0
25 Sinaloa 2.0 15242.0
4 Aguascalientes 2.0 27782.0
28 San Luis Potosi 2.0 15866.0
3 Jalisco 3.0 21610.0
6 Queretaro de Arteaga 3.0 26149.0
5 Guanajuato 3.0 15585.0
7 Hidalgo 3.0 12348.0
17 Tlaxcala 3.0 11701.0
31 Veracruz-Llave 3.0 12191.0
8 Michoacan de Ocampo 3.0 11838.0
12 Morelos 3.0 18170.0
15 Puebla 3.0 15685.0
11 Colima 3.0 21358.0
10 Distrito Federal 4.0 54349.0
9 Mexico 4.0 16322.0
21 Chiapas 5.0 8684.0
19 Oaxaca 5.0 9010.0
18 Guerrero 5.0 11820.0
14 Campeche 6.0 36163.0
13 Yucatan 6.0 17509.0
20 Tabasco 6.0 13360.0
16 Quintana Roo 6.0 33442.0

How good is HANSON03 as a partition?

One question we might ask about the partition, is how good of a job does it do at capturing the spatial distribution of incomes in Mexico? To answer this question, we are going to leverage the random regions functionality in spopt.

A random partition respecting cardinality constraints

We will first create a new random partition that has the same cardinality structure/distribution as the HANSON03 partition:

[11]:
ids = mdf.index.values.tolist()
[12]:
numpy.random.seed(RANDOM_SEED)
rrmx = RandomRegion(ids, num_regions=6, cardinality=cards)

This creates a new RandomRegion instance. One of its attributes is the definition of the regions in the partition:

[13]:
rrmx.regions
[13]:
[[27, 12, 18, 3, 15, 8],
 [0, 25, 21, 20, 7, 6, 24],
 [23, 10, 13, 11, 19, 16, 26, 14, 17, 22],
 [28, 31],
 [30, 9, 4],
 [1, 29, 5, 2]]

This will have the same cardinality as the HANSON03 partition:

[14]:
set([len(region) for region in rrmx.regions]) == set(cards)
[14]:
True

A simple helper function let’s us attach the region labels for this new partition into the dataframe:

[15]:
def region_labels(df, solution, name="labels_"):
    n, k = df.shape
    labels_ = numpy.zeros((n,), int)
    for i, region in enumerate(solution.regions):
        labels_[region] = i
    df[name] = labels_

And, then we can visualize the new partition:

[16]:
region_labels(mdf, rrmx, name="rrmx")
mdf.plot(figsize=(8, 6), column="rrmx", categorical=True).axis("off");
../_images/notebooks_randomregion_28_0.png

While the rrmx partition and the HANSON03 partitions have identical cardinality structures, their spatial distributions are radically different. The rrmx map looks nothing like the HANSON03 map.

One of the key differences is that the sets in the HANSON03 partition are each spatially connected components while this is not the case for the “regions” in the rrmx partition.

A random partition respecting cardinality and contiguity constraints

A second way to form a random partition is to add an additional constraint in the form of the spatial connectivity between the states. To do so, we will construct a spatial weights object using the Queen contiguity criterion:

[17]:
with warnings.catch_warnings(record=True) as w:
    w = libpysal.weights.Queen.from_dataframe(mdf)

and then add this as an additional parameter to instantiate a new RandomRegion:

[18]:
numpy.random.seed(RANDOM_SEED)
rrmxc = RandomRegion(ids, num_regions=6, cardinality=cards, contiguity=w)
[19]:
rrmxc.regions
[19]:
[[27, 5, 28, 8, 11, 24, 3, 4, 2, 29],
 [18, 19, 9, 21, 10, 12, 20],
 [0, 22, 1, 25, 23, 26],
 [13, 16, 14],
 [15, 17, 7, 6],
 [31, 30]]

This will also have the same cardinality as the HANSON03 partition:

[20]:
set([len(region) for region in rrmxc.regions]) == set(cards)
[20]:
True

But, more importantly, we see the partition yields spatially connected components for the regions:

[21]:
region_labels(mdf, rrmxc, name="rrmxc")
mdf.plot(figsize=(8, 6), column="rrmxc", categorical=True).axis("off");
../_images/notebooks_randomregion_37_0.png

Comparing Partitions

We now can get back to the question of how good the HANSON03 partition does in capturing the spatial structure of state incomes in Mexico. Two alternative random partitions have been constructed using RandomRegion and we can compare these with the HANSON03 partition:

[22]:
f, axs = plt.subplots(1, 3, figsize=(12, 8))
ax1, ax2, ax3 = axs

mdf.plot(column="HANSON03", categorical=True, ax=ax1)
ax1.set_axis_off()
ax1.set_aspect("equal")
ax1.set_title("HANSON03")

mdf.plot(column="rrmx", categorical=True, ax=ax2)
ax2.set_axis_off()
ax2.set_aspect("equal")
ax2.set_title("Cardinality Constrained")

mdf.plot(column="rrmxc", categorical=True, ax=ax3)
ax3.set_axis_off()
ax3.set_aspect("equal")
ax3.set_title("Cardinality and Contiguity Constrained")

plt.subplots_adjust(wspace=-0.2);
../_images/notebooks_randomregion_39_0.png

In order to judge the HANSON03 regions against those from the two random solutions, we need some objective function to serve as our benchmark. Given the interest in the spatial distribution of incomes, we might ask if a partition has minimized the internal heterogeneity of the incomes for states belonging to each set?

We can use the PySAL inequality package to implement this measure:

Let’s focus on the last year in the sample and pull out the state incomes:

[23]:
y = mdf.PCGDP2000

The TheilD statistic decomposes the inequality of the 32 state incomes into two parts:

  • inequality between states belonging to different regions (inequality between regions)

  • inequality between states belonging to the same region (inequality within regions)

We can calculate this for the original partition:

[24]:
t_hanson = inequality.theil.TheilD(y, mdf.HANSON03)

The overall level of inequality is:

[25]:
t_hanson.T
[25]:
0.10660832349588024

which is decomposed into the “between-regions” component:

[26]:
t_hanson.bg
[26]:
array([0.05373534])

and the “within-regions” component:

[27]:
t_hanson.wg
[27]:
array([0.05287298])

For this partition, the two components are roughly equal. How does this compare to a random partition? Well, for a random partition with the same cardinality structure as HANSON03 we can carry out the same decomposition:

[28]:
t_rrmx = inequality.theil.TheilD(y, mdf.rrmx)

The level of overall inequality will be the same:

[29]:
t_rrmx.T
[29]:
0.10660832349588024

However, the decomposition is very different now, with the within component being much larger than the between component.

[30]:
t_rrmx.bg
[30]:
array([0.02096667])
[31]:
t_rrmx.wg
[31]:
array([0.08564165])

How about when both cardinality and contiguity are taken into account when developing the random partition? We can pass in the partition definition for this solution into the decomposition:

[32]:
t_rrmxc = inequality.theil.TheilD(y, mdf.rrmxc)

Again, the total inequality remains the same:

[33]:
t_rrmxc.T
[33]:
0.10660832349588024

and the within-region component is even larger than either the HANSON03 and rrmx partitions:

[34]:
t_rrmxc.bg
[34]:
array([0.01313884])
[35]:
t_rrmxc.wg
[35]:
array([0.09346949])

Generating a Reference Distribution

Both of the comparison partitions are examples of random partitions, subject to matching the cardinality (rrmx) or cardinality and contiguity constraints (rmxc). As such, the generated partitions should be viewed as samples from the population of all possible random partitions that also respect those constraints. Rather than relying on a single draw from each of these distributions, as we have done up to now, if we sample from these distributions, we can develop an empirical reference distribution to compare the original HANSON03 partition against. In other words, we can ask if the latter partition is performing any better than a random partition that respects the cardinality and contiguity constraints.

We can do this by adding in a permutations argument to the class RandomRegions. Note the plural here in that this class is intended to generate permutations solutions for the random regions that respect the relevant cardinality and contiguity constraints. We will generate permutations=99 solutions and calculate the inequality decomposition for each of the 99 partitions, extracting the wg component to develop the reference distribution against which we can evaluate the observed wg value from the HANSON03 partition:

[36]:
numpy.random.seed(RANDOM_SEED)
rrmxcs = RandomRegions(
    ids, num_regions=6, cardinality=cards, contiguity=w, permutations=99
)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    wg = []
    for i, solution in enumerate(rrmxcs.solutions_feas):
        name = f"rrmxc_{i}"
        region_labels(mdf, solution, name=name)
        wg.append(inequality.theil.TheilD(y, mdf[name]).wg)
wg = numpy.array(wg)

The mean of the within-region inequality component is:

[37]:
wg.mean()
[37]:
0.08499903576350978

which is larger than the observed value for HANSON03:

[38]:
t_hanson.wg
[38]:
array([0.05287298])

Examining the empirical distribution of the within-region inequality component for the random partitions:

[39]:
rdf = pandas.DataFrame(data=wg, columns=["Within-region inequality"])
[40]:
_ = seaborn.displot(
    rdf, kind="kde", legend=False, x="Within-region inequality", rug=True
);
../_images/notebooks_randomregion_72_0.png

provides a more comprehensive benchmark for evaluating the HANSON03 partition. We see that the observed within-region component is extreme relative to what we would expect if the partitions were formed randomly subject to the cardinality and contiguity constraints:

[41]:
t_hanson.wg
[41]:
array([0.05287298])
[42]:
wg.min()
[42]:
0.052392123130697854

We could even formalize this statement by developing a pseudo p-value for the observed component:

[43]:
pvalue = (1 + (wg <= t_hanson.wg[0]).sum()) / 100
pvalue
[43]:
0.02

which would lead us to conclude that the observed value is significantly different from what would be expected under a random region null.

So, yes,the HANSON03 partition does appear to be capturing the spatial pattern of incomes in Mexico in the sense that its larger within-region inequality component is not due to random chance.


2. Demonstrating the max_swaps parameter

Generate a 20 x 20 lattice in spenc with synthetic labels & plot.

[44]:
hori, vert = 20, 20
n_polys = hori * vert
gdf = spopt.region.spenclib.utils.lattice(hori, vert)
numpy.random.seed(RANDOM_SEED)
gdf["synth_data"] = numpy.random.laplace(100, 10, n_polys)
gdf.head()
[44]:
geometry synth_data
0 POLYGON ((0.00000 0.00000, 1.00000 0.00000, 1.... 119.606435
1 POLYGON ((0.00000 1.00000, 1.00000 1.00000, 1.... 95.423219
2 POLYGON ((0.00000 2.00000, 1.00000 2.00000, 1.... 89.998863
3 POLYGON ((0.00000 3.00000, 1.00000 3.00000, 1.... 91.062546
4 POLYGON ((0.00000 4.00000, 1.00000 4.00000, 1.... 101.455462
[45]:
plt.rcParams["figure.figsize"] = [10, 10]
gdf.plot(
    column="synth_data",
    scheme="quantiles",
    cmap="Blues",
    edgecolor="grey",
    legend=True,
    legend_kwds={"fmt": "{:.2f}"},
);
../_images/notebooks_randomregion_81_0.png

Set the number of desired regions to 4.

[46]:
n_regions = 4

Define IDs & set cardinality to 100 areas per region.

[47]:
ids = gdf.index.values.tolist()
cards = [int(n_polys / n_regions)] * n_regions
cards
[47]:
[100, 100, 100, 100]

Generate a RandomRegion instance (constraining only region count and cardinality) and plot.

[48]:
numpy.random.seed(RANDOM_SEED)
rrlat = RandomRegion(ids, num_regions=n_regions, cardinality=cards)
[49]:
region_labels(gdf, rrlat, name="rrlat")
gdf.plot(column="rrlat", cmap="tab20");
../_images/notebooks_randomregion_88_0.png

Create a Queen weights object to define spatial contiguity.

[50]:
with warnings.catch_warnings(record=True) as w:
    w = libpysal.weights.Queen.from_dataframe(gdf)

Iterate over max_swaps=[100, 350, 650, 999] for 4 RandomRegion instances.

[51]:
for swaps in [100, 350, 650, 999]:
    swap_str = f"rrlatms{swaps}"
    numpy.random.seed(RANDOM_SEED)
    result = RandomRegion(
        ids,
        num_regions=n_regions,
        cardinality=cards,
        contiguity=w,
        compact=True,
        max_swaps=swaps,
    )
    region_labels(gdf, result, name=swap_str)

Assign labels and titles, then plot.

[52]:
labels = [
    ["synth_data", "rrlatms100", "rrlatms650"],
    ["rrlat", "rrlatms350", "rrlatms999"],
]
titles = [
    ["Data", "Max 100 Swaps", "Max 650 Swaps"],
    ["Unconstrained", "Max 350 Swaps", "Max 999 Swaps"],
]
[53]:
rows, cols = 2, 3
f, axs = plt.subplots(rows, cols, figsize=(9, 7.5))
for i in range(rows):
    for j in range(cols):
        label, title = labels[i][j], titles[i][j]
        _ax = axs[i, j]
        if i == j and i == 0:
            kws = {"scheme": "quantiles", "cmap": "Blues", "ec": "grey"}
            gdf.plot(column=label, ax=_ax, **kws)
        else:
            gdf.plot(column=label, ax=_ax, cmap="tab20")
        _ax.set_title(title)
        _ax.set_axis_off()
        _ax.set_aspect("equal")
plt.subplots_adjust(wspace=1, hspace=0.5)
plt.tight_layout()
../_images/notebooks_randomregion_95_0.png

As shown here, the results obtained from setting the maximum number of swaps clearly vary.