# Max-P Regionalization¶

Authors: Sergio Rey, Xin Feng, James Gaboardi

The max-p problem involves the clustering of a set of geographic areas into the maximum number of homogeneous regions such that the value of a spatially extensive regional attribute is above a predefined threshold value. The spatially extensive attribute can be specified to ensure that each region contains sufficient population size, or a minimum number of enumeration units. The number of regions $$p$$ is endogenous to the problem and is useful for regionalization problems where the analyst does not require a fixed number of regions a-priori.

Originally formulated as a mixed-integer problem in Duque, Anselin, Rey (2012), max-p is an NP-hard problem and exact solutions are only feasible for small problem sizes. As such, a number of heuristic solution approaches have been suggested. PySAL implements the heuristic approach described in Wei, Rey, and Knaap (2020).

[1]:

%config InlineBackend.figure_format = "retina"
%watermark

Last updated: 2023-12-10T13:56:28.684909-05:00

Python implementation: CPython
Python version       : 3.12.0
IPython version      : 8.18.0

Compiler    : Clang 15.0.7
OS          : Darwin
Release     : 23.1.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit


[2]:

import warnings

import geopandas
import libpysal
import matplotlib.pyplot as plt
import numpy
import spopt
from spopt.region import MaxPHeuristic as MaxP

RANDOM_SEED = 123456

%watermark -w
%watermark -iv

Watermark: 2.4.3

matplotlib: 3.8.2
libpysal  : 4.9.2
numpy     : 1.26.2
geopandas : 0.14.1



## Mexican State Regional Income Clustering¶

To illustrate maxp we utilize data on regional incomes for Mexican states over the period 1940-2000, originally used in Rey and Sastré-Gutiérrez (2010).

We can first explore the data by plotting the per capital gross regional domestic product (in constant USD 2000 dollars) for each year in the sample, using a quintile classification. Here we will define a function for creating subplots useful in visual comparisons, which also can solve Max-P instances and will be used again later).

[3]:

def subplotter(gdf, incrs, W, threshold=5, top_n=2, seed=RANDOM_SEED):
"""Helper plotting function, also solves MaxP instances if desired."""
rows, cols = incrs.shape
f, axs = plt.subplots(rows, cols, figsize=(9, 14))
for i in range(rows):
for j in range(cols):
year, _ax = incrs[i, j], axs[i, j]
if not year:
# plot country geographies
_attr = "Mexico"
gdf.plot(ax=_ax, ec="grey", fc="white")
else:
_attr = "PCGDP%s" % year
if not W:
# plot country geographies by attributes
plt_kws = dict(scheme="Quantiles", cmap="GnBu", ec="grey")
plt_kws.update(dict(legend_kwds={"fmt": "{:.0f}"}))
gdf.plot(column=_attr, ax=_ax, legend=True, **plt_kws)
else:
# solve a MaxP instance and plot regions
numpy.random.seed(seed)
maxp_args = gdf, W, _attr, "count", threshold
model = MaxP(*maxp_args, top_n=top_n)
model.solve()
label = year + "labels_"
gdf[label] = model.labels_
gdf.plot(column=label, ax=_ax, cmap="tab20")
_ax.set_title(_attr)
_ax.set_axis_off()
_ax.set_aspect("equal")
f.tight_layout()

[4]:

pth = libpysal.examples.get_path("mexicojoin.shp")

[4]:

POLY_ID AREA CODE NAME PERIMETER ACRES HECTARES PCGDP1940 PCGDP1950 PCGDP1960 ... GR9000 LPCGDP40 LPCGDP50 LPCGDP60 LPCGDP70 LPCGDP80 LPCGDP90 LPCGDP00 TEST geometry
0 1 7.252751e+10 MX02 Baja California Norte 2040312.385 1.792187e+07 7252751.376 22361.0 20977.0 17865.0 ... 0.05 4.35 4.32 4.25 4.40 4.47 4.43 4.48 1.0 MULTIPOLYGON (((-113.13972 29.01778, -113.2405...
1 2 7.225988e+10 MX03 Baja California Sur 2912880.772 1.785573e+07 7225987.769 9573.0 16013.0 16707.0 ... 0.00 3.98 4.20 4.22 4.39 4.46 4.41 4.42 2.0 MULTIPOLYGON (((-111.20612 25.80278, -111.2302...
2 3 2.731957e+10 MX18 Nayarit 1034770.341 6.750785e+06 2731956.859 4836.0 7515.0 7621.0 ... -0.05 3.68 3.88 3.88 4.04 4.13 4.11 4.06 3.0 MULTIPOLYGON (((-106.62108 21.56531, -106.6475...
3 4 7.961008e+10 MX14 Jalisco 2324727.436 1.967200e+07 7961008.285 5309.0 8232.0 9953.0 ... 0.03 3.73 3.92 4.00 4.21 4.32 4.30 4.33 4.0 POLYGON ((-101.52490 21.85664, -101.58830 21.7...
4 5 5.467030e+09 MX01 Aguascalientes 313895.530 1.350927e+06 546702.985 10384.0 6234.0 8714.0 ... 0.13 4.02 3.79 3.94 4.21 4.32 4.32 4.44 5.0 POLYGON ((-101.84620 22.01176, -101.96530 21.8...

5 rows × 35 columns

[5]:

mxgdp_years = [str(x) for x in range(1940, 2010, 10)]
mxgdp_years = numpy.array([None] + mxgdp_years)
mxgdp_years_grid = mxgdp_years.reshape(4, 2)
mxgdp_years_grid

[5]:

array([[None, '1940'],
['1950', '1960'],
['1970', '1980'],
['1990', '2000']], dtype=object)


### Regional incomes for Mexican states over the period 1940-2000¶

[6]:

subplotter(mexico, mxgdp_years_grid, None)


In general terms, the north-south divide in incomes is present in each of the 7 decades. There is some variation in states moving across quintiles however, and this is true at both the bottom and top of the state income distribution.

To develop a holistic view of the Mexican space economy over this timespan, we can try to form a set of spatially connected regions that maximizes the internal socieconomic levels of the states belonging to each region.

## Pooled Regionalization¶

To develop our holistic view, we can treat the six cross-sections as a multidimensional array and seek to cluster 32 Mexican states into the maximum number of regions such that each region as at least 6 = 32 // 5 states and homogeneity in per capita gross regional product over 1940-2000 is maximized.

We first define the variables in the dataframe that will be used to measure regional homogeneity:

[7]:

attrs_name = [f"PCGDP{year}" for year in mxgdp_years[1:]]
attrs_name

[7]:

['PCGDP1940',
'PCGDP1950',
'PCGDP1960',
'PCGDP1970',
'PCGDP1980',
'PCGDP1990',
'PCGDP2000']


Next, we specify a number of parameters that will serve as input to the max-p model.

A spatial weights object expresses the spatial connectivity of the states:

[8]:

with warnings.catch_warnings(record=True):
w = libpysal.weights.Queen.from_dataframe(mexico)


The remaining arguments are the minimum number of states each region must have (threshold):

[9]:

threshold = 6


and the number of the top candidate regions to consider when assigning enclaves (top_n):

[10]:

top_n = 2


We create the attribute count which will serve as the threshold attribute which we add to the dataframe:

[11]:

mexico["count"] = 1
threshold_name = "count"


The model can then be instantiated and solved:

[12]:

numpy.random.seed(RANDOM_SEED)
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()

[13]:

mexico["maxp_new"] = model.labels_

[14]:

mexico[["maxp_new", "AREA"]].groupby(by="maxp_new").count()

[14]:

AREA
maxp_new
1 6
2 7
3 6
4 7
5 6
[15]:

model.p

[15]:

5

[16]:

mexico.plot(figsize=(8, 5), column="maxp_new", cmap="tab20", edgecolor="w").axis("off");


The model solution results in five regions, three of which have six states, and two with seven states each. Each region is a spatially connected component, as required by the max-p problem.

## Change threshold to a minimum of 3 states per region¶

[17]:

numpy.random.seed(RANDOM_SEED)
threshold = 3
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()

[18]:

mexico["maxp_3"] = model.labels_
mexico[["maxp_3", "AREA"]].groupby(by="maxp_3").count()

[18]:

AREA
maxp_3
1 3
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 3
[19]:

model.p

[19]:

10

[20]:

mexico.plot(figsize=(8, 5), column="maxp_3", cmap="tab20", edgecolor="w").axis("off");


## Year-by-Year Regionalization while Varying Parameters¶

### Vary minimum in-region threshold (5, 10, 15); hold enclave assignment constant (2)¶

5 states

[21]:

region_args = mexico, mxgdp_years_grid, w

[22]:

subplotter(*region_args, threshold=5, top_n=2)


10 states

[23]:

subplotter(*region_args, threshold=10, top_n=2)


15 states

[24]:

subplotter(*region_args, threshold=15, top_n=2)


### Vary state enclave assignments (1, 3, 5); hold minimum in-region threshold constant (5)¶

1 state

[25]:

subplotter(*region_args, threshold=5, top_n=1)


3 states

[26]:

subplotter(*region_args, threshold=5, top_n=3)


5 states

[27]:

subplotter(*region_args, threshold=5, top_n=5)


As shown here, varying the threshold and top_n parameters both have a demonstrable effect regionalization outcomes.