This page was generated from notebooks/maxp.ipynb. Interactive online version: Binder badge

Max-P Regionalization

Authors: Sergio Rey, Xin Feng, James Gaboardi

The max-p problem involves the clustering of a set of geographic areas into the maximum number of homogeneous regions such that the value of a spatially extensive regional attribute is above a predefined threshold value. The spatially extensive attribute can be specified to ensure that each region contains sufficient population size, or a minimum number of enumeration units. The number of regions \(p\) is endogenous to the problem and is useful for regionalization problems where the analyst does not require a fixed number of regions a-priori.

Originally formulated as a mixed-integer problem in Duque, Anselin, Rey (2012), max-p is an NP-hard problem and exact solutions are only feasible for small problem sizes. As such, a number of heuristic solution approaches have been suggested. PySAL implements the heuristic approach described in Wei, Rey, and Knaap (2020).

[1]:
%config InlineBackend.figure_format = "retina"
%load_ext watermark
%watermark
Last updated: 2023-12-10T13:56:28.684909-05:00

Python implementation: CPython
Python version       : 3.12.0
IPython version      : 8.18.0

Compiler    : Clang 15.0.7
OS          : Darwin
Release     : 23.1.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit

[2]:
import warnings

import geopandas
import libpysal
import matplotlib.pyplot as plt
import numpy
import spopt
from spopt.region import MaxPHeuristic as MaxP

RANDOM_SEED = 123456

%watermark -w
%watermark -iv
Watermark: 2.4.3

matplotlib: 3.8.2
spopt     : 0.5.1.dev53+g5cadae7
libpysal  : 4.9.2
numpy     : 1.26.2
geopandas : 0.14.1

Mexican State Regional Income Clustering

To illustrate maxp we utilize data on regional incomes for Mexican states over the period 1940-2000, originally used in Rey and Sastré-Gutiérrez (2010).

We can first explore the data by plotting the per capital gross regional domestic product (in constant USD 2000 dollars) for each year in the sample, using a quintile classification. Here we will define a function for creating subplots useful in visual comparisons, which also can solve Max-P instances and will be used again later).

[3]:
def subplotter(gdf, incrs, W, threshold=5, top_n=2, seed=RANDOM_SEED):
    """Helper plotting function, also solves MaxP instances if desired."""
    rows, cols = incrs.shape
    f, axs = plt.subplots(rows, cols, figsize=(9, 14))
    for i in range(rows):
        for j in range(cols):
            year, _ax = incrs[i, j], axs[i, j]
            if not year:
                # plot country geographies
                _attr = "Mexico"
                gdf.plot(ax=_ax, ec="grey", fc="white")
            else:
                _attr = "PCGDP%s" % year
                if not W:
                    # plot country geographies by attributes
                    plt_kws = dict(scheme="Quantiles", cmap="GnBu", ec="grey")
                    plt_kws.update(dict(legend_kwds={"fmt": "{:.0f}"}))
                    gdf.plot(column=_attr, ax=_ax, legend=True, **plt_kws)
                else:
                    # solve a MaxP instance and plot regions
                    numpy.random.seed(seed)
                    maxp_args = gdf, W, _attr, "count", threshold
                    model = MaxP(*maxp_args, top_n=top_n)
                    model.solve()
                    label = year + "labels_"
                    gdf[label] = model.labels_
                    gdf.plot(column=label, ax=_ax, cmap="tab20")
            _ax.set_title(_attr)
            _ax.set_axis_off()
            _ax.set_aspect("equal")
    f.subplots_adjust(wspace=-0.7, hspace=-0.65)
    f.tight_layout()
[4]:
pth = libpysal.examples.get_path("mexicojoin.shp")
mexico = geopandas.read_file(pth)
mexico.head()
[4]:
POLY_ID AREA CODE NAME PERIMETER ACRES HECTARES PCGDP1940 PCGDP1950 PCGDP1960 ... GR9000 LPCGDP40 LPCGDP50 LPCGDP60 LPCGDP70 LPCGDP80 LPCGDP90 LPCGDP00 TEST geometry
0 1 7.252751e+10 MX02 Baja California Norte 2040312.385 1.792187e+07 7252751.376 22361.0 20977.0 17865.0 ... 0.05 4.35 4.32 4.25 4.40 4.47 4.43 4.48 1.0 MULTIPOLYGON (((-113.13972 29.01778, -113.2405...
1 2 7.225988e+10 MX03 Baja California Sur 2912880.772 1.785573e+07 7225987.769 9573.0 16013.0 16707.0 ... 0.00 3.98 4.20 4.22 4.39 4.46 4.41 4.42 2.0 MULTIPOLYGON (((-111.20612 25.80278, -111.2302...
2 3 2.731957e+10 MX18 Nayarit 1034770.341 6.750785e+06 2731956.859 4836.0 7515.0 7621.0 ... -0.05 3.68 3.88 3.88 4.04 4.13 4.11 4.06 3.0 MULTIPOLYGON (((-106.62108 21.56531, -106.6475...
3 4 7.961008e+10 MX14 Jalisco 2324727.436 1.967200e+07 7961008.285 5309.0 8232.0 9953.0 ... 0.03 3.73 3.92 4.00 4.21 4.32 4.30 4.33 4.0 POLYGON ((-101.52490 21.85664, -101.58830 21.7...
4 5 5.467030e+09 MX01 Aguascalientes 313895.530 1.350927e+06 546702.985 10384.0 6234.0 8714.0 ... 0.13 4.02 3.79 3.94 4.21 4.32 4.32 4.44 5.0 POLYGON ((-101.84620 22.01176, -101.96530 21.8...

5 rows × 35 columns

[5]:
mxgdp_years = [str(x) for x in range(1940, 2010, 10)]
mxgdp_years = numpy.array([None] + mxgdp_years)
mxgdp_years_grid = mxgdp_years.reshape(4, 2)
mxgdp_years_grid
[5]:
array([[None, '1940'],
       ['1950', '1960'],
       ['1970', '1980'],
       ['1990', '2000']], dtype=object)

Regional incomes for Mexican states over the period 1940-2000

[6]:
subplotter(mexico, mxgdp_years_grid, None)
../_images/notebooks_maxp_8_0.png

In general terms, the north-south divide in incomes is present in each of the 7 decades. There is some variation in states moving across quintiles however, and this is true at both the bottom and top of the state income distribution.

To develop a holistic view of the Mexican space economy over this timespan, we can try to form a set of spatially connected regions that maximizes the internal socieconomic levels of the states belonging to each region.

Pooled Regionalization

To develop our holistic view, we can treat the six cross-sections as a multidimensional array and seek to cluster 32 Mexican states into the maximum number of regions such that each region as at least 6 = 32 // 5 states and homogeneity in per capita gross regional product over 1940-2000 is maximized.

We first define the variables in the dataframe that will be used to measure regional homogeneity:

[7]:
attrs_name = [f"PCGDP{year}" for year in mxgdp_years[1:]]
attrs_name
[7]:
['PCGDP1940',
 'PCGDP1950',
 'PCGDP1960',
 'PCGDP1970',
 'PCGDP1980',
 'PCGDP1990',
 'PCGDP2000']

Next, we specify a number of parameters that will serve as input to the max-p model.

A spatial weights object expresses the spatial connectivity of the states:

[8]:
with warnings.catch_warnings(record=True):
    w = libpysal.weights.Queen.from_dataframe(mexico)

The remaining arguments are the minimum number of states each region must have (threshold):

[9]:
threshold = 6

and the number of the top candidate regions to consider when assigning enclaves (top_n):

[10]:
top_n = 2

We create the attribute count which will serve as the threshold attribute which we add to the dataframe:

[11]:
mexico["count"] = 1
threshold_name = "count"

The model can then be instantiated and solved:

[12]:
numpy.random.seed(RANDOM_SEED)
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()
[13]:
mexico["maxp_new"] = model.labels_
[14]:
mexico[["maxp_new", "AREA"]].groupby(by="maxp_new").count()
[14]:
AREA
maxp_new
1 6
2 7
3 6
4 7
5 6
[15]:
model.p
[15]:
5
[16]:
mexico.plot(figsize=(8, 5), column="maxp_new", cmap="tab20", edgecolor="w").axis("off");
../_images/notebooks_maxp_24_0.png

The model solution results in five regions, three of which have six states, and two with seven states each. Each region is a spatially connected component, as required by the max-p problem.

Change threshold to a minimum of 3 states per region

[17]:
numpy.random.seed(RANDOM_SEED)
threshold = 3
model = MaxP(mexico, w, attrs_name, threshold_name, threshold, top_n)
model.solve()
[18]:
mexico["maxp_3"] = model.labels_
mexico[["maxp_3", "AREA"]].groupby(by="maxp_3").count()
[18]:
AREA
maxp_3
1 3
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 3
[19]:
model.p
[19]:
10
[20]:
mexico.plot(figsize=(8, 5), column="maxp_3", cmap="tab20", edgecolor="w").axis("off");
../_images/notebooks_maxp_29_0.png

Year-by-Year Regionalization while Varying Parameters

Vary minimum in-region threshold (5, 10, 15); hold enclave assignment constant (2)

5 states

[21]:
region_args = mexico, mxgdp_years_grid, w
[22]:
subplotter(*region_args, threshold=5, top_n=2)
../_images/notebooks_maxp_32_0.png

10 states

[23]:
subplotter(*region_args, threshold=10, top_n=2)
../_images/notebooks_maxp_34_0.png

15 states

[24]:
subplotter(*region_args, threshold=15, top_n=2)
../_images/notebooks_maxp_36_0.png

Vary state enclave assignments (1, 3, 5); hold minimum in-region threshold constant (5)

1 state

[25]:
subplotter(*region_args, threshold=5, top_n=1)
../_images/notebooks_maxp_38_0.png

3 states

[26]:
subplotter(*region_args, threshold=5, top_n=3)
../_images/notebooks_maxp_40_0.png

5 states

[27]:
subplotter(*region_args, threshold=5, top_n=5)
../_images/notebooks_maxp_42_0.png

As shown here, varying the threshold and top_n parameters both have a demonstrable effect regionalization outcomes.