Regional-k-means¶

Authors: Sergio Rey, Xin Feng, James Gaboardi

Regional-k-means is K-means with the constraint that each cluster forms a spatially connected component. The algorithm is developed by Sergio Rey. This tutorial goes through the following:

1. a small synthetic example (10x10 lattice)

2. a large synthetic example (50x50 lattice)

3. an empirical example (Rio Grande do Sul, Brasil)

[1]:

from spopt.region import RegionKMeansHeuristic

import geopandas
import libpysal
import numpy


1. Small synthetic example¶

Create synthetic data¶

Create a spatial weights object for a 10*10 regular lattice.

[2]:

dim = 10
w = libpysal.weights.lat2W(dim, dim)
w.n

[2]:

100


Draw 100 random samples (the given shape is (100, 3)) from a normal (Gaussian) distribution. Then, there are three values for each lattice. They are variables in the dataframe that will be used to measure regional homogeneity.

[3]:

RANDOM_SEED = 12345
numpy.random.seed(RANDOM_SEED)
data = numpy.random.normal(size=(w.n, 3))

[4]:

data

[4]:

array([[-2.04707659e-01,  4.78943338e-01, -5.19438715e-01],
[-5.55730304e-01,  1.96578057e+00,  1.39340583e+00],
[ 9.29078767e-02,  2.81746153e-01,  7.69022568e-01],
[ 1.24643474e+00,  1.00718936e+00, -1.29622111e+00],
[ 2.74991633e-01,  2.28912879e-01,  1.35291684e+00],
[ 8.86429341e-01, -2.00163731e+00, -3.71842537e-01],
[ 1.66902531e+00, -4.38569736e-01, -5.39741446e-01],
[ 4.76985010e-01,  3.24894392e+00, -1.02122752e+00],
[-5.77087303e-01,  1.24121276e-01,  3.02613562e-01],
[ 5.23772068e-01,  9.40277775e-04,  1.34380979e+00],
[-7.13543985e-01, -8.31153539e-01, -2.37023165e+00],
[-1.86076079e+00, -8.60757398e-01,  5.60145293e-01],
[-1.26593449e+00,  1.19827125e-01, -1.06351245e+00],
[ 3.32882716e-01, -2.35941881e+00, -1.99542955e-01],
[-1.54199553e+00, -9.70735912e-01, -1.30703025e+00],
[ 2.86349747e-01,  3.77984111e-01, -7.53886535e-01],
[ 3.31285650e-01,  1.34974221e+00,  6.98766888e-02],
[ 2.46674110e-01, -1.18616011e-02,  1.00481159e+00],
[ 1.32719461e+00, -9.19261558e-01, -1.54910644e+00],
[ 2.21845987e-02,  7.58363145e-01, -6.60524328e-01],
[ 8.62580083e-01, -1.00319021e-02,  5.00093559e-02],
[ 6.70215594e-01,  8.52965032e-01, -9.55868852e-01],
[-2.34933207e-02, -2.30423388e+00, -6.52468841e-01],
[-1.21830198e+00, -1.33260971e+00,  1.07462269e+00],
[ 7.23641505e-01,  6.90001853e-01,  1.00154344e+00],
[-5.03087391e-01, -6.22274225e-01, -9.21168608e-01],
[-7.26213493e-01,  2.22895546e-01,  5.13161009e-02],
[-1.15771947e+00,  8.16706936e-01,  4.33609606e-01],
[ 1.01073695e+00,  1.82487521e+00, -9.97518248e-01],
[ 8.50591099e-01, -1.31577601e-01,  9.12414152e-01],
[ 1.88210680e-01,  2.16946144e+00, -1.14928205e-01],
[ 2.00369736e+00,  2.96101523e-02,  7.95253156e-01],
[ 1.18109754e-01, -7.48531548e-01,  5.84969738e-01],
[ 1.52676573e-01, -1.56565729e+00, -5.62540188e-01],
[-3.26641392e-02, -9.29006202e-01, -4.82572646e-01],
[-3.62638461e-02,  1.09539006e+00,  9.80928477e-01],
[-5.89487686e-01,  1.58170009e+00, -5.28734826e-01],
[ 4.57001871e-01,  9.29968759e-01, -1.56927061e+00],
[-1.02248698e+00, -4.02826924e-01,  2.20486863e-01],
[-1.93401108e-01,  6.69158336e-01, -1.64898482e+00],
[-2.25279725e+00, -1.16683222e+00,  3.53607102e-01],
[ 7.02110171e-01, -2.74569205e-01, -1.39142188e-01],
[ 1.07657222e-01, -6.06545125e-01, -4.17064408e-01],
[-1.70070368e-02, -1.22414528e+00, -1.80083991e+00],
[ 1.63473620e+00,  9.89008302e-01,  4.57940143e-01],
[ 5.55154410e-01,  1.30671972e+00, -4.40553570e-01],
[-3.01350280e-01,  4.98791490e-01, -8.23991040e-01],
[ 1.32056584e+00,  5.07964786e-01, -6.53437675e-01],
[ 1.86979514e-01, -3.91725249e-01, -2.72292975e-01],
[-1.71414356e-02,  6.80320749e-01,  6.35512357e-01],
[-7.57176502e-01,  7.18085834e-01, -3.04273076e-01],
[-1.67779025e+00,  4.26986085e-01, -1.56373985e+00],
[-3.67487521e-01,  1.04591253e+00,  1.21995436e+00],
[-2.47699116e-01, -4.16232132e-01, -1.16747004e-01],
[-1.84478762e+00,  2.06870785e+00, -7.76967474e-01],
[ 1.44016687e+00, -1.10557360e-01,  1.22738699e+00],
[ 1.92078426e+00,  7.46433038e-01,  2.22465959e+00],
[-6.79400410e-01,  7.27368782e-01, -8.68730734e-01],
[-1.21385091e+00, -4.70630931e-01, -9.19241697e-01],
[-8.38826689e-01,  4.35155305e-01, -5.57804717e-01],
[-5.67454871e-01, -3.72641553e-01, -9.26556901e-01],
[ 1.75510839e+00,  1.20980999e+00,  1.27002473e+00],
[-9.74378127e-01, -6.34709255e-01, -3.95700752e-01],
[-2.89435900e-01, -7.34297072e-01, -7.28504679e-01],
[ 8.38775073e-01,  2.66893213e-01,  7.21194339e-01],
[ 9.10982642e-01, -1.02090261e+00, -1.41341604e+00],
[ 1.29660784e+00,  2.52275209e-01,  1.12748110e+00],
[-5.68363447e-01,  3.09362168e-01, -5.77385473e-01],
[-1.16863407e+00, -8.25019972e-01, -2.64440949e+00],
[-1.52985803e-01, -7.51921003e-01, -1.32609252e-01],
[ 1.45729970e+00,  6.09511845e-01, -4.93779257e-01],
[ 1.23997988e+00, -1.35722140e-01,  1.43004181e+00],
[-8.46852451e-01,  6.03282130e-01,  1.26357226e+00],
[-2.55490556e-01, -4.45688380e-01,  4.68366681e-01],
[-9.61603924e-01, -1.82450454e+00,  6.25428156e-01],
[ 1.02287238e+00,  1.10742460e+00,  9.09370895e-02],
[-3.50108657e-01,  2.17957016e-01, -8.94813130e-01],
[-1.74149395e+00, -1.05225574e+00,  1.43660279e+00],
[-5.76207386e-01, -2.42029443e+00, -1.06232963e+00],
[ 2.37372262e-01,  9.57369064e-04,  6.52531808e-02],
[-1.36752411e+00, -3.02800519e-02,  9.40489321e-01],
[-6.42436751e-01,  1.04017925e+00, -1.08292226e+00],
[ 4.29213588e-01, -2.36223669e-01,  6.41817816e-01],
[-3.31660557e-01,  1.39407223e+00, -1.07674194e+00],
[-1.92465982e-01, -8.71187651e-01,  4.20851997e-01],
[-1.21141107e+00, -2.58866912e-01, -5.81646850e-01],
[-1.26042063e+00,  4.64574793e-01, -1.07024091e+00],
[ 8.04222698e-01, -1.56735508e-01,  2.01039001e+00],
[-8.87104430e-01, -9.77936232e-01, -2.67217350e-01],
[ 4.83337822e-01, -4.00332733e-01,  4.49880415e-01],
[ 3.99593953e-01, -1.51574804e-01, -2.55793406e+00],
[ 1.60806841e-01,  7.65250677e-02, -2.97204166e-01],
[-1.29427402e+00, -8.85180013e-01, -1.87496526e-01],
[-4.93560000e-01, -1.15412964e-01, -3.50744607e-01],
[ 4.46973764e-02, -8.97756316e-01,  8.90873502e-01],
[-1.15118516e+00, -2.61230270e+00,  1.14125019e+00],
[-8.67135525e-01,  3.83583258e-01, -4.37030164e-01],
[ 3.47488810e-01, -1.23017904e+00,  5.71078139e-01],
[ 6.00612128e-02, -2.25523994e-01,  1.34972614e+00],
[ 1.35029973e+00, -3.86653322e-01,  8.65989542e-01]])


The neighbors of each lattice can be checked by:

[5]:

w.neighbors

[5]:

{0: [10, 1],
10: [0, 20, 11],
1: [0, 11, 2],
11: [1, 10, 21, 12],
2: [1, 12, 3],
12: [2, 11, 22, 13],
3: [2, 13, 4],
13: [3, 12, 23, 14],
4: [3, 14, 5],
14: [4, 13, 24, 15],
5: [4, 15, 6],
15: [5, 14, 25, 16],
6: [5, 16, 7],
16: [6, 15, 26, 17],
7: [6, 17, 8],
17: [7, 16, 27, 18],
8: [7, 18, 9],
18: [8, 17, 28, 19],
9: [8, 19],
19: [9, 18, 29],
20: [10, 30, 21],
21: [11, 20, 31, 22],
22: [12, 21, 32, 23],
23: [13, 22, 33, 24],
24: [14, 23, 34, 25],
25: [15, 24, 35, 26],
26: [16, 25, 36, 27],
27: [17, 26, 37, 28],
28: [18, 27, 38, 29],
29: [19, 28, 39],
30: [20, 40, 31],
31: [21, 30, 41, 32],
32: [22, 31, 42, 33],
33: [23, 32, 43, 34],
34: [24, 33, 44, 35],
35: [25, 34, 45, 36],
36: [26, 35, 46, 37],
37: [27, 36, 47, 38],
38: [28, 37, 48, 39],
39: [29, 38, 49],
40: [30, 50, 41],
41: [31, 40, 51, 42],
42: [32, 41, 52, 43],
43: [33, 42, 53, 44],
44: [34, 43, 54, 45],
45: [35, 44, 55, 46],
46: [36, 45, 56, 47],
47: [37, 46, 57, 48],
48: [38, 47, 58, 49],
49: [39, 48, 59],
50: [40, 60, 51],
51: [41, 50, 61, 52],
52: [42, 51, 62, 53],
53: [43, 52, 63, 54],
54: [44, 53, 64, 55],
55: [45, 54, 65, 56],
56: [46, 55, 66, 57],
57: [47, 56, 67, 58],
58: [48, 57, 68, 59],
59: [49, 58, 69],
60: [50, 70, 61],
61: [51, 60, 71, 62],
62: [52, 61, 72, 63],
63: [53, 62, 73, 64],
64: [54, 63, 74, 65],
65: [55, 64, 75, 66],
66: [56, 65, 76, 67],
67: [57, 66, 77, 68],
68: [58, 67, 78, 69],
69: [59, 68, 79],
70: [60, 80, 71],
71: [61, 70, 81, 72],
72: [62, 71, 82, 73],
73: [63, 72, 83, 74],
74: [64, 73, 84, 75],
75: [65, 74, 85, 76],
76: [66, 75, 86, 77],
77: [67, 76, 87, 78],
78: [68, 77, 88, 79],
79: [69, 78, 89],
80: [70, 90, 81],
81: [71, 80, 91, 82],
82: [72, 81, 92, 83],
83: [73, 82, 93, 84],
84: [74, 83, 94, 85],
85: [75, 84, 95, 86],
86: [76, 85, 96, 87],
87: [77, 86, 97, 88],
88: [78, 87, 98, 89],
89: [79, 88, 99],
90: [80, 91],
91: [81, 90, 92],
92: [82, 91, 93],
93: [83, 92, 94],
94: [84, 93, 95],
95: [85, 94, 96],
96: [86, 95, 97],
97: [87, 96, 98],
98: [88, 97, 99],
99: [89, 98]}


We first explore the simulated data by building a 10*10 lattice shapefile.

[6]:

libpysal.weights.build_lattice_shapefile(dim, dim, "lattice.shp")

[7]:

gdf = geopandas.read_file("lattice.shp")
gdf.plot(column="ID");


Regionalization¶

With reg-k-means, we aggregate 100 simulated lattices into 20 regions.

The model can then be solved:

[8]:

model = RegionKMeansHeuristic(data, 20, w)
model.solve()

[9]:

model.labels_

[9]:

array([12,  3,  3,  3,  3,  3, 19, 19, 17, 11, 12,  9,  9,  7,  7,  2,  2,
2, 17, 11, 12, 15,  9,  7, 18,  2, 13, 13,  6,  6, 15, 15,  9, 18,
18,  2, 13, 13,  0,  6, 15,  1,  1, 18, 18, 13, 13, 13,  0,  6,  1,
1,  4,  4, 18, 18,  0,  0,  0,  0,  1, 16,  4,  4,  4, 10, 10, 14,
5,  5, 16, 16, 16,  4,  8, 10, 10, 14,  5,  5, 16,  8, 16,  8,  8,
10, 10, 14,  5, 14,  8,  8,  8,  8,  8, 10, 10, 14, 14, 14])

[10]:

gdf["region"] = model.labels_
gdf.plot(column="region");


The model solution results in 20 spatially connected regions. We can summarize which lattice belongs to which region:

[11]:

areas = numpy.arange(dim * dim)
regions = [areas[model.labels_ == region] for region in range(20)]

[12]:

regions

[12]:

[array([38, 48, 56, 57, 58, 59]),
array([41, 42, 50, 51, 60]),
array([15, 16, 17, 25, 35]),
array([1, 2, 3, 4, 5]),
array([52, 53, 62, 63, 64, 73]),
array([68, 69, 78, 79, 88]),
array([28, 29, 39, 49]),
array([13, 14, 23]),
array([74, 81, 83, 84, 90, 91, 92, 93, 94]),
array([11, 12, 22, 32]),
array([65, 66, 75, 76, 85, 86, 95, 96]),
array([ 9, 19]),
array([ 0, 10, 20]),
array([26, 27, 36, 37, 45, 46, 47]),
array([67, 77, 87, 89, 97, 98, 99]),
array([21, 30, 31, 40]),
array([61, 70, 71, 72, 80, 82]),
array([ 8, 18]),
array([24, 33, 34, 43, 44, 54, 55]),
array([6, 7])]


2. Large synthetic example¶

Create synthetic data¶

Generate a 50 x 50 lattice with spenc

[13]:

from spopt.region.spenclib.utils import lattice
hori, vert = 50, 50
n_polys = hori * vert
gdf = lattice(hori, vert)

[13]:

geometry
0 POLYGON ((0.00000 0.00000, 1.00000 0.00000, 1....
1 POLYGON ((0.00000 1.00000, 1.00000 1.00000, 1....
2 POLYGON ((0.00000 2.00000, 1.00000 2.00000, 1....
3 POLYGON ((0.00000 3.00000, 1.00000 3.00000, 1....
4 POLYGON ((0.00000 4.00000, 1.00000 4.00000, 1....

Generate some random attribute data values

[14]:

numpy.random.seed(RANDOM_SEED)
gdf["data_values_1"] = numpy.random.random(n_polys)
gdf["data_values_2"] = numpy.random.random(n_polys)
vals = ["data_values_1", "data_values_2"]

[14]:

geometry data_values_1 data_values_2
0 POLYGON ((0.00000 0.00000, 1.00000 0.00000, 1.... 0.929616 0.510240
1 POLYGON ((0.00000 1.00000, 1.00000 1.00000, 1.... 0.316376 0.483683
2 POLYGON ((0.00000 2.00000, 1.00000 2.00000, 1.... 0.183919 0.686935
3 POLYGON ((0.00000 3.00000, 1.00000 3.00000, 1.... 0.204560 0.441431
4 POLYGON ((0.00000 4.00000, 1.00000 4.00000, 1.... 0.567725 0.242545

Split into 2 artifical islands

[15]:

gdf = gdf[:1300].append(gdf[1500:])
w = libpysal.weights.Rook.from_dataframe(gdf)

/Users/the-gaboardi/miniconda3/envs/py39_spopt/lib/python3.9/site-packages/libpysal/weights/weights.py:172: UserWarning: The weights matrix is not fully connected:
There are 2 disconnected components.
warnings.warn(message)


Regionalization¶

Partition into 8 regions

[16]:

numpy.random.seed(RANDOM_SEED)
model = RegionKMeansHeuristic(gdf[vals].values, 8, w)
model.solve()

[17]:

gdf["reg_k_mean"] = model.labels_
gdf.plot(
column="reg_k_mean",
categorical=True,
cmap="tab20",
figsize=(12,12),
edgecolor="w"
);


3. Empirical example (empircal geographies & synthetic atttributes)¶

Read in Rio Grande do Sul, Brasil dataset

[18]:

rsbr = libpysal.examples.get_path("map_RS_BR.shp")

[18]:

NM_MUNICIP CD_GEOCMU geometry
0 ACEGUÁ 4300034 POLYGON ((-54.10940 -31.43316, -54.10889 -31.4...
1 ÁGUA SANTA 4300059 POLYGON ((-51.98932 -28.12943, -51.98901 -28.1...
2 AGUDO 4300109 POLYGON ((-53.13696 -29.49483, -53.13481 -29.4...
3 AJURICABA 4300208 POLYGON ((-53.61993 -28.14569, -53.62100 -28.1...
4 ALECRIM 4300307 POLYGON ((-54.77813 -27.58372, -54.77307 -27.5...

Generate some random attribute data values

[19]:

n_polys = rsbr_gdf.shape[0]
numpy.random.seed(RANDOM_SEED)
attr_cols = ["attr_1", "attr_2", "attr_3", "attr_4"]
for attr_col in attr_cols:
rsbr_gdf[attr_col] = numpy.random.random(n_polys)

[19]:

NM_MUNICIP CD_GEOCMU geometry attr_1 attr_2 attr_3 attr_4
0 ACEGUÁ 4300034 POLYGON ((-54.10940 -31.43316, -54.10889 -31.4... 0.929616 0.990111 0.978448 0.194226
1 ÁGUA SANTA 4300059 POLYGON ((-51.98932 -28.12943, -51.98901 -28.1... 0.316376 0.126155 0.004249 0.245969
2 AGUDO 4300109 POLYGON ((-53.13696 -29.49483, -53.13481 -29.4... 0.183919 0.976601 0.559856 0.018801
3 AJURICABA 4300208 POLYGON ((-53.61993 -28.14569, -53.62100 -28.1... 0.204560 0.229106 0.751780 0.427996
4 ALECRIM 4300307 POLYGON ((-54.77813 -27.58372, -54.77307 -27.5... 0.567725 0.186056 0.390045 0.179598

Enforce fuzzy contiguity due to nonplanar-geometries

[20]:

rsbr_w = libpysal.weights.fuzzy_contiguity(rsbr_gdf)


Partition into 2 regions

[21]:

numpy.random.seed(RANDOM_SEED)
model = RegionKMeansHeuristic(rsbr_gdf[attr_cols].values, 2, rsbr_w)
model.solve()

[22]:

rsbr_gdf["reg_k_mean"] = model.labels_
rsbr_gdf.plot(
column="reg_k_mean",
categorical=True,
cmap="tab20",
figsize=(12,12),
edgecolor="w"
);