Bandwidth search¶

To find out the optimal bandwidth, gwlearn provides a BandwidthSearch class, which trains models on a range of bandwidths and selects the most optimal one.

import geopandas as gpd
from geodatasets import get_path

from gwlearn.linear_model import GWLinearRegression, GWLogisticRegression
from gwlearn.search import BandwidthSearch

Get sample data

gdf = gpd.read_file(get_path("geoda.south")).to_crs(5070)

gdf["point"] = gdf.representative_point()
gdf = gdf.set_geometry("point")

y = gdf["FH90"]
X = gdf.iloc[:, 9:15]

Downloading file 'south.zip' from 'https://geodacenter.github.io/data-and-lab//data/south.zip' to '/home/runner/.cache/geodatasets'.
Extracting 'south/south.gpkg' from '/home/runner/.cache/geodatasets/south.zip' to '/home/runner/.cache/geodatasets/south.zip.unzip'

Interval search¶

Interval search tests the model at a set interval. The default and recommended criterion for linear models is corrected Akaike’s information criterion (AICc).

search = BandwidthSearch(
    GWLinearRegression,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=100,
    criterion="aicc",
    verbose=True,
)
search.fit(
    X,
    y,
    geometry=gdf.geometry,
)

Bandwidth: 50.00, aicc: 7687.620
Bandwidth: 150.00, aicc: 7598.565
Bandwidth: 250.00, aicc: 7672.857
Bandwidth: 350.00, aicc: 7723.089
Bandwidth: 450.00, aicc: 7764.724
Bandwidth: 550.00, aicc: 7811.336
Bandwidth: 650.00, aicc: 7862.825
Bandwidth: 750.00, aicc: 7904.418
Bandwidth: 850.00, aicc: 7941.883
Bandwidth: 950.00, aicc: 7981.397

<gwlearn.search.BandwidthSearch at 0x7faebde2be00>

The scores_ series then contains the AICc, selected as the criterion, which can be plotted to see the change of the model performance as the bandwidth grows.

search.scores_.plot()

<Axes: >

_images/4b6589f79c63c37cd8dca987141c0790ebbadca6fbed081662088a6b6e5b0dca.png

The optimal bandwidth is then the lowest one.

search.optimal_bandwidth_

Golden section¶

Alternatively, you can try to use the golden section algorithm that attempts to find the optimal bandwidth iteratively. However, note that there’s no guaratnee that it will find the globally optimal bandwidth as it may stick to the local minimum.

search = BandwidthSearch(
    GWLinearRegression,
    fixed=True,
    n_jobs=-1,
    search_method="golden_section",
    criterion="aicc",
    min_bandwidth=10_000,
    max_bandwidth=1_000_000,
    verbose=True,
)
search.fit(
    X,
    y,
    geometry=gdf.geometry,
)

Bandwidth: 388150.3, score: 7670.159
Bandwidth: 621849.7, score: 7770.628
Bandwidth: 243708.23, score: 7596.823
Bandwidth: 154442.07, score: 7912.455
Bandwidth: 298880.77, score: 7626.218
Bandwidth: 209613.32, score: 7617.231
Bandwidth: 264783.28, score: 7604.824
Bandwidth: 230686.59, score: 7597.453
Bandwidth: 251759.37, score: 7599.373
Bandwidth: 238735.76, score: 7596.182
Bandwidth: 235660.47, score: 7596.368
Bandwidth: 240634.23, score: 7596.262
Bandwidth: 237560.29, score: 7596.229
Bandwidth: 239460.08, score: 7596.185

<gwlearn.search.BandwidthSearch at 0x7faebdb02e90>

You can see how the agorithm searches and iteratively gets closer to the optimum.

search.optimal_bandwidth_

np.float64(238735.7582789531)

Other metrics¶

By default, BandwidthSearch computes AICc, AIC and BIC for linear models, available through metrics_.

search.metrics_

	aicc	aic	bic
388150.300000	7670.159314	7650.730706	8238.208833
621849.700000	7770.627977	7766.321405	8047.693794
243708.229909	7596.823008	7497.313206	8762.156146
154442.070091	7912.454856	7355.237895	9995.188728
298880.767422	7626.217681	7578.766844	8477.096909
209613.319310	7617.230987	7443.545102	9065.985674
264783.280267	7604.823570	7531.986411	8628.488146
230686.589297	7597.453089	7475.433826	8862.323769
251759.367217	7599.373132	7511.192286	8708.266105
238735.758279	7596.181669	7488.668425	8798.649792
235660.465361	7596.367751	7483.564238	8822.281852
240634.225285	7596.262234	7491.905192	8784.335011
237560.292439	7596.229247	7486.713835	8807.668987
239460.075156	7596.184987	7489.889063	8793.138119

You can also ask for a log loss and even use it as a criterion for the selection. This is useful when comparing classification models with varying prediction rate (you can also retrieve that for each bandwidth) or when using non-linear models, where MAE or RMSE should be used instead of AICc.

search = BandwidthSearch(
    GWLogisticRegression,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=200,
    metrics=["log_loss", "prediction_rate"],
    criterion="log_loss",
    verbose=True,
)
search.fit(
    X,
    y > y.median(),  # simulate binary categorical variable
    geometry=gdf.geometry,
)

Bandwidth: 50.00, log_loss: 0.276
Bandwidth: 250.00, log_loss: 0.376
Bandwidth: 450.00, log_loss: 0.413
Bandwidth: 650.00, log_loss: 0.437
Bandwidth: 850.00, log_loss: 0.454

<gwlearn.search.BandwidthSearch at 0x7faebdda3610>

Log loss is then part of the metrics.

search.metrics_

	aicc	aic	bic	log_loss	prediction_rate
50	1378.952921	1119.216533	2551.160956	0.275782	0.682011
250	1263.263762	1249.800971	1741.819969	0.376227	1.000000
450	1275.477037	1271.315792	1547.948404	0.412885	1.000000
650	1308.279791	1306.304157	1497.258268	0.436826	1.000000
850	1337.858308	1336.720382	1481.487638	0.453824	1.000000

And is reported directly as score as it is set as the criterion.

search.scores_.plot()

<Axes: >

_images/4430c0a93f901ae7c7595e6f656447dce0366ad6854dbfc0674ebbb1ac6a1970.png

As a result, the optimal bandwidth is derived directly from it.

search.optimal_bandwidth_