Bandwidth search

To find out the optimal bandwidth, gwlearn provides a BandwidthSearch class, which trains models on a range of bandwidths and selects the most optimal one.

import geopandas as gpd
from geodatasets import get_path

from gwlearn.ensemble import GWRandomForestClassifier
from gwlearn.linear_model import GWLinearRegression
from gwlearn.search import BandwidthSearch

Get sample data

gdf = gpd.read_file(get_path("geoda.ncovr")).to_crs(5070)

gdf['point'] = gdf.representative_point()
gdf = gdf.set_geometry('point')

y = gdf["FH90"]
X = gdf.iloc[:, 9:15]

Interval search tests the model at a set interval.

search = BandwidthSearch(
    GWLinearRegression,
    geometry=gdf.geometry,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=100,
    criterion="aicc",
    verbose=True,
)
search.fit(
    X,
    y,
)
Bandwidth: 50.00, aicc: 15666.353
Bandwidth: 150.00, aicc: 15498.936
Bandwidth: 250.00, aicc: 15660.999
Bandwidth: 350.00, aicc: 15775.129
Bandwidth: 450.00, aicc: 15854.287
Bandwidth: 550.00, aicc: 15918.673
Bandwidth: 650.00, aicc: 15975.564
Bandwidth: 750.00, aicc: 16031.276
Bandwidth: 850.00, aicc: 16080.155
Bandwidth: 950.00, aicc: 16126.986
<gwlearn.search.BandwidthSearch at 0x17f353bc0>

The scores_ series then contains the AICc, selected as the criterion, which can be plotted to see the change of the model performance as the bandwidth grows.

search.scores_.plot()
<Axes: >
_images/64dbf3cace807d6c58c83c6e2c12a76db18e5131f54652f59067268a5f654160.png

The optimal bandwidth is then the lowest one.

search.optimal_bandwidth_
np.int64(150)

Golden section

Alternatively, you can try to use the golden section algorithm that attempts to find the optimal bandwidth iteratively. However, note that there’s no guaratnee that it will find the globally optimal bandwidth as it may stick to the local minimum.

search = BandwidthSearch(
    GWLinearRegression,
    geometry=gdf.geometry,
    fixed=True,
    n_jobs=-1,
    search_method="golden_section",
    criterion="aicc",
    min_bandwidth=250_000,
    max_bandwidth=2_000_000,
    verbose=True,
)
search.fit(
    X,
    y,
)
Bandwidth: 918447.5, score: 16135.833
Bandwidth: 1331552.5, score: 16367.508
Bandwidth: 663120.61, score: 15901.394
Bandwidth: 505326.89, score: 15734.258
Bandwidth: 407799.68, score: 15644.228
Bandwidth: 347527.21, score: 15599.261
Bandwidth: 310274.74, score: 15585.033
Bandwidth: 287252.47, score: 15579.996
Bandwidth: 273023.14, score: 15583.440
Bandwidth: 296045.75, score: 15580.453
Bandwidth: 281817.09, score: 15580.952
Bandwidth: 290610.83, score: 15579.452
Bandwidth: 292686.98, score: 15579.549
Bandwidth: 289328.29, score: 15579.566
Bandwidth: 291404.06, score: 15579.440
Bandwidth: 291893.95, score: 15579.458
Bandwidth: 291100.94, score: 15579.438
<gwlearn.search.BandwidthSearch at 0x331d955b0>

You can see how the agorithm searches and iteratively gets closer to the optimum.

search.optimal_bandwidth_
np.float64(291100.9431339667)

Other metrics

By default, BandwidthSearch computes AICc, AIC and BIC, available through metrics_.

search.metrics_
aicc aic bic
9.184475e+05 16135.832631 16132.605473 16556.503755
1.331552e+06 16367.507618 16366.574492 16597.123551
6.631206e+05 15901.393868 15891.545192 16623.552420
5.053269e+05 15734.258373 15708.559009 16874.775997
4.077997e+05 15644.228226 15588.888391 17273.283311
3.475272e+05 15599.260550 15500.960777 17709.892539
3.102747e+05 15585.032697 15436.700466 18110.784586
2.872525e+05 15579.995539 15383.747794 18423.649729
2.730231e+05 15583.440146 15347.827157 18650.241701
2.960457e+05 15580.452736 15404.583147 18296.148267
2.818171e+05 15580.952181 15370.724573 18507.176295
2.906108e+05 15579.451964 15391.351712 18373.091183
2.926870e+05 15579.548602 15396.249687 18343.014101
2.893283e+05 15579.566286 15388.413516 18392.115393
2.914041e+05 15579.439825 15393.191365 18361.504693
2.918940e+05 15579.458231 15394.342859 18354.402006
2.911009e+05 15579.438393 15392.484736 18365.920248

You can also ask for a log loss and even use it as a criterion for the selection. This is useful when comparing classification models with varying prediction rate (you can also retrieve that for each bandwidth).

search = BandwidthSearch(
    GWRandomForestClassifier,
    geometry=gdf.geometry,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=100,
    metrics=['log_loss', 'prediction_rate'],
    criterion="log_loss",
    verbose=True,
)
search.fit(
    X,
    y > y.median(),  # simulate binary categorical variable
)
Bandwidth: 50.00, log_loss: 0.568
Bandwidth: 150.00, log_loss: 0.550
Bandwidth: 250.00, log_loss: 0.578
Bandwidth: 350.00, log_loss: 0.617
Bandwidth: 450.00, log_loss: 0.593
Bandwidth: 550.00, log_loss: 0.553
Bandwidth: 650.00, log_loss: 0.494
Bandwidth: 750.00, log_loss: 0.524
Bandwidth: 850.00, log_loss: 0.555
Bandwidth: 950.00, log_loss: 0.571
<gwlearn.search.BandwidthSearch at 0x30be6d370>

Log loss is then part of the metrics.

search.metrics_
aicc aic bic log_loss prediction_rate
50 NaN 14260.458748 47847.673185 0.568032 0.663533
150 6152.053899 4482.927615 10277.629979 0.549784 0.727391
250 3894.720134 3628.636942 6452.220177 0.577733 0.747164
350 3624.008863 3543.537156 5240.691021 0.617262 0.784441
450 3498.456652 3457.505985 4743.376262 0.593468 0.832415
550 3303.210326 3278.050284 4322.854246 0.552747 0.864182
650 3034.904706 3017.896184 3905.122425 0.493768 0.897245
750 3258.804907 3246.121567 4034.695415 0.523635 0.930308
850 3524.552076 3514.813454 4226.823882 0.555048 0.967585
950 3705.116372 3697.267706 4352.457033 0.570866 1.000000

And is reported directly as score as it is set as the criterion.

search.scores_.plot()
<Axes: >
_images/956e732bf09586cc72e29d7429e5964c88842516ef394b90040099eb7754b7a1.png

As a result, the optimal bandwidth is derived directly from it.

search.optimal_bandwidth_
np.int64(650)