Bandwidth search¶

To find out the optimal bandwidth, gwlearn provides a BandwidthSearch class, which trains models on a range of bandwidths and selects the most optimal one.

import geopandas as gpd
from geodatasets import get_path

from gwlearn.ensemble import GWRandomForestClassifier
from gwlearn.linear_model import GWLinearRegression
from gwlearn.search import BandwidthSearch

Get sample data

gdf = gpd.read_file(get_path("geoda.ncovr")).to_crs(5070)

gdf['point'] = gdf.representative_point()
gdf = gdf.set_geometry('point')

y = gdf["FH90"]
X = gdf.iloc[:, 9:15]

Interval search¶

Interval search tests the model at a set interval.

search = BandwidthSearch(
    GWLinearRegression,
    geometry=gdf.geometry,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=100,
    criterion="aicc",
    verbose=True,
)
search.fit(
    X,
    y,
)

Bandwidth: 50.00, aicc: 15666.353
Bandwidth: 150.00, aicc: 15498.936
Bandwidth: 250.00, aicc: 15660.999
Bandwidth: 350.00, aicc: 15775.129
Bandwidth: 450.00, aicc: 15854.287
Bandwidth: 550.00, aicc: 15918.673
Bandwidth: 650.00, aicc: 15975.564
Bandwidth: 750.00, aicc: 16031.276
Bandwidth: 850.00, aicc: 16080.155
Bandwidth: 950.00, aicc: 16126.986

<gwlearn.search.BandwidthSearch at 0x17f353bc0>

The scores_ series then contains the AICc, selected as the criterion, which can be plotted to see the change of the model performance as the bandwidth grows.

search.scores_.plot()

<Axes: >

_images/64dbf3cace807d6c58c83c6e2c12a76db18e5131f54652f59067268a5f654160.png

The optimal bandwidth is then the lowest one.

search.optimal_bandwidth_

np.int64(150)

Golden section¶

Alternatively, you can try to use the golden section algorithm that attempts to find the optimal bandwidth iteratively. However, note that there’s no guaratnee that it will find the globally optimal bandwidth as it may stick to the local minimum.

search = BandwidthSearch(
    GWLinearRegression,
    geometry=gdf.geometry,
    fixed=True,
    n_jobs=-1,
    search_method="golden_section",
    criterion="aicc",
    min_bandwidth=250_000,
    max_bandwidth=2_000_000,
    verbose=True,
)
search.fit(
    X,
    y,
)

Bandwidth: 918447.5, score: 16135.833
Bandwidth: 1331552.5, score: 16367.508
Bandwidth: 663120.61, score: 15901.394
Bandwidth: 505326.89, score: 15734.258
Bandwidth: 407799.68, score: 15644.228
Bandwidth: 347527.21, score: 15599.261
Bandwidth: 310274.74, score: 15585.033
Bandwidth: 287252.47, score: 15579.996
Bandwidth: 273023.14, score: 15583.440
Bandwidth: 296045.75, score: 15580.453
Bandwidth: 281817.09, score: 15580.952
Bandwidth: 290610.83, score: 15579.452
Bandwidth: 292686.98, score: 15579.549
Bandwidth: 289328.29, score: 15579.566
Bandwidth: 291404.06, score: 15579.440
Bandwidth: 291893.95, score: 15579.458
Bandwidth: 291100.94, score: 15579.438

<gwlearn.search.BandwidthSearch at 0x331d955b0>

You can see how the agorithm searches and iteratively gets closer to the optimum.

search.optimal_bandwidth_

np.float64(291100.9431339667)

Other metrics¶

By default, BandwidthSearch computes AICc, AIC and BIC, available through metrics_.

search.metrics_

	aicc	aic	bic
9.184475e+05	16135.832631	16132.605473	16556.503755
1.331552e+06	16367.507618	16366.574492	16597.123551
6.631206e+05	15901.393868	15891.545192	16623.552420
5.053269e+05	15734.258373	15708.559009	16874.775997
4.077997e+05	15644.228226	15588.888391	17273.283311
3.475272e+05	15599.260550	15500.960777	17709.892539
3.102747e+05	15585.032697	15436.700466	18110.784586
2.872525e+05	15579.995539	15383.747794	18423.649729
2.730231e+05	15583.440146	15347.827157	18650.241701
2.960457e+05	15580.452736	15404.583147	18296.148267
2.818171e+05	15580.952181	15370.724573	18507.176295
2.906108e+05	15579.451964	15391.351712	18373.091183
2.926870e+05	15579.548602	15396.249687	18343.014101
2.893283e+05	15579.566286	15388.413516	18392.115393
2.914041e+05	15579.439825	15393.191365	18361.504693
2.918940e+05	15579.458231	15394.342859	18354.402006
2.911009e+05	15579.438393	15392.484736	18365.920248

You can also ask for a log loss and even use it as a criterion for the selection. This is useful when comparing classification models with varying prediction rate (you can also retrieve that for each bandwidth).

search = BandwidthSearch(
    GWRandomForestClassifier,
    geometry=gdf.geometry,
    fixed=False,
    n_jobs=-1,
    search_method="interval",
    min_bandwidth=50,
    max_bandwidth=1000,
    interval=100,
    metrics=['log_loss', 'prediction_rate'],
    criterion="log_loss",
    verbose=True,
)
search.fit(
    X,
    y > y.median(),  # simulate binary categorical variable
)

Bandwidth: 50.00, log_loss: 0.568
Bandwidth: 150.00, log_loss: 0.550
Bandwidth: 250.00, log_loss: 0.578
Bandwidth: 350.00, log_loss: 0.617
Bandwidth: 450.00, log_loss: 0.593
Bandwidth: 550.00, log_loss: 0.553
Bandwidth: 650.00, log_loss: 0.494
Bandwidth: 750.00, log_loss: 0.524
Bandwidth: 850.00, log_loss: 0.555
Bandwidth: 950.00, log_loss: 0.571

<gwlearn.search.BandwidthSearch at 0x30be6d370>

Log loss is then part of the metrics.

search.metrics_

	aicc	aic	bic	log_loss	prediction_rate
50	NaN	14260.458748	47847.673185	0.568032	0.663533
150	6152.053899	4482.927615	10277.629979	0.549784	0.727391
250	3894.720134	3628.636942	6452.220177	0.577733	0.747164
350	3624.008863	3543.537156	5240.691021	0.617262	0.784441
450	3498.456652	3457.505985	4743.376262	0.593468	0.832415
550	3303.210326	3278.050284	4322.854246	0.552747	0.864182
650	3034.904706	3017.896184	3905.122425	0.493768	0.897245
750	3258.804907	3246.121567	4034.695415	0.523635	0.930308
850	3524.552076	3514.813454	4226.823882	0.555048	0.967585
950	3705.116372	3697.267706	4352.457033	0.570866	1.000000

And is reported directly as score as it is set as the criterion.

search.scores_.plot()

<Axes: >

_images/956e732bf09586cc72e29d7429e5964c88842516ef394b90040099eb7754b7a1.png

As a result, the optimal bandwidth is derived directly from it.

search.optimal_bandwidth_

np.int64(650)