Introduction to prototyping geographically weighted models¶

gwlearn implements generic structures for geographically weighted modelling useful for flexible prototyping of various types of local models. On top, it provides implementation of a subset of specific models for regression and classification tasks.

Common principle¶

The principle applied here is the same as in standard geographically weighted regression:

Each observation in the (spatial) dataset has a local model
Each local model is fitted on a neighbourhood around its focal observation defined by a set bandwidth
Each local model uses sample weighting derived from the distance to the focal point and a set kernel function

Linear regression from scratch¶

Let’s explore the principle by implementing simplified version of linear regression.

import geopandas as gpd
import matplotlib.pyplot as plt
from geodatasets import get_path

from gwlearn.base import BaseRegressor

Let’s load some data. In this example, you can predict the number of suicides based on other population data in the Guerry dataset.

gdf = gpd.read_file(get_path("geoda.guerry"))
gdf.plot().set_axis_off()

_images/4cd6f40416928952dbfc42f4ad77b31636935690a42b86a030a0bd125a4f4d9a.png

Let’s start with the model class that will be used as individual local models. In this case, LinearRegression.

from sklearn.linear_model import LinearRegression

With this, you can use BaseRegressor to build geographically weighted version of LinearRegression.

gwr = BaseRegressor(
    model=LinearRegression,
    bandwidth=25,
    fixed=False,
    kernel="bisquare",
    geometry=gdf.centroid,
    include_focal=True
)
gwr

BaseRegressor(bandwidth=25,
              geometry=0     POINT (832852.279 2126600.576)
1     POINT (688485.627 2507622.044)
2     POINT (665510.138 2155203.452)
3     POINT (912995.751 1908303.142)
4     POINT (911433.897 1970311.894)
                   ...              
80    POINT (322059.477 2192535.348)
81    POINT (456183.284 2175488.824)
82    POINT (514516.691 2099653.307)
83    POINT (900562.832 2363005.197)
84    POINT (691913.683 2316341.407)
Length: 85, dtype: geometry,
              include_focal=True,
              model=<class 'sklearn.linear_model._base.LinearRegression'>)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The model specification above contains the model class, the bandwidth size and type (adaptive bandwidth with 25 nearest neighbors) and a deifinition of the kernel to be used for sample weights. On top of that, importantly, is also defined geometry representing location of observations. Alternatively, you could skip all but model and pass directly a libpysal.graph.Graph object capturing spatial neighborhoods and weights directly. Given we are dealing with a linear model, we can also include the focal observation in the training, while still using it for evaluation later.

Fitting the model works as you know it from scikit-learn itself.

gwr.fit(
    X=gdf[['Crm_prp', 'Litercy', 'Donatns', 'Lottery']],
    y=gdf["Suicids"],
)

BaseRegressor(bandwidth=25,
              geometry=0     POINT (832852.279 2126600.576)
1     POINT (688485.627 2507622.044)
2     POINT (665510.138 2155203.452)
3     POINT (912995.751 1908303.142)
4     POINT (911433.897 1970311.894)
                   ...              
80    POINT (322059.477 2192535.348)
81    POINT (456183.284 2175488.824)
82    POINT (514516.691 2099653.307)
83    POINT (900562.832 2363005.197)
84    POINT (691913.683 2316341.407)
Length: 85, dtype: geometry,
              include_focal=True,
              model=<class 'sklearn.linear_model._base.LinearRegression'>)

The basic model contains some information that is common to any generic regressive model.

The first is prediction on focal geometries using the local model built around each (individually).

gwr.pred_

   60800.878612
   14283.306567
   61918.194878
   26228.794562
   15531.652691
          ...     
  42602.513575
  19914.708697
  29128.830096
  25875.030847
  19159.068731
Length: 85, dtype: float64

Similarly, you can get residuals based on this prediction.

gwr.resid_

  -25761.878612
   -1452.306567
   52202.805122
  -11990.794562
     639.347309
          ...     
  25360.486425
   1936.291303
   4368.169904
   7153.969153
  -6370.068731
Length: 85, dtype: float64

Both of which can be plotted and compared to the original y.

f, axs = plt.subplots(1, 3, figsize=(12, 6), sharey=True)

top = gdf["Suicids"].max()
low = gdf["Suicids"].min()
leg = {"orientation": "horizontal", "shrink": .7}

gdf.plot("Suicids", ax=axs[0], legend=True, legend_kwds=leg)
gdf.plot(gwr.pred_, ax=axs[1], vmin=low, vmax=top, legend=True, legend_kwds=leg)
gdf.plot(gwr.resid_, ax=axs[2], cmap="PiYG", legend=True, legend_kwds=leg)

for ax in axs.flat:
    ax.set_axis_off()

axs[0].set_title('y')
axs[1].set_title('pred_')
axs[2].set_title('resid_');

_images/92a7ce8e54a2cf43a5dfb0406221f93fc15b88ec1433039d9999af15bcff0cac.png

With regression models, you can furhter retrieve local $R^2$.

gwr.local_r2_

   0.667045
   0.587837
   0.638611
   0.686119
   0.719047
        ...   
  0.506366
  0.594885
  0.671242
  0.165759
  0.398011
Length: 85, dtype: float64

gdf.plot(gwr.local_r2_, legend=True).set_axis_off()

_images/60b16aa4f3d9a01cac76057df594ddc11fc05e4305198f622c22e32bc0ae83ec.png

For the model comparison, you can also get information criteria.

gwr.aic_, gwr.aicc_, gwr.bic_

(np.float64(1959.934896362852),
 np.float64(2006.2644970358097),
 np.float64(2044.582781928831))

Implemented models¶

While this captures the generic case and can be extrapolated to many other predictive models, it has some limitations. Notably, it is unable to extract information linked to specificities of the selected model. In case of linear regression, we might be interested in fitted coefficients, in case of non-linear models in feature importance and so on.

Consult the rest of the user guide to get an overview of implemented models and other functionality.

	model	<class 'sklea...arRegression'>
	bandwidth	25
	fixed	False
	kernel	'bisquare'
	include_focal	True
	geometry	0 POINT (...type: geometry
	graph	None
	n_jobs	-1
	fit_global_model	True
	strict	False
	keep_models	False
	temp_folder	None
	batch_size	None
	verbose	False