Linear models¶

An overview of linear and logistic regression models.

import geopandas as gpd
import matplotlib.pyplot as plt
from geodatasets import get_path
from sklearn import metrics

from gwlearn.linear_model import GWLinearRegression, GWLogisticRegression

Linear regression¶

Linear regression is a classical example of geographically weighted regression models. In this example, you can predict the number of suicides based on other population data in the Guerry dataset.

gdf = gpd.read_file(get_path("geoda.guerry"))

It is a relatively small dataset covering Frech data from 1800s.

gdf.plot().set_axis_off()

_images/4cd6f40416928952dbfc42f4ad77b31636935690a42b86a030a0bd125a4f4d9a.png

Specify the model and fit the data.

adaptive = GWLinearRegression(bandwidth=25, fixed=False, kernel="tricube")
adaptive.fit(
    gdf[["Crm_prp", "Litercy", "Donatns", "Lottery"]],
    gdf["Suicids"],
    geometry=gdf.representative_point(),
)

GWLinearRegression(bandwidth=25, kernel='tricube')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You can get a number of outputs from the fitted model. For example, focal predictions (a value is predicted on an observations using the local model fitted around it).

adaptive.pred_

   65929.297258
   14289.540382
   60668.901578
   25191.170857
   12777.621045
          ...     
  41304.669559
  19766.026593
  29125.591134
  25717.441407
  19381.885833
Length: 85, dtype: float64

This allows you to measure model performance metrics like $R^2$. This value would be comparable to the R2 reported by mgwr.

metrics.r2_score(gdf["Suicids"], adaptive.pred_)

0.7030520942210708

Local version of $R^2$ is accessible directly.

adaptive.local_r2_

   0.655274
   0.564506
   0.616975
   0.690529
   0.704666
        ...   
  0.444105
  0.554521
  0.646275
  0.174557
  0.215379
Length: 85, dtype: float64

Any of the values linked to individual observations can then be mapped.

gdf.plot(adaptive.local_r2_, legend=True).set_axis_off()

_images/aedd4462e4f5d2eddbfc9839a28052e2c3b4789432985ceb462617a3f9b5b667.png

Similarly, you can access residuals.

gdf.plot(adaptive.resid_, legend=True).set_axis_off()

_images/89970d5ca799ab996b9e3b3e73ed04922538b49968549a69f5daf0e23e7796b0.png

Each local model then have its own local coefficients.

adaptive.local_coef_

	Crm_prp	Litercy	Donatns	Lottery
0	3.147001	-500.570435	0.218349	274.308624
1	2.177315	103.815973	1.276062	-94.396645
2	1.155884	-2112.020365	-7.912018	1015.701421
3	7.945416	-189.098606	-3.969571	132.812885
4	7.575983	-111.394715	-4.221681	107.181855
...	...	...	...	...
80	0.422307	-769.135891	0.472364	315.564934
81	2.228270	-343.158854	0.233416	242.722924
82	2.319692	-390.456757	-1.543721	356.856559
83	1.918249	147.505689	1.016886	-58.835632
84	0.027807	-92.139088	1.604362	331.684347

85 rows × 4 columns

Again, this can be explored visually.

f, axs = plt.subplots(2, 2, figsize=(12, 10))

for column, ax in zip(adaptive.local_coef_.columns, axs.flat):
    gdf.plot(adaptive.local_coef_[column], legend=True, ax=ax)
    ax.set_title(column)
    ax.set_axis_off()

_images/a63aff8a4c55141f967d79fa6af1a8fbfd1223b16058333419dc78d9c7f43acd.png

Alongside local coefficients, you can retireve a local intercept value.

gdf.plot(adaptive.local_intercept_, legend=True).set_axis_off()

_images/af4e172ded8f62ad2aef95838509e4c7ad57dc52d465c59c731d7c619c0c10fa.png

For details on prediction, see the Prediction guide.

Logistic regression¶

Due to the nature of the principle in which gwlearn extends scikit-learn, you can also fit the local version of logistic regression for classification tasks. However, there’s a caveat. Given it is unlikely that all categories are present in all local models, fitting a non-binary would lead to inconsistent local models. Hence gwlearn currently supports only binary regression, akin to binomial model in mgwr.

Let’s illustrate it by prediciting whether the number of suicides is over or under median.

y = gdf["Suicids"] > gdf["Suicids"].median()

The API then looks the same, following scikit-learn’s model.

binary = GWLogisticRegression(bandwidth=25, fixed=False, max_iter=500)
binary.fit(
    gdf[["Crm_prp", "Litercy", "Donatns", "Lottery"]],
    y,
    geometry=gdf.representative_point(),
)

GWLogisticRegression(bandwidth=25)

Given this is a classification task, we can retrieve probabilites from focal predictions.

binary.proba_

	False	True
0	0.005900	0.994100
1	NaN	NaN
2	0.242553	0.757447
3	0.972850	0.027150
4	0.716932	0.283068
...	...	...
80	0.435476	0.564524
81	0.809622	0.190378
82	0.439796	0.560204
83	0.542404	0.457596
84	0.988792	0.011208

85 rows × 2 columns

Or their binary counterparts (based on the maximum value).

binary.pred_

    True
    <NA>
    True
   False
   False
      ...  
   True
  False
   True
  False
  False
Length: 85, dtype: boolean

The regions with missing values are not fitted due to extreme class imbalance in their local context (less than 20% of local y values are of a minority class). This is a specific behavior linked to classification models in gwlearn. Rather than simply failing, invariant or extremely imbalanced local y does not make the model to fail, it just records the location as skipped. That also means that the location is not used in prediction and for certain locations, there is no possible prediction.

You can check the proportion of local models that exist using prediction_rate_.

binary.prediction_rate_

np.float64(0.8705882352941177)

This shows that 87% of focal geometries have their respective models, the rest does not.

For more on dealing with class imbalance, see the imbalance guide.

The focal metrics (based on a prediction on a focal geometry using a single local model fitted around that geometry) can be measured using the two outputs shown above.

na_mask = binary.pred_.notna()

metrics.accuracy_score(y[na_mask], binary.pred_[na_mask])

0.9594594594594594

However, you can also extract all data from all local models and use those to measure the performance of the model, rather than using only focal geometries.

The data pooled from all local models are accessible as y_pooled_ and pred_pooled_.

metrics.accuracy_score(binary.y_pooled_, binary.pred_pooled_)

0.7864864864864864

Pooled metrics are typically showing worse results as the distance decay of local observation weights in individual models are playing against a good prediction of values far from the focal point.

Alternatively, you can measure performance metrics per each local model. That is relying on the same pooled data but split to their respective parent local models.

local_accuracy = binary.local_metric(metrics.accuracy_score)
local_accuracy

array([0.64,  nan, 0.84, 0.72, 0.6 , 0.84,  nan, 0.92, 0.84, 0.92,  nan,
       0.8 , 0.96, 0.88, 0.76, 0.84, 0.8 , 0.84, 0.64, 0.6 , 0.8 , 0.88,
       0.56, 0.76, 0.96, 0.88, 0.52, 0.92, 0.8 , 0.8 , 0.92, 0.92, 0.72,
       0.72, 0.6 , 0.72, 0.64, 0.92, 0.64, 0.72, 0.88, 0.76, 0.84,  nan,
       0.88, 0.96, 0.68, 0.8 ,  nan, 0.68, 0.8 , 0.84, 0.88, 0.56, 0.8 ,
       0.8 ,  nan,  nan, 0.88, 1.  , 0.76, 0.92, 0.76, 0.88, 0.76, 0.56,
       0.8 , 0.64, 0.72, 0.72,  nan, 0.96,  nan, 0.88, 0.8 , 0.96,  nan,
        nan, 0.72, 0.76, 0.72, 0.68, 0.76, 0.72, 0.84])

gdf.plot(
    local_accuracy, legend=True, missing_kwds=dict(color="lightgray")
).set_axis_off()

_images/e5ab5db8b59a9b95d7522808e06d7e90c446a6b32381a9212e02236470ba209c.png

The local coefficients are available as in the regression counterpart.

binary.local_coef_

	Crm_prp	Litercy	Donatns	Lottery
0	0.000583	0.032988	0.000678	-0.050549
1	NaN	NaN	NaN	NaN
2	0.001483	0.152063	0.000674	0.199996
3	0.001018	-0.052208	0.000547	-0.010365
4	0.000945	0.008447	0.000280	-0.030962
...	...	...	...	...
80	0.000411	-0.260107	-0.000190	0.036529
81	0.000255	-0.121968	-0.000161	0.019247
82	0.000182	-0.196829	-0.000244	0.065412
83	0.000462	0.010739	0.000122	-0.008270
84	-0.000173	0.042576	0.000985	0.160625

85 rows × 4 columns

Again, this can be explored visually.

f, axs = plt.subplots(2, 2, figsize=(12, 10))

for column, ax in zip(binary.local_coef_.columns, axs.flat):
    gdf.plot(
        binary.local_coef_[column],
        legend=True,
        ax=ax,
        missing_kwds=dict(color="lightgray"),
    )
    ax.set_title(column)
    ax.set_axis_off()

_images/615d5c898405d4ab0a7e4ad85278ac4ad1d3c70e70b7cd3b3c69cd08c02b016a.png

For details on prediction, see the Prediction guide.

	bandwidth	25
	fixed	False
	kernel	'bisquare'
	include_focal	True
	graph	None
	n_jobs	-1
	fit_global_model	True
	strict	False
	keep_models	False
	temp_folder	None
	batch_size	None
	min_proportion	0.2
	undersample	False
	leave_out	None
	random_state	None
	verbose	False