Linear models

An overview of linear and logistic regression models.

import geopandas as gpd
import matplotlib.pyplot as plt
from geodatasets import get_path
from sklearn import metrics

from gwlearn.linear_model import GWLinearRegression, GWLogisticRegression

Linear regression

Linear regression is a classical example of geographically weighted regression models. In this example, you can predict the number of suicides based on other population data in the Guerry dataset.

gdf = gpd.read_file(get_path("geoda.guerry"))

It is a relatively small dataset covering Frech data from 1800s.

gdf.plot().set_axis_off()
_images/4cd6f40416928952dbfc42f4ad77b31636935690a42b86a030a0bd125a4f4d9a.png

Specify the model and fit the data.

adaptive = GWLinearRegression(bandwidth=25, fixed=False, kernel="tricube")
adaptive.fit(
    gdf[["Crm_prp", "Litercy", "Donatns", "Lottery"]],
    gdf["Suicids"],
    geometry=gdf.representative_point(),
)
GWLinearRegression(bandwidth=25, kernel='tricube')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

You can get a number of outputs from the fitted model. For example, focal predictions (a value is predicted on an observations using the local model fitted around it).

adaptive.pred_
0     65929.297258
1     14289.540382
2     60668.901578
3     25191.170857
4     12777.621045
          ...     
80    41304.669559
81    19766.026593
82    29125.591134
83    25717.441407
84    19381.885833
Length: 85, dtype: float64

This allows you to measure model performance metrics like $R^2$. This value would be comparable to the R2 reported by mgwr.

metrics.r2_score(gdf["Suicids"], adaptive.pred_)
0.7030520942210708

Local version of $R^2$ is accessible directly.

adaptive.local_r2_
0     0.655274
1     0.564506
2     0.616975
3     0.690529
4     0.704666
        ...   
80    0.444105
81    0.554521
82    0.646275
83    0.174557
84    0.215379
Length: 85, dtype: float64

Any of the values linked to individual observations can then be mapped.

gdf.plot(adaptive.local_r2_, legend=True).set_axis_off()
_images/aedd4462e4f5d2eddbfc9839a28052e2c3b4789432985ceb462617a3f9b5b667.png

Similarly, you can access residuals.

gdf.plot(adaptive.resid_, legend=True).set_axis_off()
_images/89970d5ca799ab996b9e3b3e73ed04922538b49968549a69f5daf0e23e7796b0.png

Each local model then have its own local coefficients.

adaptive.local_coef_
Crm_prp Litercy Donatns Lottery
0 3.147001 -500.570435 0.218349 274.308624
1 2.177315 103.815973 1.276062 -94.396645
2 1.155884 -2112.020365 -7.912018 1015.701421
3 7.945416 -189.098606 -3.969571 132.812885
4 7.575983 -111.394715 -4.221681 107.181855
... ... ... ... ...
80 0.422307 -769.135891 0.472364 315.564934
81 2.228270 -343.158854 0.233416 242.722924
82 2.319692 -390.456757 -1.543721 356.856559
83 1.918249 147.505689 1.016886 -58.835632
84 0.027807 -92.139088 1.604362 331.684347

85 rows × 4 columns

Again, this can be explored visually.

f, axs = plt.subplots(2, 2, figsize=(12, 10))

for column, ax in zip(adaptive.local_coef_.columns, axs.flat):
    gdf.plot(adaptive.local_coef_[column], legend=True, ax=ax)
    ax.set_title(column)
    ax.set_axis_off()
_images/a63aff8a4c55141f967d79fa6af1a8fbfd1223b16058333419dc78d9c7f43acd.png

Alongside local coefficients, you can retireve a local intercept value.

gdf.plot(adaptive.local_intercept_, legend=True).set_axis_off()
_images/af4e172ded8f62ad2aef95838509e4c7ad57dc52d465c59c731d7c619c0c10fa.png

For details on prediction, see the Prediction guide.

Logistic regression

Due to the nature of the principle in which gwlearn extends scikit-learn, you can also fit the local version of logistic regression for classification tasks. However, there’s a caveat. Given it is unlikely that all categories are present in all local models, fitting a non-binary would lead to inconsistent local models. Hence gwlearn currently supports only binary regression, akin to binomial model in mgwr.

Let’s illustrate it by prediciting whether the number of suicides is over or under median.

y = gdf["Suicids"] > gdf["Suicids"].median()

The API then looks the same, following scikit-learn’s model.

binary = GWLogisticRegression(bandwidth=25, fixed=False, max_iter=500)
binary.fit(
    gdf[["Crm_prp", "Litercy", "Donatns", "Lottery"]],
    y,
    geometry=gdf.representative_point(),
)
GWLogisticRegression(bandwidth=25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Given this is a classification task, we can retrieve probabilites from focal predictions.

binary.proba_
False True
0 0.005900 0.994100
1 NaN NaN
2 0.242553 0.757447
3 0.972850 0.027150
4 0.716932 0.283068
... ... ...
80 0.435476 0.564524
81 0.809622 0.190378
82 0.439796 0.560204
83 0.542404 0.457596
84 0.988792 0.011208

85 rows × 2 columns

Or their binary counterparts (based on the maximum value).

binary.pred_
0      True
1      <NA>
2      True
3     False
4     False
      ...  
80     True
81    False
82     True
83    False
84    False
Length: 85, dtype: boolean

The regions with missing values are not fitted due to extreme class imbalance in their local context (less than 20% of local y values are of a minority class). This is a specific behavior linked to classification models in gwlearn. Rather than simply failing, invariant or extremely imbalanced local y does not make the model to fail, it just records the location as skipped. That also means that the location is not used in prediction and for certain locations, there is no possible prediction.

You can check the proportion of local models that exist using prediction_rate_.

binary.prediction_rate_
np.float64(0.8705882352941177)

This shows that 87% of focal geometries have their respective models, the rest does not.

For more on dealing with class imbalance, see the imbalance guide.

The focal metrics (based on a prediction on a focal geometry using a single local model fitted around that geometry) can be measured using the two outputs shown above.

na_mask = binary.pred_.notna()

metrics.accuracy_score(y[na_mask], binary.pred_[na_mask])
0.9594594594594594

However, you can also extract all data from all local models and use those to measure the performance of the model, rather than using only focal geometries.

The data pooled from all local models are accessible as y_pooled_ and pred_pooled_.

metrics.accuracy_score(binary.y_pooled_, binary.pred_pooled_)
0.7864864864864864

Pooled metrics are typically showing worse results as the distance decay of local observation weights in individual models are playing against a good prediction of values far from the focal point.

Alternatively, you can measure performance metrics per each local model. That is relying on the same pooled data but split to their respective parent local models.

local_accuracy = binary.local_metric(metrics.accuracy_score)
local_accuracy
array([0.64,  nan, 0.84, 0.72, 0.6 , 0.84,  nan, 0.92, 0.84, 0.92,  nan,
       0.8 , 0.96, 0.88, 0.76, 0.84, 0.8 , 0.84, 0.64, 0.6 , 0.8 , 0.88,
       0.56, 0.76, 0.96, 0.88, 0.52, 0.92, 0.8 , 0.8 , 0.92, 0.92, 0.72,
       0.72, 0.6 , 0.72, 0.64, 0.92, 0.64, 0.72, 0.88, 0.76, 0.84,  nan,
       0.88, 0.96, 0.68, 0.8 ,  nan, 0.68, 0.8 , 0.84, 0.88, 0.56, 0.8 ,
       0.8 ,  nan,  nan, 0.88, 1.  , 0.76, 0.92, 0.76, 0.88, 0.76, 0.56,
       0.8 , 0.64, 0.72, 0.72,  nan, 0.96,  nan, 0.88, 0.8 , 0.96,  nan,
        nan, 0.72, 0.76, 0.72, 0.68, 0.76, 0.72, 0.84])
gdf.plot(
    local_accuracy, legend=True, missing_kwds=dict(color="lightgray")
).set_axis_off()
_images/e5ab5db8b59a9b95d7522808e06d7e90c446a6b32381a9212e02236470ba209c.png

The local coefficients are available as in the regression counterpart.

binary.local_coef_
Crm_prp Litercy Donatns Lottery
0 0.000583 0.032988 0.000678 -0.050549
1 NaN NaN NaN NaN
2 0.001483 0.152063 0.000674 0.199996
3 0.001018 -0.052208 0.000547 -0.010365
4 0.000945 0.008447 0.000280 -0.030962
... ... ... ... ...
80 0.000411 -0.260107 -0.000190 0.036529
81 0.000255 -0.121968 -0.000161 0.019247
82 0.000182 -0.196829 -0.000244 0.065412
83 0.000462 0.010739 0.000122 -0.008270
84 -0.000173 0.042576 0.000985 0.160625

85 rows × 4 columns

Again, this can be explored visually.

f, axs = plt.subplots(2, 2, figsize=(12, 10))

for column, ax in zip(binary.local_coef_.columns, axs.flat):
    gdf.plot(
        binary.local_coef_[column],
        legend=True,
        ax=ax,
        missing_kwds=dict(color="lightgray"),
    )
    ax.set_title(column)
    ax.set_axis_off()
_images/615d5c898405d4ab0a7e4ad85278ac4ad1d3c70e70b7cd3b3c69cd08c02b016a.png

For details on prediction, see the Prediction guide.