This page was generated from notebooks/skater_reg.ipynb. Interactive online version:
Skater Regression¶
This notebook shows the use of the Skater Regression funcion (Skater_reg), introduced by Anselin & Amaral (2021). For more information on the method, check:¶
https://www.researchgate.net/publication/353411566_Endogenous_Spatial_Regimes
In this example, in addition to the required packages, we will use geopandas to load the data and matplotlib to plot the results. Alternatively, PySAL’s own IO could also be used to load the data.
[1]:
# Required imports
import libpysal as ps
import numpy as np
import spreg
from spreg.skater_reg import Skater_reg
# Optional imports
import matplotlib.pyplot as plt
import geopandas as gpd
We use Messner et al. (2000) data on homicides and selected socio-economic characteristics for continental U.S. counties to exemplify the use of Skater_reg. It can be downloaded from PySAL’s examples repository.
[2]:
# Load the example from PySAL
ps.examples.load_example("NCOVR")
data = gpd.read_file(ps.examples.get_path('NAT.shp')).set_index('FIPS')
# Set depedent and independent variables and the W matrix.
y = data['HR90'].to_numpy()
x = data[['RD90','PS90','UE90']].to_numpy()
w = ps.weights.Queen.from_dataframe(data, use_index=True)
Skater_reg by default uses Euclidean distance to compute the Minimum Spanning Tree (MST). Therefore, we standardize the variables that will be used to compute the MST before calling the main Skater_reg function. Here, we use the X variables to compute the MST. Alternative specifications can be used.
We set the number of clusters to 20 and minimum quorum to 100.
[3]:
%%time
# Standardize the variables to be used to compute the minimum spanning tree (could add/remove any variable)
x_std = (x - np.mean(x,axis=0)) / np.std(x,axis=0)
# Call the Skater_reg method based on OLS
results = Skater_reg().fit(20, w, x_std, {'reg':spreg.OLS,'y':y,'x':x}, quorum=100)
CPU times: user 1min 16s, sys: 31.6 s, total: 1min 48s
Wall time: 25.3 s
The intermediate steps are stored in the attibute _trace. We can use this information to plot the decrease in the total sum of squared residuals by number of clusters. This information can be helpful to select the number of desired clusters.
[4]:
trace = [results._trace[i][1][2] for i in range(1,len(results._trace))]
fig, ax = plt.subplots()
ax.plot(list(range(2,len(trace)+2)), trace, '-o', color='black', linewidth=2)
ax.set(xlabel='Number of clusters', ylabel='Total sum of squared residuals')
ax.grid()
plt.show()
Let’s say we choose 12 clusters. We can plot the results using geopandas and matplotlib.
[5]:
data["cl_regions"] = results._trace[11][0]
data.plot(column="cl_regions", categorical=True, legend=True, cmap='Paired').axis("off")
[5]:
(-127.6195011138916, -64.0817699432373, 23.735178565979005, 50.59252300262451)
With the cluster allocations and selected number of clusters, we can call the Regimes methods in Spreg to get the full regression results and Chow tests on the stability of the coefficients accross the 12 different clusters.
[6]:
reg = spreg.OLS_Regimes(y,x,
regimes=results._trace[11][0], w=w, name_y=['HR90'], name_x=['RD90','PS90','UE90'], name_regimes='skater_reg')
print(reg.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 0
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 0_['HR90'] Number of Observations: 604
Mean dependent var : 2.4577 Number of Variables : 4
S.D. dependent var : 3.9266 Degrees of Freedom : 600
R-squared : 0.3305
Adjusted R-squared : 0.3271
Sum squared residual: 6224.952 F-statistic : 98.7116
Sigma-square : 10.375 Prob(F-statistic) : 6.109e-52
S.E. of regression : 3.221 Log likelihood : -1561.528
Sigma-square ML : 10.306 Akaike info criterion : 3131.057
S.E of regression ML: 3.2103 Schwarz criterion : 3148.671
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
0_CONSTANT 4.0851876 0.4756999 8.5877407 0.0000000
0_RD90 3.1288181 0.2856649 10.9527571 0.0000000
0_PS90 1.4553321 0.1760078 8.2685645 0.0000000
0_UE90 0.0530250 0.0612926 0.8651115 0.3873233
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 7.255
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 20184.603 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 69.424 0.0000
Koenker-Bassett test 3 4.695 0.1956
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 1
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 1_['HR90'] Number of Observations: 180
Mean dependent var : 2.6269 Number of Variables : 4
S.D. dependent var : 4.5592 Degrees of Freedom : 176
R-squared : 0.1473
Adjusted R-squared : 0.1328
Sum squared residual: 3172.661 F-statistic : 10.1339
Sigma-square : 18.026 Prob(F-statistic) : 3.424e-06
S.E. of regression : 4.246 Log likelihood : -513.652
Sigma-square ML : 17.626 Akaike info criterion : 1035.304
S.E of regression ML: 4.1983 Schwarz criterion : 1048.076
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
1_CONSTANT 1.6802968 1.0041720 1.6733157 0.0960416
1_RD90 0.9214816 0.7386502 1.2475210 0.2138639
1_PS90 0.5520464 0.3817853 1.4459601 0.1499668
1_UE90 0.3793060 0.1012117 3.7476508 0.0002418
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 6.525
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 4249.029 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 6.927 0.0743
Koenker-Bassett test 3 0.565 0.9044
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 2
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 2_['HR90'] Number of Observations: 105
Mean dependent var : 5.4586 Number of Variables : 4
S.D. dependent var : 3.9328 Degrees of Freedom : 101
R-squared : 0.6004
Adjusted R-squared : 0.5885
Sum squared residual: 642.756 F-statistic : 50.5868
Sigma-square : 6.364 Prob(F-statistic) : 4.794e-20
S.E. of regression : 2.523 Log likelihood : -244.108
Sigma-square ML : 6.121 Akaike info criterion : 496.217
S.E of regression ML: 2.4742 Schwarz criterion : 506.832
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
2_CONSTANT 2.9279569 1.6500241 1.7744934 0.0789946
2_RD90 3.8978472 0.8119511 4.8005936 0.0000055
2_PS90 2.5952604 0.2458875 10.5546660 0.0000000
2_UE90 0.3236171 0.1945793 1.6631633 0.0993799
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 15.160
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 6.807 0.0333
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 15.329 0.0016
Koenker-Bassett test 3 14.809 0.0020
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 3
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 3_['HR90'] Number of Observations: 157
Mean dependent var : 3.2521 Number of Variables : 4
S.D. dependent var : 3.4925 Degrees of Freedom : 153
R-squared : 0.3735
Adjusted R-squared : 0.3612
Sum squared residual: 1192.118 F-statistic : 30.4049
Sigma-square : 7.792 Prob(F-statistic) : 1.785e-15
S.E. of regression : 2.791 Log likelihood : -381.912
Sigma-square ML : 7.593 Akaike info criterion : 771.824
S.E of regression ML: 2.7556 Schwarz criterion : 784.049
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
3_CONSTANT 1.6648327 1.6499048 1.0090477 0.3145449
3_RD90 2.5911850 0.5446873 4.7571975 0.0000045
3_PS90 1.7951113 0.2645028 6.7867381 0.0000000
3_UE90 0.2831519 0.1896309 1.4931734 0.1374511
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 17.037
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 700.804 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 128.095 0.0000
Koenker-Bassett test 3 23.069 0.0000
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 4
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 4_['HR90'] Number of Observations: 157
Mean dependent var : 5.2565 Number of Variables : 4
S.D. dependent var : 7.5670 Degrees of Freedom : 153
R-squared : 0.0718
Adjusted R-squared : 0.0536
Sum squared residual: 8291.273 F-statistic : 3.9445
Sigma-square : 54.191 Prob(F-statistic) : 0.009592
S.E. of regression : 7.361 Log likelihood : -534.160
Sigma-square ML : 52.811 Akaike info criterion : 1076.321
S.E of regression ML: 7.2671 Schwarz criterion : 1088.546
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
4_CONSTANT 6.5333323 2.4673414 2.6479239 0.0089474
4_RD90 2.7602351 1.0586310 2.6073627 0.0100275
4_PS90 -0.6252142 0.6065058 -1.0308463 0.3042397
4_UE90 -0.0983422 0.2825469 -0.3480561 0.7282764
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 9.215
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 10522.321 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 397.444 0.0000
Koenker-Bassett test 3 19.450 0.0002
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 5
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 5_['HR90'] Number of Observations: 416
Mean dependent var : 3.5350 Number of Variables : 4
S.D. dependent var : 3.5289 Degrees of Freedom : 412
R-squared : 0.2384
Adjusted R-squared : 0.2328
Sum squared residual: 3936.025 F-statistic : 42.9870
Sigma-square : 9.553 Prob(F-statistic) : 3.458e-24
S.E. of regression : 3.091 Log likelihood : -1057.705
Sigma-square ML : 9.462 Akaike info criterion : 2123.409
S.E of regression ML: 3.0760 Schwarz criterion : 2139.532
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
5_CONSTANT 3.6580644 0.7883611 4.6400874 0.0000047
5_RD90 2.1705064 0.3732128 5.8157339 0.0000000
5_PS90 1.6485127 0.2143249 7.6916535 0.0000000
5_UE90 -0.0049898 0.0843801 -0.0591343 0.9528738
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 11.124
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 2163.820 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 295.778 0.0000
Koenker-Bassett test 3 48.598 0.0000
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 6
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 6_['HR90'] Number of Observations: 105
Mean dependent var : 6.9902 Number of Variables : 4
S.D. dependent var : 7.7137 Degrees of Freedom : 101
R-squared : 0.5133
Adjusted R-squared : 0.4989
Sum squared residual: 3011.570 F-statistic : 35.5107
Sigma-square : 29.818 Prob(F-statistic) : 9.373e-16
S.E. of regression : 5.461 Log likelihood : -325.192
Sigma-square ML : 28.682 Akaike info criterion : 658.384
S.E of regression ML: 5.3555 Schwarz criterion : 669.000
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
6_CONSTANT 12.6762323 2.3973782 5.2875396 0.0000007
6_RD90 7.2357749 0.8959392 8.0761894 0.0000000
6_PS90 3.0087836 0.5687185 5.2904618 0.0000007
6_UE90 -0.9087647 0.4285148 -2.1207310 0.0363931
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 10.336
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1082.543 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 69.868 0.0000
Koenker-Bassett test 3 8.330 0.0397
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 7
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 7_['HR90'] Number of Observations: 212
Mean dependent var : 6.7045 Number of Variables : 4
S.D. dependent var : 7.5062 Degrees of Freedom : 208
R-squared : 0.6513
Adjusted R-squared : 0.6463
Sum squared residual: 4145.470 F-statistic : 129.5007
Sigma-square : 19.930 Prob(F-statistic) : 2.429e-47
S.E. of regression : 4.464 Log likelihood : -615.973
Sigma-square ML : 19.554 Akaike info criterion : 1239.945
S.E of regression ML: 4.4220 Schwarz criterion : 1253.372
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
7_CONSTANT 6.8749278 1.4815715 4.6402943 0.0000062
7_RD90 5.0735729 0.4720012 10.7490683 0.0000000
7_PS90 4.3257745 0.4290443 10.0823502 0.0000000
7_UE90 -0.1901823 0.2136345 -0.8902227 0.3743748
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 10.840
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 23.086 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 68.023 0.0000
Koenker-Bassett test 3 50.360 0.0000
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 8
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 8_['HR90'] Number of Observations: 142
Mean dependent var : 6.9674 Number of Variables : 4
S.D. dependent var : 7.7639 Degrees of Freedom : 138
R-squared : 0.0816
Adjusted R-squared : 0.0616
Sum squared residual: 7805.895 F-statistic : 4.0858
Sigma-square : 56.564 Prob(F-statistic) : 0.008156
S.E. of regression : 7.521 Log likelihood : -485.973
Sigma-square ML : 54.971 Akaike info criterion : 979.945
S.E of regression ML: 7.4142 Schwarz criterion : 991.769
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
8_CONSTANT 8.9134518 2.2205952 4.0139922 0.0000975
8_RD90 3.1669956 1.0249618 3.0898670 0.0024224
8_PS90 0.9219418 0.7333370 1.2571872 0.2108094
8_UE90 -0.2902235 0.3083686 -0.9411576 0.3482687
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 7.818
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 211.413 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 16.332 0.0010
Koenker-Bassett test 3 4.771 0.1893
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 9
---------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 9_['HR90'] Number of Observations: 494
Mean dependent var : 9.4357 Number of Variables : 4
S.D. dependent var : 6.1868 Degrees of Freedom : 490
R-squared : 0.3161
Adjusted R-squared : 0.3120
Sum squared residual: 12904.325 F-statistic : 75.5098
Sigma-square : 26.335 Prob(F-statistic) : 3.675e-40
S.E. of regression : 5.132 Log likelihood : -1506.863
Sigma-square ML : 26.122 Akaike info criterion : 3021.726
S.E of regression ML: 5.1110 Schwarz criterion : 3038.536
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
9_CONSTANT 10.2257744 0.7060239 14.4836089 0.0000000
9_RD90 4.9173048 0.3601681 13.6528046 0.0000000
9_PS90 2.7435413 0.3773269 7.2709932 0.0000000
9_UE90 -0.5158999 0.1027098 -5.0228862 0.0000007
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 7.112
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 164.804 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 93.891 0.0000
Koenker-Bassett test 3 45.305 0.0000
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 10
----------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 10_['HR90'] Number of Observations: 322
Mean dependent var : 10.9368 Number of Variables : 4
S.D. dependent var : 7.1069 Degrees of Freedom : 318
R-squared : 0.1980
Adjusted R-squared : 0.1904
Sum squared residual: 13003.178 F-statistic : 26.1676
Sigma-square : 40.890 Prob(F-statistic) : 3.738e-15
S.E. of regression : 6.395 Log likelihood : -1052.340
Sigma-square ML : 40.383 Akaike info criterion : 2112.680
S.E of regression ML: 6.3547 Schwarz criterion : 2127.779
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
10_CONSTANT 8.9613419 1.1548606 7.7596741 0.0000000
10_RD90 3.1036861 0.4676112 6.6373221 0.0000000
10_PS90 2.0054035 0.4775095 4.1997143 0.0000347
10_UE90 -0.1659300 0.1618652 -1.0251118 0.3060896
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 8.728
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 805.242 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 271.848 0.0000
Koenker-Bassett test 3 60.502 0.0000
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION - REGIME 11
----------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : 11_['HR90'] Number of Observations: 191
Mean dependent var : 12.6163 Number of Variables : 4
S.D. dependent var : 6.4910 Degrees of Freedom : 187
R-squared : 0.2010
Adjusted R-squared : 0.1882
Sum squared residual: 6396.303 F-statistic : 15.6792
Sigma-square : 34.205 Prob(F-statistic) : 3.882e-09
S.E. of regression : 5.848 Log likelihood : -606.337
Sigma-square ML : 33.488 Akaike info criterion : 1220.674
S.E of regression ML: 5.7869 Schwarz criterion : 1233.683
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
11_CONSTANT 10.6523809 1.8667675 5.7063244 0.0000000
11_RD90 2.8884391 0.5972275 4.8364133 0.0000028
11_PS90 0.2205890 0.5966193 0.3697316 0.7120008
11_UE90 -0.0296635 0.3154386 -0.0940388 0.9251790
------------------------------------------------------------------------------------
Regimes variable: skater_reg
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 10.444
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 16.443 0.0003
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 3 18.499 0.0003
Koenker-Bassett test 3 11.542 0.0091
REGIMES DIAGNOSTICS - CHOW TEST
VARIABLE DF VALUE PROB
CONSTANT 11 110.297 0.0000
PS90 11 95.473 0.0000
RD90 11 75.733 0.0000
UE90 11 52.373 0.0000
Global test 44 532.110 0.0000
================================ END OF REPORT =====================================
[ ]: