spreg.Skater_reg¶

class spreg.Skater_reg(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]¶

Initialize the Skater_reg algorithm based on [AA24]. The function can currently estimate OLS, from spreg or stats_models, and Spatial Lag models from spreg. Fit method performs estimation and returns a Skater_reg object.

Parameters:

dissimilaritya callable() distance metric.: Default: sklearn.metrics.pairwise.euclidean_distances
affinitya callable() affinity metric between 0,1.: Will be inverted to provide a dissimilarity metric.
reductionthe reduction applied over all clusters: to provide the map score. Default: numpy.sum
centerway to compute the center of each region in attribute space: Default: numpy.mean
NOTE: Optimization occurs with respect to a *dissimilarity* metric, so the reduction should: yield some kind of score where larger values are less desirable than smaller values. Typically, this means we use addition.

Attributes:

coordsarray_like: n*2, collection of n sets of (x,y) coordinates used for calibration locations
yarray: n*1, dependent variable
Xarray: n*k, independent variable, not including the constant
bwscalar: bandwidth value consisting of either a distance or N nearest neighbors; user specified or obtained using Sel_BW
familyfamily object: underlying probability model; provides distribution-specific calculations
offsetarray: n*1, the offset variable at the ith location. For Poisson model this term is often the size of the population at risk or the expected size of the outcome in spatial epidemiology Default is None where Ni becomes 1.0 for all locations
sigma2_v1bool: specify form of corrected denominator of sigma squared to use for model diagnostics; Acceptable options are: ‘True’: n-tr(S) (defualt) ‘False’: n-2(tr(S)+tr(S’S))
kernelstr: type of kernel function used to weight observations; available options: ‘gaussian’ ‘bisquare’ ‘exponential’
fixedbool: True for distance based kernel function and False for adaptive (nearest neighbor) kernel function (default)
constantbool: True to include intercept (default) in model and False to exclude intercept
sphericalbool: True for shperical coordinates (long-lat), False for projected coordinates (defalut).
hat_matrixbool: True to store full n by n hat matrix, False to not store full hat matrix to minimize memory footprint (defalut).
ninteger: number of observations
kinteger: number of independent variables
mean_yfloat: mean of y
std_yfloat: standard deviation of y
fit_paramsdict: parameters passed into fit method to define estimation routine
pointsarray_like: n*2, collection of n sets of (x,y) coordinates used for calibration locations instead of all observations; defaults to None unles specified in predict method
Parray: n*k, independent variables used to make prediction; exlcuding the constant; default to None unless specified in predict method
exog_scalescalar: estimated scale using sampled locations; defualt is None unless specified in predict method
exog_residarray_like: estimated residuals using sampled locations; defualt is None unless specified in predict method
Examples
——–
>>> import libpysal as ps
>>> import numpy as np
>>> import spreg
>>> from spreg.skater_reg import Skater_reg
>>> data = ps.io.open(ps.examples.get_path(‘columbus.dbf’))
>>> y = np.array(data.by_col(‘HOVAL’)).reshape((-1,1))
>>> x_var = [‘INC’,’CRIME’]
>>> x = np.array([db.by_col(name) for name in x_var]).T
>>> w = ps.weights.Queen.from_shapefile(ps.examples.get_path(“columbus.shp”))
>>> x_std = (x - np.mean(x,axis=0)) / np.std(x,axis=0)
>>> results = Skater_reg().fit(3, w, x_std, {‘reg’:spreg.OLS,’y’:y,’x’:x}, quorum=10, trace=False)
>>> results.current_labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,: 0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 0, 0, 1, 2, 2, 2, 1, 0, 2, 2], dtype=int32)

__init__(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]¶

Methods

`__init__`([dissimilarity, affinity, ...])
`find_cut`(MSF[, data, data_reg, ...])	Find the best cut from the MSF.
`fit`(n_clusters, W[, data, data_reg, quorum, ...])	Method that fits a model with a particular estimation routine.
`score_spreg`([data, data_reg, all_labels, ...])	This yields a score for the data using methods from the spreg library, given the labels provided.
`score_stats`([data, data_reg, all_labels, ...])	This yields a score for the data using methods from the stats_models library, given the labels provided.

find_cut(MSF, data=None, data_reg=None, current_n_subtrees=None, current_labels=None, quorum=-inf, trees_scores=None, labels=None, target_label=None, make=False, verbose=False, model_family='spreg')[source]¶

Find the best cut from the MSF.

MSF: (N,N) scipy sparse matrix with zero elements removed.: Represents the adjacency matrix for the minimum spanning forest. Constructed from sparse.csgraph.sparse_from_dense or using MSF.eliminate_zeros(). You MUST remove zero entries for this to work, otherwise they are considered no-cost paths.

data: (N,p) attribute matrix. If not provided, replaced with (N,1) vector of ones. data_reg: optional list containing:

1- a callable spreg or statsmodels regression method (ex. OLS or GM_Lag) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

current_n_subtrees: integer indication the current number of subtrees. current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation. quorum: int denoting the minimum number of elements in the region trees_scores: dictionary indicating subtress’s labels and their respective current score. labels: (N,) flat vector of labels for each point. Represents the “cluster labels”

for disconnected components of the graph.

target_label: int from the labels array to subset the MSF. If passed along with labels, then a cut: will be found that is restricted to that subset of the MSF.

make: bool, whether or not to modify the input MSF in order to make the best cut that was found. verbose: bool/int, denoting how much output to provide to the user, in terms

of print statements or progressbars

Returns a namedtuple with in_node, out_node, and score.

fit(n_clusters, W, data=None, data_reg=None, quorum=-inf, trace=True, islands='increase', verbose=False, model_family='spreg')[source]¶

Method that fits a model with a particular estimation routine.

Parameters:

n_clustersint of clusters wanted
Wpysal W object expressing the neighbor relationships between observations.: Should be symmetric and binary, so Queen/Rook, DistanceBand, or a symmetrized KNN.
datanp.ndarray of (N,P) shape with N observations and P features: This is the data that is used to evaluate the similarity between each observation.
data_reglist containing:: 1- a callable regression method (ex. OLS or GM_Lag from spreg or OLS from statsmodels) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)
quorumint with minimum size of each region.
tracebool denoting whether to store intermediate: labelings as the tree gets pruned
islandsstr describing what to do with islands.: If “ignore”, will discover n_clusters regions, treating islands as their own regions. If “increase”, will discover n_clusters regions, treating islands as separate from n_clusters.
verbosebool/int describing how much output to provide to the user,: in terms of print statements or progressbars.
model_familystr describing the fFamily of estimation method used for the regression.: Must be either ‘spreg’ (default) or ‘statsmodels’

Returns:

: Skater_reg object.

score_spreg(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]¶

This yields a score for the data using methods from the spreg library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.

If a quorum is not passed, it is assumed to be irrelevant.

If a quorum is passed and the labels do not meet quorum, the score is inf.

data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : dict containing:

‘reg’: callable spreg regression method (e.g., OLS or GM_Lag)

‘y’: (N,1) np.ndarray with dependent variable

‘x’: (N,k) np.ndarray with explanatory variables (excluding constant)

‘w’: (optional) pysal W object for spatial weights

‘yend’, ‘q’: (optional) np.ndarrays for instrumental variables

all_labels : (N,) flat vector of labels expressing the classification of each observation into a region considering the cut under evaluation. quorum : int expressing the minimum size of regions. Can be -inf if there is no lower bound.

Any region below quorum makes the score inf.

current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation. current_tree: integer indicating the tree’s label currently being considered for division

score_stats(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]¶

This yields a score for the data using methods from the stats_models library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.

If a quorum is not passed, it is assumed to be irrelevant.

If a quorum is passed and the labels do not meet quorum, the score is inf.

data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : list containing:

1- a callable statsmodels regression method (ex. OLS) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

Any region below quorum makes the score inf.

current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation.

current_tree: integer indicating the tree label is currently being considered for division

NOTE: Optimization occurs with respect to a dissimilarity metric, so the problem minimizes: the map dissimilarity. So, lower scores are better.