spreg.Skater_reg

class spreg.Skater_reg(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]

Initialize the Skater_reg algorithm based on [AA21]. The function can currently estimate OLS, from spreg or stats_models, and Spatial Lag models from spreg. Fit method performs estimation and returns a Skater_reg object.

Parameters:
dissimilaritya callable() distance metric.

Default: sklearn.metrics.pairwise.euclidean_distances

affinitya callable() affinity metric between 0,1.

Will be inverted to provide a dissimilarity metric.

reductionthe reduction applied over all clusters

to provide the map score. Default: numpy.sum

centerway to compute the center of each region in attribute space

Default: numpy.mean

NOTE: Optimization occurs with respect to a *dissimilarity* metric, so the reduction should

yield some kind of score where larger values are less desirable than smaller values. Typically, this means we use addition.

Attributes:
coordsarray_like

n*2, collection of n sets of (x,y) coordinates used for calibration locations

yarray

n*1, dependent variable

Xarray

n*k, independent variable, exlcuding the constant

bwscalar

bandwidth value consisting of either a distance or N nearest neighbors; user specified or obtained using Sel_BW

familyfamily object

underlying probability model; provides distribution-specific calculations

offsetarray

n*1, the offset variable at the ith location. For Poisson model this term is often the size of the population at risk or the expected size of the outcome in spatial epidemiology Default is None where Ni becomes 1.0 for all locations

sigma2_v1bool

specify form of corrected denominator of sigma squared to use for model diagnostics; Acceptable options are: ‘True’: n-tr(S) (defualt) ‘False’: n-2(tr(S)+tr(S’S))

kernelstr

type of kernel function used to weight observations; available options: ‘gaussian’ ‘bisquare’ ‘exponential’

fixedbool

True for distance based kernel function and False for adaptive (nearest neighbor) kernel function (default)

constantbool

True to include intercept (default) in model and False to exclude intercept

sphericalbool

True for shperical coordinates (long-lat), False for projected coordinates (defalut).

hat_matrixbool

True to store full n by n hat matrix, False to not store full hat matrix to minimize memory footprint (defalut).

ninteger

number of observations

kinteger

number of independent variables

mean_yfloat

mean of y

std_yfloat

standard deviation of y

fit_paramsdict

parameters passed into fit method to define estimation routine

pointsarray_like

n*2, collection of n sets of (x,y) coordinates used for calibration locations instead of all observations; defaults to None unles specified in predict method

Parray

n*k, independent variables used to make prediction; exlcuding the constant; default to None unless specified in predict method

exog_scalescalar

estimated scale using sampled locations; defualt is None unless specified in predict method

exog_residarray_like

estimated residuals using sampled locations; defualt is None unless specified in predict method

Examples
——–
>>> import libpysal as ps
>>> import numpy as np
>>> import spreg
>>> from spreg.skater_reg import Skater_reg
>>> data = ps.io.open(ps.examples.get_path(‘columbus.dbf’))
>>> y = np.array(data.by_col(‘HOVAL’)).reshape((-1,1))
>>> x_var = [‘INC’,’CRIME’]
>>> x = np.array([db.by_col(name) for name in x_var]).T
>>> w = ps.weights.Queen.from_shapefile(ps.examples.get_path(“columbus.shp”))
>>> x_std = (x - np.mean(x,axis=0)) / np.std(x,axis=0)
>>> results = Skater_reg().fit(3, w, x_std, {‘reg’:spreg.OLS,’y’:y,’x’:x}, quorum=10, trace=False)
>>> results.current_labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,

0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 0, 0, 1, 2, 2, 2, 1, 0, 2, 2], dtype=int32)

__init__(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]

Methods

__init__([dissimilarity, affinity, ...])

find_cut(MSF[, data, data_reg, ...])

Find the best cut from the MSF.

fit(n_clusters, W[, data, data_reg, quorum, ...])

Method that fits a model with a particular estimation routine.

score_spreg([data, data_reg, all_labels, ...])

This yields a score for the data using methods from the spreg library, given the labels provided.

score_stats([data, data_reg, all_labels, ...])

This yields a score for the data using methods from the stats_models library, given the labels provided.

find_cut(MSF, data=None, data_reg=None, current_n_subtrees=None, current_labels=None, quorum=-inf, trees_scores=None, labels=None, target_label=None, make=False, verbose=False, model_family='spreg')[source]

Find the best cut from the MSF.

MSF: (N,N) scipy sparse matrix with zero elements removed.

Represents the adjacency matrix for the minimum spanning forest. Constructed from sparse.csgraph.sparse_from_dense or using MSF.eliminate_zeros(). You MUST remove zero entries for this to work, otherwise they are considered no-cost paths.

data: (N,p) attribute matrix. If not provided, replaced with (N,1) vector of ones. data_reg: optional list containing:

1- a callable spreg or statsmodels regression method (ex. OLS or GM_Lag) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

current_n_subtrees: integer indication the current number of subtrees. current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation. quorum: int denoting the minimum number of elements in the region trees_scores: dictionary indicating subtress’s labels and their respective current score. labels: (N,) flat vector of labels for each point. Represents the “cluster labels”

for disconnected components of the graph.

target_label: int from the labels array to subset the MSF. If passed along with labels, then a cut

will be found that is restricted to that subset of the MSF.

make: bool, whether or not to modify the input MSF in order to make the best cut that was found. verbose: bool/int, denoting how much output to provide to the user, in terms

of print statements or progressbars

Returns a namedtuple with in_node, out_node, and score.

fit(n_clusters, W, data=None, data_reg=None, quorum=-inf, trace=True, islands='increase', verbose=False, model_family='spreg')[source]

Method that fits a model with a particular estimation routine.

Parameters:
n_clustersint of clusters wanted
Wpysal W object expressing the neighbor relationships between observations.

Should be symmetric and binary, so Queen/Rook, DistanceBand, or a symmetrized KNN.

datanp.ndarray of (N,P) shape with N observations and P features

This is the data that is used to evaluate the similarity between each observation.

data_reglist containing:

1- a callable regression method (ex. OLS or GM_Lag from spreg or OLS from statsmodels) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

quorumint with minimum size of each region.
tracebool denoting whether to store intermediate

labelings as the tree gets pruned

islandsstr describing what to do with islands.

If “ignore”, will discover n_clusters regions, treating islands as their own regions. If “increase”, will discover n_clusters regions, treating islands as separate from n_clusters.

verbosebool/int describing how much output to provide to the user,

in terms of print statements or progressbars.

model_familystr describing the fFamily of estimation method used for the regression.

Must be either ‘spreg’ (default) or ‘statsmodels’

Returns:
: Skater_reg object.
score_spreg(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]

This yields a score for the data using methods from the spreg library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.

If a quorum is not passed, it is assumed to be irrelevant.

If a quorum is passed and the labels do not meet quorum, the score is inf.

data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : list containing:

1- a callable spreg regression method (ex. OLS or GM_Lag) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

all_labels : (N,) flat vector of labels expressing the classification of each observation into a region considering the cut under evaluation. quorum : int expressing the minimum size of regions. Can be -inf if there is no lower bound.

Any region below quorum makes the score inf.

current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation.

current_tree: integer indicating the tree label is currently being considered for division

score_stats(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]

This yields a score for the data using methods from the stats_models library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.

If a quorum is not passed, it is assumed to be irrelevant.

If a quorum is passed and the labels do not meet quorum, the score is inf.

data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : list containing:

1- a callable statsmodels regression method (ex. OLS) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)

all_labels : (N,) flat vector of labels expressing the classification of each observation into a region considering the cut under evaluation. quorum : int expressing the minimum size of regions. Can be -inf if there is no lower bound.

Any region below quorum makes the score inf.

current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation.

current_tree: integer indicating the tree label is currently being considered for division

NOTE: Optimization occurs with respect to a dissimilarity metric, so the problem minimizes

the map dissimilarity. So, lower scores are better.