spreg.Skater_reg¶
- class spreg.Skater_reg(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]¶
Initialize the Skater_reg algorithm based on [AA24]. The function can currently estimate OLS, from spreg or stats_models, and Spatial Lag models from spreg. Fit method performs estimation and returns a Skater_reg object.
- Parameters:
- dissimilarity
a
callable()
distance
metric. Default: sklearn.metrics.pairwise.euclidean_distances
- affinity
a
callable()
affinity
metric
between
0,1. Will be inverted to provide a dissimilarity metric.
- reduction
the
reduction
applied
over
all
clusters
to provide the map score. Default: numpy.sum
- center
way
to
compute
the
center
of
each
region
in
attribute
space
Default: numpy.mean
- NOTE: Optimization occurs with respect to a *dissimilarity* metric, so the reduction should
yield some kind of score where larger values are less desirable than smaller values. Typically, this means we use addition.
- dissimilarity
- Attributes:
- coordsarray_like
n*2, collection of n sets of (x,y) coordinates used for calibration locations
- y
array
n*1, dependent variable
- X
array
n*k, independent variable, not including the constant
- bwscalar
bandwidth value consisting of either a distance or N nearest neighbors; user specified or obtained using Sel_BW
- family
family
object
underlying probability model; provides distribution-specific calculations
- offset
array
n*1, the offset variable at the ith location. For Poisson model this term is often the size of the population at risk or the expected size of the outcome in spatial epidemiology Default is None where Ni becomes 1.0 for all locations
- sigma2_v1bool
specify form of corrected denominator of sigma squared to use for model diagnostics; Acceptable options are: ‘True’: n-tr(S) (defualt) ‘False’: n-2(tr(S)+tr(S’S))
- kernel
str
type of kernel function used to weight observations; available options: ‘gaussian’ ‘bisquare’ ‘exponential’
- fixedbool
True for distance based kernel function and False for adaptive (nearest neighbor) kernel function (default)
- constantbool
True to include intercept (default) in model and False to exclude intercept
- sphericalbool
True for shperical coordinates (long-lat), False for projected coordinates (defalut).
- hat_matrixbool
True to store full n by n hat matrix, False to not store full hat matrix to minimize memory footprint (defalut).
- n
integer
number of observations
- k
integer
number of independent variables
- mean_y
float
mean of y
- std_y
float
standard deviation of y
- fit_params
dict
parameters passed into fit method to define estimation routine
- pointsarray_like
n*2, collection of n sets of (x,y) coordinates used for calibration locations instead of all observations; defaults to None unles specified in predict method
- P
array
n*k, independent variables used to make prediction; exlcuding the constant; default to None unless specified in predict method
- exog_scalescalar
estimated scale using sampled locations; defualt is None unless specified in predict method
- exog_residarray_like
estimated residuals using sampled locations; defualt is None unless specified in predict method
- Examples
- ——–
- >>> import libpysal as ps
- >>> import numpy as np
- >>> import spreg
- >>> from spreg.skater_reg import Skater_reg
- >>> data = ps.io.open(ps.examples.get_path(‘columbus.dbf’))
- >>> y = np.array(data.by_col(‘HOVAL’)).reshape((-1,1))
- >>> x_var = [‘INC’,’CRIME’]
- >>> x = np.array([db.by_col(name) for name in x_var]).T
- >>> w = ps.weights.Queen.from_shapefile(ps.examples.get_path(“columbus.shp”))
- >>> x_std = (x - np.mean(x,axis=0)) / np.std(x,axis=0)
- >>> results = Skater_reg().fit(3, w, x_std, {‘reg’:spreg.OLS,’y’:y,’x’:x}, quorum=10, trace=False)
- >>> results.current_labels_
- array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 2, 1, 2, 1, 2, 0, 1, 0, 0, 1, 2, 2, 2, 1, 0, 2, 2], dtype=int32)
- __init__(dissimilarity=<function euclidean_distances>, affinity=None, reduction=<function sum>, center=<function mean>)[source]¶
Methods
__init__
([dissimilarity, affinity, ...])find_cut
(MSF[, data, data_reg, ...])Find the best cut from the MSF.
fit
(n_clusters, W[, data, data_reg, quorum, ...])Method that fits a model with a particular estimation routine.
score_spreg
([data, data_reg, all_labels, ...])This yields a score for the data using methods from the spreg library, given the labels provided.
score_stats
([data, data_reg, all_labels, ...])This yields a score for the data using methods from the stats_models library, given the labels provided.
- find_cut(MSF, data=None, data_reg=None, current_n_subtrees=None, current_labels=None, quorum=-inf, trees_scores=None, labels=None, target_label=None, make=False, verbose=False, model_family='spreg')[source]¶
Find the best cut from the MSF.
- MSF: (N,N) scipy sparse matrix with zero elements removed.
Represents the adjacency matrix for the minimum spanning forest. Constructed from sparse.csgraph.sparse_from_dense or using MSF.eliminate_zeros(). You MUST remove zero entries for this to work, otherwise they are considered no-cost paths.
data: (N,p) attribute matrix. If not provided, replaced with (N,1) vector of ones. data_reg: optional list containing:
1- a callable spreg or statsmodels regression method (ex. OLS or GM_Lag) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)
current_n_subtrees: integer indication the current number of subtrees. current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation. quorum: int denoting the minimum number of elements in the region trees_scores: dictionary indicating subtress’s labels and their respective current score. labels: (N,) flat vector of labels for each point. Represents the “cluster labels”
for disconnected components of the graph.
- target_label: int from the labels array to subset the MSF. If passed along with labels, then a cut
will be found that is restricted to that subset of the MSF.
make: bool, whether or not to modify the input MSF in order to make the best cut that was found. verbose: bool/int, denoting how much output to provide to the user, in terms
of print statements or progressbars
Returns a namedtuple with in_node, out_node, and score.
- fit(n_clusters, W, data=None, data_reg=None, quorum=-inf, trace=True, islands='increase', verbose=False, model_family='spreg')[source]¶
Method that fits a model with a particular estimation routine.
- Parameters:
- n_clusters
int
of
clusters
wanted
- W
pysal
W
object
expressing
the
neighbor
relationships
between
observations. Should be symmetric and binary, so Queen/Rook, DistanceBand, or a symmetrized KNN.
- data
np.ndarray
of
(N,P)shape
with
N
observations
and
P
features
This is the data that is used to evaluate the similarity between each observation.
- data_reg
list
containing: 1- a callable regression method (ex. OLS or GM_Lag from spreg or OLS from statsmodels) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)
- quorum
int
with
minimum
size
of
each
region. - tracebool
denoting
whether
to
store
intermediate
labelings as the tree gets pruned
- islands
str
describing
what
to
do
with
islands. If “ignore”, will discover n_clusters regions, treating islands as their own regions. If “increase”, will discover n_clusters regions, treating islands as separate from n_clusters.
- verbosebool/int
describing
how
much
output
to
provide
to
the
user, in terms of print statements or progressbars.
- model_family
str
describing
the
fFamily
of
estimation
method
used
for
the
regression. Must be either ‘spreg’ (default) or ‘statsmodels’
- n_clusters
- Returns:
- :
Skater_reg
object.
- :
- score_spreg(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]¶
This yields a score for the data using methods from the spreg library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.
If a quorum is not passed, it is assumed to be irrelevant.
If a quorum is passed and the labels do not meet quorum, the score is inf.
data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : list containing:
1- a callable spreg regression method (ex. OLS or GM_Lag) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)
all_labels : (N,) flat vector of labels expressing the classification of each observation into a region considering the cut under evaluation. quorum : int expressing the minimum size of regions. Can be -inf if there is no lower bound.
Any region below quorum makes the score inf.
current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation.
current_tree: integer indicating the tree’s label currently being considered for division
- score_stats(data=None, data_reg=None, all_labels=None, quorum=-inf, current_labels=None, current_tree=None)[source]¶
This yields a score for the data using methods from the stats_models library, given the labels provided. If no labels are provided, and the object has been fit, then the labels discovered from the previous fit are used.
If a quorum is not passed, it is assumed to be irrelevant.
If a quorum is passed and the labels do not meet quorum, the score is inf.
data : (N,P) array of data on which to compute the score of the regions expressed in labels data_reg : list containing:
1- a callable statsmodels regression method (ex. OLS) 2- np.ndarray of (N,1) shape with N observations on the depedent variable for the regression 3- np.ndarray of (N,k) shape with N observations and k columns containing the explanatory variables (constant must not be included) 4- pysal W object to be used in the regression (optional)
all_labels : (N,) flat vector of labels expressing the classification of each observation into a region considering the cut under evaluation. quorum : int expressing the minimum size of regions. Can be -inf if there is no lower bound.
Any region below quorum makes the score inf.
current_labels: (N,) flat vector of labels expressing the classification of each observation into a region not considering the cut under evaluation.
current_tree: integer indicating the tree label is currently being considered for division
- NOTE: Optimization occurs with respect to a dissimilarity metric, so the problem minimizes
the map dissimilarity. So, lower scores are better.