spopt.region.Spenc

class spopt.region.Spenc(gdf, w, attrs_name, n_clusters=5, random_state=None, gamma=1, eigen_solver=None, n_init=10, affinity='rbf', n_neighbors=10, eigen_tol=1e-09, assign_labels='discretize', degree=3, coef0=1, kernel_params=None, n_jobs=1)[source]

Spatially encouraged spectral clustering found in [Wol18].

Apply clustering to a projection of the normalized laplacian, using spatial information to constrain the clustering. In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plan.

Spatially-Encouraged Spectral Clustering (SPENC) is useful for when there may be highly non-convex clusters or clusters with irregular topology in a geographic context. If a binary weights matrix is provided during fit, this method can be used to find weighted normalized graph cuts.

When calling fit, an affinity matrix is constructed using either kernel function such the Gaussian (aka RBF) kernel of the euclidean distanced \(d(X, X)\):

numpy.exp(-gamma * d(X,X) ** 2)

or a \(k\)-nearest neighbors connectivity matrix. Alternatively, using precomputed, a user-provided affinity matrix can be used. Read more in the scikit-learn user guide on spectral clustering.

__init__(gdf, w, attrs_name, n_clusters=5, random_state=None, gamma=1, eigen_solver=None, n_init=10, affinity='rbf', n_neighbors=10, eigen_tol=1e-09, assign_labels='discretize', degree=3, coef0=1, kernel_params=None, n_jobs=1)[source]
Parameters:
gdfgeopandas.GeoDataFrame

Input data.

wlibpywal.weights.W

Spatial weights matrix.

attrs_namelist

Strings for attribute names from columns in gdf.

n_clustersint (default 5)

The number of clusters to form.

random_stateint or numpy.random.RandomState (default None)

A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver='amg' and by the \(k\)-Means initialization. If int, random_state is the seed used by the random number generator; If numpy.random.RandomState, random_state is the random number generator; If None, the random number generator is the numpy.random.RandomState instance used by numpy.random.

gammaint, float (default 1)

Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity='nearest_neighbors'.

eigen_solverstr (default None)

The eigenvalue decomposition strategy to use. Valid values include {'arpack', 'lobpcg', 'amg'}. AMG requires pyamg to be installed, which may be faster on very large, sparse problems, but may also lead to instabilities. Noteeigen_solver is ignored unless fitting using the breakme flag in the .fit() method (so do not use then).

n_initint (default 10)

The number of times the \(k\)-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

affinitystr, array_like, callable() (default ‘rbf’)

If a str, valid values include {'nearest_neighbors', 'precomputed', 'rbf'} or one of the kernels supported by sklearn.metrics.pairwise_kernels. Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

n_neighborsint (default 10)

The number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity='rbf'.

eigen_tolfloat (default 1e-7)

Stopping criterion for eigen decomposition of the Laplacian matrix when using 'arpack' as the eigen_solver.

assign_labelsstr (default ‘discretize’)

The strategy to use to assign labels in the embedding space. There are three ways to assign labels after the laplacian embedding: {'kmeans', 'discretize', 'hierarchical'}:

  • 'kmeans' can be applied and is a popular choice. But it can also be sensitive to initialization.

  • 'discretize' is another approach which is less sensitive to random initialization, and which usually finds better clusters.

  • 'hierarchical' decomposition repeatedly bi-partitions the graph, instead of finding the decomposition all at once, as suggested in [SM00].

degreefloat (default 3)

Degree of the polynomial affinity kernel. Ignored by other kernels.

coef0float (default 1)

Zero coefficient for polynomial and sigmoid affinity kernels. Ignored by other kernels.

kernel_paramsdict (default None)

Parameters (keyword arguments) and values for affinity kernel passed as callable object. Ignored by other affinity kernels.

n_jobsint (default 1)

The number of parallel jobs to run for the nearest-neighbors affinity kernel, if used. If -1, then the number of jobs is set to the number of CPU cores.

Notes

If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values mean very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:

numpy.exp(-dist_matrix ** 2 / (2. * delta ** 2))

Where delta is a free parameter representing the width of the Gaussian kernel.

Another alternative is to take a symmetric version of the \(k\)-nearest neighbors connectivity matrix of the points/areas.

References

Attributes:
affinity_matrix_array_like

Affinity matrix used for clustering in the shape of (n_samples, n_samples). Available only if after calling fit.

labels_list

Cluster labels of each point or area.

Methods

__init__(gdf, w, attrs_name[, n_clusters, ...])

Parameters:

solve([fit_kwargs])

Solve the spenc.

solve(fit_kwargs={})[source]

Solve the spenc.

Parameters:
fit_kwargsdict

Keyword arguments passed into spenclib.abstracts.SPENC.fit().