esda.adbscan.ADBSCAN

class esda.adbscan.ADBSCAN(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]

A-DBSCAN, as introduced in [ABGLopezVM21].

A-DSBCAN is an extension of the original DBSCAN algorithm that creates an ensemble of solutions generated by running DBSCAN on a random subset and “extending” the solution to the rest of the sample through nearest-neighbor regression.

See the original reference ([ABGLopezVM21]) for more details or the notebook guide for an illustration. …

Parameters:
epsfloat

The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samplesint

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

n_jobsint

[Optional. Default=1] The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.

pct_exactfloat

[Optional. Default=0.1] Proportion of the entire dataset used to calculate DBSCAN in each draw

repsint

[Optional. Default=100] Number of random samples to draw in order to build final solution

keep_solusbool

[Optional. Default=False] If True, the solus and solus_relabelled objects are kept, else it is deleted to save memory

pct_thrfloat

[Optional. Default=0.9] Minimum proportion of replications that a non-noise label need to be assigned to an observation for that observation to be labelled as such

Examples

>>> import pandas
>>> from esda.adbscan import ADBSCAN
>>> import numpy as np
>>> np.random.seed(10)
>>> db = pandas.DataFrame({'X': np.random.random(25),                                'Y': np.random.random(25)                               })

ADBSCAN can be run following scikit-learn like API as:

>>> np.random.seed(10)
>>> clusterer = ADBSCAN(0.03, 3, reps=10, keep_solus=True)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['-1', '-1', '-1', '0', '-1', '-1', '-1', '0', '-1', '-1', '-1',
       '-1', '-1', '-1', '0', '0', '0', '-1', '0', '-1', '0', '-1', '-1',
       '-1', '-1'], dtype=object)

We can inspect the winning label for each observation, as well as the proportion of votes:

>>> print(clusterer.votes.head().to_string())
  lbls  pct
0   -1  0.7
1   -1  0.5
2   -1  0.7
3    0  1.0
4   -1  0.7

If you have set the option to keep them, you can even inspect each solution that makes up the ensemble:

>>> print(clusterer.solus.head().to_string())
  rep-00 rep-01 rep-02 rep-03 rep-04 rep-05 rep-06 rep-07 rep-08 rep-09
0      0      1      1      0      1      0      0      0      1      0
1      1      1      1      1      0      1      0      1      1      1
2      0      1      1      0      0      1      0      0      1      0
3      0      1      1      0      0      1      1      1      0      0
4      0      1      1      1      0      1      0      1      0      1

If we select only one replication and the proportion of the entire dataset that is sampled to 100%, we obtain a traditional DBSCAN:

>>> clusterer = ADBSCAN(0.2, 5, reps=1, pct_exact=1)
>>> np.random.seed(10)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['0', '-1', '0', '0', '0', '-1', '-1', '0', '-1', '-1', '0', '-1',
       '-1', '-1', '0', '0', '0', '-1', '0', '0', '0', '-1', '-1', '0',
       '-1'], dtype=object)
Attributes:
labels_array

[Only available after fit] Cluster labels for each point in the dataset given to fit(). Noisy (if the proportion of the most common label is < pct_thr) samples are given the label -1.

votesDataFrame

[Only available after fit] Table indexed on X.index with labels_ under the lbls column, and the frequency across draws of that label under pct

solusDataFrame, shape = [n, reps]

[Only available after fit] Each solution of labels for every draw

solus_relabelledDataFrame, shape = [n, reps]

[Only available after fit] Each solution of labels for every draw, relabelled to be consistent across solutions

__init__(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]

Methods

__init__(eps, min_samples[, algorithm, ...])

fit(X[, y, sample_weight, xy])

Perform ADBSCAN clustering from fetaures

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, sample_weight, xy])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None, sample_weight=None, xy=['X', 'Y'])[source]

Perform ADBSCAN clustering from fetaures …

Parameters:
XDataFrame

Features

sample_weightSeries, shape (n_samples,)

[Optional. Default=None] Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

xylist

[Default=`[‘X’, ‘Y’]`] Ordered pair of names for XY coordinates in xys

yIgnored
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$', xy: bool | None | str = '$UNCHANGED$') ADBSCAN

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

xystr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for xy parameter in fit.

Returns:
selfobject

The updated object.