esda.adbscan.ADBSCAN¶
- class esda.adbscan.ADBSCAN(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
A-DBSCAN, as introduced in [ABGLopezVM21].
A-DSBCAN is an extension of the original DBSCAN algorithm that creates an ensemble of solutions generated by running DBSCAN on a random subset and “extending” the solution to the rest of the sample through nearest-neighbor regression.
See the original reference ([ABGLopezVM21]) for more details or the notebook guide for an illustration. …
- Parameters:
- eps
float
The maximum distance between two samples for them to be considered as in the same neighborhood.
- min_samples
int
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
- n_jobs
int
[Optional. Default=1] The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.
- pct_exact
float
[Optional. Default=0.1] Proportion of the entire dataset used to calculate DBSCAN in each draw
- reps
int
[Optional. Default=100] Number of random samples to draw in order to build final solution
- keep_solusbool
[Optional. Default=False] If True, the solus and solus_relabelled objects are kept, else it is deleted to save memory
- pct_thr
float
[Optional. Default=0.9] Minimum proportion of replications that a non-noise label need to be assigned to an observation for that observation to be labelled as such
- eps
Examples
>>> import pandas >>> from esda.adbscan import ADBSCAN >>> import numpy as np >>> np.random.seed(10) >>> db = pandas.DataFrame({'X': np.random.random(25), 'Y': np.random.random(25) })
ADBSCAN can be run following scikit-learn like API as:
>>> np.random.seed(10) >>> clusterer = ADBSCAN(0.03, 3, reps=10, keep_solus=True) >>> _ = clusterer.fit(db) >>> clusterer.labels_ array(['-1', '-1', '-1', '0', '-1', '-1', '-1', '0', '-1', '-1', '-1', '-1', '-1', '-1', '0', '0', '0', '-1', '0', '-1', '0', '-1', '-1', '-1', '-1'], dtype=object)
We can inspect the winning label for each observation, as well as the proportion of votes:
>>> print(clusterer.votes.head().to_string()) lbls pct 0 -1 0.7 1 -1 0.5 2 -1 0.7 3 0 1.0 4 -1 0.7
If you have set the option to keep them, you can even inspect each solution that makes up the ensemble:
>>> print(clusterer.solus.head().to_string()) rep-00 rep-01 rep-02 rep-03 rep-04 rep-05 rep-06 rep-07 rep-08 rep-09 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 1 1 2 0 1 1 0 0 1 0 0 1 0 3 0 1 1 0 0 1 1 1 0 0 4 0 1 1 1 0 1 0 1 0 1
If we select only one replication and the proportion of the entire dataset that is sampled to 100%, we obtain a traditional DBSCAN:
>>> clusterer = ADBSCAN(0.2, 5, reps=1, pct_exact=1) >>> np.random.seed(10) >>> _ = clusterer.fit(db) >>> clusterer.labels_ array(['0', '-1', '0', '0', '0', '-1', '-1', '0', '-1', '-1', '0', '-1', '-1', '-1', '0', '0', '0', '-1', '0', '0', '0', '-1', '-1', '0', '-1'], dtype=object)
- Attributes:
- labels_
array
[Only available after fit] Cluster labels for each point in the dataset given to fit(). Noisy (if the proportion of the most common label is < pct_thr) samples are given the label -1.
- votes
DataFrame
[Only available after fit] Table indexed on X.index with labels_ under the lbls column, and the frequency across draws of that label under pct
- solus
DataFrame
, shape = [n
,reps
] [Only available after fit] Each solution of labels for every draw
- solus_relabelled
DataFrame
, shape = [n
,reps
] [Only available after fit] Each solution of labels for every draw, relabelled to be consistent across solutions
- labels_
- __init__(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
Methods
__init__
(eps, min_samples[, algorithm, ...])fit
(X[, y, sample_weight, xy])Perform ADBSCAN clustering from fetaures
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_fit_request
(*[, sample_weight, xy])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
- fit(X, y=None, sample_weight=None, xy=['X', 'Y'])[source]¶
Perform ADBSCAN clustering from fetaures …
- Parameters:
- X
DataFrame
Features
- sample_weight
Series
, shape (n_samples,) [Optional. Default=None] Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.- xy
list
[Default=`[‘X’, ‘Y’]`] Ordered pair of names for XY coordinates in xys
- y
Ignored
- X
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$', xy: bool | None | str = '$UNCHANGED$') ADBSCAN ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- Returns:
- self
object
The updated object.
- self