gwlearn.ensemble.GWRandomForestClassifier¶

class gwlearn.ensemble.GWRandomForestClassifier(*, bandwidth=None, fixed=False, kernel='bisquare', include_focal=False, geometry=None, graph=None, n_jobs=-1, fit_global_model=True, strict=False, keep_models=False, temp_folder=None, batch_size=None, min_proportion=0.2, undersample=False, leave_out=None, random_state=None, verbose=False, **kwargs)[source]¶

Geographically weighted random forest classifier.

Fits one sklearn.ensemble.RandomForestClassifier per focal observation using spatially varying sample weights.

The spatial interaction is defined either by (a) geometry + bandwidth/kernel settings or (b) a precomputed libpysal.graph.Graph passed via graph.

Notes

y must be binary ({0, 1} or boolean).
To enable prediction on new data via predict()/predict_proba(), you must set keep_models=True (store in memory) or keep_models=Path(...) (serialize to disk).
Only point geometries are supported.

Parameters:¶

bandwidth : float | int | None¶

Bandwidth for defining neighborhoods.

If fixed=True, this is a distance threshold.
If fixed=False, this is the number of nearest neighbors used to form the local neighborhood.

If graph is provided, bandwidth is ignored.

fixed : bool, optional¶

True for distance based bandwidth and False for adaptive (nearest neighbor) bandwidth, by default False

kernel : str | Callable, optional¶

type of kernel function used to weight observations, by default “bisquare”

include_focal : bool, optional¶

Include focal in the local model training. Excluding it allows assessment of geographically weighted metrics on unseen data without a need for train/test split, hence providing value for all samples. This is needed for futher spatial analysis of the model performance (and generalises to models that do not support OOB scoring). However, it leaves out the most representative sample. By default False

geometry : gpd.GeoSeries, optional¶

Geographic location of the observations in the sample. Used to determine the spatial interaction weight based on specification by bandwidth, fixed, kernel, and include_focal keywords. Either geometry or graph need to be specified. To allow prediction, it is required to specify geometry.

graph : Graph, optional¶

Custom libpysal.graph.Graph object encoding the spatial interaction between observations in the sample. If given, it is used directly and bandwidth, fixed, kernel, and include_focal keywords are ignored. Either geometry or graph need to be specified. To allow prediction, it is required to specify geometry. Potentially, both can be specified where graph encodes spatial interaction between observations in geometry.

n_jobs : int, optional¶

The number of jobs to run in parallel. -1 means using all processors by default -1

fit_global_model : bool, optional¶

Determines if the global baseline model shall be fitted alognside the geographically weighted, by default True

strict : bool | None, optional¶

Do not fit any models if at least one neighborhood has invariant y, by default False. None is treated as False but provides a warning if there are invariant models.

keep_models : bool | str | Path, optional¶

Keep all local models (required for prediction), by default False. Note that for some models, like random forests, the objects can be large. If string or Path is provided, the local models are not held in memory but serialized to the disk from which they are loaded in prediction.

temp_folder : str | None, optional¶

Folder to be used by the pool for memmapping large arrays for sharing memory with worker processes, e.g., /tmp. Passed to joblib.Parallel, by default None

batch_size : int | None, optional¶

Number of models to process in each batch. Specify batch_size if your models do not fit into memory. By default None

min_proportion : float, optional¶

Minimum proportion of minority class for a model to be fitted, by default 0.2

undersample : bool | float, optional¶

Whether to apply random undersampling to balance classes.

If True, undersample the majority class to match the minority class (i.e., minority/majority ratio = 1.0).

If a float alpha > 0, target a minority/majority ratio of alpha after resampling, i.e. alpha = N_min / N_resampled_majority. By default False

leave_out : float | int, optional¶

Leave out a fraction (when float) or a set number (when int) of random observations from each local model to be used to measure out-of-sample log loss based on pooled samples from all the models. This is useful for bandwidth selection for cases where some local models are not fitted due to local invariance and resulting information criteria are not comparable.

random_state : int | None, optional¶

Random seed for reproducibility, by default None

verbose : bool, optional¶

Whether to print progress information, by default False

**kwargs¶

Additional keyword arguments passed to model initialisation

proba_¶

Probability predictions for focal locations based on a local model trained around the point itself.

Type:¶: pd.DataFrame

pred_¶

Binary predictions for focal locations based on a local model trained around the location itself.

Type:¶: pd.Series

hat_values_¶

Hat values for each location (diagonal elements of hat matrix)

Type:¶: pd.Series

effective_df_¶

Effective degrees of freedom (sum of hat values)

Type:¶: float

log_likelihood_¶

Global log likelihood of the model

Type:¶: float

aic_¶

Akaike information criterion of the model

Type:¶: float

aicc_¶

Corrected Akaike information criterion to account for model complexity (smaller bandwidths)

Type:¶: float

bic_¶

Bayesian information criterion

Type:¶: float

feature_importances_¶

Feature importance values for each local model

Type:¶: pd.DataFrame

prediction_rate_¶

Proportion of models that are fitted, where the rest are skipped due to not fulfilling min_proportion.

Type:¶: float

left_out_y_¶

Array of y values left out when leave_out is set.

Type:¶: numpy.ndarray

left_out_proba_¶

Array of probabilites on left out observations in local models when leave_out is set.

Type:¶: numpy.ndarray

left_out_w_¶

Array of weights on left out observations in local models when leave_out is set.

Type:¶: numpy.ndarray

oob_y_pooled_¶

Pooled out-of-bag (OOB) true labels across all fitted local models.

Type:¶: numpy.ndarray

oob_pred_pooled_¶

Pooled out-of-bag (OOB) predictions/scores across all fitted local models.

Type:¶: numpy.ndarray

Examples

>>> import geopandas as gpd
>>> from geodatasets import get_path
>>> from gwlearn.ensemble import GWRandomForestClassifier

>>> gdf = gpd.read_file(get_path('geoda.guerry'))
>>> X = gdf[['Crm_prp', 'Litercy', 'Donatns', 'Lottery']]
>>> y = gdf["Region"] == 'E'

>>> gw = GWRandomForestClassifier(
...     bandwidth=30,
...     fixed=False,
...     geometry=gdf.representative_point(),
...     random_state=0,
... ).fit(X, y)
>>> gw.pred_.head()
0    False
1    False
2    False
3     True
4     True
dtype: boolean

Methods

`__init__`(*[, bandwidth, fixed, kernel, ...])
`fit`(X, y)	Fit geographically weighted random forests.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`local_metric`(func, args, *kwargs)	Compute a metric per fitted local model.
`predict`(X, geometry)	Predict classes for new observations.
`predict_proba`(X, geometry)	Predict class probabilities for new observations.
`score`(X, y[, sample_weight])	Return accuracy on provided data and labels.
`set_params`(**params)	Set the parameters of this estimator.
`set_predict_proba_request`(*[, geometry])	Configure whether metadata should be requested to be passed to the `predict_proba` method.
`set_predict_request`(*[, geometry])	Configure whether metadata should be requested to be passed to the `predict` method.
`set_score_request`(*[, sample_weight])	Configure whether metadata should be requested to be passed to the `score` method.

Attributes

fit(X, y)[source]¶

Fit geographically weighted random forests.

Parameters:¶

X : pandas.DataFrame¶: Feature matrix.
y : pandas.Series¶: Binary target encoded as boolean or {0, 1}.

Returns:¶

Fitted estimator.

Return type:¶

GWRandomForestClassifier

Notes

In addition to the base classifier outputs, this method also populates oob_y_pooled_ and oob_pred_pooled_ by pooling OOB values across all fitted local models.

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:¶: routing – A MetadataRequest encapsulating routing information.
Return type:¶: MetadataRequest

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:¶

deep : bool, default=True¶: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:¶

params – Parameter names mapped to their values.

Return type:¶

dict

local_metric(func, *args, **kwargs)¶

Compute a metric per fitted local model.

Parameters:¶

func : callable¶: Callable with a signature func(y_true, y_pred, *args, **kwargs).

Returns:¶

One value per focal location (NaN for skipped / unfitted local models).

Return type:¶

numpy.ndarray

predict(X, geometry)¶

Predict classes for new observations.

This is equivalent to predict_proba(...).idxmax(axis=1).

Parameters:¶

X : pandas.DataFrame¶: Feature matrix for new observations.
geometry : geopandas.GeoSeries¶: Point geometries for new observations.

Returns:¶

Predicted class.

Return type:¶

pandas.Series

Notes

Requires the estimator to have been fit with keep_models=True (or a Path) so local models can be used at prediction time.

predict_proba(X, geometry)¶

Predict class probabilities for new observations.

Parameters:¶

X : pandas.DataFrame¶: Feature matrix for new observations.
geometry : geopandas.GeoSeries¶: Point geometries for new observations.

Returns:¶

Predicted probabilities with columns equal to the global classes observed during fit.

Return type:¶

pandas.DataFrame

Notes

Requires the estimator to have been fit with keep_models=True (or a Path) so local models can be used at prediction time.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:¶

**params : dict¶: Estimator parameters.

Returns:¶

self – Estimator instance.

Return type:¶

estimator instance