gwlearn.ensemble.GWRandomForestRegressor

class gwlearn.ensemble.GWRandomForestRegressor(*, bandwidth=None, fixed=False, kernel='bisquare', include_focal=False, graph=None, n_jobs=-1, fit_global_model=True, strict=False, keep_models=False, temp_folder=None, batch_size=None, random_state=None, verbose=False, **kwargs)[source]

Geographically weighted random forest regressor.

Fits one sklearn.ensemble.RandomForestRegressor per focal observation using spatially varying sample weights.

The spatial interaction is defined either by (a) geometry + bandwidth/kernel settings or (b) a precomputed libpysal.graph.Graph passed via graph.

Notes

  • To enable prediction on new data via predict(), you must set keep_models=True (store in memory) or keep_models=Path(...) (serialize to disk).

  • Only point geometries are supported.

Parameters:
bandwidth : float | int | None

Bandwidth for defining neighborhoods.

  • If fixed=True, this is a distance threshold.

  • If fixed=False, this is the number of nearest neighbors used to form the local neighborhood.

If graph is provided, bandwidth is ignored.

fixed : bool, optional

True for distance based bandwidth and False for adaptive (nearest neighbor) bandwidth, by default False

kernel : str | Callable, optional

type of kernel function used to weight observations, by default “bisquare”

include_focal : bool, optional

Include focal in the local model training. Excluding it allows assessment of geographically weighted metrics on unseen data without a need for train/test split, hence providing value for all samples. This is needed for further spatial analysis of the model performance (and generalises to models that do not support OOB scoring). However, it leaves out the most representative sample. By default False

graph : Graph, optional

Custom libpysal.graph.Graph object encoding the spatial interaction between observations in the sample. If given, it is used directly and bandwidth, fixed, kernel, and include_focal keywords are ignored. Either geometry or graph need to be specified. To allow prediction, it is required to specify geometry. Potentially, both can be specified where graph encodes spatial interaction between observations in geometry.

n_jobs : int, optional

The number of jobs to run in parallel. -1 means using all processors by default -1

fit_global_model : bool, optional

Determines if the global baseline model shall be fitted alongside the geographically weighted, by default True

strict : bool | None, optional

Do not fit any models if at least one neighborhood has invariant y, by default False. None is treated as False but provides a warning if there are invariant models.

keep_models : bool | str | Path, optional

Keep all local models (required for prediction), by default False. Note that for some models, like random forests, the objects can be large. If string or Path is provided, the local models are not held in memory but serialized to the disk from which they are loaded in prediction.

temp_folder : str | None, optional

Folder to be used by the pool for memmapping large arrays for sharing memory with worker processes, e.g., /tmp. Passed to joblib.Parallel, by default None

batch_size : int | None, optional

Number of models to process in each batch. Specify batch_size if your models do not fit into memory. By default None

random_state : int | None, optional

Random seed for reproducibility, by default None

verbose : bool, optional

Whether to print progress information, by default False

**kwargs

Additional keyword arguments passed to model initialisation

pred_[source]

Focal predictions for each location.

Type:

pd.Series

resid_[source]

Residuals for each location (y - pred_).

Type:

pd.Series

RSS_[source]

Residual sum of squares for each location.

Type:

pd.Series

TSS_[source]

Total sum of squares for each location.

Type:

pd.Series

y_bar_[source]

Weighted mean of y for each location.

Type:

pd.Series

local_r2_[source]

Local R2 for each location.

Type:

pd.Series

hat_values_[source]

Hat values for each location (diagonal elements of hat matrix)

Type:

pd.Series

effective_df_[source]

Effective degrees of freedom (sum of hat values)

Type:

float

log_likelihood_[source]

Global log likelihood of the model

Type:

float

aic_[source]

Akaike information criterion of the model

Type:

float

aicc_[source]

Corrected Akaike information criterion to account for model complexity (smaller bandwidths)

Type:

float

bic_[source]

Bayesian information criterion

Type:

float

feature_importances_[source]

Feature importance values for each local model

Type:

pd.DataFrame

oob_y_pooled_[source]

Pooled out-of-bag (OOB) true values across all fitted local models.

Type:

numpy.ndarray

oob_pred_pooled_[source]

Pooled out-of-bag (OOB) predictions across all fitted local models.

Type:

numpy.ndarray

Examples

>>> import geopandas as gpd
>>> from geodatasets import get_path
>>> from gwlearn.ensemble import GWRandomForestRegressor
>>> gdf = gpd.read_file(get_path('geoda.guerry'))
>>> X = gdf[['Crm_prp', 'Litercy', 'Donatns', 'Lottery']]
>>> y = gdf["Suicids"]
>>> gw = GWRandomForestRegressor(
...     bandwidth=30,
...     fixed=False,
...     random_state=0,
... ).fit(X, y, geometry=gdf.representative_point())
>>> gw.pred_.head()
0    85064.34
1    19490.90
2    29501.62
3    33270.86
4    54608.57
dtype: float64

Methods

__init__(*[, bandwidth, fixed, kernel, ...])

fit(X, y[, geometry])

Fit geographically weighted random forests.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X, geometry[, bandwidth, ...])

Predict target values for new observations.

score(X, y, geometry[, bandwidth, ...])

Return the coefficient of determination R^2 of the prediction.

set_fit_request(*[, geometry])

Configure whether metadata should be requested to be passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, bandwidth, ...])

Configure whether metadata should be requested to be passed to the predict method.

set_score_request(*[, bandwidth, geometry, ...])

Configure whether metadata should be requested to be passed to the score method.

Attributes

fit(X, y, geometry=None)[source]

Fit geographically weighted random forests.

Parameters:
X : pandas.DataFrame

Feature matrix.

y : pandas.Series

Target values.

geometry : geopandas.GeoSeries | None

Geographic location of the observations in the sample. Used to determine the spatial interaction weight based on specification by bandwidth, fixed, kernel, and include_focal keywords. If None, a precomputed graph needs to be specified. To allow prediction, it is required to specify geometry. If both graph and geometry are specified, graph is used at the fit time, while geometry is used for prediction.

Returns:

Fitted estimator.

Return type:

GWRandomForestRegressor

Notes

In addition to the base regressor outputs, this method also populates oob_y_pooled_ and oob_pred_pooled_ by pooling OOB values across all fitted local models.

get_metadata_routing()[source]

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X, geometry, bandwidth='nearest', global_model_weight=0)[source]

Predict target values for new observations.

Prediction can be retrieved either from the nearest local model or based on the ensemble of local models. In the latter case, the prediction process works as follows:

  1. For a new location on which you want a prediction, identify local models within the bandwidth used to train the model.

  2. Apply the kernel function used to train the model to derive weights of each of the local models.

  3. Make prediction using each of the local models in the bandwidth.

  4. Make weighted average of predictions based on the kernel weights.

The results from the nearest and ensemble predictions are typically similar, with the ensemble being significantly slower due to the required number of inference calls.

Further the prediction can be a result of a fusion of local and global models when global_model_weight is set to a non-zero value, following Georganos et al. [2021].

Parameters:
X : pandas.DataFrame

Feature matrix for new observations.

geometry : geopandas.GeoSeries

Point geometries for new observations.

bandwidth : "nearest", float or None

Prediction method. Nearest uses the nearest location available at the fit time and does prediction using its single model. When set to a numeric value, uses an ensemble of local models available within the bandwidth, with predictions from individual models being weighted based on the distance and a set kernel. When None, uses the bandwidth set at the fit time.

global_model_weight : float

Weight of the prediction from the global model. When non-zero, the resulting prediction is a weighted average of the values from local model(s) and from global model, where local prediction has a weight of 1 and global model has a weight equal to global_model_weight.

Returns:

Predicted values.

Return type:

pandas.Series

Notes

Requires the estimator to have been fit with keep_models=True (or a Path) so local models can be used at prediction time.

score(X, y, geometry, bandwidth='nearest', global_model_weight=0)[source]

Return the coefficient of determination R^2 of the prediction.

Parameters:
X : pandas.DataFrame

Feature matrix for new observations.

y : pandas.Series

True values for X.

geometry : geopandas.GeoSeries

Point geometries for new observations.

bandwidth : "nearest", float or None

Prediction method. See predict().

global_model_weight : float

Weight of the prediction from the global model.

Returns:

R^2 of self.predict(X, geometry).

Return type:

float

set_fit_request(*, geometry='$UNCHANGED$')[source]

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
geometry : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for geometry parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**params : dict

Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_predict_request(*, bandwidth='$UNCHANGED$', geometry='$UNCHANGED$', global_model_weight='$UNCHANGED$')[source]

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
bandwidth : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for bandwidth parameter in predict.

geometry : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for geometry parameter in predict.

global_model_weight : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for global_model_weight parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, bandwidth='$UNCHANGED$', geometry='$UNCHANGED$', global_model_weight='$UNCHANGED$')[source]

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
bandwidth : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for bandwidth parameter in score.

geometry : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for geometry parameter in score.

global_model_weight : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for global_model_weight parameter in score.

Returns:

self – The updated object.

Return type:

object