This page was generated from /home/runner/work/esda/esda/docs/notebooks/localjoincounts.ipynb. Interactive online version:
Local join counts¶
In the following notebook we review the different type of local join counts (LJC) put forward by Anselin and Li (2019). LJC focus on spatial phenomenon that take on binary values (e.g. 0 or 1). This suite of exploratory statistics is especially useful to analysts who want to focus on different types of what Anselin and Li call ‘co-location’; that is, the presence or absence of specific 0 or 1 values.
Note that there are several versions of the LJC:
univariate LJC
bivariate LJC (case 1)
bivariate LJC (case 2)
multivariate LJC
The use case of each of these statistics, as well as their implementation in PySAL, is provided below.
Univariate LJC¶
The univariate LJC is a the local version of the ‘black-black’ (aka ‘BB’) statistic. This statistic describes the count of the neighbors, \(x_j\), of a given unit, \(x_i\), that are equal to 1 when the unit is also equal to 1. Formally:
Eq 1.
It is important to note that when a given unit \(x_i\) is equal to 0, the statistic also becomes 0. Anselin and Li describe the application of this statisttic as:
Hence, the local join count statistic is only meaningful to assess whether locations with an “event” (i.e., \(x_i = 1\) ) are surrounded by more locations with events than would be the case under spatial randomness. Anselin and Li, 2019, Section 2.2 Page 192
We can apply the PySAL implementation of the univariate LJC statistic to its original implementation in GeoDa. We first load in the Guerry dataset and convert the column Donats
to binary column. This new binary column has a value of 1 for the top three groupings of Donats
based on a Natural Breaks classification method (and 0 for otherwise).
[1]:
import libpysal
import geopandas as gpd
guerry = libpysal.examples.load_example('Guerry')
guerry_ds = gpd.read_file(guerry.get_path('guerry.shp'))
guerry_ds['SELECTED'] = 0
guerry_ds.loc[(guerry_ds['Donatns'] > 10997), 'SELECTED'] = 1
/home/serge/anaconda3/envs/dev/lib/python3.8/site-packages/geopandas/_compat.py:84: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.8.1-CAPI-1.13.3). Conversions between both will be slow.
warnings.warn(
We now make a Queen-contiguity weights object to describe the relationship between the units.
[2]:
w = libpysal.weights.Queen.from_dataframe(guerry_ds)
We can now apply the univariate LJC function on the dataset.
[3]:
from esda.join_counts_local import Join_Counts_Local
LJC_uni = Join_Counts_Local(connectivity=w).fit(guerry_ds['SELECTED'])
The LJC
attribute returned from the function counts the total number of local join counts.
[4]:
LJC_uni.LJC
[4]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 3., 3., 0.,
1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 2., 0., 3., 0., 0.])
And the p_sim
attribute contains the p-values obtained from a conditional randomization procedure.
[5]:
LJC_uni.p_sim
[5]:
array([ nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, 0.457, nan, 0.026, 0.026, nan, 0.289,
nan, nan, nan, nan, nan, nan, 0.275, nan, 0.325,
nan, nan, nan, nan, nan, nan, 0.297, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, 0.459,
nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, 0.031, nan, nan, nan, nan, nan, 0.117,
nan, 0.053, nan, nan])
We can map these values after placing them back into the dataset.
[6]:
guerry_ds['LJC_UNI'] = LJC_uni.LJC
guerry_ds['LJC_UNI_p_sim'] = LJC_uni.p_sim
From here you may be interested in mapping the LJC…
[7]:
import matplotlib.pyplot as plt
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
guerry_ds.plot(color='white', edgecolor='black', ax=ax)
guerry_ds.plot(column='LJC_UNI',
categorical=True,
cmap='GnBu',
legend=True, ax=ax)
[7]:
<AxesSubplot:>
Or mapping the accompanying significance values below a certain threshold…
[8]:
import numpy
guerry_ds['LJC_UNI_p_sim_sig'] = numpy.nan
guerry_ds.loc[(guerry_ds['LJC_UNI_p_sim'] <= 0.05), 'LJC_UNI_p_sim_sig'] = guerry_ds['LJC_UNI_p_sim']
import matplotlib.pyplot as plt
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
guerry_ds.plot(color='white', edgecolor='black', ax=ax)
guerry_ds.plot(column='LJC_UNI_p_sim_sig',
legend=True, ax=ax)
[8]:
<AxesSubplot:>
Bivariate LJC (case 1)¶
When considering two variables, say \(x\) and \(z\), in the bivariate LJC, distinctions need to be made about what spatial pattern is of interest. Anselin and Li define two use cases of the statistic. The first use case is when there is no in-situ co-location, or where \(x_i\) and \(z_i\) no not take the same value at \(i\) or \(j\). More specifically, the bivariate LJC case 1 is interested in when both \(x_i\) and \(x_j\) equal 1 (i.e. \(x_i=x_j=1\)) as well as \(x_j=0\). Substantively, Anselin and Li explain that case is useful when the phenomenon represented by \(x\) and \(z\) “cannot occur in the same location”.
Formally, this case is represented as:
Eq 2.
As above, we apply the PySAL implementation of the bivariate LJC case 1 statistic to its original implementation in GeoDa. Unlike above, we now move to the Community areas in Chicago dataset (``commpop`) <https://geodacenter.github.io/data-and-lab//>`__. We are going to examine instances of negative spatial autocorrelation, identified by locations where negative population changes (popneg=1
)
surrounded by more locations with positive popuation changes (popplus=1
).
[9]:
import libpysal
import geopandas as gpd
commpop = gpd.read_file("https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/commpop.gpkg")
/home/serge/anaconda3/envs/dev/lib/python3.8/site-packages/geopandas/geodataframe.py:422: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance.
for feature in features_lst:
We again make a Queen-contiguity weights object to describe the relationship between the units.
[10]:
w = libpysal.weights.Queen.from_dataframe(commpop)
Now we run the Local_Join_Count_BV
function. Note that the order of the variables is meaningful! Changing the order of the variables will change the implied research question.
[11]:
from esda.join_counts_local_bv import Join_Counts_Local_BV
LJC_BV_Case1 = Join_Counts_Local_BV(connectivity=w).fit(commpop['popneg'], commpop['popplus'], case='BJC')
As before, we can map both the LJC
and p_sim
values after placing the results back into the commpop
dataset.
[12]:
commpop['LJC_BV_Case1'] = LJC_BV_Case1.LJC
commpop['LJC_BV_Case1_p_sim'] = LJC_BV_Case1.p_sim
[13]:
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
commpop.plot(color='white', edgecolor='black', ax=ax)
commpop.plot(column='LJC_BV_Case1',
cmap='GnBu',
categorical=True,
legend=True, ax=ax)
[13]:
<AxesSubplot:>
[14]:
commpop['LJC_BV_Case1_p_sim_sig'] = numpy.nan
commpop.loc[(commpop['LJC_BV_Case1_p_sim'] <= 0.05), 'LJC_BV_Case1_p_sim_sig'] = commpop['LJC_BV_Case1_p_sim']
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
commpop.plot(color='white', edgecolor='black', ax=ax)
commpop.plot(column='LJC_BV_Case1_p_sim_sig',
legend=True, ax=ax)
[14]:
<AxesSubplot:>
Bivariate LJC (case 2)¶
We now move on to case 2 of the bivariate LJC. Unlike case 1, case 2 identifies areas with co-location. Case 2 is explicitly concerned with instances where \(x_i=z_i=1\) as well as \(x_j=z_j=1\). Anselin and Li refer to this situation as a co-location cluster (CLC). We formally write this as:
We return to the Guerry dataset to demonstrate its implementation in PySAL. We are interested in identifying areas that are in the top Quantile for the variables Infants
(children born out of wedlock) and Donatns
(donations). We reload the dataset and create the variables below:
[15]:
guerry = libpysal.examples.load_example('Guerry')
guerry_ds = gpd.read_file(guerry.get_path('guerry.shp'))
guerry_ds['infq5'] = 0
guerry_ds['donq5'] = 0
guerry_ds.loc[(guerry_ds['Infants'] > 23574), 'infq5'] = 1
guerry_ds.loc[(guerry_ds['Donatns'] > 10973), 'donq5'] = 1
We again make a Queen-contiguity weights object to describe the relationship between the units.
[16]:
w = libpysal.weights.Queen.from_dataframe(guerry_ds)
We now run the Local_Join_Count_BV
command, this time changing the case
argument to CLC
.
[17]:
LJC_BV_Case2 = Join_Counts_Local_BV(connectivity=w).fit(guerry_ds['infq5'], guerry_ds['donq5'], case='CLC')
As before, we can map the LJC
and accompanying p_sim
values.
[18]:
guerry_ds['LJC_BV_Case2'] = LJC_BV_Case2.LJC
guerry_ds['LJC_BV_Case2_p_sim'] = LJC_BV_Case2.p_sim
[19]:
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
guerry_ds.plot(color='white', edgecolor='black', ax=ax)
guerry_ds.plot(column='LJC_BV_Case2',
cmap='GnBu',
categorical=True,
legend=True, ax=ax)
[19]:
<AxesSubplot:>
[20]:
fig,ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
guerry_ds.plot(color='white', edgecolor='black', ax=ax)
guerry_ds.plot(column='LJC_BV_Case2_p_sim',
legend=True, ax=ax)
[20]:
<AxesSubplot:>
Multivariate LJC¶
The last LJC to review is the multivariate LJC statistic. This is actually an extension of the bivariate LJC case 2. Rather than considering two variables \(x\) and \(z\), the multivariate statistic consdiers \(m\) variables at each location \(i\). All \(m\) variables must then meet the co-location criterion of equalling 1. Formally:
Anselin and Li state clearly in multiple sources that although the multivariate LJC statistic is a straightforward extension of the bivariate LJC case 2, its substantive meaning is less straightforward. In their own words:
The extension to multiple binary variables is mathematically straightforward, although maybe conceptually less so. While different combinations are possible, the most practical use case would be one where the interest focuses on the co-location of multiple variables coinciding with co-location for the neighbors. Again, we can refer to this as a co-location cluster. An example would be where binary variables were constructed from continuous-valued measures for those locations where the observations fall in a pre-specified range, such as the upper decile. The co-location cluster would indicate where such coincidences occur with neighbors that have similar coincidences. However, as the number of variables considered increases, we run into the “curse of dimensionality,” and results would be less meaningful, in the sense that such coincidences would likely be increasingly rare and thus always be indicated as “significant.” Anselin and Li, 2019, Section 3.3, Page 198
We demonstrate this problem by extending the example from the bivariate LJC case 2. We add a third variable that counts the number of suicides in an area (Suicids
). The substnative quesiton of interest is now seeking areas that are in the highest Quantile for the number of births out of wedlock, the number of donations to the poor, and the number of suicides. We reload the Guerry dataset and create the variables:
[21]:
guerry = libpysal.examples.load_example('Guerry')
guerry_ds = gpd.read_file(guerry.get_path('guerry.shp'))
guerry_ds['infq5'] = 0
guerry_ds['donq5'] = 0
guerry_ds['suic5'] = 0
guerry_ds.loc[(guerry_ds['Infants'] > 23574), 'infq5'] = 1
guerry_ds.loc[(guerry_ds['Donatns'] > 10973), 'donq5'] = 1
guerry_ds.loc[(guerry_ds['Suicids'] > 55564), 'suic5'] = 1
And agian create the Queen contiguity spatial weights object.
[22]:
w = libpysal.weights.Queen.from_dataframe(guerry_ds)
We now load in the Local_Join_Count_MV
function and apply it to all three variables:
[23]:
from esda.join_counts_local_mv import Join_Counts_Local_MV
LJC_MV = Join_Counts_Local_MV(connectivity=w).fit([guerry_ds['infq5'], guerry_ds['donq5'], guerry_ds['suic5']])
Before proceeding to mapping, it is worthwhile to check if there are any areas that meet the criterion described above.
[24]:
LJC_MV.LJC
[24]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Because there are no areas that meet the criteria, no areas will have a significance value.
[25]:
LJC_MV.p_sim
[25]:
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan])
Nevertheless, it is important to recognize tht such a finding may still be important as a ‘null’ result!