Global Spatial Autocorrelation Getis-Ord G¶

This notebook covers Getis-Ord statistics for identifying spatial concentrations of high or low values.

Learning goals¶

By the end of this notebook, you will be able to:

compute the global \(G\) statistic and assess its significance
explain how Getis-Ord \(G\) differs from Moran’s \(I\)

The substantive question is whether high Airbnb prices cluster spatially — and, if so, where those high-price hotspots are located.

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from libpysal import graph

import esda

Data preparation¶

We use the same Berlin neighborhood dataset as in the global Moran’s I and local Moran’s I notebooks so that results can be compared directly.

df = gpd.read_file("data/berlin-housing.gpkg")

fig, ax = plt.subplots(figsize=(12, 10))
df.plot(column="median_pri", scheme="Quantiles", k=5, cmap="OrRd", legend=True, ax=ax)
ax.set_axis_off()
ax.set_title("Median Airbnb price by neighbourhood (quintiles)")
plt.show()

../_images/73ae41e8d2c4a0c14310ab7563eb73a588fcef13b1e0c06d8d98ec0890bf63a1.png

Getis-Ord G¶

Moran’s I measures spatial autocorrelation by comparing each observation to its deviation from the global mean. A High-High cluster in Moran’s framework means both a neighborhood and its neighbors are above the mean.

Getis-Ord \(G\) takes a different approach: it measures spatial concentration of raw values. A high \(G\) means large values are co-located with other large values — a true hotspot. This can be better understood by examining the form of the statistic:

\[ G = \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} x_i x_j}{\sum_{i=1}^{n} \sum_{j=1}^{n} x_i x_j}, \quad \forall j \neq i \]

where \(x_i\) is the value for the variable \(x\) at location \(i\) of \(n\) locations, and \(w_{ij}\) is the spatial weight expressing the neighbor relationship between locations \(i\) and \(j\).

The expected value of \(G\) under the null hypothesis of no spatial association is:

\[ E[G] = \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}}{n(n-1)}. \]

Because \(G\) works with raw values rather than mean-centered deviations, it is particularly well-suited to detecting concentration of phenomena that are always non-negative (counts, prices, rates).

The global statistic \(G\) answers: is there a statistically significant overall concentration of high or low values?

# G requires binary (not row-standardised) weights
wq = graph.Graph.build_contiguity(df, rook=False)
y = df["median_pri"]

np.random.seed(12345)
g_global = esda.G(y, wq)
print(f"Observed G:   {g_global.G:.6f}")
print(f"Expected G:   {g_global.EG:.6f}")

Observed G:   0.040514
Expected G:   0.039671

The calculated value of the global statistic is above its expectation. Whether this difference is statistically significant, or larger than what could be due to a certain level of random chance, requires an inferential framework.

Inference¶

Inference about \(G\) uses the same general framework underlying Moran’s I, where the reference distribution can be based on analytical reasoning or permutation based.

To derive the analytical expected value and variance for statistical inference, we define the following structural constants based on the weights matrix \(W\):

\(S_0 = \sum_{i=1}^n \sum_{j=1}^n w_{ij}\) (The sum of all spatial weights)
\(S_1 = \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n (w_{ij} + w_{ji})^2\)
\(S_2 = \sum_{i=1}^n \left( \sum_{j=1}^n w_{ij} + \sum_{j=1}^n w_{ji} \right)^2\)

We also define global moments of the attribute vector \(x\):

\(W_0 = S_0\)
The sample moments: \(D_1 = \sum_{i=1}^n x_i\), \(D_2 = \sum_{i=1}^n x_i^2\), \(D_3 = \sum_{i=1}^n x_i^3\), and \(D_4 = \sum_{i=1}^n x_i^4\).

Free Sampling (The Normality Assumption)¶

Expected Value¶

The expected value of \(G\) under free sampling depends entirely on the sample size \(n\) and the number of pairs, remaining independent of the empirical values of \(x\):

\[E_N(G) = \frac{W_0}{n(n - 1)}.\]

Variance¶

The mathematical variance under the normality assumption is structured as follows:

\[Var_N(G) = \frac{1}{[n(n-1)]^2 \cdot D_1^4} \left[ S_1 \cdot D_1^2 \cdot D_2 + (S_2 - 2S_1) \cdot D_1 \cdot D_3 + (W_0^2 - S_2 + S_1) \cdot D_2^2 \right] - [E_N(G)]^2\]

Non-Free Sampling (The Randomization Assumption)¶

Expected Value¶

Because the expectation across all permutations averages out the spatial structure, the expected value under randomization is identical to the free sampling expectation:

\[E_R(G) = \frac{W_0}{n(n - 1)}\]

Variance¶

The variance under randomization must account for the exact empirical distribution of the data (incorporating higher-order sample moments \(D_3\) and \(D_4\)). The formula is conventionally expressed using a series of compounded variables:

\[Var_R(G) = E_R(G^2) - [E_R(G)]^2\]

Where \(E_R(G^2)\) is calculated as:

\[E_R(G^2) = \frac{A_1 \cdot S_1 + A_2 \cdot S_2 + A_3 \cdot W_0^2}{n(n-1)(n-2)(n-3) \cdot (\sum_{i \neq j} x_i x_j)^2}\]

The structural data coefficients (\(A_1, A_2, A_3\)) isolate the interactions of the empirical array data:

\[A_1 = n \left[ (n-1) D_2^2 - 2(n-1)D_1 D_3 + 2D_1^2 D_2 - 3D_1^4 + 2(n-1)D_4 \right]\]

\[A_2 = \frac{n}{2} \left[ -(n-1)D_2^2 + 4(n-1)D_1 D_3 - 6D_1^2 D_2 + 5D_1^4 - 2(n-1)D_4 \right]\]

\[A_3 = D_1^4 - 2D_1^2 D_2 + (n-1)D_2^2 + 2(n-1)D_1 D_3 - (n-1)D_4\]

And the denominator’s data product adjusts for the omission of the diagonal (\(i=j\)):

\[\left(\sum_{i \neq j} x_i x_j\right)^2 = (D_1^2 - D_2)^2\]

Asymptotic Inference (Z-score)¶

For practical execution, these analytical moments are translated into an asymptotic standard normal \(Z\)-score:

\[Z(G) = \frac{G - E(G)}{\sqrt{Var(G)}}\]

Positive \(Z(G)\): Indicates a spatial cluster of high values (hot spots).
Negative \(Z(G)\): Indicates a spatial cluster of low values (cold spots).

Because the randomization variance (\(Var_R(G)\)) natively accounts for extreme skewness and kurtosis common in positive non-negative values (like real estate prices or population counts), it is the default analytical configuration deployed for spatial boundary validation.

While computationally fast, the analytical approach relies on the assumption of normality which may not hold for small sample sizes or highly skewed data.

print(f"z-score:      {g_global.z_norm:.4f}")

z-score:      3.2702

print(f"p-value (norm): {g_global.p_norm:.4f}")

p-value (norm): 0.0005

Permutation-based inference¶

As with Moran’s \(I\), we can also assess the observed \(G\) against a reference distribution generated by randomly permuting attribute values while holding the spatial structure fixed.

sns.kdeplot(g_global.sim, fill=True)
plt.vlines(g_global.G, 0, plt.gca().get_ylim()[1] * 0.9, color="r", label="Observed G")
plt.vlines(
    g_global.EG,
    0,
    plt.gca().get_ylim()[1] * 0.9,
    color="k",
    linestyle="--",
    label="Expected G",
)
plt.xlabel("G statistic")
plt.legend()
plt.title("Permutation reference distribution for global G")
plt.show()

../_images/caf0bc75896d2e34efa977a666e861c0962403fd80390be3d4ecc17d43b16800.png

print(f"p-value (sim):  {g_global.p_sim:.4f}")

p-value (sim):  0.0020

Comparing Getis-Ord and Moran’s I results¶

The global \(G\) statistic captures concentrations of High-High values, as well as concentrations of Low-Low values, similar to the global Moran’s \(I\). The two methods can diverge when local variance is high — a neighborhood with very high prices surrounded by moderate prices may register as a spatial outlier in Moran but not a hotspot for \(G\).

Considerations¶

Two final points should be kept in mind when applying the global \(G\):

Positivity Requirement: The values of \(x\) must be positive (\(x > 0\))
The Zero Problem: The statistic is designed for ratio-scale data with a natural zero. If the data contains zeros or negative alues, the denominator (the sum of cross-products) can behave erratically, and the interpretation of “high” or “low” value clusters becomes mathematically unstable.

Takeaways¶

Global G tests whether the entire study area has more concentration of high (or low) values than expected under spatial randomness.
\(G\) does not capture negative spatial autocorrelation, in contrast to Moran’s I.