{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7b8975c4",
   "metadata": {},
   "source": [
    "# Specification Tests\n",
    "\n",
    "### Luc Anselin\n",
    "\n",
    "### (revised 09/11/2024)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cfd0985",
   "metadata": {},
   "source": [
    "## Preliminaries\n",
    "\n",
    "In this notebook, the basic regression diagnostics for spatial autocorrelation are introduced. These include the classic Moran's I test, as well as the Lagrange Multiplier/Rao Score tests for lag and error dependence developed during the 1980s and 1990s. In addition, the recent tests for the Spatial Durbin specification developed by Koley and Bera are covered as well. The tests are detailed in Anselin and Rey (2014), *Modern Spatial Econometrics in Practice* and in Anselin, Serenini and Amaral (2024). *Spatial Econometric Model Specification Search: Another Look* (DOI: 10.13140/RG.2.2.10650.86721), as well as in the references therein.\n",
    "\n",
    "In addition to the classic case with OLS estimates, a test for spatial correlation is covered for models with endogenous variables, estimated by means of 2SLS.\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "Familiarity with OLS and 2SLS estimation in *spreg* is assumed, as covered in the respective notebooks, as well as basics of *numpy*, *pandas*, *geopandas*, and *libpysal*. In addition, it is assumed that the **chicagoSDOH** PySAL sample data set has been installed (for specific instructions, refer to the *sample data notebook*)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8a50738",
   "metadata": {},
   "source": [
    "### Modules Needed\n",
    "\n",
    "The main module for spatial regression in PySAL is *spreg*. In addition *libpysal* is needed for data import and spatial weights manipulation, and *geopandas* for data input from a shape file. This notebook is based on version 1.7 of *spreg*. \n",
    "\n",
    "As before, only the needed functions from *libpysal* are imported, i.e., `libpysal.io.open` as `open`, `libpysal.examples.get_path` as `get_path`, and `libpysal.weights` as `weights`. The `OLS` and `TSLS` estimation routines are imported from `spreg`.\n",
    "\n",
    "Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. As before, the `set_printoptions` is used for *numpy* 2.0 and later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e398e42f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import os\n",
    "os.environ['USE_PYGEOS'] = '0'\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "from libpysal.io import open\n",
    "from libpysal.examples import get_path\n",
    "import libpysal.weights as weights\n",
    "from spreg import OLS, TSLS\n",
    "np.set_printoptions(legacy=\"1.25\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ac85fb3",
   "metadata": {},
   "source": [
    "### Functionality Used\n",
    "\n",
    "- from geopandas:\n",
    "  - read_file\n",
    "  \n",
    "- from libpysal:\n",
    "  - examples.get_path\n",
    "  - io.open\n",
    "  - weights.transform\n",
    "  \n",
    "- from spreg:\n",
    "  - OLS\n",
    "  - TSLS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b34442e2",
   "metadata": {},
   "source": [
    "### Data, Weights and Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "effd05a4",
   "metadata": {},
   "source": [
    "As in the previous notebooks, all data sets, weights files and variables are specified at the top, so that they can be easily changed to other examples.\n",
    "\n",
    "Date sets and weights are from the **chicagoSDOH** sample data set:\n",
    "\n",
    "- **Chi-SDOH.shp,shx,dbf,prj**: socio-economic indicators of health for 2014 in 791 Chicago tracts\n",
    "- **Chi-SDOH_q.gal**: queen contiguity spatial weights created with *GeoDa*\n",
    "\n",
    "The weights are used in row-standardized form.\n",
    "\n",
    "For the OLS case, the same model specification is used as before, with **YPLL_rate** (an index measuring premature mortality, i.e., higher values are worse health outcomes) as the dependent variable, and **HIS_ct** (economic hardship index), **Blk14P** (percent Black population), and **Hisp14P** \n",
    "(percent Hispanic population) as the explanatory variables. These are specified in the **y_name** etc. variables, which are then used to create the corresponding *numpy* vectors and matrices for use as **y**, **x**, etc. in the regression specification.\n",
    "\n",
    "For the 2SLS case, the variable **HIS_ct** is considered to be endogenous, with **COORD_X** and **COORD_Y** (the tract centroids) as instruments, as in the 2SLS notebook.\n",
    "\n",
    "The various initializations are carried out in two steps:\n",
    "\n",
    "- first, all file names and variable names are defined\n",
    "- second, the files are read and variable vectors/matrices constructed\n",
    "\n",
    "The first step allows for customization to other examples, the second step is agnostic to the actual files and variables that were specified. To keep the code simple, there are no error checks for missing files or mismatches in the variable names."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "829cca11",
   "metadata": {},
   "source": [
    "#### Specify file and variable names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fc2dba41",
   "metadata": {},
   "outputs": [],
   "source": [
    "infileshp = get_path(\"Chi-SDOH.shp\")            # input shape file with data\n",
    "infileq = get_path(\"Chi-SDOH_q.gal\")            # queen contiguity weights from GeoDa\n",
    "y_name = 'YPLL_rate'\n",
    "x_names1 = ['Blk14P','Hisp14P']\n",
    "x_names2 = ['Blk14P','Hisp14P','HIS_ct']\n",
    "yend_names = ['HIS_ct']\n",
    "q_names = ['COORD_X', 'COORD_Y']\n",
    "ds_name = 'Chi-SDOH'\n",
    "w_name = 'Chi-SDOH_q'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91648058",
   "metadata": {},
   "source": [
    "#### Read files and extract variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4103013d",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs = gpd.read_file(infileshp)\n",
    "wq =  open(infileq).read()    # queen contiguity weights\n",
    "wq.transform = 'r'    # row-transform the weights\n",
    "y = dfs[y_name]\n",
    "x1 = dfs[x_names1]\n",
    "x2 = dfs[x_names2]\n",
    "yend = dfs[yend_names]\n",
    "q = dfs[q_names]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7765db8",
   "metadata": {},
   "source": [
    "## Specification Tests\n",
    "\n",
    "The specification tests for spatial autocorrelation in regression residuals are invoked by setting the argument `spat_diag = True` in the OLS call and passing a spatial weights object. This will result in the LM tests being added to the result output. In order to also list Moran's I, the argument `moran = True` is required as well. Since this test statistic involves more computation, it is not included in the default `spat_diag` listing.\n",
    "\n",
    "Before considering the implementation of these tests in `spreg`, their formal expressions are briefly summarized."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3acd8851",
   "metadata": {},
   "source": [
    "### Moran's I"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65fbe0bd",
   "metadata": {},
   "source": [
    "Formally,the Moran's I test statistic is:\n",
    "\n",
    "\\begin{equation*}\n",
    "I = \\frac{e'We / S_0}{e'e / n}.\n",
    "\\end{equation*}\n",
    "\n",
    "In this expression, $S_0 = \\sum_i \\sum_j w_{ij}$ is the sum of the weights in matrix $W$,\n",
    "and $e$ are the residuals.\n",
    "\n",
    "Inference for Moran's I is based on an asymptotic standard normal approximation.\n",
    "The statistic itself is first converted into a standardized or z-value by subtracting\n",
    "the mean and dividing by the standard deviation. Those two moments are obtained\n",
    "under the null hypothesis of no spatial autocorrelation. \n",
    "\n",
    "The moments of Moran's I for regression residuals,\n",
    " under the null of no spatial autocorrelation (for the\n",
    "regression error terms) were derived by Cliff and Ord (1972) (Testing for spatial autocorrelation among regression residuals, *Geographical Analysis* 4, 267-284).\n",
    "\n",
    "The mean is:\n",
    "\n",
    "\\begin{equation*}\n",
    "E[ I ] = \\frac{tr(MW)}{(n - k)},\n",
    "\\end{equation*}\n",
    "\n",
    "where the matrix $M$ in the trace expression follows the conventional\n",
    "notation as $M = I - X(X'X)^{-1}X'$, a $n \\times n$ projection matrix.\n",
    "\n",
    "The variance of Moran's I is:\n",
    "\n",
    "\\begin{equation*}\n",
    "Var[I] = \\frac{ tr(MWMW') + tr(MWMW) + (tr(MW))^2}\n",
    "  {(n - k)(n - k + 2)} - ( E[I])^2.\n",
    "\\end{equation*}\n",
    "\n",
    "The so-called z-value is then obtained in the usual fashion, as:\n",
    "\n",
    "\\begin{equation*}\n",
    "I_z = \\frac{I - E[I]}{\\sqrt{Var[I]}} \\sim \\  N(0, 1).\n",
    "\\end{equation*}\n",
    "\n",
    "The $I_z$ statistic has an asymptotic distribution that is approximated by \n",
    "the standard normal.\n",
    "\n",
    "Moran's I is a very powerful misspecification test. While designed to detect spatial error autocorrelation, it also has power against a range of other misspecifications, including heteroskedasticity and non-normality. Hence, when the null is *not* rejected, one can be fairly confident that none of these misspecifications are present. On the other hand, when the null is rejected, it is not always clear what the next step should be, other than maybe implement HAC standard errors. In contrast, the LM statistics are so-called focused tests against spatial autocorrelation that consider a specific alternative (lag or error)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87d52c95",
   "metadata": {},
   "source": [
    "### LM Statistics for Spatial Lag and Spatial Error"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e60bd9c0",
   "metadata": {},
   "source": [
    "Technical details pertaining to these tests are given in Chapter 5 of Anselin and Rey (2014).\n",
    "The original results are from Burridge (1980) (On the Cliff-Ord test for spatial autocorrelation, *Journal of the Royal Statistical Society B*, 42, 107-108) for the LM-Error test, Anselin (1988) (Lagrange Multiplier test diagnostics for spatial dependence and spatial heterogeneity, *Geographical Analysis* 20, 1-17), for the LM-Lag test, and Anselin, Bera, Florax, Yoon (1996) (Simple diagnostic tests for spatial dependence, *Regional Science and Urban Economics* 26, 77-104) for their robust forms.\n",
    "\n",
    "The LM test for spatial lag is:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\rho} = \\frac{d_{\\rho}^2}{D} \\sim \\chi^2(1),\n",
    "\\end{equation*}\n",
    "\n",
    "with $d_{\\rho} = e'Wy / \\hat{\\sigma}^2$ as the score for $\\rho$, and\n",
    "$D = (WX\\hat{\\beta})' M (WX \\hat{\\beta}) / \\hat{\\sigma}^2 + T$, where\n",
    "$M = I - X(X'X)^{-1}X'$, $T = tr(WW + W'W)$ (with tr as the trace of\n",
    "a matrix), $e$ is the OLS residual vector and $\\hat{\\beta}$ are the OLS\n",
    "regression coefficients (so, $X \\hat{\\beta}$ is the vector of predicted values).\n",
    "\n",
    "The LM test for spatial error is:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\lambda} = \\frac{d_{\\lambda}^2}{T} \\sim \\chi^2(1),\n",
    "\\end{equation*}\n",
    "\n",
    "where $d_{\\lambda} = (e'We) /\\hat{\\sigma}^2$ is the score for $\\lambda$ and $T$\n",
    "is as before.\n",
    "\n",
    "The LM test for lag robust to the presence of error is:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\rho}^* = \\frac{(d_{\\rho} - d_{\\lambda} )^2}{( D - T)} \\sim \\chi^2(1),\n",
    "\\end{equation*}\n",
    "\n",
    "The LM test for error robust to the presence of lag is:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\lambda}^* = \\frac{(d_{\\lambda} - TD^{-1}d_{\\rho} )^2}{[T (1 - TD) ] } \\sim \\chi^2(1),\n",
    "\\end{equation*}\n",
    "\n",
    "in the same notation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48984604",
   "metadata": {},
   "source": [
    "### LM Statistics for Spatial Durbin Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64cadb7c",
   "metadata": {},
   "source": [
    "In Koley and Bera (2024) (To use or not to use the spatial Durbin model? - that is the question. *Spatial Economic Analysis* 19, 30-56), robust\n",
    "LM tests are derived for $\\rho$ and $\\gamma$ in the spatial Durbin model. Again, the point of departure is\n",
    "an OLS regression of the classic non-spatial specification. \n",
    "\n",
    "LM test for $\\gamma$:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\gamma} = \\frac{(e'(WX_0))[(WX_0)'M(WX_0)]^{-1}((WX_0)'e)}{\\hat{\\sigma}^2}  \\sim \\chi^2(h),\n",
    "\\end{equation*}\n",
    "\n",
    "where $h = k-1$ (i.e., not counting the constant term), and $X_0$ is the $X$ matrix without the constant column. The expression for the $LM_{\\rho}$ test is the same as before.\n",
    "\n",
    "Also, the joint LM test on $\\rho$ and $\\gamma$ is:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\rho\\gamma} = \n",
    "\\begin{bmatrix}\n",
    "(Wy)'e / \\hat{\\sigma}^2 & e'(WX_0) / \\hat{\\sigma}^2\n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    "(WX\\beta)'M(WX\\hat{\\beta}) + T & (WX\\hat{\\beta})'M(WX_0) \\\\\n",
    "(WX_0)'M(WX\\hat{\\beta}) & (WX_0)'M(WX_0)\n",
    "\\end{bmatrix}^{-1}\n",
    "\\begin{bmatrix}\n",
    "(Wy)'e / \\hat{\\sigma}^2 \\\\\n",
    "(WX_0)'e / \\hat{\\sigma}^2\n",
    "\\end{bmatrix} \\sim \\chi^2(k)\n",
    "\\end{equation*}\n",
    "\n",
    "In these expressions, $T$ and\n",
    "$M$ are as before.\n",
    "\n",
    "The robust forms of the $LM_{\\rho}$ and $LM_{\\gamma}$ tests can be obtained from the result of the joint test and the expressions for $LM_{\\rho}$ and\n",
    "$LM_{\\gamma}$, since the following equality holds:\n",
    "\n",
    "\\begin{equation*}\n",
    "LM_{\\rho\\gamma} = LM_{\\rho} + LM_{\\gamma}^* = LM_{\\rho}^* + LM_{\\gamma}.\n",
    "\\end{equation*}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2894211",
   "metadata": {},
   "source": [
    "### Diagnostics in a 2SLS Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e41f826f",
   "metadata": {},
   "source": [
    "As in the classic OLS estimation, the residuals from 2SLS\n",
    "estimation can be assessed for spatial autocorrelation. Specifically, the so-called $AK$\n",
    "test extends the principle behind\n",
    " the Moran's I statistic to residuals from a 2SLS estimation (Anselin and Kelejian 1997. Testing for spatial error autocorrelation in the presence of endogenous regressors. *International Regional Science Review* 20, 153-182).\n",
    "\n",
    "The\n",
    "statistic reduces to the same expression as the Lagrange Multiplier test for \n",
    "error spatial autocorrelation, but using the residuals\n",
    "from the 2SLS regression. Formally:\n",
    "\\begin{equation*}\n",
    "AK = \\frac{[ (e'We)/e'e/n)  ]^2}{tr(WW + W'W)} \\sim \\chi^2(1)\n",
    "\\end{equation*}\n",
    "where $e$ is a vector of 2SLS residuals, $W$ is the spatial weights matrix, and  and $tr$\n",
    "stands for a matrix trace expression. The statistic is distributed asymptotically as \n",
    "Chi-squared with one degree of freedom.\n",
    "\n",
    "Some caution is needed in the interpretation of the results of the $AK$ test. \n",
    "Even though it takes the form of an $LM$ test, the statistic\n",
    "is really a generalization \n",
    "of Moran's I and therefore not actually a Lagrange Multiplier test. Since estimation\n",
    "is based on 2SLS, there is no assumption of normality and thus also no likelihood\n",
    "function (on which the $LM$ statistic is based). Therefore, \n",
    " the \n",
    "$AK$ test needs to be interpreted as a diffuse test rather than as a focused\n",
    "test (e.g., in the standard $LM$ case).  In other words, the\n",
    "rejection of the null of no spatial autocorrelation does not point to either a lag or an error\n",
    "specification as the proper alternative. As is the case for Moran's I in a classic regression, rejection\n",
    "of the null points to the absence of independence, but not to a particular specification\n",
    "that may be the reason for the spatial correlation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edf9be3d",
   "metadata": {},
   "source": [
    "## Spatial Diagnostics in OLS Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ed4d91c",
   "metadata": {},
   "source": [
    "To obtain the spatial diagnostics, the argument `spat_diag = True` must be set, with, in addition, `moran = True` if Moran's I is desired. Also, a spatial weights matrix must be specified as the `w` argument (optionally with its name in `name_w`).\n",
    "\n",
    "#### Two explanatory variables\n",
    "\n",
    "First, this is illustrated for the immigrant paradox regression with just two explanatory variables, i.e., with the arguments **y** and **x1**. The call to `OLS` is the same as in the earlier notebook, but now with the arguments for the spatial diagnostics included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e00b184",
   "metadata": {},
   "outputs": [],
   "source": [
    "ols1 = OLS(y,x1,w=wq,spat_diag=True,moran=True,\n",
    "                 name_w=w_name,name_ds=ds_name)\n",
    "print(ols1.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58b8b9e5",
   "metadata": {},
   "source": [
    "The regression output is augmented with a section at the bottom, entitled DIAGNOSTICS FOR SPATIAL DEPENDENCE. It is organized into two parts, one dealing the SAR-Error model as the alternative specification, the other with the Spatial Durbin specification (SDM).\n",
    "\n",
    "In the SAR-Error part, the first result (if `moran=True`) is for Moran's I. The value of the Moran's I is listed under MI/DF, here 0.1271, with the associated z-value under VALUE, as 6.521. Finally, in the PROB column, the associated p-value is given. In this example, Moran's I is highly significant, suggesting a misspecification problem. However, since the test is not focused on a specific alternative, it is not clear what the next step would be. Also, there is very strong evidence of heteroskedasticity, against which Moran's I has power as well.\n",
    "\n",
    "The next set of tests are the LM tests and their associated robust forms, first for Lag and then for Error. The LM-Lag test (26.211) is highly significant, but its robust form (0.258) is not. In contrast, both LM-Error (39.569) and the associated robust form (13.615) are highly significant, suggesting an Error alternative. Finally, the joint test for Lag and Error is highly significant as well. This test is not always indicative of the need for a higher order alternative, since it has high power against the single alternatives as well.\n",
    "\n",
    "The upshot of these statistics is strong evidence towards a spatial error alternative.\n",
    "\n",
    "The final set of tests are the LM tests in a Spatial Durbin context. The LM test for the coefficients of WX is given first, with its robust counterpart. The degrees of freedom (DF, 2) match the number of explanatory variables (not counting the constant term). It this example, the LM test of 0.836 is not significant, but its robust counterpart of 14.193 is (robust to the presence of a spatial lag). The LM test for Lag has the same value as in the SAR-Error context (26.211), but its robust form (39.569) is now robust to the presence of an SLX term and is highly significant. The joint test for Spatial Durbin (40.404) has degrees of freedom equal to the number of explanatory variables + 1, or 3 in this case. It is highly significant.\n",
    "\n",
    "Koley and Bera (2024) suggest the following interpretation of these results. First, consider whether the joint test is significant, which it clearly is. Then consider each of the robust forms of the test statistic, which are also both highly significant in this example. This would point to a Spatial Durbin alternative. However, there may be some other misspecifications going on here, since the values of the robust tests are *larger* than their original counterparts, which is not standard behavior. In a strict sense, the robust tests are robust to *small* departures from the null and they should be *smaller* than the original test, since they *correct* for the ignored misspecification. When this inequality does not hold, there is an indication that the misspecification is more major than the correction is able to accommodate.\n",
    "\n",
    "Clearly, the recommendations given by the SAR-Error and the Spatial Durbin contexts are at odds. This is further explored in the notebook on specification searches."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab470abe",
   "metadata": {},
   "source": [
    "#### Three explanatory variables\n",
    "\n",
    "With the hardship indicator (**HIS_ct**) included in the regression, the call uses **x2**, but is otherwise identical to the previous one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e356306",
   "metadata": {},
   "outputs": [],
   "source": [
    "ols2 = OLS(y,x2,w=wq,spat_diag=True,moran=True,\n",
    "                 name_w=w_name,name_ds=ds_name)\n",
    "print(ols2.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e91fcee3",
   "metadata": {},
   "source": [
    "The inclusion of the hardship indication in the regression specification changes not only the interpretation of the regression coefficients, but also greatly affects the results for the spatial diagnostics. In the SAR-Error context, the LM-Lag statistic is still significant, but its robust form is not. There is now only the weakest of evidence for the Error case (5.705 with a p-value of 0.02). Neither of the robust tests are significant. The joint LM test (7.329) only achieves a p-value of 0.0256, which provides very weak evidence. On the other hand, Moran's I remains significant, but much less so than before (z-value of 2.619 with p=0.0088).\n",
    "\n",
    "The situation is completely different on the Spatial Durbin front. Now, the values for the robust forms of the tests are smaller than the original counterparts, as they should be. There is strong evidence for an SLX alternative. Following the Koley-Bera (2024) recommendations, interest focuses on the robust forms of the one-directional tests, since the joint test is highly significant. The robust Lag test (5.705) is no longer significant for a p-value of 0.01 (p=0169), whereas the robust WX test (17.742) is still highly significant (p=0.0005). This would suggest an SLX alternative, or possibly, with a lower standard for significance (e.g., p=0.05), a Spatial Durbin model.\n",
    "\n",
    "Since it is straightforward to estimate an SLX model by means of OLS, the effect of this specification on the spatial diagnostics is examined next. A full consideration of the estimation of various SLX models is considered in a separate notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f76a7ef2",
   "metadata": {},
   "source": [
    "## Spatial Diagnostics in SLX Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "057a2703",
   "metadata": {},
   "source": [
    "The call for the estimation of an SLX specification is the same as for any other OLS estimation, except that an additional argument `slx_lags=1` must be included. If higher order lags are desired, the argument to `slx_lags` must be adjusted accordingly.\n",
    "\n",
    "The results for the full regression (with **x2**) is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2c6636d",
   "metadata": {},
   "outputs": [],
   "source": [
    "slx = OLS(y,x2,w=wq,slx_lags=1,\n",
    "                 spat_diag=True,moran=True,\n",
    "                 name_w=w_name,name_ds=ds_name)\n",
    "print(slx.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e76d1a1",
   "metadata": {},
   "source": [
    "The inclusion of the three SLX terms changes the estimates and significance of the other coefficients considerably. The **Hisp14P** variable is no longer significant, but its spatial lag is, with a large negative coefficient. All three SLX coefficients are significant. However, their magnitude and sign raise some concerns. It is very difficult to interpret a case where the coefficient of the neighbor's influence has a different sign from the original coefficient, as is the case for **Blk14P** and **W_Blk14P**. This suggests negative spatial autocorrelation, which is rare, but not impossible, although it runs counter to Tobler's first law of geography.\n",
    "\n",
    "Similarly, obtaining a coefficient for WX that is larger than the corresponding coefficient for X suggests a stronger effect of the neighbors than for the location itself, which runs counter to the distance decay implied by Tobler's law. While this does not make the SLX specification invalid as such, it does require a careful consideration of the interpretation of the coefficients.\n",
    "\n",
    "Diagnostics are only reported for the SAR-Error alternatives, since the WX term is already included in the model. This inclusion seems to have eliminated most evidence for remaining spatial autocorrelation. While Moran's I and the LM tests are weakly significant (but *not* for a p-value of p=0.01), the robust LM tests are not, and neither is the joint LM test.\n",
    "\n",
    "This would suggest that the inclusion of the SLX terms has taken care of the spatial autocorrelation *problem*. However, as mentioned, the signs and magnitudes of the coefficients are difficult to interpret and would demand closer scrutiny."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7471653e",
   "metadata": {},
   "source": [
    "## Spatial Diagnostics in 2SLS Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bb8b6b3",
   "metadata": {},
   "source": [
    "The AK test is included in the output of a 2SLS regression when `spat_diag=True` and spatial weights are specified. In all other respects, the call is the same as reviewed in a previous notebook, with **x1** as the exogenous explanatory variables, **yend** as the endogenous variable and **q** as the instruments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e90e6a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls1 = TSLS(y,x=x1,yend=yend,q=q,w=wq,spat_diag = True,\n",
    "             name_w=w_name,name_ds=ds_name)\n",
    "print(tsls1.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f55944e8",
   "metadata": {},
   "source": [
    "In contrast to what was found with Moran's I for the OLS case, there is no evidence for residual spatial autocorrelation when the variable **HIS_CT** is treated as being endogenous."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65aadbff",
   "metadata": {},
   "source": [
    "## Practice\n",
    "\n",
    "At this point, it would most effective if you could continue with your baseline regression, assess whether there is any evidence of spatial autocorrelation and what type of alternative suggests itself. If warranted, check the effect of including SLX terms or correcting for endogeneity on the regression coefficients and spatial diagnostics. Consider using a number of different spatial weights to assess the robustness of your findings."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py312",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}