{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7b8975c4",
   "metadata": {},
   "source": [
    "# Two Stage Least Squares Regression (2SLS)\n",
    "\n",
    "### Luc Anselin\n",
    "\n",
    "### (revised 09/08/2024)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cfd0985",
   "metadata": {},
   "source": [
    "## Preliminaries\n",
    "\n",
    "In this notebook, endogenous and instrumental variables are introduced and the basics of two stage least squares estimation are presented.\n",
    "\n",
    "Technical details are given in Chapter 6 in Anselin and Rey (2014). *Modern Spatial Econometrics in Practice*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a443690f",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "Familiarity with *pandas*, *geopandas* and *libpysal* is assumed to read data files and manipulate spatial weights as well as knowledge of how to use *PySAL* sample data sets. Again, the **chicagoSDOH** sample data set is used. If not available, it must be installed first with `libpysal.examples.load_example(\"chicaogoSDOH\")`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6494b68c",
   "metadata": {},
   "source": [
    "### Modules Needed\n",
    "\n",
    "The main module for spatial regression in PySAL is *spreg*. In addition, *libpysal* is needed to handle the example data sets and spatial weights, and *pandas* and *geopandas* for data input and output. This notebook is based on version 1.7 of *spreg*. \n",
    "\n",
    "Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. Finally, to avoid issues with the `print` function for *numpy* 2.0 and later, the `legacy` option is set in `set_printoptions`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e398e42f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import os\n",
    "os.environ['USE_PYGEOS'] = '0'\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "from libpysal.io import open\n",
    "from libpysal.examples import get_path\n",
    "import libpysal.weights as weights\n",
    "from spreg import OLS, TSLS, f_stat\n",
    "np.set_printoptions(legacy=\"1.25\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ac85fb3",
   "metadata": {},
   "source": [
    "### Functionality Used\n",
    "\n",
    "- from pandas/geopandas:\n",
    "  - read_file\n",
    "  - DataFrame\n",
    "  - concat\n",
    "  \n",
    "- from libpysal:\n",
    "  - examples.get_path\n",
    "  - io.open\n",
    "  - weights.distance.Kernel\n",
    "  \n",
    "- from spreg:\n",
    "  - OLS\n",
    "  - TSLS\n",
    "  - f_stat"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9b0c168",
   "metadata": {},
   "source": [
    "### Input Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74ab0075",
   "metadata": {},
   "source": [
    "All notebooks are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables.\n",
    "\n",
    "In the current notebook, the **Chi-SDOH** sample shape file is used, with associated kernel weights (for HAC standard errors). The specific file names are:\n",
    "\n",
    "- **Chi-SDOH.shp,shx,dbf,prj**: socio-economic determinants of health for 2014 in 791 Chicago tracts\n",
    "- **Chi-SDOH_k10tri.kwt**: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from *GeoDa*\n",
    "\n",
    "As before, the input files are specified generically as **infileshp** (for the shape file) and **infilek** (for the kernel weights). The `libpysal.examples.get_path` functionality is used to get the correct path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4d4335bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "infileshp = get_path(\"Chi-SDOH.shp\")            # input shape file with data\n",
    "infilek = get_path(\"Chi-SDOH_k10tri.kwt\")       # triangular kernel weights from GeoDa"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6613bdc",
   "metadata": {},
   "source": [
    "### Variable Names"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06db06ed",
   "metadata": {},
   "source": [
    "The illustration in this notebook considers two different specifications. One has **YPLL_rate** (index of premature mortality) as the dependent variable and **Blk14P** (percent Black population), **Hisp14P** (percent Hispanic population), and **HIS_ct** (economic hardship index) as explanatory variables. The variable **HIS_ct** is considered to be *endogenous*, with the census tract centroids as instruments (**COORD_X** and **COORD_Y**). The full set of regressors is specified in **z_names1**, the exogenous variables in **xe_names**, endogenous variable in **yend_names1**, and instruments in **q_names1**.\n",
    "\n",
    "A second specification uses the same dependent variable (**y_name**) and exogenous regressors (**xe_names**), but now includes two endogenous regressors, **HIS_ct** and **EP_NOHDSP** (percent without a high school diploma) in **yend_names2** (with the full set of regressors in **z_names2**), and one additional instrument (in addition to the census tract centroids), **EP_LIMENG** (percent with limited English proficiency), specified in **q_names2**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "69476f7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_name = ['YPLL_rate']\n",
    "z_names1 = ['Blk14P','Hisp14P','HIS_ct']\n",
    "z_names2 = ['Blk14P','Hisp14P','EP_NOHSDP','HIS_ct']\n",
    "xe_names = ['Blk14P','Hisp14P']\n",
    "yend_names1 = ['HIS_ct']\n",
    "yend_names2 = ['EP_NOHSDP','HIS_ct']\n",
    "q_names1 = ['COORD_X', 'COORD_Y']\n",
    "q_names2 = ['COORD_X', 'COORD_Y','EP_LIMENG']\n",
    "ds_name = 'Chi-SDOH'\n",
    "gwk_name = 'Chi-SDOH_k10tri'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35566ba3",
   "metadata": {},
   "source": [
    "### Variable definition and data input"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73afed82",
   "metadata": {},
   "source": [
    "The input geo data frame is created using `read_file` and the weights file using `libpysal.io` (imported as `open`). Also, the class of the kernel weights is corrected using `libpysal.weights.distance.Kernel` (`libpysal.weights` is imported as `weights`). Next, all the relevant variables are initialized as subsets from the data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8d93a92",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs = gpd.read_file(infileshp)\n",
    "wk = open(infilek).read()\n",
    "print(wk.n)\n",
    "wk.__class__ = weights.distance.Kernel\n",
    "print(type(wk))\n",
    "\n",
    "y = dfs[y_name]\n",
    "z1 = dfs[z_names1]\n",
    "z2 = dfs[z_names2]\n",
    "xe = dfs[xe_names]\n",
    "yend1 = dfs[yend_names1]\n",
    "yend2 = dfs[yend_names2]\n",
    "q1 = dfs[q_names1]\n",
    "q2 = dfs[q_names2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0762e5db-f024-412d-9691-d1e3916a446d",
   "metadata": {},
   "source": [
    "## The Principle of Two Stage Least Squares"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d294d0d1-da43-44ee-a107-71258882eee7",
   "metadata": {},
   "source": [
    "A fundamental assumption behind OLS is, loosely put, that the explanatory variables ($X$) are uncorrelated with the error terms. If this is not the case for one or more variables, the OLS estimates will be *biased* (in the early days, this used to be called simultaneous equation bias, now it is referred to as endogeneity). To correct for this bias, the violating *endogenous* variables are replaced by *instruments* that are: (1) uncorrelated with the error term; and (2) closely related (but not too close) to the original endogenous variables.\n",
    "\n",
    "For example, consider a set of explanatory variables, such that:\n",
    "\\begin{equation*}\n",
    "y = Z\\delta + u,\n",
    "\\end{equation*}\n",
    "where $Z$ is organized such that the exogenous variables $X$ are first and the endogenous ones $Y$ are second, as $Z = [X \\  Y]$.\n",
    "\n",
    "In essence, 2SLS estimation can be thought of as proceeding\n",
    "in two stages (hence the acronym). In the first stage, each of the endogenous variables\n",
    "is regressed on a matrix of instruments, $Q$, which includes all of the exogenous\n",
    "variables in $X$ as well as specific *instruments*. The predicted values from this regression are then used as\n",
    "explanatory variables, replacing the endogenous variables in the *second* stage, which yields consistent estimates\n",
    "for the coefficients $\\delta$. Note that the predicted value in a regression of the $X$ variables on $Q$\n",
    "(which includes the $X$) simply yields the original $X$, so that \n",
    "there are in fact no new instrumental variables for the $X$. \n",
    "\n",
    "The first stage can be expressed succinctly as:\n",
    "\\begin{equation*}\n",
    "\\hat{Z} = Q [ (Q'Q)^{-1} Q'Z ],\n",
    "\\end{equation*}\n",
    "where the term in square brackets is the vector of OLS regression coefficients in a regression\n",
    "of each of the $Z$ on the instrument matrix $Q$.\n",
    "\n",
    "The second stage consists of an OLS regression of $y$ on the predicted values\n",
    "$\\hat{Z}$:\n",
    "\\begin{equation*}\n",
    "\\hat{\\delta} = (\\hat{Z'} \\hat{Z} )^{-1} \\hat{Z'} y.\n",
    "\\end{equation*}\n",
    "\n",
    "Alternatively, substituting the results of the first regression yields\n",
    "the full expression as:\n",
    "\\begin{equation*}\n",
    "\\hat{\\delta} = [ Z'Q (Q'Q)^{-1} Q'Z ]^{-1} Z'Q (Q'Q)^{-1} Q' y.\n",
    "\\end{equation*}\n",
    "\n",
    "Strictly speaking, the 2SLS estimator is an instrumental variables estimator with instruments:\n",
    "\\begin{equation*}\n",
    "H = Q (Q'Q)^{-1} Q'Z,\n",
    "\\end{equation*}\n",
    "where $Q (Q'Q)^{-1} Q'$ is often referred to as a projection matrix. The estimator can\n",
    "be expressed in the usual way as:\n",
    "\\begin{equation*}\n",
    "\\hat{\\delta} = (H'Z)^{-1} H'y,\n",
    "\\end{equation*}\n",
    "which, after substituting the full expression for $H$ and some matrix algebra yields \n",
    "the same result as above.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2b3f701-05ff-4064-859e-f3aa2c35554e",
   "metadata": {},
   "source": [
    "## The Two Stages of 2SLS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95ed9650-f8fa-477f-baa0-1ecf4a1e411a",
   "metadata": {},
   "source": [
    "Before illustrating the `spreg.TSLS` functionality, the two stages of 2SLS are spelled out in detail, using the first regression example. First, as a reference point, the OLS estimation, using `spreg.OLS` with `nonspat_diag = False` to limit the output. The complete set of regressors is contained in **z1**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce2419e7-887b-40f2-ae7d-7129e6cc074a",
   "metadata": {},
   "outputs": [],
   "source": [
    "ols1 = OLS(y,x=z1,nonspat_diag = False,name_ds = ds_name)\n",
    "print(ols1.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d14f8459-dc04-48b1-befb-201089605ace",
   "metadata": {},
   "source": [
    "### Creating the instruments"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c815b43-bf56-4575-8b5d-00e5bd179673",
   "metadata": {},
   "source": [
    "In the example, there is only one endogenous variable, **HIS_ct**. The associated *instrumental variable* is the predicted value from a regression of **HIS_ct** on the exogenous variables **xe** and the instruments **q1**. The terminology can get a bit confusing, since instrumental variables and instruments are often used interchangeably. In a strict sense, the predicted value is an instrumental variable for the endogenous variable and the exogenous variables are often not explicitly designated as instruments. This is the meaning used here. So, the full set of instruments consists of *both* **xe** and **q1**. This is accomplished by means of *pandas* `concat` function.\n",
    "\n",
    "The instrumental variable for **HIS_ct** is then extracted as the `predy` attribute of the regression object and turned into a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59b0cb8e-1285-4d31-b6c3-70f09cd8cb1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "qq = pd.concat([xe,q1],axis=1)\n",
    "olsinst = OLS(yend1,qq,nonspat_diag=False,name_ds=ds_name)\n",
    "print(olsinst.summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e9501f8-69d9-4929-99af-6ed5dafe2a0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "yend_p = olsinst.predy\n",
    "yep = pd.DataFrame(yend_p,columns=['HIS_ct_p'])\n",
    "yep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "173a37ac-8579-4188-a99c-ae8aa770d986",
   "metadata": {},
   "source": [
    "### Second stage"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bea8e71-26fd-4e1b-a7ed-98f3e0de1275",
   "metadata": {},
   "source": [
    "The second stage consists of an OLS estimation of **YPLL_rate** on the three regressors, where **HIS_ct** is replaced by its predicted value. Again, first `concat` is used to put the exogenous variables together with the newly created predicted value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "730c0601-1bd8-4768-9035-17c666225d44",
   "metadata": {},
   "outputs": [],
   "source": [
    "zz = pd.concat([xe,yep],axis=1)\n",
    "ols2 = OLS(y,zz,nonspat_diag = False, name_ds=ds_name)\n",
    "print(ols2.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e5ff992-0404-4824-94a6-f6f8408e3df1",
   "metadata": {},
   "source": [
    "Compared to the results of the original OLS regression (**ols1**), there are some marked differences in both the magnitude and significance of the coefficients. The coefficient for **Blk14P** is less than half its previous value (16.4 vs. 42.1) and is no longer significant. The coefficient for **Hisp14P** is three times as large (-45.7 vs. -14.6) and **HIS_ct_p** double (153.6 vs. 72.7). Both remain highly significant. However, as noted below, the standard errors and thus also the significance must be interpreted with caution."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2129c6c0-d1cd-438a-88ef-bdd2601230ed",
   "metadata": {},
   "source": [
    "### Predicted values and residuals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "426c68e2-3ee9-457f-96fd-caab85547425",
   "metadata": {},
   "source": [
    "The proper residuals for the 2SLS regression should be $y - Z\\hat{\\delta}$, and *not* $y - \\hat{Z}\\hat{\\delta}$, as given by the second stage OLS regression. As a consequence, the standard errors (which are based on those residuals) given in the second stage OLS regression are *not* the proper standard errors in a 2SLS estimation. In some software implementations, this is sometimes overlooked when the estimation is implemented as two actual separate stages. The correct standard errors and associated (asymptotic) t-statistics and p-values are given by the `spreg.TSLS` command."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef05242c-8a10-4dde-a371-93e90b7685c2",
   "metadata": {},
   "source": [
    "## TSLS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4105dc9-115d-4092-99cd-d0d77da89c46",
   "metadata": {},
   "source": [
    "### Immigrant paradox"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22995b21-c8ed-4b57-849b-a9526e85c29a",
   "metadata": {},
   "source": [
    "The proper standard errors for 2SLS are obtained with `spreg.TSLS`. The required arguments are `y` for the dependent variable, `x` for the exogenous variables, `yend` for the endogenous variables and `q` for the instruments. The default is to have `nonspat_diag = True`, so this is turned to `False` to focus on just the estimates and their significance. Finally, `name_ds` is included for the data set name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61a5e9a8-c086-4d74-9911-3c94bc1a2dff",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls1 = TSLS(y,x=xe,yend=yend1,q=q1,nonspat_diag = False,name_ds=ds_name)\n",
    "print(tsls1.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb56a0c1-9825-47e2-945b-e52391c22d7d",
   "metadata": {},
   "source": [
    "The coefficient estimates are identical to the ones obtained in **ols2**, but the standard errors are slightly different. In the end, this does not affect the significance in a meaningful way. **Blk14P** remains not significant and the other p-values are only marginally affected.\n",
    "\n",
    "The complete set of attributes of the regression object is given with the usual `dir` command. They are essentially the same as for an OLS object, e.g., with the estimates in `betas`, predicted values in `predy` and residuals in `u`. In addition, several intermediate matrix results are included as well that are useful for customized calculations, but beyond the current scope."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1edd29b5-89db-4076-ba59-98c90f1977e1",
   "metadata": {},
   "source": [
    "### Predicted values and residuals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c514a80c-6286-4db6-b724-d32fb11e133f",
   "metadata": {},
   "source": [
    "As mentioned, the residuals are $y - \\hat{y}$, with $\\hat{y}$ as the predicted values.\n",
    "The latter are obtained as $Z \\hat{\\delta}$. This is somewhat misleading as a \n",
    "measure of fit, since observations on the endogenous variables are included in the matrix\n",
    "$Z$. A proper predicted value would be obtained from a solution of the reduced\n",
    "form, with only exogenous variables on the RHS of the equation. However, in a single equation\n",
    "cross-sectional setting, this is not possible. As computed, the predicted value thus may give an\n",
    "overly optimistic measure of the fit of the model. Consequently, the associated residual sum of squares may be too small to reflect the correct performance of the model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebde4bba-1085-41f1-8e33-eb5d95fd817f",
   "metadata": {},
   "source": [
    "### Two endogenous variables"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f3fcf58-a2f6-488d-a02b-331d8f533dbd",
   "metadata": {},
   "source": [
    "To further illustrate that 2SLS also works for multiple endogenous variables, the second specification is used, now with **yend2** as the endogenous variables and **q2** as the associated instruments. For reference, the OLS results are given as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97ae9e6e-f34a-4884-9950-73f19177a9d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "ols3 = OLS(y,x=z2,nonspat_diag=False,name_ds=ds_name)\n",
    "print(ols3.summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ade228c-3c80-40ca-8355-7982773f3d3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls2 = TSLS(y,x=xe,yend=yend2,q=q2,nonspat_diag=False,name_ds=ds_name)\n",
    "print(tsls2.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19bac111-50d6-4b0d-a5d6-bec6bed028a3",
   "metadata": {},
   "source": [
    "The effect of correcting for potential endogeneity is again significant. In the OLS results, all coefficients except **EP_NOHSDP** are highly significant. However, in the 2SLS results, neither **Blk14P** nor **Hisp14P** are significant, but **EP_NOHSDP** becomes significant. Its coefficient also changes sign, from positive (but not significant, so essentially zero) to negative. This illustrates the importance of checking for potential endogeneity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32e0deb4-bc0c-4868-9600-944db1e51482",
   "metadata": {},
   "source": [
    "## Durbin-Wu-Hausman Test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a882499-ceca-4413-8028-d7433eeceade",
   "metadata": {},
   "source": [
    "The Durbin-Wu-Hausman statistic (DWH) is a test for the endogeneity of some regressors, provided instruments are available. It is a simple F-test on the joint significance of selected coefficients in an augmented regression. The augmented regression specification consists of the original exogenous variables and predicted values of the endogenous variables when regressed on the instruments (those include the exogenous variables). If the null hypothesis holds, the coefficients of these predicted values should not be significant. The DWH test then boils down to an F-test on their joint significance (see Davidson and MacKinnon 2004, p. 338-340).\n",
    "\n",
    "The test is included as a default diagnostic in `spreg.TSLS` (the default is `nonspat_diag=True` so this must not be included explicitly). For the two example regressions, this yields the following results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75de9b19-a557-4d04-bb59-4b3b1780868e",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls1a = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name)\n",
    "print(tsls1a.summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15e7ce06-780e-483b-bb29-05a1fc3280b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls2a = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name)\n",
    "print(tsls2a.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80984aee-322b-495f-afd9-02fdae4de98c",
   "metadata": {},
   "source": [
    "In both instances, there is clear evidence that controlling for endogeneity was warranted.\n",
    "\n",
    "Since the predicted value for **HIS_ct** in the first regression is already available as **yep** above, the DWH test can be verified by carrying out the auxiliary regression. To this end **z1** must be concatenated with **yep** to form the new matrix of regressors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f9098f0-f23e-4c72-b6a3-8450e67a45f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "zza = pd.concat([z1,yep],axis=1)\n",
    "ols4 = OLS(y,zza,nonspat_diag=False,name_ds=ds_name)\n",
    "print(ols4.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0f6ff3f-3b77-479c-b535-3648a2a77447",
   "metadata": {},
   "source": [
    "The DWH statistic then follows as the result of an F-test on the coefficient of **HIS_ct_p** in the auxiliary regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37c27f55-c389-4dc3-8fa1-dbb180eda9a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "dwhf = f_stat(ols4,df=1)\n",
    "print(dwhf)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "491f98c2-30b9-4e56-a8c9-0eef8a855fae",
   "metadata": {},
   "source": [
    "The resulting F-statistic is exactly the value of DWH listed for **tsls1a**, with the associated p-value. In the same way, the result for the second 2SLS regression can be verified, but this is slightly more involved. In the second case, two additional regression are required to obtain predicted values for each of the endogenous variables, but otherwise everything is the same. This is all carried out under the hood for the DWH statistic."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd5699e6-1a1a-41ae-a437-aaf524b245e4",
   "metadata": {},
   "source": [
    "## Robust Standard Errors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16a9c979-d3b5-488a-b242-4b52014077a0",
   "metadata": {},
   "source": [
    "As was the case for OLS regression, 2SLS allows for robust standard errors by setting the arguments `robust=\"white\"` or `robust=\"hac\"`. Everything operates in the same way as for OLS, i.e., the estimates are identical, but the standard errors, z-statistic and p-values may differ."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "420a53ae-e121-4496-a4fb-cbe9981e07ef",
   "metadata": {},
   "source": [
    "### White standard errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04000053-cee4-4ada-befe-779bf28e2dff",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls1b = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name,\n",
    "                     robust='white')\n",
    "print(tsls1b.summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c509296-0547-4dbc-9ef5-40531ae36e52",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls2b = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name,\n",
    "                   robust='white')\n",
    "print(tsls2b.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e7eed4-ee84-40de-9a40-40900cbb65ce",
   "metadata": {},
   "source": [
    "### HAC standard errors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a695fbf-0877-4d90-b969-05081a75df8d",
   "metadata": {},
   "source": [
    "As was the case for OLS, in addition to `robust=\"hac\"`, the kernel weights (`gwk`) and optionally their name (`name_gwk`) must be specified as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd58790d-2b7f-466e-9e10-91951da1eda0",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls1c = TSLS(y,x=xe,yend=yend1,q=q1,name_ds=ds_name,\n",
    "                     robust='hac',gwk=wk,name_gwk=gwk_name)\n",
    "print(tsls1c.summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "438ae0cc-de8d-4051-981e-c4307f82cbd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "tsls2c = TSLS(y,x=xe,yend=yend2,q=q2,name_ds=ds_name,\n",
    "                   robust='hac',gwk=wk,name_gwk=gwk_name)\n",
    "print(tsls2c.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65aadbff",
   "metadata": {},
   "source": [
    "## Practice\n",
    "\n",
    "At this point, you can assess endogeneity in your own regression specification, or experiment with different models constructed from the variables in the **chicagoSDOH** sample data set. Replicate the 2SLS results by explicitly carrying out the two stage OLS regressions and/or check on the value of the Durbin-Wu-Hausman test by means of the auxiliary regression."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}