{ "cells": [ { "cell_type": "markdown", "id": "7b8975c4", "metadata": {}, "source": [ "# Basic Ordinary Least Squares Regression (OLS)\n", "\n", "### Luc Anselin\n", "\n", "### (revised 09/08/2024)\n" ] }, { "cell_type": "markdown", "id": "4cfd0985", "metadata": {}, "source": [ "## Preliminaries\n", "\n", "In this notebook, the basic OLS regression and elementary regression diagnostics are reviewed. In addition, a number of concepts related to robust standard errors are covered.\n", "\n", "Technical details are treated in Chapter 5 in Anselin and Rey (2014). *Modern Spatial Econometrics in Practice*.\n", "\n", "Video recordings of the basics and non-spatial diagnostics are available from the GeoDa Center YouTube channel playlist *Applied Spatial Regression - Notebooks*:\n", "- https://www.youtube.com/watch?v=x63dL2M5x_0&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=5\n", "- https://www.youtube.com/watch?v=pimNfZyOyKk&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=6" ] }, { "cell_type": "markdown", "id": "a443690f", "metadata": {}, "source": [ "### Prerequisites\n", "\n", "Familiarity with *pandas*, *geopandas* and *libpysal* is assumed to read data files and manipulate spatial weights. In addition, the sample data set **chicagoSDOH** should be installed." ] }, { "cell_type": "markdown", "id": "6494b68c", "metadata": {}, "source": [ "### Modules Needed\n", "\n", "The main module for spatial regression in PySAL is *spreg*. In addition *libpysal* is needed to handle the example data sets and spatial weights, and *pandas* and *geopandas* for data input and output. This notebook is based on version 1.7 of *spreg*. In order to make the graphs a bit nicer looking, *matplotlib.pyplot* is imported as well, although this is not strictly needed since it is behind the plotting functionality of *pandas* and *geopandas*. Importing *matplotlib.pyplot* also allows for some further customization of the plots and maps, but that is not pursued in detail here.\n", "\n", "Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed. As before, in order to avoid some arguably obnoxious new features of *numpy* 2.0, it is necessary to include the `set_printoptions` command if you are using a Python 3.12 environment with numpy 2.0 or greater." ] }, { "cell_type": "code", "execution_count": 4, "id": "e398e42f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "import os\n", "os.environ['USE_PYGEOS'] = '0'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import matplotlib.pyplot as plt\n", "from libpysal.io import open\n", "from libpysal.examples import get_path\n", "import libpysal.weights as weights\n", "from spreg import OLS, f_stat, wald_test\n", "np.set_printoptions(legacy=\"1.25\")" ] }, { "cell_type": "markdown", "id": "1ac85fb3", "metadata": {}, "source": [ "### Functionality Used\n", "\n", "- from numpy:\n", " - array\n", " - hstack\n", " - identity\n", " - zeros\n", "\n", "- from pandas/geopandas:\n", " - read_file\n", " - DataFrame\n", " - concat\n", " - to_file\n", " - plot\n", " \n", "- from libpysal:\n", "\n", " - examples.get_path\n", " - io.open\n", " - weights.distance.Kernel\n", " \n", "- from spreg:\n", " - OLS\n", " - f_stat\n", " - wald_test" ] }, { "cell_type": "markdown", "id": "b9b0c168", "metadata": {}, "source": [ "### Input Files" ] }, { "cell_type": "markdown", "id": "74ab0075", "metadata": {}, "source": [ "All notebooks are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables.\n", "\n", "In the examples, the **Chi-SDOH** sample shape file is used, with associated kernel weights (for HAC standard errors). The specific file names are:\n", "\n", "- **Chi-SDOH.shp,shx,dbf,prj**: socio-economic determinants of health for 2014 in 791 Chicago tracts\n", "- **Chi-SDOH_k10tri.kwt**: triangular kernel weights based on a variable bandwidth with 10 nearest neighbors from `GeoDa`\n", "\n", "In this and the other *spreg* related notebooks, it is assumed that you have installed the **chicagoSDOH** example data set using `libpysal.examples.load_example(\"chicagoSDOH\")`. \n", "\n", "The input files are specified generically as **infileshp** (for the shape file) and **infilek** (for the kernel weights). In addition, optionally, an output file is specified to add predicted values and residuals to a shape file as **outfileshp**." ] }, { "cell_type": "code", "execution_count": 9, "id": "4d4335bb", "metadata": {}, "outputs": [], "source": [ "infileshp = get_path(\"Chi-SDOH.shp\") # input shape file with data\n", "infilek = get_path(\"Chi-SDOH_k10tri.kwt\") # triangular kernel weights from GeoDa\n", "outfileshp = \"./testols.shp\" # output shape file with predicted values and residuals" ] }, { "cell_type": "markdown", "id": "f6613bdc", "metadata": {}, "source": [ "### Variable Names" ] }, { "cell_type": "markdown", "id": "06db06ed", "metadata": {}, "source": [ "In this notebook, the regression model is illustrated in the context of the *immigrant paradox*. This is based on\n", "four variables from the SDOH data set: **YPLL_rate** (an index measuring premature mortality, i.e., higher values are worse health outcomes), **HIS_ct** (economic hardship index), **Blk14P** (percent Black population), and **Hisp14P** \n", "(percent Hispanic population).\n", "\n", "The easiest way to specify a generic regression model is to first create lists with the variable names for the dependent variable (**y_name**), the explanatory variables (**x_names1** and **x_names2**), the data set name (optional, as **ds_name**), the name of the contiguity weights (optional, as **w_name**, but not needed in this notebook), and the file name for the kernel weights (optional, as **gwk_name**). In this way, all the regression commands pertain to these generic variable names and do not need to be adjusted for different specifications and/or data sets." ] }, { "cell_type": "code", "execution_count": 10, "id": "69476f7b", "metadata": {}, "outputs": [], "source": [ "y_name = 'YPLL_rate'\n", "x_names1 = ['Blk14P','Hisp14P']\n", "x_names2 = ['Blk14P','Hisp14P','HIS_ct']\n", "ds_name = 'Chi-SDOH'\n", "gwk_name = 'Chi-SDOH_k10tri'" ] }, { "cell_type": "markdown", "id": "35566ba3", "metadata": {}, "source": [ "### Reading Input Files from the Example Data Set" ] }, { "cell_type": "markdown", "id": "73afed82", "metadata": {}, "source": [ "The actual path to the files contained in the local copy of the remote data set was found above by means of the *libpysal* `get_path` command. This is then passed to the *geopandas* `read_file` function in the usual way.\n", "\n", "As mentioned earlier, if the example data are not installed locally by means of `libpysal.examples`, the `get_path` command must be replaced by an explicit reference to the correct file path name. This is easiest if the files are in the current working directory, in which case just specifying the file names in **infileshp** etc. is sufficient." ] }, { "cell_type": "code", "execution_count": null, "id": "b8d93a92", "metadata": {}, "outputs": [], "source": [ "dfs = gpd.read_file(infileshp)\n", "print(dfs.shape)\n", "print(dfs.columns)" ] }, { "cell_type": "markdown", "id": "1ba18d32", "metadata": {}, "source": [ "### Reading Weights " ] }, { "cell_type": "markdown", "id": "f009a3bb", "metadata": {}, "source": [ "The weights are read by means of the `libpysal.io.open` command which was imported with a `open` alias. In the example, the kernel weights are loaded from a file created with *GeoDa*, and the full path was specified above using `get_path`. After reading the weights, their `class` is adjusted to avoid raising an exception in the computation of HAC standard errors as illustrated in the spatial weights ntoebook (note that `libpysal.weights` was imported as `weights`)." ] }, { "cell_type": "code", "execution_count": null, "id": "b2a933b3", "metadata": {}, "outputs": [], "source": [ "wk = open(infilek).read()\n", "print(\"dimension \",wk.n)\n", "wk.__class__ = weights.distance.Kernel\n", "print(\"type of wk \",wk.__class__)" ] }, { "cell_type": "markdown", "id": "8f3b8116", "metadata": {}, "source": [ "### Setting up the Variables" ] }, { "cell_type": "markdown", "id": "e310f464", "metadata": {}, "source": [ "In legacy *spreg*, variables were typically read from a *dBase* data file (associated with a shape file). This required a multi-step process to extract the variables and then turn them into *numpy* arrays. With the more recent support for *pandas* and *geopandas*, this process has become greatly simplified. The variables can be loaded directly from the *pandas* or *geopandas* data frame. As a side effect, this no longer requires the `name_y` and `name_x` arguments to be set explicitly in the OLS call.\n", "\n", "The setup uses the variable names specified in the respective **y_name**, **x_names**, etc. lists. \n", "\n", "Note that there is no constant term in the x matrices. A constant vector is included by default in the `spreg.OLS` routine." ] }, { "cell_type": "code", "execution_count": 13, "id": "1b1e2614-aea5-4eb5-86d4-de90665c2e0d", "metadata": {}, "outputs": [], "source": [ "y = dfs[y_name]\n", "x1 = dfs[x_names1]\n", "x2 = dfs[x_names2]" ] }, { "cell_type": "markdown", "id": "17b17262", "metadata": {}, "source": [ "## OLS Regression" ] }, { "cell_type": "markdown", "id": "edf37d1b", "metadata": {}, "source": [ "A first illustration uses the simplest of OLS regressions, where only y and X are specified as arguments in the `spreg.OLS` command. Since `OLS` was imported explicitly, the `spreg` part is omitted in what follows.\n", "\n", "The resulting OLS object has many attributes. An easy (the easiest) way to list the results is to print the `summary` attribute.\n", "\n", "First, the regression with the two ethnicity explanatory variables is considered. The dependent variable is **y** and the explanatory variables are contained in **x1**.\n", "\n", "The option `nonspat_diag=False` must be set, since the default will provide the diagnostics, which are considered in more detail below." ] }, { "cell_type": "code", "execution_count": 14, "id": "de3d5549", "metadata": {}, "outputs": [], "source": [ "ols1 = OLS(y,x1,nonspat_diag=False)" ] }, { "cell_type": "markdown", "id": "ef9d40af", "metadata": {}, "source": [ "The basic result of this operation is to create an OLS object. The range of attributes and methods included can be readily viewed by means of the `dir` command." ] }, { "cell_type": "code", "execution_count": null, "id": "29d10e41", "metadata": {}, "outputs": [], "source": [ "print(dir(ols1))" ] }, { "cell_type": "markdown", "id": "446bc1da", "metadata": {}, "source": [ "The regression coefficients are in `betas`, the predicted values in `predy`, residuals in `u`, the coefficient variance-covariance matrix in `vm`, and the $R^2$ measure of fit in `r2`. The `summary` method provides a nice looking output listing." ] }, { "cell_type": "code", "execution_count": null, "id": "e69dccca", "metadata": {}, "outputs": [], "source": [ "print(ols1.summary)" ] }, { "cell_type": "markdown", "id": "7731d364", "metadata": {}, "source": [ "As is, the output does not list any information on the data set. The variable names were extracted directly from the data frame. In order to have a more informative output, `name_ds` should be specified in the arguments, as illustrated below. Since no spatial diagnostics are specified so far, the **Weights matrix** is listed as **None**." ] }, { "cell_type": "code", "execution_count": null, "id": "28ce15e1", "metadata": {}, "outputs": [], "source": [ "ols1a = OLS(y,x1,nonspat_diag=False,\n", " name_ds=ds_name)\n", "print(ols1a.summary)" ] }, { "cell_type": "markdown", "id": "a00aa4ab", "metadata": {}, "source": [ "This bare bones regression output lists the variables and data set involved. In addition to the coefficient estimates, standard errors, t-statistics and p-values, it includes the $R^2$ and adjusted $R^2$ as measures of fit. The mean and standard deviation of the dependent variable are given as well. \n", "\n", "The overall fit of about 0.60 is reasonable and both slope coefficients are positive and highly significant. " ] }, { "cell_type": "markdown", "id": "dbcde13a-142c-4ecc-8fb4-db9fe8f5fdab", "metadata": {}, "source": [ "### Alternative way to pass arguments" ] }, { "cell_type": "markdown", "id": "8bf21520-8492-47c7-b040-6bf2b39ae94b", "metadata": {}, "source": [ "The approach taken in the example, whereby **y** and **x1** as passed is mostly for convenience. In practice, one can specify the dependent and explanatory variables directly as subsets from a data frame. For example, one could use `dfs[\"YPLL_rate\"]` and `dfs[[\"Blk14P\",\"Hisp14P\"]]` directly as arguments instead of taking the extra step to define **y** and **x1** separately. The results are identical. Which approach one takes is a matter of preference." ] }, { "cell_type": "code", "execution_count": null, "id": "0d6b3599-62aa-41e9-9cfc-a34c403dbb04", "metadata": {}, "outputs": [], "source": [ "ols1b = OLS(dfs[\"YPLL_rate\"],dfs[[\"Blk14P\",\"Hisp14P\"]],nonspat_diag=False,\n", " name_ds=ds_name)\n", "print(ols1b.summary)" ] }, { "cell_type": "markdown", "id": "5cca34cc-bb02-4275-8a9d-9bd53b17a95d", "metadata": {}, "source": [ "### Immigrant paradox" ] }, { "cell_type": "markdown", "id": "11b0c2e4-cff1-4090-a2e6-4813133652a1", "metadata": {}, "source": [ "As it turns out, the initial regression result is somewhat misleading, which is revealed when the economic hardship variable is included. This is implemented by specifying **x2** as the x-variable argument." ] }, { "cell_type": "code", "execution_count": null, "id": "71eaba25", "metadata": {}, "outputs": [], "source": [ "ols2 = OLS(y,x2,nonspat_diag=False,\n", " name_ds=ds_name)\n", "print(ols2.summary)" ] }, { "cell_type": "markdown", "id": "a3beb65b", "metadata": {}, "source": [ "The inclusion of economic hardship turned the coefficient of the Hispanic share from positive to\n", "negative. This is the so-called *immigrant paradox*. All coefficients are highly significant, with an adjusted $R^2$ of 0.6316." ] }, { "cell_type": "markdown", "id": "ab78c8bc", "metadata": {}, "source": [ "### Predicted Values and Residuals" ] }, { "cell_type": "markdown", "id": "6a36bc6c", "metadata": {}, "source": [ "Two important attributes of the regression object are the predicted values and residuals. Unlike many of the other attributes, these are full-length column vectors of size n x 1. They can be extracted as the `predy` and `u` attributes. A quick check is included to make sure the dimensions are correct.\n", "\n", "For the residuals, the `mean()` is computed to confirm that it is zero." ] }, { "cell_type": "code", "execution_count": null, "id": "aa1829a3", "metadata": {}, "outputs": [], "source": [ "yp = ols2.predy\n", "yp.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "c2b5c0dd", "metadata": {}, "outputs": [], "source": [ "resid = ols2.u\n", "print(resid.mean())\n", "resid.shape" ] }, { "cell_type": "markdown", "id": "83537680-abd9-444d-9cf4-8990cf2238ae", "metadata": {}, "source": [ "#### Adding predicted values and residuals to the data frame" ] }, { "cell_type": "markdown", "id": "93cf9a3e", "metadata": {}, "source": [ "The most useful application of the predicted values and residuals in a diagnostic sense is to plot and map them. Before briefly illustrating this, it is shown how the two vectors can be added to the data frame and then optionally written out to a new shape file. This uses the `Dataframe` functionality from *pandas* and horizontal vector stacking (`hstack`) from *numpy*." ] }, { "cell_type": "code", "execution_count": null, "id": "792e0ec7", "metadata": {}, "outputs": [], "source": [ "preds = pd.DataFrame(np.hstack((yp,resid)),columns=['ypred','resid'])\n", "preds.head()" ] }, { "cell_type": "markdown", "id": "0341e6e1", "metadata": {}, "source": [ "The data frame with the predicted values and residuals is then concatenated (`pandas.concat`) with the current shape file and can be written to the output file **outfileshp** using the `to_file` function." ] }, { "cell_type": "code", "execution_count": 25, "id": "b0ba9486", "metadata": {}, "outputs": [], "source": [ "dfs = pd.concat([dfs,preds],axis=1)\n", "dfs.to_file(outfileshp,mode='w')" ] }, { "cell_type": "markdown", "id": "e7c13af0-0072-4ddb-bd04-03fea69785db", "metadata": {}, "source": [ "#### Residuals diagnostic plot" ] }, { "cell_type": "markdown", "id": "52a86141-dc66-4c89-9956-06b3e36d20e5", "metadata": {}, "source": [ "A common residual diagnostic plot is a scatter plot of the residuals against the predicted values. If the error variance is constant (homoskedasticity) the point cloud should be more or less parallel around the value of zero (the mean of the residuals). On the other hand, a fan-like shape or clearly different spreads for different ranges of the predicted values would suggest heteroskedasticity.\n", "\n", "The *pandas* `plot` functionality can be used to make a simple scatter plot of residuals against predicted values. In the example, **ypred** is used as the `x` argument, **resid** as the `y` argument, with `kind = \"scatter\"`. With *matplotlib.pyplot* imported, `plt.show()` will produce a clean graph. Alternatively, one can set the plot object to `ax`, as illustrated in the mapping notebook. This can be useful for further customization, but is not pursued here." ] }, { "cell_type": "code", "execution_count": null, "id": "4d03ba70-6d2e-4d27-a05e-5ca793d2cb4d", "metadata": {}, "outputs": [], "source": [ "dfs.plot(x=\"ypred\",y=\"resid\",kind='scatter',title='Diagnostic plot')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e2afd10c-bdf1-492e-b8db-dbee160a6ada", "metadata": {}, "source": [ "The slight fan-like shape of the residuals with values further away from zero for increasing predicted values is a *visual* diagnostic suggesting heteroskedasticity. However, since this is purely visual, it remains only suggestive. More formal tests againsts heteroskedasticity are considered below." ] }, { "cell_type": "markdown", "id": "eddaa4fa-cded-4385-b13d-41e361bc50c1", "metadata": {}, "source": [ "#### Residual map" ] }, { "cell_type": "markdown", "id": "60f30322-6954-4e0c-82a5-3ce264c0dad4", "metadata": {}, "source": [ "Some insight into possible spatial patterns among the residuals can be gained from a residual map, i.e., a choropleth map of the residuals. This is readily implemented by means of the `plot` functionality of a *geopandas* object. The most useful classification in this respect is a *standard deviational* map, for which the bins in the classification are based on standard deviational units. This has two advantages: (1) it clearly shows the positive and negative residuals; and (2) it highlights *outliers* as residuals that are more than two standard deviational units away from zero.\n", "\n", "As covered in a previous notebook, a standard deviational map is obtained by setting the argument `scheme = std_mean` in the *geopandas* `plot` function. This relies on the PySAL *mapclassify* package under the hood, but the latter does not need to be imported separately. As pointed out in the mapping notebook, in *mapclassify*, the convention is to merge all observations that are within one standard deviation below and above the mean into a single category. This may not be optimal for all types of maps (it can be customized by means of `anchor`), but it is useful for a residual map. Specifically, all observations that have residuals within one standard deviation from the mean (zero) are shown in the same color (here, white). To highlight the difference between negative and positive residuals, the color map `cmap=\"bwr\"`is chosen. In this scheme, the negative residuals are shades of blue (overprediction), whereas the positive residuals are shades of red (underprediction). The middle category is white.\n", "\n", "In order to make sure that the polygon boundaries are shown as well, `edgecolor = \"black\"` is set as an additional argument (otherwise all the census tracts in the middle category will not be able to be distinguished). The `linewidth` is adjusted somewhat to improve legibility. A legend is set in the usual way.\n", "\n", "Finally, to remove the default axes and to add a title, the `set_axis_off` and `set_title` methods are applied to the plot object. Further customization can always be carried out using more advanced features of *matplotlib*. For some examples, see the mapping notebook.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "23d268a9-d4cd-455d-99e3-a30c8569b79d", "metadata": {}, "outputs": [], "source": [ "ax = dfs.plot(\n", " column = 'resid',\n", " figsize=(8,8),\n", " cmap='bwr',\n", " scheme='std_mean',\n", " edgecolor='black',\n", " linewidth = 0.3,\n", " legend=True,\n", " legend_kwds={'loc':'center left','bbox_to_anchor':(1,0.5), 'title': \"Residuals\"})\n", "ax.set_axis_off()\n", "ax.set_title('Residual Standard Deviational Plot')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "4b4355ac-fb57-4adf-82ba-a538b823e3d6", "metadata": {}, "source": [ "The residual map seems to suggest some spatial patterning among the residuals, but such a visual impression can be misleading. More formal approaches are considered in the notebook that deals with Specification Tests." ] }, { "cell_type": "markdown", "id": "245d9f28", "metadata": {}, "source": [ "### Latex Table Output" ] }, { "cell_type": "markdown", "id": "5a6da470", "metadata": {}, "source": [ "In addition to the standard regression output shown so far, it is also possible to generate a latex-formatted table for the main coefficient results. This makes it easier to incorporate the regression results into reports or papers. It is accomplished by setting the `latex` option to `True` (the default is `False`).\n", "\n", "Note that only the table with the estimated coefficients, standard errors, etc. is in latex format. The other items (various headings and diagnostics) remain simple text." ] }, { "cell_type": "code", "execution_count": null, "id": "844c667a", "metadata": {}, "outputs": [], "source": [ "ols2lat = OLS(y,x2,nonspat_diag=False,\n", " name_ds=ds_name,\n", " latex=True)\n", "print(ols2lat.summary)" ] }, { "cell_type": "markdown", "id": "edbf3d3e", "metadata": {}, "source": [ "## Non-Spatial Diagnostics" ] }, { "cell_type": "markdown", "id": "41ac5682", "metadata": {}, "source": [ "The default setting for OLS regression is to always include the non-spatial diagnostics, with `nonspat_diag=True`. Since this is the default, this argument does not have to be set. The default output with diagnostics will be as follows." ] }, { "cell_type": "code", "execution_count": null, "id": "ff0fc48a", "metadata": {}, "outputs": [], "source": [ "ols2a = OLS(y,x2,\n", " name_ds=ds_name)\n", "print(ols2a.summary)" ] }, { "cell_type": "markdown", "id": "f8b1730b", "metadata": {}, "source": [ "The previous listing is now agumented with several additional measures of fit (shown above the coefficients) as well as diagnostics for multicollinearity, non-normality and heteroskedasticity (listed below the coefficients).\n", "\n", "The measures of fit on the left-hand column are all related to the sum of squared residuals. This is listed, as well as two measures of $\\sigma^2$ (one from division by the degrees of freedom, the other - ML - from division by the number of observations) and their square roots, the S.E. of regression.\n", "\n", "On the right-hand side are the results of an F-statistic for the joint significance of the coefficients and its associated p-value, the log-likelihood $L$ (under the assumption of normality) for use in comparisions with spatial models, and two adjustments of the log-likelihood for the number of variables in the model, the $AIC$ and $SC$, with $AIC = -2L + 2k$ and $SC = -2L + k.ln(n)$ ($SC$ is sometimes referred to as $BIC$). Whereas a better fit is reflected in a higher log-likelihood, it is the reverse for AIC and SC (lower is better).\n", "\n", "Below the listing of coefficients are the multicollinearity condition number, the Jarque-Bera test on normality of the errors and two tests for heteroskedasticity (random coefficients): the Breusch-Pagan LM test and the more robust (against non-normality) Koenker-Bassett test.\n", "\n", "There is no evidence of a problem with multicollinearity (typically associated with values larger than 30), but a strong indication of both non-normality and heteroskedasticity." ] }, { "cell_type": "markdown", "id": "e0617e69", "metadata": {}, "source": [ "### White test\n", "\n", "Because it requires additional computation, the White test against heteroskedasticity is not included by default. To include it, the argument `white_test=True` must be set.\n", "\n", "All the output is the same as before, except for the addition of the test results, which again strongly indicate the presence of heteroskedasticity." ] }, { "cell_type": "code", "execution_count": null, "id": "12992479", "metadata": {}, "outputs": [], "source": [ "ols2b = OLS(y,x2,white_test=True,\n", " name_ds=ds_name)\n", "print(ols2b.summary)" ] }, { "cell_type": "markdown", "id": "b118604d-7323-48f3-80ba-4a30031cf6f6", "metadata": {}, "source": [ "### Variance Inflation Factor (VIF)" ] }, { "cell_type": "markdown", "id": "5c63b10f-47d1-44cb-bebb-5aa7946bc9b5", "metadata": {}, "source": [ "The Variance Inflation Factor or VIF is an alternative to the multicollinearity condition number to assess the sensitivity of the regression results to the presence of multicollinearity. The condition number is a measure computed for all variables jointly and is a measure of the degree of linear dependence among all the columns of $X$. Instead, the VIF is computed for each variable separately. \n", "\n", "The VIF depends on the fit of a regression of the variable in question on all other x-variables, e.g., $R^2_k$ for $x_k$. $1 - R^2_k$ is called the *tolerance* (where $R^2_k$ is the unadjusted $R^2$). The more collinear $x_k$ is with the other variables, the larger $R^2_k$ and thus the lower the tolerance. The VIF is then the inverse of the tolerance.\n", "\n", "The VIF is not reported as part of the standard regression output, since it requires considerable additional computation, but is available by setting the argument `vif = True` (the default setting is `vif = False`). The output is augmented with a listing of the VIF and corresponding tolerance (its inverse) for each coefficient.\n", "\n", "While there is no indication of a multicollinearity problem in the current set of results, it may still be informative to check which variables are the most correlated with the others." ] }, { "cell_type": "code", "execution_count": null, "id": "fc516cc5-19af-420c-a37f-3e7b21d7658c", "metadata": {}, "outputs": [], "source": [ "ols2c = OLS(y,x2,vif=True,\n", " name_ds=ds_name)\n", "print(ols2c.summary)" ] }, { "cell_type": "markdown", "id": "1cf27dc2-dffa-491a-86e2-92405d9e2957", "metadata": {}, "source": [ "In this result, **Blk14P** has the highest VIF at 4.323. However, this is not very high. When there are more serious multicollinearity problems, this value can be much higher. For example, with $R^2_k = 0.95$, the VIF would be 20, and with $R^2_k = 0.99$, it would be 100.\n", "\n", "In the example, the corresponding tolerance is 0.231. In other words, a regression of **Blk4P** on the other two variables would have an (unadjusted) $R^2$ of 0.769. This can be readily verified in a simple regression by specifying the respective subsets of the **dfs** data frame." ] }, { "cell_type": "code", "execution_count": null, "id": "e807cb2e-9759-4206-b9f6-0256b7a4cf0f", "metadata": {}, "outputs": [], "source": [ "regv = OLS(dfs[\"Blk14P\"],dfs[[\"Hisp14P\",\"HIS_ct\"]],nonspat_diag = False)\n", "print(regv.summary)" ] }, { "cell_type": "markdown", "id": "4d18bdc0-ed6e-41de-b914-2e63a9a7828b", "metadata": {}, "source": [ "The unadjusted $R^2$ in this regression is indeed 0.769." ] }, { "cell_type": "markdown", "id": "f16811ab-6c0a-4add-9da3-6726e0b02c17", "metadata": {}, "source": [ "### F-Test on joint significance of coefficients" ] }, { "cell_type": "markdown", "id": "1dc85fe3-cc5c-4790-bd46-bb08378e8318", "metadata": {}, "source": [ "The F-test reported as part of the regression output listing is a joint test on all slope coefficients. \n", "It is also possible to test the joint significance of a subset of coefficients. \n", "In order to carry this out, the variables of interest must be the *last* variables. \n", "In other words, the test is on the joint significance of the **last df** coefficients in **betas**, \n", "where **df** is passed as the second argument to `spreg.f_stat` (because of the way the `import` statement was phrased, here available as just `f_stat`). The first argument is a regression object.\n", "One can apply this to one or more variables.\n", "\n", "For example, to test the significance of **HIS_ct** in the last regression, the F-test would have `df=1`:" ] }, { "cell_type": "code", "execution_count": null, "id": "6b6cd61a-782a-43d0-8685-53a5103b9102", "metadata": {}, "outputs": [], "source": [ "f_hisct = f_stat(ols2a,df=1)\n", "print(f_hisct)" ] }, { "cell_type": "markdown", "id": "76d87358-0f47-47ed-a0fd-33459c0b811f", "metadata": {}, "source": [ "The result is an F-statistic with 1, 787 degress of freedom, 1 for the numerator (**df**) and 787 ($n - k$) for the denominator. The value of 65.796 is clearly significant, confirming the indication given by the t-test. Since the first degree of freedom of the F-statistic is 1, its value is exactly the square of the t-test in the regression, i.e., $8.11149^2 = 65.796$." ] }, { "cell_type": "markdown", "id": "b0b52e61-fb4f-4695-ab6d-1733ed55ab1c", "metadata": {}, "source": [ "To replicate the result of the F test that is listed in the regression output, two different approaches are valid. In one, the degrees of freedom is set to 3 (the last three coefficients), in the other it is not specified. The default of `f_stat` is to carry out an F test on all slope coefficients. The results for the two approaches are identical and match the regression output." ] }, { "cell_type": "code", "execution_count": null, "id": "c92d418d-39c2-4508-9307-d0507032e7cc", "metadata": {}, "outputs": [], "source": [ "f_all1 = f_stat(ols2a,df=3)\n", "print(f_all1)\n", "f_all2 = f_stat(ols2a)\n", "print(f_all2)" ] }, { "cell_type": "markdown", "id": "c136550b-8451-4faf-9dd4-f21db8709fe6", "metadata": {}, "source": [ "Finally, to test the joint significance of Hisp14P and His_ct, **df** is set to 2." ] }, { "cell_type": "code", "execution_count": null, "id": "bf373e8c-0e3c-42d4-850b-612470b3e122", "metadata": {}, "outputs": [], "source": [ "f_two = f_stat(ols2a,df=2)\n", "print(f_two)" ] }, { "cell_type": "markdown", "id": "f0d8c1aa-67e7-4f51-bf67-3b50c0ba62e3", "metadata": {}, "source": [ "### Wald test on joint significance of coefficients" ] }, { "cell_type": "markdown", "id": "b05a5256-34e2-4695-b1d1-1f80a880ac6d", "metadata": {}, "source": [ "The F-test as currently implemented is a bit awkward in that the joint test only applies to the last **df** coefficients. A more flexible approach is based on the textbook Wald test for linear restrictions on the regression coefficients (for example, as explained in Greene, 2012, p. 117-118). A set of linear constraints on the coefficients can be expressed as:\n", "\n", "\\begin{equation*}\n", "R \\beta = q,\n", "\\end{equation*}\n", "\n", "where $R$ is a $J \\times k$ matrix of constants, for $J$ constraints\n", "and $k$ total parameters, $\\beta$ is the column vector with all the\n", "model coefficients (i.e., including both the intercept and slopes), and\n", "$q$ is a $J \\times 1$ column vector of zeros. In the case of a test on the joint significance of coefficients, the\n", "matrix $R$ is a truncated identity matrix, with only the rows selected that pertain to the coefficients of interest.\n", "This will be illustrated in more detail below.\n", "\n", "The test statistic takes the form of a general Wald statistic, or:\n", "\\begin{equation*}\n", "(R \\hat{\\beta})' [ R V R' ]^{-1} (R \\hat{\\beta}) \\sim \\chi^2(J),\n", "\\end{equation*}\n", "\n", "i.e., following a chi-squared distribution with $J$ degrees of freedom, and\n", "where $V$ is the estimated variance-covariance matrix for the coefficients, a $k \\times k$ matrix.\n", "\n", "The Wald test is available from `spreg.regimes.wald_test` (imported as just `wald_test` in this notebook). The arguments are `betas`, the vector with regression coefficients, `r`, the matrix $R$, `q`, the vector $q$, and `vm`, the estimated variance-covariance matrix for the regression. The result is a tuple with the test statistic and associated p-value." ] }, { "cell_type": "markdown", "id": "c43d2368-0dfa-4dd5-a602-6323dbb5af94", "metadata": {}, "source": [ "For example, taking the results from **ols2**, the regression coefficients would be in **ols2.betas** with associated variance-covariance matrix in **ols2.vm**. For a test on the joint significance of the coefficients of **Blk14P** and **HIS_ct**, `r` would be a $2 \\times 4$ matrix, with a 1 in position [0,1] and [1,3] (keep in mind that the count starts at 0). `q` would then be a $2 \\times 1$ column vector of zeros." ] }, { "cell_type": "code", "execution_count": null, "id": "31de0d25-65bd-4e85-8dd2-00b9b936f7b6", "metadata": {}, "outputs": [], "source": [ "rr = np.identity(ols2.k)\n", "r = rr[(1,3),:]\n", "r" ] }, { "cell_type": "code", "execution_count": null, "id": "03a3d28e-2768-432d-af6f-2316ca96a6bf", "metadata": {}, "outputs": [], "source": [ "j = 2 # number of constraints\n", "q = np.zeros((j,1))\n", "q" ] }, { "cell_type": "markdown", "id": "728e157a-c30c-41d4-90dc-5e10c3dbe273", "metadata": {}, "source": [ "The Wald statistic then follows as:" ] }, { "cell_type": "code", "execution_count": null, "id": "76ee3ed1-43ad-4bef-9162-9ec4b6ab8516", "metadata": {}, "outputs": [], "source": [ "wald_test(ols2.betas,r,q,ols2.vm)" ] }, { "cell_type": "markdown", "id": "a81e06ea-3710-4acc-beae-aa597aa0321f", "metadata": {}, "source": [ "The two coefficients are clearly jointly significant (one doesn't actually need a test to see that)." ] }, { "cell_type": "markdown", "id": "3cb17e54-f829-4eac-9864-58f6090f9440", "metadata": {}, "source": [ "A final example illustrates the equivalence of the Wald test, the t-test and the F-test when testing a single coefficient. For the last coefficient, of **HIS_ct**, `q` equals 0 and `r` is a row vector with all zeros except for the last element, which is 1." ] }, { "cell_type": "code", "execution_count": null, "id": "d15ec24b-77a8-4fc4-a44f-29ad87ee569a", "metadata": {}, "outputs": [], "source": [ "r = rr[3,:].reshape(1,-1)\n", "q = 0\n", "wald_test(ols2.betas,r,q,ols2.vm)" ] }, { "cell_type": "markdown", "id": "d911a27f-b360-4665-902a-538ab8229647", "metadata": {}, "source": [ "The value of the statistic is the square of the t-statistic and the same as the F-test given above." ] }, { "cell_type": "markdown", "id": "d4b62234", "metadata": {}, "source": [ "## Robust Standard Errors" ] }, { "cell_type": "markdown", "id": "c29bcbbd", "metadata": {}, "source": [ "The classical regression coefficient standard errors tend not to be very robust to deviations from the regularity conditions such as i.i.d. In cross-sectional data, it is rather unrealistic to assume the absence of heteroskedasticity. In fact, in the example used here, there is strong evidence of the presence of heteroskedasticity indicated by all three diagnostics.\n", "\n", "The so-called *White* standard errors (also known as Huber sandwich standard errors) correct for the presence of unspecified heteroskedasticity. They are almost always (but not always) larger than the classical standard errors, leading to a bit more conservative inference.\n", "\n", "A more recent development are the so-called *HAC* standard errors introduced by Kelejian and Prucha (2007)(\"HAC estimation in a spatial framework,\" *Journal of Econometrics* 140, 131-154), where HAC stands for heteroskedastic and autocorrelation consistent standard errors. The computations use a kernel logic to obtain standard errors that correct for *both* heteroskedasticity and the possible presence of spatial autocorrelation of unspecified form. Typically (but again not always) these are the largest standard errors.\n", "\n", "The robust standard errors are invoked with the option `robust` in `spreg.OLS`. The default is `None`, for classical standard errors." ] }, { "cell_type": "markdown", "id": "966bd8e9", "metadata": {}, "source": [ "### White Standard Errors" ] }, { "cell_type": "markdown", "id": "e6a028cc", "metadata": {}, "source": [ "White standard errors are obtained by setting `robust='white'`. Except for the standard errors, the regression output remains exactly the same." ] }, { "cell_type": "code", "execution_count": null, "id": "eecf3af1", "metadata": {}, "outputs": [], "source": [ "ols3 = OLS(y,x2,robust='white',\n", " name_ds=ds_name)\n", "print(ols3.summary)" ] }, { "cell_type": "markdown", "id": "6dd1ca67", "metadata": {}, "source": [ "In the example, the standard errors increase for all coefficients except for Blk14P (3.39 vs 3.49). However, the impact on inference is negligible, with all coefficients still significant at p < 0.003." ] }, { "cell_type": "markdown", "id": "99555c1d", "metadata": {}, "source": [ "### HAC Standard Errors" ] }, { "cell_type": "markdown", "id": "382e64ee", "metadata": {}, "source": [ "HAC standard errors are obtained with `robust='hac'`. In addition, a kernel weights object must be passed (`gwk`), and optionally, its name (`name_gwk`). Again, except for the standard errors, the output is the same as before. One slight difference is that no diagnostics for spatial dependence are implemented when the HAC option is used. In the example, the triangular kernel weights contained in **wk** are used. Several other kernel functions (and different bandwidths) can be used to create kernel weights through *libpysal.weights*. In practice, some experimentation is advised." ] }, { "cell_type": "code", "execution_count": null, "id": "45384072", "metadata": {}, "outputs": [], "source": [ "ols4 = OLS(y,x2,robust='hac',\n", " gwk=wk,name_gwk=gwk_name,\n", " name_ds=ds_name)\n", "print(ols4.summary)" ] }, { "cell_type": "markdown", "id": "897615b1", "metadata": {}, "source": [ "In the example, the HAC standard errors are slightly larger than in the White case, but again not sufficient to affect inference." ] }, { "cell_type": "markdown", "id": "65aadbff", "metadata": {}, "source": [ "## Practice\n", "\n", "At this point, it would be useful to set up your own baseline regression model, with a continuous dependent variable and at least two or three explanatory variables. You can pick any set of variables from the Chicago data set, from one of the PySAL sample data sets or your own data, but of course, make sure that your research question makes sense. Create some kernel weights (use `libpysal.weights`) to check out the HAC estimation.\n", "\n", "Assess the extent to which your initial specification may suffer from some forms of misspecification, as well as the sensitivity of your results to different measures of standard errors. In addition, test some linear restrictions on the coefficients.\n", "\n", "You may also want to experiment with the plotting and mapping functionality contained in *geopandas* to create plots and maps of the predicted values and/or residuals.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }