This page was generated from notebooks/1_sample_data.ipynb. Interactive online version: Binder badge

PySAL Sample Data Sets

Luc Anselin

(revised 09/06/2024)

Preliminaries

In this notebook, the installation and input of PySAL sample data sets is reviewed.

A video recording is available from the GeoDa Center YouTube channel playlist Applied Spatial Regression - Notebooks, at https://www.youtube.com/watch?v=qwnLkUFiSzY&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6.

Prerequisites

Very little is assumed in terms of prerequisites. Sample data files are examined and loaded with libpysal and geopandas is used to read the data.

Modules Needed

The three modules needed to work with sample data are libpysal, pandas and geopandas.

Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed.

[10]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'

import pandas as pd
import geopandas as gpd
import libpysal

In order to have some more flexibility when listing the contents of data frames, the display.max_rows option is set to 100 (this step can easily be skipped, but then the listing of example data sets below will be incomplete).

[ ]:
pd.options.display.max_rows = 100
pd.options.display.max_rows

Functionality Used

  • from pandas/geopandas:

    • read_file

  • from libpysal:

    • examples.available

    • examples.explain

    • examples.load_example

    • examples.get_path

Input Files

All notebooks used for this course are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables. In this notebook, the use of PySAL sample data sets is illustrated. For other data sets, the general approach is the same, except that either the files must be present in the current working directory, or the full pathname must be specified. In later notebooks, only sample data sets will be used.

Here, the Chi-SDOH sample shape file is illustrated. The specific file names are:

  • Chi-SDOH.shp,shx,dbf,prj: a shape file (four files!) with socio-economic determinants of health for 2014 in 791 Chicago tracts

In the other spreg notebooks, it is assumed that will you have installed the relevant example data sets using functionality from the libpysal.examples module. This is illustrated in detail here, but will not be repeated in the other notebooks. If the files are not loaded using the libpysal.examples functionality, they can be downloaded as individual files from https://github.com/lanselin/spreg_sample_data/ or https://geodacenter.github.io/data-and-lab/. You must then pass the full path to infileshp used as arguments in the corresponding geopandas.read_file command.

The input file is specified generically as infileshp (for the shape file).

[3]:
infileshp = "Chi-SDOH.shp"            # input shape file with data

Accessing a PySAL Remote Sample Data Set

Installing a remote sample data set

All the needed files associated with a remote data set must be installed locally. The list of available remote data sets is shown by means of libpysal.examples.available(). When the file is also installed, the matching item in the Installed column will be given as True.

If the sample data set has not yet been installed, Installed is initially set to False. For example, if the chicagoSDOH data set is not installed, item 79 in the list (chicagoSDOH), is given as False. Once the example data set is loaded, this will be changed to True.

The example data set only needs to be loaded once. After that, it will be available for all future use in PySAL (not just in the current notebook), using the standard get_path functionality of libpysal.examples.

[ ]:
libpysal.examples.available()

The contents of any PySAL example data set can be shown by means of libpysal.examples.explain. Note that this does not load the data set, but it accesses the contents remotely (you will need an internet connection). As listed, the data set is for 791 census tracts in Chicago and it contains 65 variables.

[ ]:
libpysal.examples.explain("chicagoSDOH")

The example data set is installed locally by means of libpysal.examples.load_example and passing the name of the remote example. Note the specific path to which the data sets are downloaded, you will need that if you ever want to remove the data set.

[ ]:
libpysal.examples.load_example("chicagoSDOH")

At this point, when checking available, the data set is listed as True under Installed. As mentioned, the installation only needs to be carried out once.

[ ]:
libpysal.examples.available()

Reading Input Files from the Example Data Set

The actual path to the files contained in the local copy of the remote data set is found by means of libpysal.examples.get_path. This is then passed to the geopandas read_file function in the usual way. Here, this is a bit cumbersome, but the command can be simplified by specific statements in the module import, such as from libpysal.examples import get_path. The latter approach will be used in later notebooks, but here the full command is used.

For example, the path to the input shape file is (this may be differ somewhat depending on how and where PySAL is installed):

[ ]:
libpysal.examples.get_path(infileshp)

As mentioned earlier, if the example data are not installed locally by means of libpysal.examples, the get_path command must be replaced by an explicit reference to the correct file path name. This is easiest if the files are in the current working directory, in which case just specifying the file names in infileshp etc. is sufficient.

The shape file is read by means of the geopandas read_file command, to which the full file pathname is passed obtained from libpysal.examples.get_path(infileshp). To check if all is right, the shape of the data set (number of observations, number of variables) is printed (using the standard print( ) command), as well as the list of variable names (columns in pandas speak). Details on dealing with pandas and geopandas data frames are covered in a later notebook.

[ ]:
inpath = libpysal.examples.get_path(infileshp)
dfs = gpd.read_file(inpath)
print(dfs.shape)
print(dfs.columns)

Removing an Installed Remote Sample Data Set

In case that for some reason the installed remote chicagoSDOH data set is no longer needed, it can be removed by means of standard linux commands (or equivalent, for other operating systems). For example, on a Mac or Linux-based system, one first moves to the directory where the files were copied to. This is the same path that was shown when load_example was executed. In the example for a Mac OS operating system, this was shown in Downloading chicagoSDOH to /Users/luc/Library/Application Support/pysal/chicagoSDOH.

So, in a terminal window, one first moves to /Users/your_user_name/Library/’Application Support’/pysal (don’t forget the quotes) on a Mac system (and equivalent for other operating systems). There, the chicagoSDOH directory will be present. It is removed by means of:

rm -r chicagoSDOH

Of course, once removed, it will have to be reinstalled if needed in the future.

Practice

If you want to use other PySAL data sets to practice the spatial regression functionality in spreg, make sure to install them using the instructions given in this notebook. For example, load the Police data set (item 52 in the list), which will be used as an example in later notebooks.