This page was generated from notebooks/1_sample_data.ipynb. Interactive online version:
PySAL Sample Data Sets¶
Luc Anselin¶
(revised 09/06/2024)¶
Preliminaries¶
In this notebook, the installation and input of PySAL sample data sets is reviewed.
A video recording is available from the GeoDa Center YouTube channel playlist Applied Spatial Regression - Notebooks, at https://www.youtube.com/watch?v=qwnLkUFiSzY&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6.
Prerequisites¶
Very little is assumed in terms of prerequisites. Sample data files are examined and loaded with libpysal and geopandas is used to read the data.
Modules Needed¶
The three modules needed to work with sample data are libpysal, pandas and geopandas.
Some additional imports are included to avoid excessive warning messages. With later versions of PySAL, these may not be needed.
[10]:
import warnings
warnings.filterwarnings("ignore")
import os
os.environ['USE_PYGEOS'] = '0'
import pandas as pd
import geopandas as gpd
import libpysal
In order to have some more flexibility when listing the contents of data frames, the display.max_rows
option is set to 100 (this step can easily be skipped, but then the listing of example data sets below will be incomplete).
[ ]:
pd.options.display.max_rows = 100
pd.options.display.max_rows
Functionality Used¶
from pandas/geopandas:
read_file
from libpysal:
examples.available
examples.explain
examples.load_example
examples.get_path
Input Files¶
All notebooks used for this course are organized such that the relevant filenames and variables names are listed at the top, so that they can be easily adjusted for use with your own data sets and variables. In this notebook, the use of PySAL sample data sets is illustrated. For other data sets, the general approach is the same, except that either the files must be present in the current working directory, or the full pathname must be specified. In later notebooks, only sample data sets will be used.
Here, the Chi-SDOH sample shape file is illustrated. The specific file names are:
Chi-SDOH.shp,shx,dbf,prj: a shape file (four files!) with socio-economic determinants of health for 2014 in 791 Chicago tracts
In the other spreg notebooks, it is assumed that will you have installed the relevant example data sets using functionality from the libpysal.examples module. This is illustrated in detail here, but will not be repeated in the other notebooks. If the files are not loaded using the libpysal.examples
functionality, they can be downloaded as individual files from https://github.com/lanselin/spreg_sample_data/ or https://geodacenter.github.io/data-and-lab/. You must then pass the full path
to infileshp used as arguments in the corresponding geopandas.read_file
command.
The input file is specified generically as infileshp (for the shape file).
[3]:
infileshp = "Chi-SDOH.shp" # input shape file with data
Accessing a PySAL Remote Sample Data Set¶
Installing a remote sample data set¶
All the needed files associated with a remote data set must be installed locally. The list of available remote data sets is shown by means of libpysal.examples.available()
. When the file is also installed, the matching item in the Installed column will be given as True.
If the sample data set has not yet been installed, Installed is initially set to False. For example, if the chicagoSDOH data set is not installed, item 79 in the list (chicagoSDOH), is given as False. Once the example data set is loaded, this will be changed to True.
The example data set only needs to be loaded once. After that, it will be available for all future use in PySAL (not just in the current notebook), using the standard get_path
functionality of libpysal.examples
.
[ ]:
libpysal.examples.available()
The contents of any PySAL
example data set can be shown by means of libpysal.examples.explain
. Note that this does not load the data set, but it accesses the contents remotely (you will need an internet connection). As listed, the data set is for 791 census tracts in Chicago and it contains 65 variables.
[ ]:
libpysal.examples.explain("chicagoSDOH")
The example data set is installed locally by means of libpysal.examples.load_example
and passing the name of the remote example. Note the specific path to which the data sets are downloaded, you will need that if you ever want to remove the data set.
[ ]:
libpysal.examples.load_example("chicagoSDOH")
At this point, when checking available
, the data set is listed as True under Installed. As mentioned, the installation only needs to be carried out once.
[ ]:
libpysal.examples.available()
Reading Input Files from the Example Data Set¶
The actual path to the files contained in the local copy of the remote data set is found by means of libpysal.examples.get_path
. This is then passed to the geopandas read_file
function in the usual way. Here, this is a bit cumbersome, but the command can be simplified by specific statements in the module import, such as from libpysal.examples import get_path
. The latter approach will be used in later notebooks, but here the full command is used.
For example, the path to the input shape file is (this may be differ somewhat depending on how and where PySAL is installed):
[ ]:
libpysal.examples.get_path(infileshp)
As mentioned earlier, if the example data are not installed locally by means of libpysal.examples
, the get_path
command must be replaced by an explicit reference to the correct file path name. This is easiest if the files are in the current working directory, in which case just specifying the file names in infileshp etc. is sufficient.
The shape file is read by means of the geopandas read_file
command, to which the full file pathname is passed obtained from libpysal.examples.get_path(infileshp)
. To check if all is right, the shape of the data set (number of observations, number of variables) is printed (using the standard print( )
command), as well as the list of variable names (columns in pandas speak). Details on dealing with pandas and geopandas data frames are covered in a later notebook.
[ ]:
inpath = libpysal.examples.get_path(infileshp)
dfs = gpd.read_file(inpath)
print(dfs.shape)
print(dfs.columns)
Removing an Installed Remote Sample Data Set¶
In case that for some reason the installed remote chicagoSDOH data set is no longer needed, it can be removed by means of standard linux commands (or equivalent, for other operating systems). For example, on a Mac or Linux-based system, one first moves to the directory where the files were copied to. This is the same path that was shown when load_example
was executed. In the example for a Mac OS operating system, this was shown in Downloading chicagoSDOH to
/Users/luc/Library/Application Support/pysal/chicagoSDOH.
So, in a terminal window, one first moves to /Users/your_user_name/Library/’Application Support’/pysal (don’t forget the quotes) on a Mac system (and equivalent for other operating systems). There, the chicagoSDOH directory will be present. It is removed by means of:
rm -r chicagoSDOH
Of course, once removed, it will have to be reinstalled if needed in the future.
Practice¶
If you want to use other PySAL data sets to practice the spatial regression functionality in spreg, make sure to install them using the instructions given in this notebook. For example, load the Police data set (item 52 in the list), which will be used as an example in later notebooks.