This page was generated from notebooks/2_data_input_output.ipynb. Interactive online version: Binder badge

Data Input/Output

Luc Anselin

09/06/2024

Preliminaries

In this notebook, some elementary functionality is covered to carry out data input and output from and to different types of files. The key concept is a so-called DataFrame, a tabular representation of the data with observations as rows and variables as columns.

This is implemented by means of pandas for generic text files (as well as many other formats) and geopandas for spatial data files (shape files or geojson files). The functionality will be illustrated with the Police sample data set that contains police expenditure data for Mississippi counties. It is assumed that this data has been installed using libpysal.examples.load_example("Police").

A video recording is available from the GeoDa Center YouTube channel playlist Applied Spatial Regression - Notebooks, at https://www.youtube.com/watch?v=7yWOgPEBQmE&list=PLzREt6r1NenmhNy-FCUwiXL17Vyty5VL6&index=2.

Modules Needed

The work horse for spatial analysis in Python is the PySAL library. However, before addressing specific spatial functionality, the use of pandas and geopandas will be illustrated to load data into so-called data frames. In addition, libpysal is needed to access the sample data sets. All of these rely on numpy as a dependency.

The full set of imports is shown below. Also, in this notebook, the get_path functionality of libpysal.examples is imported separately, without the rest of libpysal.

[1]:
import numpy as np
import pandas as pd
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd
from libpysal.examples import get_path

Functions Used

  • from numpy:

    • array

    • shape

    • tolist

    • reshape

  • from pandas:

    • read_csv

    • head

    • info

    • list

    • columns

    • describe

    • corr

    • DataFrame

    • concat

    • to_csv

    • drop

  • from geopandas:

    • read_file

    • to_file

  • from libpysal:

    • get_path

Files

Data input and output will be illustrated with the Police sample data set. This data set contains the same information in several different formats, such as csv, dbf, shp and geojson, which will be illustrated in turn. The following files will be used:

  • police.shp,shx,dbf,prj: shape file (four files) for 82 counties

  • police.csv: the same data in csv text format

  • police.geojson: the spatial layer in geojson format

All the files are defined here, and referred to generically afterwards, so that it will be easy to re-run the commands for a separate application. The only changes needed would be the file names and/or variable names (if needed).

[2]:
infilecsv = "police.csv"     # input csv file
outfilecsv = "test1.csv"     # output csv file
infiledbf = "police.dbf"     # input dbf file
outfiledbf = "test2.csv"     # output dbf file
infileshp = "police.shp"     # input shape file
outfileshp =  "test3.shp"    # output shape file
infilegeo = "police.geojson" # input geojson file
outfilegeo = "test4.geojson" # output geojson file

Text Files

Input

The input file for csv formatted data is infilecsv. In the example, this is the csv file police.csv. The path to the installed sample data set is found with get_path (note the form of the import statement, which means that the full prefix libpysal.examples is not needed).

The pandas command read_csv creates a data frame, essentially a data table. One of its attributes is shape, the dimension of the table as number of rows (observations) and number of columns (variables). df.head( ) lists the first few rows of the actual table.

[ ]:
inpath = get_path(infilecsv)
df = pd.read_csv(inpath)
print(df.shape)
df.head()

Contents

A technical way to see the contents of a pandas data frame is to use the info command. This gives the class, range of the index (used internally to refer to rows) and the data type of the variables (columns).

[ ]:
df.info()

An arguably more intuitive sense of the contents of the data frame is to just list the names of all the variables. This can be accomplished several different ways, illustrating the flexibility of pandas. However, it is important to know what type of object the result of each operation yields. Depending on the approach, this could be a list, a pandas index object or a numpy array. Assuming the wrong type for the result can cause trouble.

The following four approaches will each extract the column headers, but yield the result as a different type of object. This will determine how it can be further manipulated:

  • list(df): creates a simple list with the variable names

  • df.columns: yields the columns as a pandas index object

  • df.columns.values: yields the columns as a numpy array

  • df.columns.values.tolist( ): yields the columns as a list, same as list(df)

[ ]:
varlist1 = list(df)
print(varlist1)
type(varlist1)
[ ]:
varlist2 = df.columns
print(varlist2)
type(varlist2)
[ ]:
varlist3 = df.columns.values
print(varlist3)
type(varlist3)
[ ]:
varlist4 = df.columns.values.tolist()
print(varlist4)
type(varlist4)

Descriptive Statistics

A quick summary of the data set is provided by the describe command.

[ ]:
df.describe()

Extracting Variables

Variables (columns) can easily be extracted from a dataframe by listing their names in a list and subsetting the data frame (there are other ways as well, but they will not be considered here). It is important to keep in mind that the result is a different view of the same data frame, which may not be what is expected. In fact, in many applications in the context of spreg, the result should be a numpy array. This requires an extra step to cast the data frame to an array object.

Also, in many contexts, an additional variable may need to be added to the data frame. For example, this will be needed for regression residuals and predicted values in a later notebook. To illustrate some of the steps involved, the variable COLLEGE will be turned into its complement (i.e., percent population without a college degree) and subsequently added to the data frame. To illustrate some descriptive statistics, POLICE will be extracted as well.

First, the variable names are put in a list to subset the data frame and check the type. Make sure to use double brackets, the argument to the subset [ ] is a list, so [[list of variable names in quotes, separated by commas]]. The result is a pandas data frame or series (one variable).

Note: if you want to do this for your own data set, possibly using different variables and different expressions, you will need to adjust the code below accordingly. Typically, this is avoided in these notebooks, but here there is no option to make things totally generic.

[ ]:
df1 = df[['POLICE','COLLEGE']]
type(df1)

A more elegant approach and one that will make it much easier to reuse the code for different data sets and variables is to enter the variable names in a list first, and then pass that to subset the data frame.

[ ]:
varnames = ['POLICE','COLLEGE']
df2 = df[varnames]
type(df2)

At this point, it is much more meaningful to get the descriptive statistics using describe.

[ ]:
df2.describe()

A correlation coefficient is obtained by means of the corr method.

[ ]:
df2.corr()

Extracting Variables to a Numpy Array

As mentioned, when using variables in the context of spreg routines, they will often need to be numpy arrays, not a data frame. This is accomplished by means of the numpy.array function (np.array in the notation used here). The shape attribute is a check to make sure that the resulting matrices have the correct format. In the example:

[ ]:
x1 = np.array(df[varnames])
print(x1.shape)
type(x1)

Computations

New variables (columns) can be added to an existing data frame by means of straightforward element by element computations. However, to do this within the data frame structure is a bit cumbersome, since the data frame name needs to be included for each variable. On the other hand, the result is immediately attached to the data frame.

Alternatively, the computations can be carried out using the numpy array and subsequently attached to the data frame. However, for a one-dimensional result, the shape of the result is a one-dimensional numpy array, not a row or a column vector. To obtain the latter, the reshape command needs to be used.

For example, to compute the complement of the percentage with a college degree (in column 1 of array x1), the second column of the array is subtracted from 100. The element-by-element computation gives the desired result, but not the correct shape.

[ ]:
noncollege = 100.0 - x1[:,1]
noncollege
[ ]:
noncollege.shape

The correct dimension is obtained by means of reshape(-1,1).

[ ]:
noncollege = noncollege.reshape(-1,1)
print(noncollege.shape)
noncollege[0:5,:]

Note the extra brackets in the (82,1) column vector compared to the (82, ) numpy array above.

Concatenating Data Frames

In order to add the result of the matrix calculation to the data frame, two steps are involved. First, the numpy array is turned into into a data frame using pandas.DataFrame, making sure to give meaningful names to the columns by means of the columns argument. Then the pandas.concat function is applied to join the two data frames together. One can of course combine the two operations into one line, but here they are kept separate for clarity. NONCOLLEGE is added as the last variable in the data frame.

Note that axis=1 is set as an argument to the concat function to make sure a column is added (axis=0 is to add a row).

[ ]:
dd = pd.DataFrame(noncollege,columns=['NONCOLLEGE'])
df = pd.concat([df,dd],axis=1)
print(df.columns)

Output

If desired, the new data frame can be written to a csv file using the to_csv command. The only required argument is the filename. For example, with the generic file name outfilecsv as defined at the top of the notebook, the file will be written to the current working directory. Its contents can be examined with any text editor or by loading it into a spreadsheet program.

To avoid the index numbers as a first unnamed column (i.e., the default row names), an extra argument is index = False.

[13]:
df.to_csv(outfilecsv,index=False)

DBase Files (dbf)

A common (but old) format for tabular data bases is the dBase format, with file extension dbf. Even though it is old (and, arguably, out of date), this format is still quite useful because it is used to store the data (attributes) in one of the common spatial data formats, the shape file popularized by ESRI (see below).

As it happens, pandas is currently not able to read data from a dbf file directly into a data frame. Specialized packages exist that implement this functionality (like simpledbf). However, as it happens, geopandas, considered in more detail below, also reads dbf files by means of its read_file command. No special arguments are needed, since the file format is derived from the file extension.

For example, to read the data from police.dbf (the same as in police.csv), the path to the sample data file infiledbf is found with get_path and passed to the geopandas.read_file command. The result is a GeoDataFrame, not a regular DataFrame. This is an artifact of the dbf file being in the same directory as the shape file. The same command applied to the dbf file in isolation will be a DataFrame.

[ ]:
inpath = get_path(infiledbf)
dfdb = gpd.read_file(inpath)
print(dfdb.shape)
print(type(dfdb))
print(dfdb.columns)

A close look at the dimensions and the columns reveals an additional column (22 compared to 21) with column name geometry. This can be removed by means of the drop(columns = "geometry") command.

[ ]:
dfdb = dfdb.drop(columns = 'geometry')
print(dfdb.shape)
print(type(dfdb))
print(dfdb.columns)

Now, the dimension is tha same as for the csv file and the geometry column has disappeared. Also, the type of the result is a regular DataFrame.

As mentioned, if the dbf file is in a directory without the presence of a spatial layer, the geometry column will not be present. In that case, the result is a regular DataFrame, NOT a GeoDataFrame.

It is important to keep this in mind, since pandas has currently no support for writing dbf files, whereas geopandas only has support for writing dbf files that contain a geometry column. However, a pandas data frame can be written to a csv file as seen before, using to_csv. The input dbf file can thus be converted to a csv file, but any changes cannot be saved to another dbf file.

In general, working with dbf files in isolation is to be discouraged.

[16]:
dfdb.to_csv(outfiledbf,index=False)

Spatial Data Files

Spatial Data

Spatial data are characterized by the combination of locational information (the precise definition of points, lines or areas) and so-called attribute information (variables).

There are many formats to store spatial information, in files as well as in relational databases. To keep things simple, first the so-called shape file format is considered, a standard supported by ESRI, one of the major commercial GIS vendors. In addition, geojson will be covered as well, since it is an increasingly common open source format.

Reading a shape file

The terminology is a bit confusing, since there is no such thing as one shape file, but there is instead a collection of three (or four) files. One file has the extension .shp, one .shx, one .dbf, and one .prj (with the projection information). The first three are required, the fourth one is optional, but highly recommended. The files should all be in the same directory and have the same main file name.

In Python, the easiest way to read shape files is to use geopandas. The command is read_file, followed by the file pathname in parentheses. The program is smart enough to figure out the file format from the file extension .shp. As we saw before for the dbf format, the result is a geopandas data frame, a so-called GeoDataFrame, say dfs, which is a pandas DataFrame with an additional column for the geometry.

All the standard pandas commands also apply to a geopandas data frame.

The example uses the police.shp sample file as the input file, as specified in infileshp at the top of the notebook.

[ ]:
inpath = get_path(infileshp)
dfs = gpd.read_file(inpath)
print(dfs.shape)

Note how the data frame has one more column than the one created from the csv file. This is the same as in the dbf example above. The last column is geometry.

[ ]:
print(dfs.columns)

Creating New Variables

Just as for a standard pandas data frame, variables can be transformed, new variables created and data frames merged. The commands are the same as before and will not be repeated here.

Reading a GeoJSON File

Reading any of the supported spatial formats is implemented by the same read_file command. As mentioned, geopandas figures out the right format from the file extension. The result is identical to the one for the shape file.

[ ]:
inpath = get_path(infilegeo)
dfg = gpd.read_file(inpath)
print(dfg.shape)
print(type(dfg))
print(dfg.columns)

Writing a GeoDataFrame

The output is accomplished by the to_file command. This supports many different output formats, but the default is the ESRI shape file, so we do not have to specify any arguments other than the filename. Here, we use the output file name specified in outfileshp.

[20]:
dfs.to_file(outfileshp)

Writing a geojson file works in exactly the same way, for example, using the output file specified in outputgeo.

[21]:
dfg.to_file(outfilegeo)

Practice

Use your own data set or one of the PySAL sample data sets to load a spatial data frame, create some new variables, optionally get descriptive statistics and write out an updated data set. This type of operation will be used frequently in the course of the regression analysis, for example, to add predicted values and/or residuals to a spatial layer.