spreg.Probit

class spreg.Probit(y, x, w=None, slx_lags=0, slx_vars='All', optim='newton', bstart=False, scalem='phimean', predflag=False, maxiter=100, vm=False, name_y=None, name_x=None, name_w=None, name_ds=None, spat_diag=False, latex=False)[source]

Classic non-spatial Probit and spatial diagnostics. The class includes a printout that formats all the results and tests in a nice format.

The diagnostics for spatial dependence currently implemented are:

  • Pinkse Error [Pin04]

  • Kelejian and Prucha Moran’s I [KP01]

  • Pinkse & Slade Error [PS98]

Parameters:
xnumpy.ndarray or pandas object

nxk array of independent variables (assumed to be aligned with y)

ynumpy.ndarray or pandas.Series

nx1 array of dependent binary variable

wW

PySAL weights instance aligned with y

slx_lagsinteger

Number of spatial lags of X to include in the model specification. If slx_lags>0, the specification becomes of the SLX type.

slx_varsvariables to be lagged when slx_lags > 0

default = “All”, otherwise a list with Booleans indicating which variables must be lagged (True) or not (False)

optimstr

Optimization method. Default: ‘newton’ (Newton-Raphson). Alternatives: ‘ncg’ (Newton-CG), ‘bfgs’ (BFGS algorithm)

bstartlist

list with starting values for betas, default = False

scalemstr

Method to calculate the scale of the marginal effects. Default: ‘phimean’ (Mean of individual marginal effects) Alternative: ‘xmean’ (Marginal effects at variables mean)

predflagflag to print prediction table
maxiterint

Maximum number of iterations until optimizer stops

name_ystr

Name of dependent variable for use in output

name_xlist of strings

Names of independent variables for use in output

name_wstr

Name of weights matrix for use in output

name_dsstr

Name of dataset for use in output

Attributes:
xarray

Two dimensional array with n rows and one column for each independent (exogenous) variable, including the constant

xmeanarray

kx1 vector with means of explanatory variables (for use in slopes)

yarray

nx1 array of dependent variable

wspatial weights object
optimstr

optimization method

predflagflag to print prediction table (predtable)
maxiterint

maximum number of iterations

qarray

nx1 array of transformed dependent variable 2*y - 1

betasarray

kx1 array with estimated coefficients

bstartlist with starting values for betas or False
predyarray

nx1 array of predicted y values (probabilities)

nint

Number of observations

kint

Number of variables

vmarray

Variance-covariance matrix (kxk)

loglfloat

Log-Likelihhod of the estimation

xbpredicted value of linear index

nx1 array

predybinpredicted value as binary

=1 for predy > 0.5

phiynormal density at xb (phi function)

nx1 array

u_naivenaive residuals y - predy

nx1 array

u_gengeneralized residuals

nx1 array

warningbool

if True Maximum number of iterations exceeded or gradient and/or function calls not changing.

std_errstandard errors of estimates
z_statlist of tuples

z statistic; each tuple contains the pair (statistic, p-value), where each is a float

predtabledictionary

includes margins and cells of actual and predicted values for discrete choice model

fita dictionary containing various measures of fit

TPR : true positive rate (sensitivity, recall, hit rate) TNR : true negative rate (specificity, selectivity) PREDPC : accuracy, percent correctly predicted BA : balanced accuracy

predpcfloat

Percent of y correctly predicted (legacy)

LRtestdictionary

contains the statistic for the null model (L0), the LR test(likr), the degrees of freedom (df) and the p-value (pvalue)

L0likelihood of null model
LRtuple (legacy)

Likelihood Ratio test of all coefficients = 0 (test statistics, p-value)

mcfadrhoMcFadden rho statistics of fit
scalefloat

Scale of the marginal effects.

slopesarray

Marginal effects of the independent variables (k-1x1)

slopes_vmarray

Variance-covariance matrix of the slopes (k-1xk-1)

slopes_std_errestimates of standard errors of marginal effects
slopes_z_stattuple with z-statistics and p-values for marginal effects
Pinkse_error: array with statistic and p-value

Lagrange Multiplier test against spatial error correlation. Implemented as presented in [Pin04]

KP_errorarray with statistic and p-value

Moran’s I type test against spatial error correlation. Implemented as presented in [KP01]

PS_errorarray with statistic and p-value

Lagrange Multiplier test against spatial error correlation. Implemented as presented in [PS98]

name_ystr

Name of dependent variable for use in output

name_xlist of strings

Names of independent variables for use in output

name_wstr

Name of weights matrix for use in output

name_dsstr

Name of dataset for use in output

titlestr

Name of the regression method used

Examples

We first need to import the needed modules, namely numpy to convert the data we read into arrays that spreg understands and libpysal to perform all the analysis.

>>> import numpy as np
>>> import libpysal
>>> np.set_printoptions(suppress=True) #prevent scientific format

Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open(). This is the DBF associated with the Columbus shapefile. Note that libpysal.io.open() also reads data in CSV format; since the actual class requires data to be passed in as numpy arrays, the user can read their data in using any method.

>>> dbf = libpysal.io.open(libpysal.examples.get_path('columbus.dbf'),'r')

Extract the CRIME column (crime) from the DBF file and make it the dependent variable for the regression. Note that libpysal requires this to be an numpy array of shape (n, 1) as opposed to the also common shape of (n, ) that other packages accept. Since we want to run a probit model and for this example we use the Columbus data, we also need to transform the continuous CRIME variable into a binary variable. As in [McM92], we define y = 1 if CRIME > 40.

>>> y = np.array([dbf.by_col('CRIME')]).T
>>> y = (y>40).astype(float)

Extract HOVAL (home values) and INC (income) vectors from the DBF to be used as independent variables in the regression. Note that libpysal requires this to be an nxj numpy array, where j is the number of independent variables (not including a constant). By default this class adds a vector of ones to the independent variables passed in.

>>> names_to_extract = ['INC', 'HOVAL']
>>> x = np.array([dbf.by_col(name) for name in names_to_extract]).T

Since we want to the test the probit model for spatial dependence, we need to specify the spatial weights matrix that includes the spatial configuration of the observations into the error component of the model. To do that, we can open an already existing gal file or create a new one. In this case, we will use columbus.gal, which contains contiguity relationships between the observations in the Columbus dataset we are using throughout this example. Note that, in order to read the file, not only to open it, we need to append ‘.read()’ at the end of the command.

>>> w = libpysal.io.open(libpysal.examples.get_path("columbus.gal"), 'r').read()

Unless there is a good reason not to do it, the weights have to be row-standardized so every row of the matrix sums to one. In libpysal, this can be easily performed in the following way:

>>> w.transform='r'

We are all set with the preliminaries, we are good to run the model. In this case, we will need the variables and the weights matrix. If we want to have the names of the variables printed in the output summary, we will have to pass them in as well, although this is optional.

>>> from spreg import Probit
>>> model = Probit(y, x, w=w, name_y='crime', name_x=['income','home value'], name_ds='columbus', name_w='columbus.gal')

Once we have run the model, we can explore a little bit the output. The regression object we have created has many attributes so take your time to discover them.

>>> np.around(model.betas, decimals=6)
array([[ 3.353811],
       [-0.199653],
       [-0.029514]])
>>> np.around(model.vm, decimals=6)
array([[ 0.852814, -0.043627, -0.008052],
       [-0.043627,  0.004114, -0.000193],
       [-0.008052, -0.000193,  0.00031 ]])

Since we have provided a spatial weigths matrix, the diagnostics for spatial dependence have also been computed. We can access them and their p-values individually:

>>> tests = np.array([['Pinkse_error','KP_error','PS_error']])
>>> stats = np.array([[model.Pinkse_error[0],model.KP_error[0],model.PS_error[0]]])
>>> pvalue = np.array([[model.Pinkse_error[1],model.KP_error[1],model.PS_error[1]]])
>>> print(np.hstack((tests.T,np.around(np.hstack((stats.T,pvalue.T)),6))))
[['Pinkse_error' '3.131719' '0.076783']
 ['KP_error' '1.721312' '0.085194']
 ['PS_error' '2.558166' '0.109726']]

Or we can easily obtain a full summary of all the results nicely formatted and ready to be printed simply by typing ‘print model.summary’

__init__(y, x, w=None, slx_lags=0, slx_vars='All', optim='newton', bstart=False, scalem='phimean', predflag=False, maxiter=100, vm=False, name_y=None, name_x=None, name_w=None, name_ds=None, spat_diag=False, latex=False)[source]

Methods

__init__(y, x[, w, slx_lags, slx_vars, ...])

gradient(par)

hessian(par)

ll(par)

par_est()

gradient(par)
hessian(par)
ll(par)
par_est()