Logo

The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. For example:

In [3]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
     10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
     11                                       NegativeBinomial)
---> 12 from .tsa import api as tsa
     13 from .nonparametric import api as nonparametric
     14 import distributions

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
      2 from .arima_model import ARMA, ARIMA
      3 import vector_ar as var
      4 from .vector_ar.var_model import VAR
      5 from .vector_ar.svar_model import SVAR

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
     16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
     17         approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
     19 import statsmodels.base.wrapper as wrap
     20 from statsmodels.tsa.vector_ar import util

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
     30 from numpy.linalg import inv, pinv
     31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
     33 
     34 #Fast filtering and smoothing for multivariate state space models

ImportError: cannot import name kalman_loglike

In [4]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

NameError: name 'sm' is not defined

In [5]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__

NameError: name 'duncan_prestige' is not defined

In [6]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Usage

Load a dataset:

In [7]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
     10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
     11                                       NegativeBinomial)
---> 12 from .tsa import api as tsa
     13 from .nonparametric import api as nonparametric
     14 import distributions

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
      2 from .arima_model import ARMA, ARIMA
      3 import vector_ar as var
      4 from .vector_ar.var_model import VAR
      5 from .vector_ar.svar_model import SVAR

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
     16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
     17         approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
     19 import statsmodels.base.wrapper as wrap
     20 from statsmodels.tsa.vector_ar import util

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter

/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
     30 from numpy.linalg import inv, pinv
     31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
     33 
     34 #Fast filtering and smoothing for multivariate state space models

ImportError: cannot import name kalman_loglike

In [8]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()

NameError: name 'sm' is not defined

The Dataset object follows the bunch pattern explained in proposal.

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [9]: data.endog[:5]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-ecf121fa201d> in <module>()
----> 1 data.endog[:5]

NameError: name 'data' is not defined

In [10]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]

NameError: name 'data' is not defined

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [11]: data.endog_name
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-78ac46fd3666> in <module>()
----> 1 data.endog_name

NameError: name 'data' is not defined

In [12]: data.exog_name
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-53b38d63b171> in <module>()
----> 1 data.exog_name

NameError: name 'data' is not defined

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [13]: type(data.data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-2a4072828d02> in <module>()
----> 1 type(data.data)

NameError: name 'data' is not defined

In [14]: type(data.raw_data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-55b385c14017> in <module>()
----> 1 type(data.raw_data)

NameError: name 'data' is not defined

In [15]: data.names
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-bb6578e2a1cd> in <module>()
----> 1 data.names

NameError: name 'data' is not defined

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:

In [16]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()

NameError: name 'sm' is not defined

In [17]: data.exog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-a6a50950081b> in <module>()
----> 1 data.exog

NameError: name 'data' is not defined

In [18]: data.endog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-5f625520ab35> in <module>()
----> 1 data.endog

NameError: name 'data' is not defined

With pandas integration in the estimation classes, the metadata will be attached to model results:

In [19]: y, x = data.endog, data.exog
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-1bd5ddef021a> in <module>()
----> 1 y, x = data.endog, data.exog

NameError: name 'data' is not defined

In [20]: res = sm.OLS(y, x).fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-63aeb0f069b7> in <module>()
----> 1 res = sm.OLS(y, x).fit()

NameError: name 'sm' is not defined

In [21]: res.params
Out[21]: 
Intercept                  51.678376
C(race, Simple)[Simp.1]    11.541667
C(race, Simple)[Simp.2]     1.741667
C(race, Simple)[Simp.3]     7.596839
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Mon, 21 Jul 2014   Prob (F-statistic):           5.78e-05
Time:                        23:38:27   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
===========================================================================================
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000        49.741    53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001         5.061    18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525        -3.647     7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000         3.675    11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================
"""

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.