statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. For example:
In [3]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-3-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
11 NegativeBinomial)
---> 12 from .tsa import api as tsa
13 from .nonparametric import api as nonparametric
14 import distributions
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 import vector_ar as var
4 from .vector_ar.var_model import VAR
5 from .vector_ar.svar_model import SVAR
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
17 approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
19 import statsmodels.base.wrapper as wrap
20 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
30 from numpy.linalg import inv, pinv
31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
33
34 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [4]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameError: name 'sm' is not defined
In [5]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__
NameError: name 'duncan_prestige' is not defined
In [6]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
get_rdataset(dataname[, package, cache]) | download and return R dataset |
get_data_home([data_home]) | Return the path of the statsmodels data dir. |
clear_data_home([data_home]) | Delete all the content of the data home cache. |
Load a dataset:
In [7]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-7-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/api.py in <module>()
10 from .discrete.discrete_model import (Poisson, Logit, Probit, MNLogit,
11 NegativeBinomial)
---> 12 from .tsa import api as tsa
13 from .nonparametric import api as nonparametric
14 import distributions
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 import vector_ar as var
4 from .vector_ar.var_model import VAR
5 from .vector_ar.svar_model import SVAR
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/ar_model.py in <module>()
16 from statsmodels.tools.numdiff import (approx_fprime, approx_hess,
17 approx_hess_cs)
---> 18 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
19 import statsmodels.base.wrapper as wrap
20 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.5.0/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
30 from numpy.linalg import inv, pinv
31 from statsmodels.tools.tools import chain_dot
---> 32 from . import kalman_loglike
33
34 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [8]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()
NameError: name 'sm' is not defined
The Dataset object follows the bunch pattern explained in proposal.
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [9]: data.endog[:5]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-ecf121fa201d> in <module>()
----> 1 data.endog[:5]
NameError: name 'data' is not defined
In [10]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]
NameError: name 'data' is not defined
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [11]: data.endog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-78ac46fd3666> in <module>()
----> 1 data.endog_name
NameError: name 'data' is not defined
In [12]: data.exog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-53b38d63b171> in <module>()
----> 1 data.exog_name
NameError: name 'data' is not defined
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [13]: type(data.data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-2a4072828d02> in <module>()
----> 1 type(data.data)
NameError: name 'data' is not defined
In [14]: type(data.raw_data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-55b385c14017> in <module>()
----> 1 type(data.raw_data)
NameError: name 'data' is not defined
In [15]: data.names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-bb6578e2a1cd> in <module>()
----> 1 data.names
NameError: name 'data' is not defined
For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:
In [16]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()
NameError: name 'sm' is not defined
In [17]: data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-a6a50950081b> in <module>()
----> 1 data.exog
NameError: name 'data' is not defined
In [18]: data.endog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-5f625520ab35> in <module>()
----> 1 data.endog
NameError: name 'data' is not defined
With pandas integration in the estimation classes, the metadata will be attached to model results:
In [19]: y, x = data.endog, data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-1bd5ddef021a> in <module>()
----> 1 y, x = data.endog, data.exog
NameError: name 'data' is not defined
In [20]: res = sm.OLS(y, x).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-63aeb0f069b7> in <module>()
----> 1 res = sm.OLS(y, x).fit()
NameError: name 'sm' is not defined
In [21]: res.params
Out[21]:
Intercept 51.678376
C(race, Simple)[Simp.1] 11.541667
C(race, Simple)[Simp.2] 1.741667
C(race, Simple)[Simp.3] 7.596839
dtype: float64
In [22]: res.summary()
Out[22]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.107
Model: OLS Adj. R-squared: 0.093
Method: Least Squares F-statistic: 7.833
Date: Mon, 21 Jul 2014 Prob (F-statistic): 5.78e-05
Time: 23:38:27 Log-Likelihood: -721.77
No. Observations: 200 AIC: 1452.
Df Residuals: 196 BIC: 1465.
Df Model: 3
===========================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
Intercept 51.6784 0.982 52.619 0.000 49.741 53.615
C(race, Simple)[Simp.1] 11.5417 3.286 3.512 0.001 5.061 18.022
C(race, Simple)[Simp.2] 1.7417 2.732 0.637 0.525 -3.647 7.131
C(race, Simple)[Simp.3] 7.5968 1.989 3.820 0.000 3.675 11.519
==============================================================================
Omnibus: 10.487 Durbin-Watson: 1.779
Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.031
Skew: -0.551 Prob(JB): 0.00402
Kurtosis: 2.670 Cond. No. 7.03
==============================================================================
"""
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']