Connect Banner
for layout only

Search This Site

for layout only

for layout only

Link to Current Issue
Link to Archives
Link to About Connect Page
for layout only
 

Select a Spring 2004 article to read:

for layout only
 
Category: Computing in the Arts

Econometric Data Mining with PcGets

By Robert Yaffee

PcGets (PC General To Specific modeling), written by David Hendry (Nuffield College, Oxford University) and Hans-Martin Krolzig (formerly of Nuffield College, Oxford and currently of Humboldt University, Berlin), is an econometric software modeling package that combines ease of data importation, user-friendliness, and powerful graphical capability with automated econometric modeling.

This program is nestled in the versatile GiveWin (General Instrumental Variable Estimation for Windows) interface, written in the Ox programming language by Jurgen A. Doornik (Oxford University). That interface allows for easy data importation, basic graphical analysis, and push-button module selection.

The PcGets module provides options for automated general-to-specific econometric modeling for a variety of ordinary least squares and instrumental lagged distributed variables estimation of univariate models. In the documentation, Krolzig shows how the program can be used for VAR modeling as well. PcGets combines flexible data management, powerful graphics, and automatic econometric model selection into a user-friendly, simple, and efficient program.

The GiveWin Interface

Data importation can be done in several ways. One can simply copy and paste an Excel spreadsheet into the GiveWin data spreadsheet after defining the time span and observational frequency (figure 1). In the GiveWin interface, one can select the particular Oxmetrics module that one would like to apply.

a picture of the photocopier-shaped ZPrinter device

Figure 1. GiveWin Data Spreadsheet.

Of course, there is the option of using either DBMSCOPY or STAT Transfer to convert SPSS, SAS, or other files to the PcGive format, which can be read by the PcGets module. The GiveWin interface endows this package with exceptionally powerful exploratory graphical capability. Whether one is preliminarily examining the series or doing residual graphical analysis, this package has a large repertoire of graphical options. The interface allows for a variety of time sequence graphs that can be overlaid, edited, and annotated.

The interface permits graphs of transformations of the series, including logs, first differences, growths, and means within specified ranges. The interface also allows the easy inclusion of seasonality, centered seasonality, trends, and constants. It performs a variety of scatterplots, with regression lines, cubic splines and matrices. It also permits distributional analysis with frequency charts, cumulative frequency charts, density plots, histograms, and box plots. Time series graphs include graphs for both the time and frequency domain (figure 2).

a picture of the photocopier-shaped ZPrinter device

Figure 2. Time sequence plot of key series.

For the time domain, there are autocorrelation functions, partial autocorrelation functions, and crosscorrelation functions. For the frequency domain, there are the spectral density charts and periodograms. For testing residuals and other distributions, there are quantile plots against a uniform distribution, a normal distribution and against a distribution of choice. Two series can be graphed against a third, and contour or three-dimensional surfaces can also be graphed and rotated to a user-specified angle for better perspective.

PcGets itself contains a number of other graphs to enhance parameter and residual analysis. To assess parameter constancy, forward recursive or backward recursive estimation can be run and graphed, as can a rolling window of parameters estimated with rolling regression analysis.

To clearly diagnose the model, one can also select from a rich repertoire of residual graphics. In figure 3, a matrix of four graphs is output. In the upper left is a graph of the actual values plotted against the fitted values. In the lower left, there are the residuals plotted against the time line, to indicate where the possible outliers reside. In the lower right, there are the squared normalized residuals plotted against the time line; the normalized residuals make identification of the outliers particularly easy. In the upper right, the fitted values are plotted against the output. Moreover, histograms of residuals, auto and partial autocorrelation functions, and spectral density plots are also available (figure 4). These graphs can be enlarged, edited, annotated, copied, and pasted into Microsoft Word or WordPerfect files with ease.

a picture of the photocopier-shaped ZPrinter device

Figure 3. Residual graphics (L to R): actual vs. fitted, fitted vs. output, normalized residual time plot, and squared normalized residual time plot.

The PcGets Algorithm

David Hendry’s general to specific modeling for econometric model selection is based on Hoover and Perez’s (1999) Monte Carlo studies revealing that the basic stages of econometric modeling can be automated. Hoover and Perez discovered the basis for the GETS algorithm. They found that by simplifying the model, using multiple search paths, checking the models at each stage for misspecification, they could obtain an undominated and congruent model (Hendry and Krolzig, 2001).

One can begin data mining with the most general unrestricted model (GUM) based on theory, hypothesis, or logical inference to describe the data generating process being studied. This model contains all of the variables that may be deemed to be theoretically significant. Ideally, this model will involve variables that are not problematically collinear. The program adjusts the significance levels for varying sample size and applies the diagnostic tests to the GUM. If the outlier check is selected, dummy variables are added to indicate the additive outliers.

a picture of the photocopier-shaped ZPrinter device

Figure 4. Additional residual graphics include (L to R): correlogram, spectral density chart, histogram and superimposed normal curve, and quantile-quantile plot.

The program proceeds to the pre-search tests. First, the lag order is checked with a step-down test for the longest lag order permitted. Then cumulative F tests are used to test increasing and decreasing block sizes to ascertain the proper number of model variables. After insignificant variables are eliminated, a new GUM is attained and considered for further selection.

The program begins the first stage. It attempts to simplify the model by sequential estimation and reduction of the GUM through eliminating statistically insignificant variables and groups of variables. In this process, multiple different paths of simplification are tested to be sure that there is no inadvertent removal of an important variable (Hendry and Krolzig, 2003). For this reason, multiple different paths of model simplification are begun. At each simplification step, the model is diagnostically tested for specification and misspecification. If a variable has not begun a simplification path and it is insignificant, then it is removed. If the simplification path fails one or more of the diagnostic tests, that simplification path terminates. If more than one variable remains statistically insignificant, the second to most insignificant variable is removed. If no variables remain statistically insignificant, the path is terminated. Otherwise, the sub-path search persists. If all tests are passed and the variables remain significant, the model becomes the final model of that search path.

The encompassing tests are applied. If none of the simplified models emerge, then the GUM becomes the final model. Otherwise, the competing non-rejected models are tested for encompassing against their union. If the terminal non-rejected models are rejected, their union becomes the final model. The union of the remaining models becomes the GUM for the next iteration of GETS.

A second stage of multiple-path reduction of the GUM begins. Stage one is reiterated here to test further simplification of the new GUM.

Stage three entails subsample evaluation. There are three samples. There are two overlapping subsamples and the overall sample. If the parameters are statistically significant in all, they are 100% reliable. If they are significant in the overall and one subsample, they are 70% reliable. If they are significant in both subsamples, but not in the overall sample, they are 60% reliable. If they are significant in the overall sample but in neither of the subsamples, they are 40% reliable. If they are insignificant in the overall and only significant in one subsample, they are 30% reliable. If they are not significant in any of these, they are 0% reliable (Hendry and Krolzig, 2001).

Estimation in PcGets

PcGets makes it possible to begin with ordinary least square estimation, and to later proceed to automatic testing: testimation; instrumental variables estimation and automatic instrumental variables estimation; or recursive estimation and rolling regression to test for parameter constancy.

Model Building Strategies

Within each of these approaches, one can opt for a conservative, liberal, or user defined strategy. That is, the user can predispose the automated variable selection to delete irrelevant variables, retain relevant variables, or customize the criteria for variable selection. To control the modeling, the variables can be designated as dependent, endogenous, instrumental, or fixed (not to be deleted) in the automatic testing process. Linear or other restrictions can also be tested within the model.

Diagnostic Tests

The package automatically applies diagnostic misspecification and specification tests checking on the validity of the reductions of the size and complexity of the model to ensure a congruent final model selection. The misspecification tests, the results of which can be observed in the output in Figure 5, include two Chow tests for parameter constancy (one with a mid-sample split and the other for end of sample constancy), a one-through-fourth order Box-Pierce autocorrelation test, a Jarqe-Bera test for residual normality, a White’s (1980) test for residual heteroskedasticity, and autoregressive conditional heteroskedasticity (ARCH) La Grange Multiplier test for ARCH effects. A collinearity analysis can be performed to test for intercorrelation among predictor variables. The specification test includes a La Grange Multiplier test for omitted variables. Finally, the package performs a split-sample cross validation to determine the reliability of the parameters retained. The reliability of the parameters depends on the significance of the variables in one, some, or all of the samples in which the model is tested.
a picture of the photocopier-shaped ZPrinter device

Figure 5. Output from Model

The dynamic analysis option computes the modulus of the roots of the lagged dependent variable to test for stability and long run values. The modulus of exogenous variables is also included but not that necessary.

The collinearity analysis not only estimates the intercorrelation of the parameters; it also reveals eigenvalues indicating relative proportions of shared variance. With this information, the user can assess the extent to which the model parameters are orthogonal to one another.

The package contains very useful utilities. With the batch editor, one can store the models in an archive. With the algebra editor, new variables can be contstructed and easily transformed. A calculator facilitates computations that can help in constructing new variables or establishing restrictions and the tail probability calculator permits significance testing with particular distributions.

Parameter Reliability Assessment

As seen in Figure 5, the model is cross-validated with reliability estimates of the parameters. Overlapping subsamples are used to test the statistical significance of the final parameters. Those parameters which obtain significance in the separate and overall samples receive the highest reliability score of unity. Those parameters which attain statistical significance in only the separate subsamples attain a smaller reliability evaluation. When the parameters fail to attain significance in any of the subsamples or overall sample, they are assigned a reliability of zero. The parameters that fail to attain significance in all the samples receive this score.

Within Sample Forecasting

The program performs within sample forecasting. It will set aside part of the sample as a hold-out sample and forecast to it. It will test level bias at a particular forecast horizon with a one sample t test to determine whether the forecast at that particular forecast horizon is statistically significant from the actual value of the series.

This is a kind of predictive validation of the model. The program will graph the forecast against the actual values, delineating the forecast interval with error bars, error fans, or confidence intervals over the forecast horizon (Fig 6).

a picture of the photocopier-shaped ZPrinter device

Figure 6. Within Sample Forecast Profile with Error Fans.

Progress Tracking

One of the most useful features available is the tracking of the log-likelihood of the models developed during the session to identify which model fits best. The stored model can be easily retrieved for later use, with a record of which model fits the best.

Assumptions of PcGets

PcGets conducts all analysis as if continuous dependent variables (series) are stationary. The current version does not test for stationarity. For a test of stationarity, one can review the autocorrelation function to determine whether they decline exponentially or linearly, for there are no Dickey-Fuller, Augmented-Dickey Fuller, or Phillips-Perron tests incorporated in the package. A linear or polynomial trend can be inserted to detrend trend-nonstationary processes. One can also first difference the process if the process appears to be difference-nonstationary. In an upcoming version, there will be a t-test for stationarity that is part of the dynamic analysis output (Krolzig, 2004).

PcGets also assumes that the variables used are not collinear, but does include a collinearity analysis with which to test this assumption. If the items are highly correlated, one could standardize the items and construct a scale out of them. The scale could be used in the model to handle this sort of thing, provided that testing the theory does not prohibit this sort of substitution. Otherwise, proxy variables that are not highly correlated with one another may be used.

Identification of models is also assumed. There have to be enough observations to model all of the parameters, outliers, structural breaks, etc. Without sufficient sample size and identification, such modeling is impossible.

Documentation Quality

The documentation is excellent. The user’s guide is well laid out. It begins with an introduction, and a getting started section. It then proceeds to a series of tutorial on the use of different aspects of the package, including model formulation and estimation, post-estimation model evaluation, automatic GETS model selection, cross-section model selection, batch usage, and VAR modeling (Krolzig, 2004). The book proceeds to the econometrics of PcGETS, including the theory of reduction, model evaluation, and model encompassing. It contains a section on the PcGets algorithm and refutations of these criticisms of data mining as they are incorporated in this package. Another section deals with Menus, Options, and the batch language with which one can write and store programs to perform these procedures. The documentation is clear and comprehensive.

Discussion

Hendry and Krolzig, the authors of PcGets, submit that the package overcomes criticism leveled at data mining in most respects. They challenge its critics on issues of data driven modeling, measurement without theory, ignorance of selection effects, spurious significance deriving from repeated testing, arbitrary choices of significance levels, and lack of identification, among others. They challenge John Maynard Keynes (1939, 1940) who argued that data mining must presume prior complete knowledge of the DGP and that theoretical models in econometrics are neither complete nor correct. Hendry and Krolzig rejoin that econometrics, if that were so, could not be properly applied to empirical research, a position they generally regard as unreasonable, untenable, futilitarian, as well as historically discredited.

The authors also challenge the argument that data mining is measurement without theory. Theoretical development cannot proceed without due regard for data. There is a give and take between theory and measurement, and theory must be rigorously tested against the data. Data mining insofar as it serves this purpose is germane to theoretical development and cannot be summarily discarded.

In terms of the pre-test bias debate, the authors distinguish between the costs of search and the costs of inference. When the truth is unknown, the costs of search are inevitable. They must be conducted and inference tests must also be performed. The search costs of retaining irrelevant variables or omitted relevant variables turn out to be small according to the authors. General to specific modeling helps protect against the more serious problem of starting with too minute a model (Hendry and Krolzig, 2001).

Most modeling ignores the uncertainty entailed in the model selection. Few models adjust the uncertainty of the significance tests to account for the many pre-tests undertaken before the final candidate model is tested. Often, the variables are tested in different samples before they are conventionally accepted as candidate predictors in the final model. Only the final candidate model is deemed to have its standard errors derive from the sampling variation involved in the fixed specification, as it were. For this reason, Leamer (1983) suggested his sensitivity testing of the theoretically important variables by selectively removing auxiliary variables in what he called “extreme bounds analysis.” PcGets seeks to deal with this problem by applying encompassing tests to competing models and reliability tests to the same model in overlapping samples.

The idea that spurious significance will follow from repeated testing is also challenged by the authors. They argue that there are possible solutions to bias from the experimentwise error rate, and incline toward a Sidak correction. They moreover claim that block tests can alter this rate of false rejection and that multi-path searches greatly mitigate the problem. They claim that one-path selection procedures—for example, those contained in stepwise regression—are detrimental, unwise, and to be circumvented.

The authors claim to adjust their significance levels to the sample sizes to attain consistent significance testing. They argue that most testing ignores the effect of changing sample size on test power and thereby allows varying imbalance between the type I and type II error. PcGets incorporates a set of rules that compensate for this problem after the manger of the type of penalty used in the Schwartz criterion.

Path dependence in the model selection is overcome by employing a multi-path search approach to the model simplification process. As many feasible paths as possible are searched in order to simplify these models. Encompassing tests serve to combine those final competing models to obtain the dominant terminal congruent model.

Recommendations

There are a few drawbacks to the current version of the program. Within the econometric context, the program should permit automatic modeling and testing of long-wave cycles. In addition to identifying additive outliers, the program should also handle innovational outliers, temporary changes, and level shifts. Appropriate tests for these could be included in the program.

The current version of the program does not permit data mining with limited dependent response variables. Neither does the current version of the program perform data reduction with principal components or factor analysis. Nor does it perform classification analysis with discriminant or cluster analysis. There is no provision for bootstrapping standard errors or multiple imputation of missing values at this point in time.

However, this program is an exceptionally user-friendly, powerful, and flexible program. With the mastery of the batch processing and the batch language, this package can be tailored to the user’s needs. It is an econometrics modeling package with some within sample forecasting ability. As for the within sample forecasting, there is a built-in t-test for bias.

The program would benefit from a provision for evaluating interval coverage and interval precision, both with absolute and relative measures of forecast accuracy. Percent interval coverage, mean square forecast error, mean absolute error, mean absolute percentage error, and median absolute percentage error would be welcome additional measures to indicate the precision of the forecast. As for out of sample forecasting, it neither does this sort of thing nor does it evaluate them. It should be able to automatically generate a naïve forecast, against which it could compare the out of sample forecast, with relative measures of forecast accuracy by using the percent better forecast, relative absolute forecast error, Theil’s U, Theil’s V, relative absolute error, and relative absolute percentage error, and percent better evaluators.

An upcoming version of the PcGets package will automatically perform a combination of long and short run analysis with error correction models and also handle cross-sectional and VAR analysis. It will allow the option of a Full-Search in addition to OLS, GETS, Instrumental Variables, and automatic instrumental variables analysis.

In summary, PcGets version 1.0 is an excellent, indeed outstanding, program that comes with practice data sets and a well written and documented user’s guide, with which one can begin econometric data mining. The algorithm is a tour de force and the package, distributed in the U.S. and U.K. by Timberlake Consultancy Ltd. (info@timberlake-consultancy.com; http://www.timberlake-consultancy.com), is highly recommended for dynamic regression modeling with upcoming improvements are even more promising.

The Academic Computing Services’ Statistics, Social Science, and Mapping Group in New York University’s Information Technology Services has ordered modules of the Oxmetrics suite, including copies of PcGets, STAMP, PcGive, and the Ox programming language. Students and/or faculty needing assistance with PcGets or these other modules should contact frank.lopresti@nyu.edu.

References

  1. Campos, J., Hendry, D.F., and Krolzig, H.-M. (2003). “Consistent Model Selection by an Automatic GETS Approach.” http://www.econ.ox.ac.uk/research/hendry/paper/jcdfhhmk03.pdf, January 11, 2004, pp.1-14.
  2. Campos, J. & Ericsson, N.R. (2000) “Constructive data mining: modeling consumers' expenditure in Venezuela,” Federal Reserve Board International Finance Discussion Paper # 663: http://www.federalreserve.gov/pubs/ifdp/2000/663/ifdp663.pdf. Hendry, D. H. and Krolzig, H.-M. (2001). Automatic Model Selection Using PcGets 1.0. London: Timberlake Consultancy, Ltd., 122-125, Appendix A1, 212-214.
  3. Hendry, D. H. and Krolzig, H.-M. (2003). “Automatic Model Selection: A New Instrument for Social Science.” http://www.econ.ox.ac.uk/research/hendry/paper/dfhhmk03c.pdf, downloaded January 11, 2004, 3.
  4. Keynes, J. M. (1939). “Professor Tinbergen’s Method.” Economic Journal. 44, 558-568.
  5. Keynes, J. M. (1940). “Comment.” Economic Journal. 50, 154-156.
  6. Krolzig, H.M. (2004). Personal communication. January 22, 2004.
  7. Leamer, E. E. (1983). “Let’s Take the Con Out of Econometrics.” American Economic Review, 73, 31-43.


Author Biography

Robert Yaffee was a statistician within the ITS Social Sciences, Statistics & Mapping Group at the time of this article's publication.

Page last reviewed: April 17, 2004. All content © New York University.
Questions or comments about this site? Send e-mail to: its.connect@nyu.edu.