CONNECT, SPRING 1996: STATISTICS AND THE SOCIAL SCIENCES


New Statistical Modules for Marketing Research and Time-Series Analysis

by Robert A. Yaffee

[Ed: Links to web pages and/or e-mail addresses which have become inactive since the publication of this article have been enclosed in curly brackets { }. Replacement links have been provided where possible.]

SPSS (Statistical Package for the Social Sciences) is one of the most popular computer packages for statistical analysis in universities around the world today. It is a very user-friendly package, and thus useful for the beginning student learning how to apply statistics in the social sciences. (For a brief comparision of SPSS with SAS, see "SAS or SPSS?") The Academic Computing Facility makes SPSS available in three ways: it offers SPSS for Windows (version 6.1.2) in its PC labs, and SPSS for Unix (version 5) on its statistical server, an IBM RS6000/C-20 designated stats1.acf.nyu.edu. It also maintains a site license for distribution of the SPSS for Windows package to students, research staff, and faculty for a nominal fee.

The ACF has recently acquired several new statistical modules for use with SPSS for Windows:

Let's consider the time-series and the neural-networks modules in greater detail.

Time-Series Analysis

SPSS's Trends module contains a number of statistical techniques useful in time-series analysis. Trends contains sequence and time-series plots so the user may plot the data. It permits seasonal decomposition for those data series with clear seasonal components, allowing the selection of additive or multiplicative components of trend, seasonality, cycle, or error. The X-11 procedure allows for Census II seasonal adjustment of the data. The exponential-smoothing procedure may be used to utilize the most recent data for the purpose of forecasting. For those who prefer to analyze continuous data, spectral analysis is available in this module as well.

ARIMA and regression techniques are available. The ARIMA (autoregressive integrated moving average) procedure permits identification, estimation, diagnosis, metadiagnosis, and forecasting. For series afflicted with first-order autocorrelation, there is the autoregression procedure. For series afflicted with serious heteroskedasticity problems, there is the weighted least-squares regression procedure. For series with serious multicollinearity and simultaneity, there is the two-stage least-squares regression procedure, which may be used when proxy variables are available.

The Trends module under Windows produces textual and graphical output that can easily be incorporated in word-processing documents. The module has also been added to the SPSS under Unix on the ACF's IBM RS-6000/C-10 computer. A series of lectures on the use of the Trends is planned for the fall of 1996.

Neural Networks

Attempts to develop computer models of the perceptual and other cognitive processes of the brain have led to the development of computer programs that can read, learn, generalize, and adapt. These neural networks complement conventional statistical techniques for complex, nonlinear classification, clustering, prediction, and time-series modeling. When the data are "noisy," the new SPSS Neural Connections module, albeit computationally intensive, may outperform conventional models built for these purposes. According to Tony Babinec, SPSS Director of Business Development, these algorithms may facilitate model exploration, construction, and refinement of models, and hence increase understanding and productivity.

Fundamentally, neural networks consist of basic components called neurons between which connections are formed. Functionally, the neurons are arrayed in three types of layers - input, hidden, and output. Three layers is a bare minimum for analyzing nonlinear problems, and there are often several hidden layers.

In Neural Connections, icons on the computer screen serve as manipulable symbols for input data, smoothing, filtering, modeling, or forecasting processes. The decision function may be depicted in a graph.

In the first phase of the iterative model-building cycle, the data are preprocessed. The neural net first transforms input and target variables to make them utilizable. Categorical and date-formatted variables are converted to numeric variables. Nonnormal variables are deskewed. Input variables are standardized, while outliers are clipped.

In the second phase, the neural net divides the data set into three subsets, one of which it will use for training, one for validation, and one for testing.

During the third phase, the neural net uses training data to form a system of connections between the layers of neurons for segmentation, classification, or prediction of the target (output) variable values.

As the network of connections is built between the layers, each connection is assigned a statistical weight. Input layers are stimulated by software algorithms to send signals to hidden layers. As a signal is sent through each connection, it is multiplied by the connection weight. Stimulated neurons receive signals from preceding layers of neurons and from a bias neuron. They sum these signals, subject this sum to a statistical transformation, and transmit the result to neurons in subsequent layers. The output of these neurons is a function of the net structure and the connection weights.

When the signaled output value is subtracted from the actual data value, the residual is an error. Errors are fed back from the output to the hidden layers for corrective adjustment of the connection weights to minimize the sum of squared errors. Training of the net takes place through this iterative feedforward of signal and feedback of the errors until a convergence of target with actual values takes place.

A three-dimensional graph produced in SPSS's Neural Connections module represents decision output as a function of two inputs: age and maximum bank balance. The graph can be rotated as desired.

In the validation phase, the network is run with the validation data subset, in order to minimize overfitting the net to the peculiarity of the training sample. As the net is trained, errors of both training and validation decline. After a point, though, the validation error begins to increase; the optimal network configuration holds at the point of minimal validation error. After that point, any decrease in training error would indicate overtraining to the particular sample of data.

Multiple random starting points may be used to avoid local minimal error ruts and multiple data sets may be used for extended validation. When used for forecasting, the multiple starts are constrained by the time-dependent order of the data series. After fitting of the series, the forecast may be extrapolated from the fit of a model and projected into the future.

In these ways, the model may be cross-validated on other data sets and used for segmentation, classification, prediction, or forecasting.

The Neural Connections module allows for visualization of its decision criterion. A three-dimensional graph of the decisional surface of the neural net may be displayed in relation to other inputs as in the accompanying figure. In addition to this graph tool, another visualization device for sensitivity analysis shows how output would change in accordance with designed alterations of the graphed inputs. With the heuristic aid of these procedures, the neural net can be further refined or developed.

Access

Persons interested in purchasing any of these SPSS these modules should call Jane DelFavero at the ACF Help Center on the second floor of Warren Weaver Hall at 998-3333. Those seeking technical assistance with modules may contact me at 998-3402 or by e-mail at the address below. [ C ]


Dr. Yaffee was a statistical consultant at the ACF at the time of this article's publication.
{yaffee@nyu.edu}

Posted 15 February 1996. Revised 24 May 2004.