The Social Science, Statistics, and Mapping Group of the ITS Academic Computing Services has recently acquired the S-PLUS statistical package for use in the ITS computer labs. This package is also available via a Telnet dialup connection on the STATS1 Unix computer.
S-PLUS is a very powerful and flexible statistical
package that incorporates many of the procedures available in the more common
statistical packages. S-PLUS features learning resources that include a tutorial
for the novice student, multiple help menus, a dictionary of keywords and
phrases, a reference manual, and online help. It has an appearance that is
similar to that of SPSS, another popular statistical package, but it operates in
an object-oriented environment that is very different from that of SPSS. In
addition to the usual basic statistics, S-PLUS features a customizable user
interface, and an extensive and powerful exploratory graphical capability (for
example, see Figure 1 for a trellis graph of air quality variables in the data
set and Figure 2 for an example of a 3-D graph). The package also offers built-in
programming capability, the ability to do matrix programming, and a wide array of
advanced statistical techniques that make it a very popular statistical package
among professors, professional statisticians, econometricians, and
mathematicians.
The user interface provides
a data spreadsheet in which data can be easily defined, imported, manipulated
or ex- ported. S-PLUS features a data-importing capability from flat files (such
as, PRN, ASCII or text files); spreadsheets (including Excel, Lotus, and Quattro
Pro); databases (e.g., Paradox and Dbase); statistical packages (e.g., SAS,
SPSS and SYSTAT); mathematical packages (such as MATHCAD and MATLAB); and graphical
packages (like SigmaPlot). S-PLUS is also versatile in its data manipulation.
Entire data sets can be easily selected, and individual blocks of data (either
rows or columns) can be selected within data sets. These blocks can be copied,
moved, appended, packed, transposed, sorted, stacked or unstacked. Variables
can be constructed by recoding existing variables, collapsing categories, or
generating random numbers from a theoretical distribution. Complete data sets
can be selected, subset, split, merged, transposed, and easily saved. After
the data sets are saved, they can be exported in various formats. For example,
by using the PowerPoint Wizard, resulting output can be converted to MS PowerPoint
for presentation.
In the object-oriented environment, data, functions, graphs, scripts, and modeling results are saved as objects. An object-explorer displays these objects and allows the analyst to select them for special processing. Double-clicking on the model object shows the default summary for the object. Right-clicking shows methods used for analyzing the object, from which the user may select one or several options. Functions may be used to call portions of the output. For sophisticated analysis, users may avail themselves of the command and analytical history log windows to keep an audit trail of the analysis for debugging, development and reference.
S-PLUS statistical techniques include a wide variety of basic and advanced statistics. The basic statistics include summary statistics, cross-tabulations and correlations. The package includes methods for analyzing statistical power and sample size requirements. Methods for comparing samples with one or more samples, and for counts and proportions, are also part of the package. Among the more advanced options are the general linear and nonlinear models, which include experimental designs with fixed, random and mixed effects. These may be of orthogonal factorial, repeated measures or split-plot designs. Multiple comparisons with specifiable errors for experiment-wise error corrections are available using the Bonferroni, Dunnett, Fisher LSD, Scheffe, Sidak, Tukey or Simulation options. Plots of the interactions can also be produced.
Parametric, semi-parametric and nonparametric regression models are available to the user. Linear regression approaches (OLS, weighted, and generalized least squares) are also included. Nonlinear regression procedures include parametric nonlinear, logistic and Poisson (log-linear) techniques. S-PLUS, like the new SAS modules, provides robust regression that includes locally estimated sums of squares (loess), least median squared and least absolute deviation squared procedures, and regression of M-estimators. Generalized additive regression is also available with Gaussian, binominal, Poisson, Gamma, inverse normal and quasi-likelihood link functions.
Some other advanced S-PLUS statistical techniques include tree-based models, cluster, multivariate, survival and time series analysis. The tree-based models build classification and regression tree structures. The display of these top-down structures illustrates the explanation of apex dependent variables by nominal and continuous predictors. Tree-based models may be used as explanatory or classification systems, while S-PLUS cluster analysis can be used to classify objects with K-means, agglomerative hierarchical cluster analysis, and partitioning and fuzzy partitioning, among others.
Other multivariate techniques featured in S-PLUS are principal components analysis and factor analysis for common factor extraction and definition, three types of discriminant analysis along with error rate analysis and cross-validation for formulating functions that maximally discriminate among groups, and a MANOVA procedure.
S-PLUS
offers survival analysis for those who wish to study the individual or comparative
duration or reliability of phenomena. Its procedures include Kaplan-Meier analysis
with both left and right censoring. There is median and median survival computation
and the comparison of stratified survival curves. Cox regression and penalized
Cox regression models are available with a wide array of distributions--including
the Weibull, smallest extreme value, logistic, log-logistic, normal, log-normal,
exponential, log-exponential, Rayleigh, or log-Rayleigh--that can be used for
fitting. Both individual and cohort expected survival analyses are available.
Cohort analysis can use exact, Hakulinen and conditional methods. Hazard rate
tables and Cox models can be used along with tests for significance.
Time series analysis can be done in the time or frequency domain. Models of univariate time series may be developed with ARIMA, autoregression, and spectral analysis. The ARIMA models include simple models, seasonal models, ARIMA with regression variables, simulations and forecasting.
The autoregression models include univariate autoregression models, multivariate autoregression models and procedures for finding the roots of a polynomial equation. These procedures lend themselves to univariate and multivariate modeling of time series, as well as intervention and transfer function analysis. There are also procedures for long memory time series models, including fractionally differenced ARIMA models and their simulation. Robust methods for smoothing time series with outliers are available, including generalized M-estimates, robust filters, and robust smoothers.
For those interested in studying more or less continuous series in the frequency domain, S-PLUS provides a spectral analysis toolkit, including such tools as the usual spectrum estimation with periodogram or with auto-regression analysis, along with convolution or recursive filters for modeling univariate or causal processes, and band-pass filters for complex demodulation of nonstationary series.
For analysts interested in estimating and forecasting risk, particularly in financial applications, S-PLUS at NYU also has a GARCH module. This module provides the analyst with a whole family of ARCH (autoregressive conditional heteroskedastic models) and GARCH (generalized ARCH) models respectively proposed by Robert Engle in 1982 and Tim Bollerslev in 1986 to use in their analyses. They include the extended GARCH, power GARCH, exponential GARCH, threshold GARCH, GARCH in mean and multi-component GARCH models, along with an array of multivariate GARCH models.
For analysts whose data are rooted in space or area, S-PLUS also contains a module for spatial analysis. Epidemiologists studying disease clusters, environ- mental scientists studying pollution diffusion, or criminologists studying crime clusters might use exploratory spatial techniques, including contour plots and variogram clouds. They might also make use of empirical and model variograms, as well as the spatial correlation and regression procedures provided by this module. Other features include resampling techniques (bootstrapping and jackknifing) as well as a number of quality control procedures (control and Shewart charts, plus procedures for process and capability analysis).
S-PLUS, STATA and SAS all stand out in power, capability and flexibility among general purpose statistical packages for professional statisticians and serious graduate students who have need of advanced modeling procedures. For specialized applications in limited dependent variable and panel data analysis, LIMDEP is very popular. For specialized applications in time series analysis and forecasting, noteworthy packages are FORECAST PRO, THETA, AUTOBOX, EVIEWS, STAMP, PCGIVE, PCFIML, RATS, and SHAZAM.
For faculty members who
can not afford to spend a lot of time counseling their students on the proper
applications of the statistical techniques they employ, both SPSS and STATA have
a well-deserved reputation for being versatile and user-friendly general purpose
statistical packages. For more information on using S-PLUS or other statistical
packages at NYU, please contact either Bob Yaffee at (212) 998-3402, or Frank
LoPresti at (212) 998-3398.
![]()
Posted February 16, 2001
| | |
| |