by
Robert A. Yaffee
and
James T. Austin
9-15-94
Presented at the New York University Graduate Sociology Methodology
Workshop on 8 March 1995
ABSTRACT
Time and change are important concepts for many scientific disciplines.
We review statistical procedures that enable higher education researchers
to match theories of time and change with appropriate analyses. The primary
motivation for using event history analysis lies in the problems that arise
when studying duration of time until an event occurs and when characterizing
distributions of these phenomena. We discuss three procedures for nonrepeatable
events: life tables analysis, discrete-time logistic regression, and Cox
regression. We then consider repeatable events. We organize the presentation
around an extended example of attrition using a large sample of high school
teachers followed over 12 years. Extensive annotated computer syntax and
output are presented. Complex mathematical material is limited to an appendix.
One of our major conclusions is that higher education researchers should
consider reframing their substantive research questions to focus on time
and duration to take advantage of these sophisticated techniques.
DISCRETE-TIME EVENT HISTORY MODELS FOR HIGHER EDUCATION
The study of time and temporal change characterize multiple scientific disciplines (McGrath & Kelly, 1988; Thierry & Meijman, 1993), including higher education (Bloom, 1963). Researchers in higher education often study processes that operate over defined intervals. Many interesting phenomena are described by duration culminating in a terminating event, say completion of a college major. The time span of faculty retention, the length of time before students drop out or graduate, or the time taken to reach various faculty levels are topics of interest to researchers (Singer & Willett, 1993; Willett & Singer, 1991a, 1991b). Time is a frame of reference along which phenomena are observed. Reference to time permits assessment of stability and developmental trajectories in the phenomena. Repeated observations permit estimation of the rates of change, and they also allow researchers to observe the termination of the event under examination. The time axis is often defined in discrete units (years, months, semesters); therefore discrete time event history analysis is the type of analysis introduced here.
Such studies have an explanatory as well as an observational role. To
have explanatory power, hypotheses about the phenomena require invariant
generalizations, spatial contiguity, covariation, and temporal sequence
(Cook & Campbell, 1979; Nagel, 1979, p. 74). Aligning phenomena within
measured intervals is one way to conduct such analyses. The graphical display
and statistical analysis of survival times comprise event history analysis.
This analysis is becoming increasingly popular (Klein & Goel, 1992),
not just in educational research, but beyond (Blossfeld, Hamerle, &
Mayer, 1989; Miller, 1981). In higher education, researchers are interested
in several populations: students, faculty, administrators, and support
staff. Event history analysis applies to the study of student retention
(Stage, 1988; Willett & Singer, 1991a), faculty tenure (Mueller, 1986),
and job duration of high school teachers (Murnane, Singer, & Willett,
1988; Sclan, 1993; Singer & Willett, 1993). Thus, the duration and
sequencing of events, along with predictors of duration, comprise inputs
to event history analysis. Such procedures extend capability to study temporal
processes and their determinants.
Organization of this Chapter
Several event history techniques for discrete-time data are presented in this chapter. Before discussing them, issues in the study of duration are addressed. These issues include retrospective and prospective longitudinal designs, data collection using a life history calendar, differentiation of person- and event-oriented datasets, censoring indicators, and treatment of time-varying covariates.
Following those preliminaries, four techniques are presented. Three techniques are for nonrepeatable events. We present life tables analysis, which describes survival time distributions and compares distributions of different groups, discrete-time logistic regression, which models the probability that of an event within a given period as a function of one or more covariates, and Cox regression (Cox, 1972), which models the hazard probability (hazard rate) as a function of a baseline hazard rate and a parameter vector. Then we switch from nonrepeatable to repeatable events and briefly discuss multispell models with one-way transitions. For all procedures, underlying assumptions are examined and computer examples are used to illustrate how to conduct valid analyses. Personal and mainframe computer applications are illustrated, but storage requirements may make the mainframe a better option for datasets with either many cases or many observation intervals.
An extended example used in this chapter is a study of the job duration
of beginning high school teachers (Murnane et al., 1988, 1989). Using large
samples from three states, generalizations about the career paths of high-school
teachers in the United States were developed. First, however, we elaborate
several problems in the analysis of durations to show the advantages of
event history analysis.
PROBLEMS OF STUDYING DURATION
Certain problems are specific to longitudinal studies (Allison, 1982;
Blossfeld et al., 1989). Clear definitions of events, including states
of existence and state transitions, are crucial. Accurate measurement of
beginning and ending times for the first state is needed to calculate the
duration of the phenomenon. Clear transitions between states allow for
an unambiguous measurement of the duration. Special designs, data collection
procedures, dataset construction, variable coding, and treatment of cases
are required to deal with duration data. If, for instance, researchers
are interested in the topic of teacher attrition, they must provide a clear
definition of the change of state (teaching to nonteaching). Teaching can
be defined as conducting classes, but also giving homework and exams. Teaching
is usually not defined as supporting teachers. Thus, leaving teaching
means ceasing to conduct these essential classroom activities. Next, we
consider retrospective and prospective longitudinal designs.
Retrospective and Prospective Designs
Retrospective designs feature interviews of respondents consisting of questions asking them to recall the time span of events and other personal characteristics. Such designs are cheaper and less time-consuming than prospective interviewing; however, they rely heavily on respondent recall. Because respondents are asked to recall beginnings and end-points of the phenomenon as part of data collection, questions should be organized to facilitate recollection of events and timings. From the start and end dates, duration is calculated. Other phenomena associated with this duration are also measured. First, focus is placed on key life events and then on the duration of interest associated with beginning and end points of key life events. Once such key events are recorded, memories of other events are linked. The data are collected for analysis by guided recall.
Prospective observation designs require monitoring over time,
during which particular events may occur. Such plans are more costly and
time-consuming than retrospective ones and also require increased training
and supervision. But they are accurate because they avoid drawbacks of
differential memory abilities and lapses of temporal frames of reference.
In prospective designs, respondents are reinterviewed at discrete points
over the period of observation. The observation period commences with the
beginning of the study and lasts till the occurrence of an event. If the
person drops out of the study for some other reason, is lost to follow-up,
or did not experience the event under study during the observation period,
that nonappearance must be coded appropriately. We call such observations
censored.
Life History Calendar
We recommend that data be collected and entered using a Life History Calendar (LHC; Freedman, Thornton, Camburn, Alwin, & Young-Demarco, 1988). This calendar matrix has rows representing variables and columns representing time periods. Within columns, measurements are taken on variables for persons or other units of analysis represented by rows. Thus, changes in the values of the variable are recorded over time.
According to Freedman et al. (1988), the LHC improves the quality of retrospective data collection by facilitating visual and mnemonic timing of various events. First of all, a calendar-time framework provides a basis for the recall of events. More easily remembered events -- marriages, births, deaths -- are requested first and then serve as temporal markers for responses to other questions. More detailed sequences of events appear to be easier and more efficiently recalled using a framework of associations already established by the LHC. However, data evaluating these purported improvements provided by the LHC are required.
The time-axis in a LHC must be carefully conceptualized. Temporal units should provide the proper resolution for defining survival time. Units of time should be neither too small nor too large for study of the phenomenon. On the one hand, if the temporal units are too large, then all durations may fall within the first few intervals or periods. Differences between the lifetimes or survival times may be obscured by such gross time-line construction. On the other hand, if the temporal divisions are too fine, the number of observations within each of them may become small. Then the data become sparsely distributed, so that nonparametric analyses may be the only appropriate kind of analysis.
We recommend that, if erring in the construction of period length, it is best to make the intervals too small. If the intervals are too fine, they can always be aggregated to expand their time span. Aggregated intervals can provide alternative temporal definitions for the analysis. What a researcher cannot do, if the periods are too large, is to subdivide them for better temporal definition. This is a major rule of thumb for defining the temporal periods in a LHC.
The LHC presented in Table 1 helps to illustrate this tool. Suppose that the topic is teacher attrition and the researchers hypothesize that some teachers leave their careers because they marry and/or have children. The LHC uses temporal marker variables along with a variable indexing time spent teaching as follows. Calendar years and dates are indicated in the first row, months are found from the third and later columns. In this segment of the LHC, it can be seen that the teacher is 29 in December and that in the month of August turns 30. Marital status, read from the second row, changes from single to cohabitation with a partner in January, 1991. By April, the partners get married and departure from teaching occurs the month before marriage. Departure from teaching is temporally associated with an important life event, thus facilitating recall.
Table 1
Sample Life History Calendar Segment for One Person
|
Age |
29 |
Ja |
Fe |
Mar |
Apr |
May |
Jun |
Jul |
30 |
Sep |
Oct |
|
Marital |
S |
Coh |
Coh |
Coh |
Mar |
- |
- |
- |
- |
- |
- |
|
Teachng |
T |
T |
T |
Lvs |
- |
- |
- |
- |
- |
- |
- |
|
Key |
Marital S = single, Coh = cohabiting, Mar = Marriage Teaching T = teaching, Lvs = leaves teaching |
||||||||||
Censoring
Censored events are those that are not observed within the study. An event may be censored for one of several reasons. If a respondent dropped out of a study for reasons other than those of the terminating event, the observation of the event is censored. If such a person moves away without advising researchers and without leaving a forwarding address, then that respondent, if not otherwise traceable, is lost to follow-up. Similarly, if the person does not experience the event before the end of study, this censoring would also have to be recorded. The observational time frame of the study -- in this case, a single spell or single episode (where the event is not repeated) -- is divided into discrete units called intervals or periods. If the observation episode is terminated by an event, the censoring variable could be coded as a 0 for the relevant periods; if the event does not occur -- in this example, leaving teaching -- then the censoring indicator for the event would be coded as a one. Within a period, if one forms a fraction by dividing those actually experiencing the event by those at risk -- that is, capable of experiencing the event -- then this estimate uses the censored observations as part of the denominator of those at risk but not as part of the numerator (those experiencing the event). This hazard rate is formulated later, but recognize that such events are right-censored and must be coded as such.
Figure 1 illustrates several types of censoring within a single spell,
or one time frame. The vertical lines define the beginning and end of the
study along a time axis. Observations terminated with an x are cases
in which the teacher actually departs from the teaching field, that is,
the event is observed. Cases b, e, and f are individuals who leave teaching.
Observations without x's (cases a, c, d) ones in which the
individualdoes not depart the teaching field, may have withdrawn, or are lost to
follow-up. Such cases are censored.

Constructing Datasets for Analysis
Two types of datasets are used for event history analysis. Person-oriented datasets can be used if they are partitioned into intervals by the computer. They have an observation per respondent and have measures of duration already coded or computable from beginning and ending times. They also include a censoring indicator, indicating whether the event was observed or not. They may also include measured variables believed to be predictive of the occurrence of the event (i.e., predictors or covariates).
Another type of dataset, the event-oriented dataset, is often used for discrete time logistic and Cox regression analysis. The unit of analysis is not one observation per person; rather, the unit of analysis is the person-period. Each time that a person in a prospective study is interviewed, an observation occurs. If a person is interviewed ten times over the life of the study, that person has ten records. There is a record for each person-period. Each record includes identification number of the respondent, a censoring indicator indicating whether the event was experienced during that interval (assuming that the event was not experienced prior), and values of other variables assessed during that time period. The censoring indicator is a dummy variable, signifying if the teacher left teaching during the last time period the teacher was observed. A 0 indicates that the person left teaching in that period, whereas a 1 indicates that the person did not. Each record has a dependent variable score, with one value for no evidence of the departing event and another value for departure from teaching. If the censoring indicator is 1, this indicates that this person did not depart from teaching for the number of periods (s)he was observed in the study. If the censoring variable is 0, then the person left teaching at the time of the last observation. For that last record, the dependent variable score would be a 1, unlike the scores for the other periods for that person, indicating no departure from the profession. Only the last record for a person will indicate departure from teaching, if the person's observations are not censored. An advantage of the person-period dataset is that it permits large sample analysis. If the study is retrospective, a person-period dataset will have a record for each period (in this case, year).
An example of a person-oriented data set is given in Table 2. In this dataset, the identification variable is an ID number and there is only 1 case per person. The event under study is departure from teaching. LASTPD indicates the last period in which the person taught. GENDER indicates sex of the respondent (0=male; 1=female). MARSTAT indicates marital status (0=single, living alone; 1=cohabiting, 2=married, living together; 3=separated; 4=divorced). Variables beginning with T indicate years that the person was teaching, with 1=teaching and 0=nonteaching. A person gets a 0 for the teaching variable if he or she left teaching while under observation. What is observed in the person-oriented dataset is that the first person, a single female, is observed teaching only for 1 period. She did not leave teaching while being observed and therefore was censored. She may have been transferred and thus lost to follow-up. The second person, a male, was observed for five periods and was also censored because he did not leave teaching while under our observation. We can see that marital status changed over time from single, living alone to cohabiting with a partner and in the second and third periods to marriage. The third respondent is a female observed for two periods. This person was not censored; she left teaching in the second period when her marital status changed from married and living with spouse, to divorced.
Table 2
Sample Segment of a Person-oriented Dataset
|
ID |
LstPd |
Censor |
Gender |
T1 |
T2 |
T3 |
T4 |
T5 |
MS1 |
MS2 |
MS3 |
MS4 |
MS5 |
|
01 |
1 |
1 |
1 |
1 |
. |
. |
. |
. |
1 |
. |
. |
. |
. |
|
02 |
5 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
2 |
2 |
|
03 |
2 |
0 |
1 |
1 |
0 |
. |
. |
. |
2 |
4 |
. |
. |
. |
For a person-period dataset, multiplying the number of cases by the number of observation periods yields the total records in the dataset. If we convert a person-oriented to a person-period dataset, Table 2 becomes Table 3. Note that the censoring variable pertains to the person, rather than to the person-period in these models. Annotated SAS and SPSS syntax to convert person-oriented to person-period datasets are presented in Appendices 1 and 2 (with output).
Table 3
Sample Segment of a Person-Period Dataset
|
ID |
Censor |
Gender |
Teachg |
Marital |
|
01 |
1 |
1 |
1 |
1 |
|
02 |
1 |
0 |
1 |
1 |
|
02 |
1 |
0 |
1 |
0 |
|
02 |
1 |
0 |
1 |
1 |
|
02 |
1 |
0 |
1 |
1 |
|
02 |
1 |
0 |
0 |
2 |
|
03 |
0 |
1 |
1 |
2 |
|
03 |
0 |
1 |
0 |
4 |
Statistical Issues
Not only do conceptual and data management problems accompany duration data, but statistical problems arise. The simplest model is one in which the observation period is divided into discrete intervals. Discrete-time event history analysis studies the survival function, defined as the curve of probabilities of surviving until a particular time, provided that a person has not experienced the event under study. For example, given that undergraduates have completed 1, 2, 3, or more years of a major, what are the probabilities that they will drop out. Second, overall survival functions may be subdivided by one or more grouping variables into strata. One group may be experimental, the other a control group; one group may be males and the other females; or one curve may represent majority group members and another minority group members. We can test for significant differences in survival functions between strata. This approach treats the grouping variable as if it were a factor in an ANOVA. Lastly, we can use a regression analog in which the dependent variable is the logit, the natural log (ln) of the odds, of experiencing the event (dropping out of teaching). We use dummy or continuous predictor variables (covariates) to predict the logit.
All of these procedures must be able to deal with censoring. Censoring is defined as right or left, but we are concerned first with right-censoring. For right-censored cases, durations are not fully known. Some persons may not experience the event until after the study is finished. They have somehow discontinued their participation. These cases are right-censored. If it can be assumed that these observations are randomly distributed, then they do not bias parameter estimates (Teachman, 1983). An indicator must be created to signify the censoring status for each person (in computer syntax, we call this variable LEFT). When the person-record is not censored, an event is indicated in the last period observed, and such records are included in the number of persons leaving teaching for that period. When the record is censored, then the person is included in the risk set although the record is not included in the number of events for that period. Half of the censored records from that interval are subtracted from the denominator to adjust the sample size to provide for the corrected size of the risk set in the calculation of the hazard rate. A censoring indicator is required by most packages with life tables and Cox regression.
Another statistical issue for longitudinal analyses is that of covariates.
If the observations are made with a retrospective design, variables measured
for analysis may include both time-independent and time-dependent covariates.
Time-constant covariates do not change over time. Obviously, the
gender and ethnicity of a respondent usually do not change over time. These
variables have the same value at each period for the same person. Time-varying
covariates do change over time. Their values sometimes change in one or
more of the intervals of the study. Holding the MA degree can be a time-varying
covariate. In our teaching example, there would be up to 13 dummy variables
(named MADEG), one for each observation period of the study. The value
of this variable would remain 0 until the period in which the person received
the MA. At that time and for the remaining periods that the person was
included in the study, the value for this variable would be 1. Family income
is another example of a time-varying covariate. At one point, a family's
income may be below the poverty line. At a later point, their income may
be above the poverty line. Persons interviewed may have varying values
of family income each time that they are interviewed. Therefore, over time
family income could change from below the poverty line to above it. Such
changes bias the ordinary least squares (OLS) regression model applied
to person-oriented datasets, because no provision is made for changes in
the value of the covariate over time.
LIFE TABLES ANALYSIS
We begin by examining the survival time of the event of interest, which marks a change or transition from one state to another. The survival time (waiting time) is the length of time from the beginning of the study until the event occurs. If there is only one terminating event, then we say there is a single spell. If there are repeated events, in the present case repeated spells of teaching employment, then the analysis is called a multispell analysis. For didactic reasons, single spell analyses are emphasized here, and a model for repeated events is presented later. The time from which the events are measured is the beginning of the spell. The time at which the terminating event occurs is the end of the spell. If we are examining events that can have only one termination, like death, this first spell is the only spell. Time till first dropout of new teachers is a single spell analysis.
Within the total spell, we divide up the time interval of the study into periods. Periods are intervals within which we calculate survival time and probability. In higher education, many educational phenomena occur within discrete intervals (academic terms, years) and thus are amenable to such division. These are the times at which all of the sample is observed. These units of measurement are also useful in defining the risk set, defined as all those who may experience the event. The survival probability of the event is the probability that those in the risk set, defined as the proportion of those at risk of experiencing the event during a given period, survive the interval without experiencing the event. The survivorship function is a plot of survival probabilities over time (Singer & Willett, 1993, p. 7). Using the dataset, we analyze the job duration of new American high school teachers, using the time until leaving teaching measured in Michigan, North Carolina, and Colorado.
Within intervals, the probability of survival is the likelihood that a person will not experience the event of departure from teaching until some succeeding interval. Persons in the risk set who do not experience the event within the time of the interval are considered censored. The number of persons who actually experience this event, divided by the number of persons in the risk set and adjusted for those cases who are censored, defines the probability of a person within the risk set experiencing the terminating event. Censoring adjustment assumes a random distribution of cases within the interval. The midpoint of the number of cases censored is subtracted from the total number of cases at risk to yield an effective sample size. The number of cases experiencing the departure from teaching (here defined as failure) is the rate of failure for that interval. The complement of the failure probability is the survival probability. Examples of calculations are found in Table 4.
Life Table Survival Estimates
Panel A
EFFECTIVE CONDITIONAL
INTERVAL NUMBER SAMPLE PROBABILITY
[LOWER, UPPER) FAILED CENSORED SIZE OF FAILURE
0 1 0 0 63329.0
1 2 10304 1849 62404.5 0.1651
2 3 6053 1815 50268.5 0.1204
3 4 4722 1439 42588.5 0.1109
4 5 3432 1668 36313.0 0.0945
5 6 2754 1417 31338.5 0.0879
6 7 1947 2144 26804.0 0.0726
7 8 1455 3059 22255.5 0.0654
8 9 919 3282 17630.0 0.0521
9 10 630 2654 13743.0 0.0458
10 11 388 3079 10246.5 0.0379
11 12 188 3321 6658.5 0.0282
12 . 56 4754 2433.0 0.0230
Panel B
SURVIVAL
INTERVAL STANDARD
[LOWER, UPPER) SURVIVAL FAILURE ERROR
0 1 0.0000 0
1 2 1.0000 0
2 3 0.8349 0.1651 0.00149
3 4 0.7344 0.2656 0.00178
4 5 0.6529 0.3471 0.00194
5 6 0.5912 0.4088 0.00202
6 7 0.5393 0.4607 0.00207
7 8 0.5001 0.4999 0.00210
8 9 0.4674 0.5326 0.00213
9 10 0.4430 0.5570 0.00217
10 11 0.4227 0.5773 0.00221
11 12 0.4067 0.5933 0.00228
12 . 0.3952 0.6048 0.00236
In the first interval, beginning at T0 and ending at T1 (end of the first year), no one in the three states left teaching. All persons were accounted for and none were censored, thus, the risk set is the effective sample size. As can be seen in Panel B of Table 4, 100% of those at risk survived and the probability of survival during this first year is thus 1.0. The failure rate is the complement of the survival probability. That is, subtract the survival rate from 1.0 and the result will be the event or "failure" rate. Now, suppose there is censoring. In the second year, extending from T1 to T2, not only do 10,304 persons leave teaching as indicated under the number failed, but 1849 observations were censored. The number at risk is adjusted under an assumption of constant (uniform) censoring to obtain the effective sample size. When these calculations are done at the midpoint of the interval, half of those censored are deducted from the number at risk to produce the effective sample size. Half of 1849 is 924.5 and this value is subtracted from the effective sample size in the previous interval (here, year) to obtain the effective sample size of the present interval, or 62404.5. By extension, calculation of the effective sample size of the third year (T2 to T3) is accomplished by taking the effective sample size of the previous year, 62404.5, and subtracting from it 10304, the number who departed teaching in the previous interval. The remainder is 52100.5. Then one-half of those censored in the previous year (1849/2 = 924.5) is added to half of those censored in the current year (1815/2 = 907.5), yielding 1831.5. The sum of half of the censored observations from both intervals is subtracted from the 52100.5, leaving 50268.5. The effective sample size is recalculated in this way for each period.
Information is contained in the survival and hazard rates for the groups under study. Survival time is time from the beginning of the observation until the event, in this case the departure of teachers from their field. To compute the discrete survival probability, take the proportion of teachers not leaving by the end of that year. This conditional probability for a period is multiplied by the probability of survival for the previous period. This result gives the addition to the failure rate that takes place from the current to the next period. The survival function is formed by obtaining the resulting failure rate for the next period and subtracting this newly formed failure rate for the next period from 1.0. The result is the survival probability for that next period. A survival function formed this way is a decreasing function that may or may not approach 0.
An example will help to illustrate the calculations. Let us examine the rates in the period from T1 to T2. The conditional rate of failure for this period is .1651 as seen in panel A of Table 4 above. This value is multiplied by the survival rate of the previous period, 1.0. The product is also .1651 and that is the amount added to the zero probability of failure for T1 to T2 to yield the probability of failure for T2 to T3. This amount is subtracted from 1.0 to yield the survival rate for T2 to T3, .8349. Similarly, to compute the failure rate for the third period, T3 to T4, we multiply the conditional rate of failure for T2 to T3, .1204, by the survival rate for the period, .8349, to obtain .1005, the value added to the cumulative failure distribution to give .2656 for T3 to T4. This amount is subtracted from 1.0 to obtain the survival rate for that period (.7344). The standard error of the survival rate, shown in Panel B of Table 4, is defined as the square root of the sum of the conditional rates of failure for the intervals from 1 to interval i, divided by the effective sample size and multiplied by the complement of the conditional rate of failure for each interval. In this way confidence intervals and statistical tests for comparing survival functions can be obtained.
Censored observations influence the hazard rate. They may increase the
denominator of the hazard rate and deflate the hazard rate somewhat, or
they may flatten it out and keep it from reaching 0. From the formula,
the existence of censored observations slows down the decrease in the survival
rate. At the beginning, the proportion surviving is 100 percent. The probability
of those in the study surviving until the end of the first interval is
1.0. As time passes, more persons experience the event, and the proportion
remaining in the field declines. By the end of year 6, some 50 percent
of the teachers from the states remain in the field, as illustrated in
Figure 2. This is the median survival time, useful for comparing
survival times that include censored and uncensored cases. The median survival
time in the graph of the survival function is at the intersection of the
lines, where 50 percent of the cases survive beyond the 7-year period.
The survival distribution function is formed by plotting the updated probabilities
over time.
Figure 2
SAS Lifetest Procedure
SURVIVAL FUNCTION ESTIMATES
SDF |
|
|
|
|
|
1.0 + A++++A
| +
S | +
U | +
R | +
V | +
I | A
V 0.8 + ++
A | ++
L | A
| ++
D | ++
I | A+
S | ++
T 0.6 + +A+
R | ++
I | +A++
B 0.5 |___________________________________ ++A+
U | | ++
T | | +A++++A++
I | | ++A++
O 0.4 + | ++A++++A
N | |
| |
F | |
U | |
N | |
C | |
T 0.2 + |
I | |
O | |
N | |
| |
| |
0.0 + |
| |
| |
| |
----+----+----+----+----+----+----+--+----+----+----+----+----+----+----
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Years of teaching
Basic Mathematical Functions of Event History Analysis
Three important functions are the survivor function, hazard rate, and probability distribution function. The survivor function is formulated as follows:
(1)
This function is computed by multiplying one minus the conditional probability of the event, (1-qi), for each period. The survival probability of the event is the product of the survival probability for each preceding period and the probability of survival for the current period.
The ratio formed by the number of events under examination (here, leaving teaching) to the effective sample size is the conditional probability of the event, sometimes called the hazard rate. This probability of the "event" is the probability of the event under study conditional upon the event of interest not occurring prior to the current period. These conditional probabilities by period are found on the right side of Panel A in Table 4.
The hazard function can also be calculated. The failure rate at the beginning of the interval is also called the conditional rate of failure (leaving teaching). If this conditional rate, referred to in Equation 1 as q(t), is known, then the survival rate can be computed. The survival rate is the complement of the failure rate, sometimes referred to as the cumulative distribution function. The survival probability = 1 - q(t). This can be seen in Panel B of Table 4 above and is the proportion of events that have taken place before the end of the year period. From the conditional probability of failure, the failure rate can be calculated. When this conditional probability of failure is multiplied by the survival rate of the previous period, the amount added to the failure rate is obtained to compute the failure rate of the next time period. This cumulative event function is the probability density function.
The probability distribution function (p.d.f.) defines the probability that an event takes place within a time period, even though the interval is small. The p.d.f. can be mathematically formulated as follows:
(2)
This density function is the height of the function defining the probability that an event occurs within the time interval, or the derivative of the cumulative event distribution function, as shown in Equation 2. From the survival function, S(t) = Pr(T > t), Blossfeld et al. (1989, pp. 33-34) derive the cumulative distribution function F(t) as follows.
(3)
(4)
From these functions, the hazard rate can be derived. The relationship between the hazard and survival rate, and the p.d.f. can be compactly formulated as: h(t) = f(t) / S(t). Thus, we can see that the hazard rate, h(t), is formulated as:
(5)
This function is interpreted as the instantaneous risk within any interval as the size of that interval diminishes. The graph of the hazard function is shown in Figure 3 and the cumulative hazard function is formulated as:
(6)
We know that from h(t) = f(t)/S(t) and Equation 3, we may derive Equation 7:
(7)
For example, from the hazard rate and the probability density function,
the survival rate can be inferred. From the survival rate and the cumulative
distribution function, it is possible to calculate the hazard rate. To
summarize, the graph of the hazard function in Figure 3 depicts the conditional
probability of the event over the time periods. From this figure, note
that the conditional probability of leaving teaching increases at to a
maximum of .18 and then falls rapidly. By interval 12, the end of the study,
the probability has declined to nearly .03.
Figure 3
AMERICAN TEACHER ATTRITION
HAZARD |
0.18 + A
| +
| ++
| ++
| + +
0.16 + + +
| + +
| + +
| + +
| + +
0.14 + + +
| + +
| + +
| + A
| + ++
H 0.12 + + ++
A | + A
Z | + +
A | + ++
R | + +
D 0.10 + + A+
| + ++
F | + +A
U | + +
N | + ++
C 0.08 + + +
T | + A+
I | + ++
O | + +A
N | + +
0.06 + + ++
| + +
| + A++
| + ++A+
| + ++
0.04 + + +A
| + ++
| + ++
| + A
| +
0.02 + +
| +
| +
| +
| +
0.00 + A
---+----+----+----+----+----+----+----+----+----+----+----+----+----+--
0 1 2 3 4 5 6 7 8 9 10 11 12 13
LENGTH OF FIRST TEACHING SPELL (SPELL-1)
Comparing Survival Rates
Survival functions may be compared for equality. For example, to decide whether teacher attrition is significantly different for males versus females, survival rates for males may be compared to those of females. The question is whether survival rates are related to gender. If these rates differ, then the homogeneity among the two groups should be significantly different. Two tests for this comparison are the Log-Rank (Mantel-Haenszel) and the generalized Wilcoxon (Gehan) test. A third test, the likelihood ratio test, assumes that the group data are exponentially distributed. This assumption is often violated, thus we focus on the former two. When we stratify the sample into groups, we request overlaid survival, hazard, and p.d.f. plots for inspection. An example of overlaid survival plots for male and female teachers is given in Figure 4. From the divergence, there appears to be a significant difference in teacher attrition among gender groups.
Figure 4
AMERICAN TEACHER ATTRITION SURVIVAL FUNCTION ESTIMATES
SDF |
1.0 + *++++*
| +
S | +
U | +
R | +
V |
I | *
V 0.8 + ++
A | ++
L | *
| ++
D | ++
I | *++
S | ++++B+
T 0.6 + ++ ++
R | A++ +B++
I | ++A+ ++B++
B | ++ ++B++
U | +A++ ++B++
T | ++A++ ++B++++B++
I | ++A++ ++B++++B
O 0.4 + ++A++++A++
N | ++A++++A
|
F |
U |
N |
C 0.2 +
T |
I |
O |
N 0.0 +
--+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
LENGTH OF FIRST TEACHING SPELL (SPELL-1)
LEGEND FOR STRATA SYMBOLS (A: SEX=F, B: SEX=M)
The log-rank test is a chi-square test with one df for each period. The expected number of departures for males is calculated as a proportion of the total number of departures divided by the total number of persons at risk multiplied by the number of males at risk. Similarly, the expected number of departures for females is computed as a proportion of the total number of departures divided by the total number of persons at risk in that interval, multiplied by the number of female teachers at risk. The log-rank test uses the difference between the total observed departures and the total expected departures, divided by the asymptotic standard error of departures. The differences are summed for the groups and yield an approximate chi-square statistic. The log-rank and the generalized Wilcoxon tests differ in that the weighting factor (omega) is 1.0 for the log-rank test and the size of the risk set for the generalized Wilcoxon test. A basic formulation of these equality tests is given in Equation 8.
(8)
In Table 5 below, SAS output for the equality of strata tests is given. Log-rank and generalized Wilcoxon tests are given for the groups and then the chi-square test results are given. The strata are significantly different in homogeneity for both tests (p < .001). Because the hazard rate is not constant, the likelihood ratio test is inappropriate and therefore is not addressed.
Table 5
Testing Homogeneity of Survival Curves over Strata
(Time Variable Spell-1)
___________________________________________________________________________
RANK STATISTICS
SEX LOG-RANK WILCOXON
FEMALE 723.61 25040843
MALE -723.61 -2.504E7
TEST OF EQUALITY OVER STRATA
PR >
TEST CHI-SQUARE DF CHI-SQUARE
LOG-RANK 91.6043 1 0.0001
WILCOXON 49.2133 1 0.0001
-2LOG(LR) 189.1879 1 0.0001
___________________________________________________________________________
Assumptions
For many analyses, it is assumed that intervals are equidistant, although
this assumption is not necessary. Life tables are useful for studying survival
times without continuous covariates. If categorical covariates are used,
then survival times may be stratified and compared with the equality of
homogeneity tests. This, life tables are useful for preliminary analyses
of hazard rates to determine what subsequent analyses might be appropriate.
They are also used in sensitivity analyses comparing uncensored and censored
groups.
Programs for Life Tables Analysis
A number of statistical programs perform discrete life tables analysis. Among them are SAS, SPSS, BMDP 1L, LIMDEP, and EGRET. The most popular mainframe packages are SAS, SPSS, and BMDP 1L. For the personal computer, SAS, SPSS, and LIMDEP are the most popular. We discuss SAS and SPSS command syntax in this section, explain SAS output, and provide SAS, SPSS, BMDP1L and LIMDEP command syntax for the discrete time life tables analysis in Appendix 4. SYSTAT SURVIVAL does not have a subroutine to analyze discrete time life tables, thus that package is not addressed here.
SAS syntax for life tables. The command syntax presented in Appendix 4 was used to generated the SAS output discussed above. The SAS procedure is invoked with the PROC LIFETEST command. The METHOD statement specifies the type of analysis. For discrete-time models, specify LT or ACT to indicate that a discrete-time analysis is requested. Otherwise, a Kaplan-Meier Product-limit analysis, inappropriate for discrete data, is produced. Because this is useful in analyzing residuals, an explanation of it is provided later. Plots are requested by letters, an S for survival function, H for hazard function, and P for the p.d.f. The LS and LLS options may be used for plots of the natural log of the survival function and the ln(-ln S) function. If the LS function is a straight line, the hazard function follows an exponential distribution, whereas if the LLS function is a straight line then the hazard function follows a Weibull distribution. Such information is helpful in determining the nature of the hazard function, especially for continuous-time models. Periodization is specified by the intervals subcommand. The single spell is divided into 13 one-year intervals for our example, time span is designated by the SPELL-1 time duration, and the censoring variable is LEFT. The 0 inside parentheses specifies the censoring value. If the event is not observed within the time span it is coded as 0. The statement STRATA SEX specifies that survival times should be analyzed by gender. Equality of homogeneity tests are then conducted to evaluate the differences.
SPSS syntax for life tables. SPSS syntax for the discrete time
life table is similar to that of SAS. Life table analysis is requested
for the length of the Spell-1 variable, broken down by the coded values
of sex (M = 1, F = 2). There are 13 one-year intervals. The STATUS variable
indicates the censoring variable. Unlike a SAS specification, the coding
specified is that for no censoring. Regular events are coded as 1 and all
others are considered censored. When all plots are requested, the survivor
function, its log plot, the hazard function, and the p.d.f. are provided.
COMPARE requests that the equality of homogeneity tests be conducted. SPSS
provides a modified Generalized Wilcoxon test called the Desu-Lee test,
which is sensitive to differences near the beginning of the survival function,
whereas the Log-rank test is sensitive to differences near the end of the
function (Blossfeld et al., 1989, pp. 126-127).
DISCRETE-TIME LOGISTIC REGRESSION ANALYSIS
When analyzing non-repeatable events, the survival, hazard, and p.d.f. are easily derived from one another. As shown in Equations 3 through 5, the survival function can be converted to the hazard function easily. Equation 9 shows how to derive the hazard rate from the probability density function:
(9)
Conversely, the survival function may be derived from the known hazard function, as shown in Equation 10. Although it is possible to begin with an analysis of survival times, it may be more useful to follow with an analysis based on the discrete time hazard rate. The discrete time hazard rate, or conditional probability of an event, is represented as h(t). This variable can be treated as a dependent variable in a regression analysis. It is the conditional probability of an event with respect to a baseline probability. The baseline probability is that for which the covariates (predictors) are at 0. If the probability of an event within a particular time period is h(t), then the probability that the event will not occur is its complement = [1 - h(t)]. The odds that the event will take place is the ratio of the conditional probability of the event to the conditional probability of the nonevent. When this ratio is compared to the baseline ratio, the odds ratio is constructed.
(10)
The odds ratio and its transformation, the logit, are
given in Equation 11; the standard error is easily computable. The logit
of the hazard rate can serve as a criterion for a logistic regression.
The logit is modeled as a linear combination of a parameter vector (consisting
of a constant term for the period plus the sum of the coefficients multiplied
by their respective variable values; Hosmer & Lemeshow, 1989, p. 239).
In Equation 12, note that a is the intercept for the jth period under consideration
and is the baseline hazard rate against which covariates are evaluated.
If both sides of the equation are multiplied by the baseline odds, then
Equation 12 is formed. Hence, aj is the baseline hazard rate
and the logit of the hazard is equal to the constant aj plus
the parameter vector.
(11)
(12)
Alternatively, the logit can be viewed as a function of the hazard rate. Equation 13 gives the transformation of Equation 12 as a function of the hazard.
(13)
But a nonlinear regression with the hazard rate as a function of a exponential parameter vector is as intractable as the multiplicative odds version of the equation. By taking the antilog of (exponentiating) both sides of this equation, the multiplicative odds form of the equation is derived, as shown in Equation 14.
(14)
The problem with both equations is that their components are nonlinear,
and thus less tractable than a linear form. The linear form of the regression,
given in Equation 12, is more manageable and is the form assumed throughout
this chapter.
Person-Period Dataset
Discrete-time logistic regression analysis is unlike standard logistic
regression in that it is not based on a person-oriented dataset. Rather,
this procedure is applied to each of the j periods of the spell.
Results are then aggregated so that person-periods are the cases rather
than observations. The number of persons is multiplied by the number of
discrete time intervals in the first spell. An advantage of a person-period
unit of analysis is that it enhances the sample size. SAS code for constructing
a person-period dataset and the resulting data is given in Appendix 1.
Similar SPSS code and output are given in Appendix 2. Now we turn to global
statistical tests of model fit, using maximum likelihood estimation (an
exposition of ML estimation and tests of model fit is provided in Appendix
3).
Global Tests of Model Fit
Several objective tests are available to assess global model fit. They compare the fit of models with parameters compared to a model without the parameters. Prominent among these tests are the likelihood ratio test, Wald chi-square, Score test, Pearson goodness of fit test, and the pseudo-R2. In addition, the Akaike Information Criterion (AIC), Schwartz Criterion (SC), Gamma, Somer's D, and classification table provide useful information about model goodness of fit. The classification table is a very useful summary measure. When the ordered response is 1 for an event and the ordered response is 2 for no response for SAS (SAS models the lower rather than the upper probability), and 1 for an event and 0 for a non-event in SPSS, a classification table is produced. The most complete version of this table is provided by SAS. Both SAS and SPSS calculate the percent correct (hits). But only SAS outputs the sensitivity (percent of event responses predicted correctly) and the specificity (percent of the non-events predicted to be nonevents). SAS also calculates the false positive rate, or the percentage of predicted event responses that were observed to be non-events, and the false negative rate, the percentage of predicted non-event responses observed to be events. A SAS classification table and its associated statistics are given in Table 7 and in Appendix 7. An expanded version of the SAS classification table is found in SAS version 6.07 and 6.08 (Appendix 15).
Table 6
Effect Coding for Categorical Variable of Race
|
RACE |
D1 |
D2 |
D3 |
|
White |
-1 |
-1 |
-1 |
|
Black |
1 |
0 |
0 |
|
Hispanic |
0 |
1 |
0 |
|
Other |
0 |
0 |
1 |
Logistic Regression Coefficients
Logistic regression coefficients are interpreted as the amount of change in the logit that is associated with a unit change in the independent variable. If the coefficient is positive, then the logit change associated with a unit change in the independent variable is positive. If the sign is negative, then there is a decrease in the logit associated with a positive change of one in the independent variable. If the dummy variable coefficients were exponentiated-- that is, the natural log raised to that power of the coefficient -- then the transformed coefficient would represent the change in the odds of the event taking place owing to a unit change in the independent variable. Once this odds transformation is performed, coefficients greater than one indicate a positive change in the odds of the dependent variable, whereas coefficients less than 1.0 indicate negative changes in the odds of the dependent variable. To convert this change in the odds to a percentage, one uses the following transformation, which does not hold if the coefficients are not coded as dummy variables.
(15)
If the dichotomous variables are coded as deviations from means, or reference cell coding, then simple exponentiation of the logistic regression coefficient will not give the correct odds change in the dependent variable. Suppose that the gender variable is coded as 1 for male and -1 for female. Then the effect coding of categorical variables gets even more complicated. This kind of coding is sometimes used when the predictors are categorical. Hosmer and Lemeshow (1989, pp. 47-52) show what happens in the coding of a race variable with four categories. The categories are white, black, hispanic, and other. With effect coding there is a need for three variables; Table 7 contains the effect codes for the variables D1, D2, and D3.
Table 7
Classification Table for Logistic Regression
|
|
-------------------------------Predicted --------------- |
|||
|
|
|
Event |
No Event |
Total |
|
|
Event |
16 |
2 |
18 |
| Observed
|
No Event |
5 |
4 |
9 |
|
|
Total |
21 |
6 |
27 |
Sensitivity = 88.9% Specificity = 44.4% Correct= 74.1%
False Positive rate = 23.8% False negative rate =
33.3%
For effect coding, the white category, will have -1 for each of the dummies. To obtain the logit for the black-white comparison, Hosmer and Lemeshow (1989) show that the greater the number of categories, the more difficult it is to convert the logit coefficient to the correct odds. To obtain the proper effect coding of black vs. white, the coding in Equation 16 should be used.The results are not the mere exponentiation of the parameter estimate. For this reason we recommend that such effect coding be avoided in running discrete time logistic regression. analysis.
(16)
If the variables are measured as integers, then the above formulae do
not hold. A change in 10 units of the independent integer level variable
then would lead to a percentage change of (exp(b)10 - 1)*100
in the rate of the dependent variable. In fact, if the coefficients are
coded as integer level variables, then the general formula in equation 17 should
be used.
(17)
Statistical significance of the logistic regression coefficients may be assessed in two ways. One of these methods was just described and utilizes the Wald chi-square test, although the statistic becomes somewhat unreliable as the ratio exceeds four. Therefore, we suggest a more reliable method, which constructs a model by assessing -2LL for the model without the variable to be tested. Then -2LL is calculated for the model with the variable(s) added. The ratio of these two numbers is the LR chi-square for the newly added variable(s). It is distributed as a chi-square with one degree of freedom. This method appears to be more reliable than the Wald chi-square and we recommend it.
Nonetheless, we do not recommend the stepwise option in discrete-time logistic regression. Stepwise methods may unwisely eliminate variables that should be included, if the subsequent inclusion of other variables causes the significance of a variable to drop below a particular statistical limit. For instance, main effect components of an interaction might be eliminated by such an algorithm. If the product vector is left in the model but the main effects are dropped out, then the interaction is improperly specified. As the number of variables in the model change, the listwise deletion option may bring about a change in the number of observations, upon which the likelihood ratio chi-square is based. In this way an automatic stepwise algorithm might damage the modeling process theoretically.
Time-dependent covariates. Not only is it necessary to consider the matter of coding the time-invariant covariates, it is necessary to consider the coding of time-dependent covariates also. The interpretation of these coefficients depends on their correct coding. For convenience, categorical and dichotomous variables should be coded as dummy variables.
Event-functional covariates. Some time dependent covariates vary as a function of the occurrence of an event. Some covariates refer to states. If we are examining the hazard rate of job change, it has been hypothesized that marriage might have an effect on the hazard rate of job shifts. To code the event of marriage, a dummy variable might be used. The coding of the marriage variable before an event is 0, and the coding of that variable after the time at which the event took place is 1. Dummy variable coding for before and after a particular time is simple. The variable is given a 0 before the event for each interval coded. Suppose an event occurs for an individual within period 10. At the time of the event, the variable is given a 1 indicating that the event has occurred (in other words: Event = 0; if interval > 9 then Event = 1). These codes may be used in both discrete-time and continuous-time models.
Time-functional covariates. There are some variables that are a function of calendar time. Age increases as a function of time. This functional relationship may be coded easily. Let us assume that a person was born in 1910. The age of the person will increase by 12 months as each year passes. If the intervals are measured in years. Then the age of the person is so incremented at each interval. Age = 12 * Int - 1. Such functional forms may be constructed easily in SAS, BMDP, and most other statistical packages.
Time-varying covariates. These variables, like population density, vary with time. But the variation need not be a function of time. Rather they may cycle, or vary upward and then downward, depending on birth, mortality, longevity, and migration rates. These values may be entered at each interval in a discrete time model as average values for the interval. Alternatively, they may be entered as midpoints of the values at the beginning and end of the interval. Some persons code the variable, such as annual income, as the amount being earned annually at the beginning of that interval. In these ways, time-varying covariates may be coded. If one were to plot the level of these covariates over time, it might be possible to arrive at the level for each interval by means of linear interpolation. Using the average value has been suggested by Allison (1982, 1984) in his work on discrete time methods. Blossfeld et al. (1989) suggest that the interpolated value be programmed for the beginning of the interval.
Problems may arise from the use of time-varying covariates. If a variable
like income, which may increase over time, is used as an exogenous variable,
then increases in income may be erroneously associated with longer survival
times and correspondingly lower conditional risks of the event under consideration.
Age may affect the life cycle of a process and that life cycle may be related
to the duration of time till an event takes place. When causal variables
change within an episode then the transition rates change not just between
states, but within episodes as well (Tuma & Hannan, 1979). The historical
period, associated with environmental conditions, may have an age-related
effect on the rate of a process. Cohort effects may be related to the rate
of the process also. Beware that when a cohort is defined by a birthdate,
there is a linear dependency between cohort and birthdate that may create
a model including both effects that is unidentified.
Residual Diagnostics
Just as with OLS regression, residual diagnostics are useful to assess the adequacy of the model (Stevens, 1984). They help to identify observations that are outliers -- that is, observations that are not well explained by the model. They assess the influence of the outlier not only on the chi-square of the model, but also on the individual logistic regression coefficients. Observations that are outliers may seriously distort the parameter estimates and diagnostics can help indicate by how much distortion results. By identifying outliers and influential data points, one may examine the observations for coding errors or atypical cases, possibly meriting deletion or, at the least, further examination.
Residual diagnostics that permit this assessment include the Pearson residual, DIFCHISQ, and the DFBETA. The Pearson residual is useful in detecting outliers that are not well predicted by the model. The Pearson chi-square is actually the squared sum of squares of the residuals. The formula for the Pearson residual for the ith observation is:
(18)
The DIFCHISQ is a residual diagnostic that shows the difference in the
model chi-square that follows from the deletion of a particular
observation.
It is a statistic that can show the effect on the model of the observation
under consideration and its formula can be found in equation 19 below.
(19)
While the DIFCHISQ is useful in assessing the influence on model fit of a particular observation, the DFBETA index is useful in assessing the influence on the standardized beta of a particular observation. The DFBETA measures the change in the standardized parameter estimate owing to the deletion of a particular observation. The formula for DFBETA is given in Equation 20.
(20)
Together, these residual diagnostics allow the analyst to examine the
effect of the observation on the parameter or the model. In this way, outliers
may be identified, examined, and possibly deleted.
Assumptions
Undergirding discrete-time logistic regression analyses are three basic assumptions, explicated by Singer and Willett (1992, p. 41). If these assumptions are untenable, the model and inferences may not be valid. The first assumption is that of linearity of the logit. The second assumption is proportionality of the odds, and the third assumption is no unobserved heterogeneity. The linearity assumption means that equal displacements of the independent variable are associated with equal displacements of the logit hazard profile. This assumption may be assessed by both graphical and statistical means. Graphically, this linearity may be observed by dividing the independent variable into strata. The graphical display of the vertical displacement of the logits as a function of the different strata should reveal that equal vertical displacements of the logit should correspond to equal horizontal displacements of the strata of the independent variable. Statistically, linearity can be tested by showing that the addition of polynomial terms should not significantly improve the fit of the model. At first, squared terms may be added. Then higher order curvilinear terms may be entered. At no time should the addition of these terms significantly lower the -2LL, which is distributed as a chi-square with the number of degrees of freedom equal to the terms entered in that set.
The proportionality of the odds assumption is sometimes referred to as the parallel slopes assumption. This assumption means that there is no interaction between the independent variable and time. That is to say, the increase in the predictor variable is associated with a proportional increase in the odds. In other words, a shift in time period is assumed to be associated with a constant vertical displacement of the logit hazard score. The statistical test of this condition is to construct J-1 period dummies and to test their interaction with the independent variable for significant decreases in the -2LL. Alternatively, one may use a time (t) and time squared (t2) term and set up product vector interactions, sequentially adding these terms and assessing for statistically significant reductions in the -2LL. If there is a significant interaction of the independent variable with time, then the hazard profiles of the different values of the variable may intersect or criss-cross. Moreover, if an independent variable is time-varying, then the hazard may fluctuate with increase in time that yields a non-proportionality of the odds. But this is not always, although it often is, the case..
The assumption of no unobserved heterogeneity is a third requirement.
The logit hazard regression model contains no error term. All of the variation
in the logit hazard profile is assumed to be contained in the specified
variables. In other words, no misspecification of the model is permitted.
Singer and Willett (1992) note that "omission of an important independent
variable amounts to pooling of heterogeneous populations defined by the
different values of the omitted predictor" (p. 38). The aggregated
populations can have different hazard shape than that of the disaggregated
hazard profile. When individuals in the high risk groups die out early,
the hazard rate becomes depressed with the passage of time owing to the
selection effect. The selection effect of unobserved heterogeneity might
bias the hazard rate downwards, according to Allison (1984). This bias
means that the time-dependent nature of the risk may alter the hazard profile
accompanying random selection of subjects. However, it may be possible
to include the effects of unobserved heterogeneity in the estimation procedure.
Heckman and Singer (1984) have proposed a method for making the correction
(cf. Trussell & Richards, 1985). Yamaguchi (1986) provides a thorough
review of fixed and random effects methods and we direct the interested
reader to that source for additional details. Jain and Vilcassim (1991)
illustrate the use of an unobserved heterogeneity component in their investigation
of household purchase timings.
Statistical Packages for Discrete-Time Logistic Regression Analysis
Although logistic models may be estimated using SAS, SPSS, SYSTAT, and
BMDP, setting up a person-period dataset and then aggregating the periods
is most easily done in SAS. For that reason, the SAS command syntax for
such a model is presented in Appendices 5 and 6 in annotated form. The
SAS output is then presented in Appendix 7 and briefly explained.
COX REGRESSION
If one assumes that the intervals are small in size and the discrete time hazards are small in magnitude, then the discrete time logistic model may be transformed into the Cox proportional hazards regression model (Cox, 1972; Cox & Oakes, 1984). As the time intervals become finer, the dependent variable in Equation 21 for this analysis turns into a ratio of hazards (Yamaguchi, 1991, p. 18). The discrete time hazard ratio is transformed as shown in that equation. This is the odds ratio. Once the natural logarithm is taken, the logit is formed and the following equations for a Cox regression results.
(21)
(22)
(23)
Equation 22 and 23 are estimated using partial likelihood. Cox (1975) showed that the model parameters may be estimated by ignoring the baseline hazard rate and maximizing the likelihood by minimizing -2LL of the model. When a researcher is not concerned with a baseline hazard rate, this can be a robust analysis and should be considered. The goodness of fit tests are the same as given above. They include the LR chi-square, the Wald chi-square, and the Score test statistic. The Akaike Information Criterion or Schwartz Criterion are often used to compare models, with a lower score indicating the preferred model.
The Cox regression coefficients are the logits of the hazard and may be exponentiated to obtain the risk ratio. This value represents the change in the odds associated with a unit change in the independent variable, particularly when the individual variables are dummy coded. Categorical variables are dummy coded and the continuous variables are evaluated as described above for the discrete time logistic model. As before, the odds refer to the ratio of the probability of experiencing the event to the probability of not experiencing it.
The assumptions of the proportional hazards regression model are linearity, proportionality of the odds, and no unobserved heterogeneity. Proportionality of the odds means that there is no significant interaction of a covariate with time. It is the parallel slopes assumption. The model should be tested with an interaction of the covariate with time to ascertain whether this assumption holds. If it does not, the interaction term should be left in the model. The proportionality of the odds assumptions means that the Cox model cannot handle many time varying covariates. If there are more than one or two time-varying covariates, then it may better to rely on the discrete time logistic model.
Statistical Packages for Proportional Hazards Analysis
Subprograms to estimate proportional hazards regression models are found in the SAS PHREG, BMDP 2L, SPSS COXREG, SYSTAT SURVIVAL, and LIMDEP SURVIVAL packages, among others. The SAS syntax for the Proportional Hazards regression presented here (and in Appendices 9 and 17) utilizes the PHREG procedure in Version 6.04 of SAS (1990). In earlier versions, the PHGLM supplementary library procedure performed this kind of analysis. An example of the PHREG procedure follows and the syntax is given in Appendices 14 and 15 and briefly described here. The Spell-1 variable is the length of the time and the left variable is again the censoring indicator with a 0 indicating censoring of the observation. The covariates are the x variables and a covariance matrix is requested. An example of the output from this analysis follows in Table 8. The SAS output is presented as it is the most complete form of output.
Table 8
DISCRETE TIME PHREG MODEL, MICHIGAN DATA, SPELL1 ANALYSIS
Data Set: WORK.SPELLT1
Dependent Variable: DV
Censoring Variable: LEFT (Left teaching before 1984-85 school year)
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of Event and Censored Values
Percent
Total Event Censored Censored
178109 64785 113324 63.63
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 1556206.27 1531998.41 24207.85 with 2 DF (p=0.0001)
Score . . 20585.38 with 2 DF (p=0.0001)
Wald . . 15832.20 with 2 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TIME 1 -0.077037 0.00585 173.33514 0.0001 0.926
TIMESQ 1 -0.016618 0.0006293 697.27383 0.0001 0.984
The status variable indicates the code for the event of departure from
teaching. The other codes for this variable indicate censoring. Here the
left hand variables include the time variable and the censoring variable.
The coding of the censoring variable in LIMDEP must be 1 for actual departures
from teaching and 0 for censored observations. The right hand side variables
include the x covariates plus a time varying covariate that is a function
of time. This time varying covariate is a product of the variable z1
and the time variable.
APPLICATIONS OF DIFFERENT MODELS
A question generally arises as to which kind of analysis to use and when. It is usually useful to examine the survival function and the hazard rate as a first step because these functions play integral parts in the description of the duration of the phenomena under study. Therefore, usually a life tables analysis should come first. Often, the question of stratification arises. Are durations related to a categorical variables? Life tables analysis tests whether the survival rates can be stratified by a covariate so that the respective rates are significantly different. Therefore, life tables procedure should be performed first. Also, we prefer the SAS system because of the quality of its algorithms/output and the ease of construction of the person-period dataset.
Teachman (personal communication, August 1992) has commented on the relative advantage and disadvantages of the discrete time logistic model with respect to the Cox proportional hazards model. Among the relative advantages of the discrete time logistic model over the Cox regression model are handling of time-varying covariates, ease of explanation, and handling of ties. If the spell is easily and naturally divisible by intervals set up on an aggregated dataset, and if the intervals are not too large, then time aggregation bias need not pose a serious problem and the discrete time logistic model is usually a good choice. If the researcher can tradeoff a little bit of power as a consequence of the aggregation of time to test his hypotheses, then the discrete time logistic model is a good model to use. If there are a lot of time varying covariates, the discrete time logistic model may handle this form of parameter vector better than the proportional hazards model. On the other hand, if there are a lot of ties, then the discrete time logistic model should perform better.
There are situations where the Cox model may be preferred. If researchers
are not concerned with the magnitude of the baseline hazard rate, they
may prefer the simplicity of the Cox model. Researchers may have only one
or two time-varying covariates in their parameter vector. Owing to the
proportionality of the odds assumptions, the Cox model cannot handle many
time-varying covariates well. But if there are only one or two, then the
Cox model should be considered. If the covariates are fixed and proportional,
this is a good model to apply. If there are few ties, then this model may
be an appropriate one to use.
MULTIPLE SPELL MODELS
Several advanced discrete-time models deserve mention. One of them, the multiple spells model, deserves extensive treatment. Multispell models are also called multiepisode or multicycle models. There are two subtypes: one-way and two-way transitions (Yamaguchi, 1991, p. 46). We focus on the one-way transition models, under which a teacher might enter teaching, teach for a spell, and then leave. Later, the teacher may return to teaching before again departing. The one-way transition of interest is leaving teaching. The focus is duration of the teaching spell. In a two-way transition, the model may include a teaching-departure and a teaching-return transition. In such models, there may be an in-teaching spell and an out-of-teaching spell. Yamaguchi (1991) addresses the subject of two-way transitions in which a transition can be reversed (p. 46). Singer and Willett clearly explain the nature of such models in their treatment of in- and out-of-teaching spells (Willett & Singer, 19??). Multispell models are relevant for studying episodes of absenteeism, unemployment, or recidivism.
Discrete-time models may be used to analyze repeatable events by examining the duration of time from the beginning of the spell to the end of the spell, in terms of periods, for each spell. Thus, the spell lengths are aggregated as well. Because these models involve multiple persons within multiple periods within multiple spells, problems of left-censoring emerge that could complicate the analyses. There is no optimal ay to deal with left-censored observations at this time, unless the hazard rate is constant or unless there is preexisting duration dependency data from which the missing (i.e., left-censored) data may be estimated. However, constant hazard rates or such prehistorical dependency data are rare in practice, for which reason it is very important to be sure that the beginning time of the spell is correctly recorded.
As we have noted, multispell models involve person-period-spell datasets. In the analysis, the focus is on what contributes to the length of the teaching spell. We are interested in one-way transitions (leaving teaching) as the terminating event. There is a person-period dataset for each spell of the model; thus, if there are two spells, then there are two person-period datasets describing these in-teaching spells. The datasets are concatenated to create the person-period-spell dataset. In Appendix 9, a straightforward example of a SAS program converting a person-oriented dataset to a person-period-spell dataset is given. In SPSS, this conversion is complex and requires three steps. First, the person-period dataset is constructed for the first spell, using the SPSS syntax in Appendix 3, and it is saved as a system file. Second, a person-period dataset is constructed for the second spell, and it is saved as a separate system file. Finally, the two system files are concatenated with the ADD FILES command to form the person-period-spell dataset. This procedure is slightly different in SPSS for Windows in that one calls up a system file from one spell and then adds cases from the other spell system file. In this way, a multiple spells system file is created. It is on this dataset that the procedure is run. Multispell datasets may consume a lot of computer memory, depending on the number of observations, thus preparing them for processing on a personal computer may be inefficient.
In multiepisode models, the survival function, p.d.f., and hazard functions take into account the spell under consideration. Each spell has a survival, p.d.f., and hazard function, as well as a set of covariates. Examples from a two-spell Michigan teaching attrition model are found in Appendices 9 and 10. Survival and hazard rates plotted for each spell of this model in Appendix 11 and 12. A working assumption is that the spells are conditionally independent. The spells are considered a "modulated renewal" stochastic process consisting of spells that are identically and independently distributed (Yamaguchi, 1991, p. 47). The set of covariates associated with the spell under examination may change from spell to spell. The second spell may be a function of "prehistory," including the covariate vector of the first spell. This possibility can be tested. Further, the return to teaching and length of the second teaching spell may be a function of the duration of the previous teaching spell. The prehistory therefore may include the length of previous spells along with the covariate vectors associated with them. Portions of this prehistory may turn out to be significant after testing (Blossfeld, Hamerle, & Mayer, 1989, pp. 60-64). We shall first show the survival and hazard functions for an individual spell and then we shall show them for all spells. The survivor function for the kth spell (k = 1,2,...) is:
(24)
The hazard rate for a particular spell, the spell-specific hazard rate, which is the limit of probability as the time interval decreases, is given in Equation 25 (Blossfeld et al., 1989).
(25)
From this spell specific hazard rate, we can obtain a multiple spells discrete time logistic regression formula. This formula is very similar to that in Equation 5, except that the multispell model combines person-periods and spells into one formula, as can be seen in Equation 26.
(26)
The time and time-squared variables are added to check the duration
dependency --that is, a relationship of the outcome variable as a function
of the time in a state. The latter term allows for the assessment of
curvilinearity in duration dependency and is useful
if one considers parametric modeling of the equation. The sign of the
coefficient indicates
concavity upward or downward -- a negative squared term indicates a concave
downward, while a positive squared term indicates a concave upward. The
spell variable indicating the spell number indicates whether this is a
significant factor. The coefficient of the spell variable shows the shift
that comes about when one moves from one spell to next. If it is significant,
the shift in logit space from one spell to the other is noteworthy. As
this parameter estimate would be equal to the intercept and because it
is a linear function of the intercept, the intercept must be dropped from
this model to avoid multicollinearity and allow for estimation of the spell
parameter. An example of this estimation procedure is found in Appendices
14 and 15. Interactions of the covariates with time and time squared should
initially be used to test proportionality of the hazards. In Equation 26,
an example of such an interaction is between time and x1. A
SAS program to compute a discrete-time logistic regression model with an
interaction term between time and sex shows how nonproportional hazards
may be modeled, when the assumption of proportional hazards is violated.
These are logistic regression coefficients which show the shift in natural-log
hazard space associated with a unit change in the temporal or other covariate.
A classification table is presented to show sensitivity, specificity, and
false positive and false negative rates (Appendices 7 & 15). The same
procedure may be run in SPSS.
COMPETING RISK MODELS
Another discrete-time model involves the concept of competing risks.
Competing risks must be mutually exclusive and collectively exhaustive
for the models to be transition specific. The risks must be completely
different from one another in that either one or the other can happen,
but not both, at the same time. When these models are analyzed, time until
the transition-specific event is analyzed, while alternative transitions
are treated as censored. Thus, there is a different analysis for each transition
specific risk or event. Consider the event of dropping out of school and
possible primary explanations for so doing. One may drop out of school
mainly to go to a better school, or because of financial hardship, or for
family problems, or to get married and raise a family, or if one becomes
very ill, or if one dies. . Dropping out for another predominant reason
may be coded as the last alternative. Together, these explanations comprise
the set of risks (main reasons) for dropping out of school for a student.
Ideally, they are mutually exclusive and collectively exhaustive in their
locus of primary reasons for leaving school.
CONTINUOUS TIME MODELS
An additional class of models is referred to as continuous-time models (Tuma & Hannan, 1984, advocate this type of model). When the temporal data are continuous in nature, discrete-time models are unnecessary. If the data refer to particularly salient events, which are easily and exactly remembered, then exact times may be recalled and continuous time models may be used. When the data arise from continuous monitoring of events and the recording of such events is accurate, precise durations may be calculated. One example of this analysis is the medical monitoring of patients. In these cases continuous time models may be used in either nonparametric or parametric form.
The Kaplan-Meier product-limit life table is a nonparametric continuous-time model (Miller, 1981, p. 46). A brief explanation of this form of life table is presented in Appendix 16. Alternatively, parametric regression in continuous-time models requires accurate assessment of the distribution of errors, or the parameter estimates are biased. It is necessary to ascertain through examination of the hazard rate whether error distributions are exponential, Gompertz (Makeham), Weibull, log-logistic, log-normal, or gamma. If the hazard rate is constant, the model may be exponential. If the hazard rate is either increasing or decreasing, the distribution may be Gompertz or Weibull. If the hazard rate cycles (increases and then decreases), the distribution may be log-logistic.
Examination of the residuals will help determine the nature of the fit.
Researchers may compute the average or median residual duration for uncensored
as well as censored data. Even if one uses a pseudo-R2 statistic
to assess the fit, it is not always possible to be certain that the parametric
distribution selected is the correct one, especially because of possible
unobserved heterogeneity. For this reason, it is often safer to employ
the discrete-time model than risk serious parameter bias that results from
incorrect specification within a parametric framework.
CONCLUSIONS
Our presentation has developed the rationale, procedures, and assumptions of discrete-time event history analysis. The rationale centers on the increasing requirements of higher education researchers and practitioners to evaluate time until the occurrence of events (e.g., graduation, student attrition, faculty tenure). After developing three specific techniques for analyzing time until an event and relating the times to predictors, we used an extended example involving teacher attrition to demonstrate several computer packages that estimate the event history analysis models that were presented. We also presented a brief description of one multispell model with a one-way transition, in order to expand researcher knowledge of a more advanced approach to temporal data. Competing risks and continuous time models were briefly described but not elaborated. Writings from a range of disciplines were integrated, as can be seen by the explosion of interest in event history analysis over the last decade.
Researchers, spanning social science disciplines and levels of analysis, use event history analysis to study the longevity of cohabitation or marriage, medical treatments for chronic illness, and consumer product failure times. At the individual level of analysis, sociologists and demographers study the duration of cohabitation or marriage (Teachman, 1983). At an aggregate level of analysis, the survival of organizations (Hannan & Freeman, 1977), corporate illegal behavior (Baucus & Near, 1991), survival of governments (Hannan & Carroll, 1981), and decolonization (Strang, 1990) are of interest. Sociologists interested in criminal behavior examine variables that influence time until recidivism (i.e., re-arrest) of offenders. Similar techniques are used by economists to study the duration of employment and unemployment periods (Flinn & Heckman, 1982a, 1982b; Heckman & Singer, 1984, 1985; Kiefer, 1988), and by organizational psychologists to study absenteeism (Fichman, 1989; Harrison & Hulin, 1989) or time until turnover in civilian and military jobs (Morita et al., 1989). Psychologists study time to relapse of smokers (Hatziandreu et al., 1990), alcoholism (Swan & Denk, 1977) and other forms of addiction. Finally, medical researchers often study survival after different treatments for disease, for example, heart transplants (Crowley & Hu, 1977; Lawless, 1982). Because diseases often have beginnings, durations, terminations/remissions, and recurrences/recoveries, they are ideal for event history analyses (accurately called survival analyses).
One major conclusion, and one that is nearly inevitable, is a requirement
to consider reframing substantive research questions in higher education.
The refocusing should center on the duration until event occurrence, combining
the unique advantages of the person-period data matrix and the capability
to develop explanatory models using several covariates (i.e., time-independent
and/or time-varying). Further, institutional research units at many universities
provide a rich source of archival duration data that is amenable to the
sophisticated analyses that were discussed here. The applicability of survival
analysis to multiple higher education populations (students [clients],
teachers [service providers], and administrative [i.e., executive and support
personnel]), and even to institutions themselves demands increased cognizance
of this flexible family of procedures.
Notes
1. The data and codebook were respectively gathered and prepared under
the direction of Murname, Singer and Willett at Harvard University and
made available to this author through the Inter-University Consortium for
Political and Social Research (ICPSR) at the University of Michigan. We
use the same variables names found in the ICPSR codebook and are grateful
to both Murname, Singer, and Willett on the one hand and to the Inter-University
Consortium for Political and Social Research on the other for making these
data available to us.
2. A difficult type of censoring is left-censoring. Such observations
include unknown beginning time points. Unless the hazard rate is constant,
it is difficult to estimate the duration of the spell under circumstances
of left-censoring. This kind of censoring may occur with data characterized
by multiple spells. For example, studies of unemployment may involve repeated
spells of unemployment, where in retrospect the beginning of the unemployment
spells may be indefinitely recalled.
3. Singer & Willett derive the log likelihood function and their
derivation is used here.
REFERENCES
Allison, P.D. (1982). Discrete-time methods for the analysis of event-histories. In S. Leinhardt (Ed.). Sociological methodology 1982 (pp. 61-98). San Francisco, CA: Jossey-Bass.
Allison, P.D. (1984). Event history analysis: Regression for longitudinal event data. Newbury Park, CA: Sage.
Austin, J.T., Yaffee, R. A., & Hinkle, D. E. (1992) Logistic Regression for Research in Higher Education. In J. Smart (Ed). Higher Education: Handbook of Theory and Research VIII (pp. 379-410). New York, NY: Agathon Press.
Baucus, M.S., & Near, J.P. (1991). Can illegal corporate behavior be predicted? An event history analysis. Academy of Management Journal, 34, 9-36.
Bloom, B.S. (1964). Stability and change in human characteristics. New York: Wiley
Blossfeld, H.P., Hamerle, A.., & Mayer, K.U. (1989). Event history analysis: Statistical theory and application in the social sciences. Hillsdale, NJ: Erlbaum.
Cox, D.R. (1972). Regression models and life tables. Journal of the Royal Statistical Society, B, 34, 187-220.
Cox, D.R. (1975). Partial likelihood. Biometrika, 62, 269-276.
Cox, D.R., & Oakes, D. (1984). Analysis of survival data. London: Chapman & Hall.
Crowley J., & Hu, M. (1977). Covariance analysis of heart transplant survival data. Journal of the American Statistical Association, 72, 27-36.
Everitt, B.S. (1987). Introduction to optimization methods and their application in statistics. London, UK: Chapman & Hall.
Fichman, M. (1988). Motivational consequences of absence and attendance: Proportional hazard estimation of a dynamic motivation model. Journal of Applied Psychology, 73, 119-134.
Flinn, C.J., & Heckman, J.J. (1982a). Models for the analysis of labor force dynamics. In G. Rhodes & R. Basmann (Eds.), Advances in econometrics (Vol. 1, pp. 35-96). Greenwich, CT: JAI Press.
Flinn, C.J., & Heckman, J.J. (1982b). New methods for analyzing individual event histories. In S. Leinhardt (Ed.), Sociological methodology 1982 (pp. 99-140). San Francisco, CA: Jossey-Bass.
Freedman, D., Thornton, A., Camburn, D., Alwin, D., & Young-Demarco, L. (1988). The Life History Calendar: A technique for collecting retrospective data. Sociological Methodology, 17, 37-68.
Hannan, M.T., & Carroll, G.R. (1981). Dynamics of formal political structure: An event-history analysis. American Journal of Sociology, 46, 19-35.
Hannan, M.T., & Freeman, J.H. (1977). The population ecology of organizations. American Journal of Sociology, 82, 929-964.
Heckman, J., & Singer, B. (1982). Population heterogeneity. In K.G. Land & A. Rogers (Eds.), Multidimensional mathematical geography (pp. 567-598). New York: Academic Press.
Heckman, J., & Singer, B. (1984). A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica, 52, 271-320.
Heckman, J., & Singer, B. (Eds.). (1985). Longitudinal analysis of labor market data. New York: Cambridge University Press.
Heckman, J., & Singer, B. (1985). Social science duration analysis. In J. Heckman & B. Singer (Eds.), Longitudinal analysis of labor market data (pp. 39-110). New York: Cambridge University Press.
Hosmer, D.W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
Jain, D.C., & Vilcassim, N.J. (1991). Investigating household purchase timing decisions: A conditional hazard function approach. Marketing Science, 10, 1-23.
Kalbfleisch, J.D., & Prentice, R.L. (1980). The statistical analysis of failure time data. New York: Wiley.
Kiefer, N.M. (1988). Economic duration data and hazard functions. Journal of Economic Literature, 26, 646-679.
Klein, J., & Goel, P. (1992). Survival analysis: State of the art (NATO advanced workshop proceedings). Dordrecht, Holland: Kluwer.
Lawless, J.F. (1982). Statistical models and methods for lifetime data. New York: Wiley.
McGrath, J.E., & Kelly, J. (1988). Time and human interaction. Newbury Park, CA: Sage.
Miller, R.G. (1981). Survival analysis. New York: Wiley.
Morita, J.G., Lee, T.W., & Mowday, R.T. (1989). Introducing survival analysis to organizational researchers: A selected application to turnover research. Journal of Applied Psychology, 74, 280-292.
Murnane, R.J., Singer, J.D., & Willett, J.B. (1988). The career paths of teachers: Implications for teacher supply and methodological lessons for research. Educational Researcher, 17, 22-30.
Nagel, E. (1979). The structure of science (2nd ed.). Indianapolis, IN: Hackett.
Pierre, D.A. (1986). Optimization theory with applications. New York: Dover.
SAS Institute, Inc. (1990). PHREG preliminary documentation. Cary, NC: Author.
SAS Institute, Inc. (1990). Logistic procedure. In Statistics user's guide (4th ed., Vol. 2, Ver. 6). Cary, NC: Author.
Sclan, E.M. (1993). The effect of perceived workplace conditions on beginning teachers work commitment, career choice commitment, and planned retention. Unpublished Ed.D. dissertation, Teachers College, Columbia University.
Singer, J.D., & Willett, J.B. (1993). It's about time: Using discretetime survival analysis to study duration and the timing of events. Journal of Educational Statistics, 18, 155-195.
Stevens, J.P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95, 334-344.
Strang, D. (1990). From dependency to sovereignty: An event history analysis of decolonization 1870-1987. American Sociological Review, 55, 846-860.
Swan, G.E., & Denk, C.E. (1987). Dynamic models for the maintenance of smoking cessation: Event history analysis of late relapse. Journal of Behavioral Medicine, 10, 527-554.
Teachman, J. (1983). Analyzing social processes: Life tables and proportional hazards models. Social Science Research, 12, 263-301.
Trussell, J., & Richards, T. (1985). Correcting for unobserved heterogeneity in hazard models using the Heckman-Singer procedure. In N.B. Tuma (Ed.), Sociological methodology 1985 (Vol. 15, pp. 242-276). San Francisco, CA: Jossey-Bass.
Tuma, N.B. (1980). Invoking RATE. Menlo Park, CA: SRI International.
Tuma, N.B. (1982). Nonparametric and partially parametric approaches to event history analysis. In S. Leinhardt (Ed.), Sociological methodology 1982 (Vol. 12, pp. 1-60). San Francisco, CA: Jossey-Bass.
Tuma, N.B., & Hannan, M.T. (1978). Approaches to the censoring problem in analysis of event histories. In K. Schuessler (Ed.), Sociological methodology 1979 (Vol. 8, pp. 209-240). San Francisco, CA: Jossey-Bass.
Tuma, N.B., & Hannan, M.T. (1984). Social dynamics: Models and methods. New York: Academic Press.
Tuma, N.B., & Hannan, M.T. (1980). Invoking RATE (2nd ed.). Menlo Park, CA: SRI International.
Tuma, N.B., Hannan, M.T., & Groeneveldt, L.P. (1979). Dynamic analysis of event histories. American Journal of Sociology, 84, 820-854.
Willett, J.B., & Singer, J.D. (1989). Two types of questions about time: Methodological issues in the analysis of teacher career path data. International Journal of Educational Research, 13, 421-437.
Willett, J.B., & Singer, J.D. (1991a). From whether to when: New methods for student dropout and teacher attrition. Review of Educational Research, 61, 407-450.
Willett, J.B., & Singer, J.D. (1991b). How long did it take? Using survival analysis in educational and psychological research. In L. Collins & J. Horn (Eds.), Best methods for the analysis of change (pp. 310-327). Washington, DC: American Psychological Association.
Willett, J.B., & Singer, J.D. (1993). It's deja-vu all over again: Using multiple-spell discrete-time survival analysis. Manuscript submitted for review.
Yaffee, R.A. , Lorenz. V.C. & Politzer, R.M., (1993) Models Explaining Gambling Severity among Patients undergoing Treatment in Maryland: 1983 through 1989. In W.R. Eadington & J.A. Cornelius (Eds.), Gambling Behavior and Problem Gambling .(pp. 657-677). Reno, NV.: University of Nevada Press, Reno.
Yamaguchi, K. (1986). Alternative approaches to unobserved heterogeneity in the analysis of repeatable events. In N.B. Tuma (Ed.), Sociological methodology 1986 (Vol. 16, pp. 213-249). San Francisco, CA: Jossey-Bass.
Yamaguchi, K. (1991). Event history analysis. Newbury Park, CA:
Sage.
CONTRIBUTORS
Robert A. Yaffee is a statistical consultant with the Statistics and Social Science Group of the Academic Computing Facility, Courant Institute of Mathematical Sciences, New York University. He is Director of Research of the Compulsive Gambling Center, Inc.in Baltimore, Maryland. He received his Ph.D. from The Graduate Faculty of Political and Social Science of the New School for Social Research. He has lectured and written articles on logistic regression and ordinal logit models and has also applied these techniques to research on addiction severity among pathological gamblers for the Maryland Task Force on Gambling Addiction.
James T. Austin is an Assistant Professor of Psychology at The Ohio State University. He earned his Ph.D. in 1987 from Virginia Polytechnic Institute and State University, with a specialization in Industrial-Organizational Psychology. His research on goal-setting, criterion measurement, and structural equation modeling has appeared in industrial-organizational and quantitative psychology journals.
ACKNOWLEDGEMENT
The authors would like to thank Richard Murnane of the Harvard Graduate School of Education for collecting and making the teacher career data available to the Inter-University Consortium for Political and Social Research (ICPSR) in addition to Judy D. Singer and John B. Willett, also of the Harvard Graduate School of Education, for the opportunity to review their excellent treatments of single and multiple-spell discrete time event history methods using that data. To them, we remain heavily indebted. We would also like to thank the Inter-University Consortium for Political and Social Research for providing a copy of the dataset used for our analysis and Jay D. Teachman of the University of Maryland, College Park, and of the Institute of Social Research at the University of Michigan, for his valuable inspiration and assistance. We would also like to thank the Academic Computing Facility of New York University for the use of the IBM 9121 on which we constructed the multiple spells datasets for analysis.
Appendix 1
Converting a Person-Oriented Dataset to Person-Period
Dataset in SAS
PROC FORMAT;
VALUE SX 0='MALE' 1='FEMALE'; VALUE RA 0='WHITE' 1='BLACK';
VALUE LE 0='CENSORED' 1='LEFT';
OPTIONS LS=80 PS=55;
TITLE 'SAS CONVERSION OF PERSON-ORIENTED TO PERSON-PERIOD'; TITLE2 ' DATA SET ';
DATA ONE; INFILE 'MICHTCH.DAT';
INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR; LABEL ID='RESPONDENT ID'
SEX='SEX OF TEACHER' SPELL1='LENGTH OF FIRST TEACHING SPELL'
LEFT='LEFT TEACHING BEFORE 84-85 SCHOOL YEAR' ENTERYR= 'YEAR FIRST ENTERED TEACHING'
EXITYR='FIRST YEAR OF CAREER INTERRUPTION' ;
IF LEFT > 0 THEN LEFT = 1;
ARRAY TEMP(13) D1-D13; DO PERIOD = 1 TO MIN(SPELL1,13);
IF PERIOD = SPELL1 AND LEFT NE 0 THEN DV = 1; ELSE DV = 2; *creation of dep var;
DO INDEX = 1 TO 13;
IF INDEX = PERIOD THEN TEMP(INDEX) = 1; ELSE TEMP(INDEX) = 0;
SPELL1SQ = SPELL1 * SPELL1;
SPXSX = SPELL1*SEX; SP2XSX = SPELL1SQ*SEX;
END; OUTPUT; END;
FORMAT SEX SX.; FORMAT RACE RA.; FORMAT LEFT LE.;
PROC PRINT; VAR ID D1-D13 LEFT DV SPELL1 SPELL1SQ; RUN;
PROC PRINT; VAR D1-D13 PERIOD LEFT DV; RUN;
Note: The basic array construction used here is from Willett, 1993, pp. 192-193). Coding of the DV requires that events leaving teaching are coded as 1 and non-events staying in teaching as 2.
S
S P P
P E E
E L R L
O D D D D S L L I E
B I D D D D D D D D D 1 1 1 1 E L S O F D
S D 1 2 3 4 5 6 7 8 9 0 1 2 3 X 1 Q D T V
1 376523417 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE 13 169 1 CENSORED 2
2 376523417 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE 13 169 2 CENSORED 2
3 376523417 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE 13 169 3 CENSORED 2
4 376523417 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE 13 169 4 CENSORED 2
5 376523417 0 0 0 0 1 0 0 0 0 0 0 0 0 MALE 13 169 5 CENSORED 2
6 376523417 0 0 0 0 0 1 0 0 0 0 0 0 0 MALE 13 169 6 CENSORED 2
7 376523417 0 0 0 0 0 0 1 0 0 0 0 0 0 MALE 13 169 7 CENSORED 2
8 376523417 0 0 0 0 0 0 0 1 0 0 0 0 0 MALE 13 169 8 CENSORED 2
9 376523417 0 0 0 0 0 0 0 0 1 0 0 0 0 MALE 13 169 9 CENSORED 2
10 376523417 0 0 0 0 0 0 0 0 0 1 0 0 0 MALE 13 169 10 CENSORED 2
11 376523417 0 0 0 0 0 0 0 0 0 0 1 0 0 MALE 13 169 11 CENSORED 2
12 376523417 0 0 0 0 0 0 0 0 0 0 0 1 0 MALE 13 169 12 CENSORED 2
13 376523417 0 0 0 0 0 0 0 0 0 0 0 0 1 MALE 13 169 13 CENSORED 2
14 386479798 1 0 0 0 0 0 0 0 0 0 0 0 0 FEMALE 1 1 1 LEFT 1
15 364497052 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE 1 1 1 LEFT 1
16 366545454 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE 13 169 1 CENSORED 2
17 366545454 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE 13 169 2 CENSORED 2
18 366545454 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE 13 169 3 CENSORED 2
19 366545454 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE 13 169 4 CENSORED 2
20 366545454 0 0 0 0 1 0 0 0 0 0 0 0 0 MALE 13 169 5 CENSORED 2
21 366545454 0 0 0 0 0 1 0 0 0 0 0 0 0 MALE 13 169 6 CENSORED 2
22 366545454 0 0 0 0 0 0 1 0 0 0 0 0 0 MALE 13 169 7 CENSORED 2
23 366545454 0 0 0 0 0 0 0 1 0 0 0 0 0 MALE 13 169 8 CENSORED 2
24 366545454 0 0 0 0 0 0 0 0 1 0 0 0 0 MALE 13 169 9 CENSORED 2
25 366545454 0 0 0 0 0 0 0 0 0 1 0 0 0 MALE 13 169 10 CENSORED 2
26 366545454 0 0 0 0 0 0 0 0 0 0 1 0 0 MALE 13 169 11 CENSORED 2
27 366545454 0 0 0 0 0 0 0 0 0 0 0 1 0 MALE 13 169 12 CENSORED 2
28 366545454 0 0 0 0 0 0 0 0 0 0 0 0 1 MALE 13 169 13 CENSORED 2
29 378504498 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE 4 16 1 LEFT 2
30 378504498 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE 4 16 2 LEFT 2
31 378504498 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE 4 16 3 LEFT 2
32 378504498 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE 4 16 4 LEFT 1
Appendix 2
Panel 1
Converting Person-Oriented Datasets to Person-Period Datasets in SPSS
SET WIDTH 80/UNDEFINED=NOWARN/MXWARNS=100000.
TITLE 'SAS CONVERSION OF PERSON PERIOD DATA SET'.
INPUT PROGRAM.
DATA LIST FILE='MICHTCH.DAT' FREE/ ID SEX RACE SPELL1 LEFT ENTERYR EXITYR.
VAR LABELS ID='RESPONDENT ID' / SEX='SEX OF TEACHER' / SPELL1='LENGTH OF FIRST TEACHING
SPELL' / LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR' / ENTERYR= 'YEAR FIRST
ENTERED TEACHING' / EXITYR='FIRST YEAR OF CAREER INTERRUPTION'.
VALUE LABELS SEX 0='MALE' 1='FEMALE'/RACE 0='WHITE' 1='BLACK'/LEFT 0='CENSORED' 1='LEFT'.
VECTOR TEMP(13).
LOOP PERIOD = 1 TO MIN(SPELL1,13).
LEAVE ID SEX RACE SPELL1 ENTERYR EXITYR LEFT.
COMPUTE DV = 0.
IF (PERIOD = SPELL1 AND LEFT NE 0) DV = 1.
LOOP INDEX = 1 TO 13.
IF (LEFT > 0) LEFT = 1.
COMPUTE TEMP(INDEX) =0.
IF (INDEX = PERIOD) TEMP(INDEX) = 1.
COMPUTE TIME1 = PERIOD.
COMPUTE TIME1SQ = PERIOD*PERIOD.
COMPUTE T1XSX = TIME1*SEX.
COMPUTE T1SQXSX = TIME1SQ*SEX.
END LOOP.
END CASE.
END LOOP.
VAR LABELS TIME1 'THE LENGTH OF TIME ELAPSED IN SPELL1' / TIME1SQ 'TIME ELAPSED IN SPELL1
SQUARED' / T1XSX 'TIME1 BY SEX INTERACTION' / T1SQXSX 'TIME1 SQUARED BY SEX INTERACTION'.
END INPUT PROGRAM.
LIST VARIABLES= TIME1 TIME1SQ SEX RACE PERIOD LEFT DV.
SAVE OUTFILE='MT1B.SAV'.
Appendix 2
Panel 2
Note: This program was run on SPSS for Windows version 5 on a 486/50 with 16MB RAM. It produced the same output as the above SAS program, with the exception of the coding of the dependent variable (i.e., event is coded as a 1, non-event is coded as a 0).
Output of SPSS Program in Appendix 2
TIME1 TIME1SQ SEX RACE PERIOD LEFT DV
1.00 1.00 .00 .00 1.00 .00 .00
2.00 4.00 .00 .00 2.00 .00 .00
3.00 9.00 .00 .00 3.00 .00 .00
4.00 16.00 .00 .00 4.00 .00 .00
5.00 25.00 .00 .00 5.00 .00 .00
6.00 36.00 .00 .00 6.00 .00 .00
7.00 49.00 .00 .00 7.00 .00 .00
8.00 64.00 .00 .00 8.00 .00 .00
9.00 81.00 .00 .00 9.00 .00 .00
10.00 100.00 .00 .00 10.00 .00 .00
11.00 121.00 .00 .00 11.00 .00 .00
12.00 144.00 .00 .00 12.00 .00 .00
13.00 169.00 .00 .00 13.00 .00 .00
1.00 1.00 1.00 .00 1.00 1.00 1.00
1.00 1.00 .00 .00 1.00 1.00 1.00
1.00 1.00 .00 .00 1.00 .00 .00
2.00 4.00 .00 .00 2.00 .00 .00
3.00 9.00 .00 .00 3.00 .00 .00
4.00 16.00 .00 .00 4.00 .00 .00
5.00 25.00 .00 .00 5.00 .00 .00
6.00 36.00 .00 .00 6.00 .00 .00
7.00 49.00 .00 .00 7.00 .00 .00
8.00 64.00 .00 .00 8.00 .00 .00
9.00 81.00 .00 .00 9.00 .00 .00
10.00 100.00 .00 .00 10.00 .00 .00
11.00 121.00 .00 .00 11.00 .00 .00
12.00 144.00 .00 .00 12.00 .00 .00
13.00 169.00 .00 .00 13.00 .00 .00
1.00 1.00 .00 .00 1.00 1.00 .00
2.00 4.00 .00 .00 2.00 1.00 .00
3.00 9.00 .00 .00 3.00 1.00 .00
4.00 16.00 .00 .00 4.00 1.00 1.00
Appendix 3
Maximum Likelihood Estimation
Discrete-time logistic regression uses maximum likelihood
(ML) parameter estimation. Using the likelihood equation it is possible
to derive an objective function to be minimized for the parameters to be
estimated. Because the dependent variable assumes values of either 1 or
0, depending upon whether the person experiences the event of leaving teaching
or not within the period under consideration, Singer and Willett (1993)
noted that the binomial probability outcome provides the likelihood function
(L) shown in Equation 27. The binomial probability of experiencing the
event or not experiencing the event multiplied by the log-odds of experiencing
the event, when it does occur, is formulated in Equation 27:
(27)
In the likelihood function, c is designated
the censoring variable. This variable tells whether the person's career
ended in the event or not is incorporated in the version of the likelihood
equation in 35. If the person did not drop out of teaching in the kth interval,
then the teacher's observation is censored and c = 1. If the person left
teaching in the jth interval, then this observation is not censored and
c = 0.
(28)
According to Singer and Willett (1992), if it is assumed
that the individuals in the sample are independent, the likelihood function
is the product of the probabilities of their observed data as shown in
18. The probability that T = j for uncensored individuals and the probability
that T > 0 for censored persons.
(29)
This translates into
(30)
Taking the logarithms gives the log likelihood (LL)
function
(31)
Singer and Willett (1992) show that expansion of the
middle term yields:
(32)
This expansion permits collection of terms of the
log likelihood function as:
(33)
Reparameterizing this function by taking the censoring
indicator c and replacing it with the outcome of the event history process,
y, the log likelihood (LL) function is obtained:
(34)
which in turn can be reformulated as:
(35)
For each person there are j intervals (periods).
Therefore each person contributes j cases to the dataset. The likelihood
function is then computed over all persons over all intervals. By taking
the natural log of this function, the loge likelihood is obtained.
It is this function that can be multiplied by - 2 to obtain a function
that is distributed as a chi-square, which can then be minimized to obtain
the parameter estimates of the model. The loge likelihood function
is shown in Equation 35 and Equation 36.
(36)
Global Model Tests
A global model test is the likelihood ratio
chi-square
(LR). The test statistic is the ratio of -2 loge likelihood
of the model with parameters to -2 loge likelihood of the null
model, without parameters. The ratio of these two values is itself distributed
as chi-square. As a heuristic, the LR chi-square and the
Pearson
chi-square can be compared. When they are approximately equal, this comparison
suggests that there is no sample size problem. But if there is a substantial
discrepancy between this goodness of fit chi-square and the likelihood
ratio chi-square, then the sample size is probably too small for asymptotic
estimation. The Wald chi-square test is
computed by
pre- and post-multiplying the inverse of the variance-covariance matrix
by the parameter vector and its transpose. Formulas for both tests are
shown in Equation 37.
(37)
The Score test is a similar global chi-square
test, computed by premultiplying the transpose of the vector of partial
derivatives of the likelihood of the parameters of the model with respect
to the parameters by the inverted information matrix (the asymptotic variance-covariance
matrix) and post-multiplying that product by the vector of partial derivatives
of the model.
(38)
The number of degrees of freedom for this test is equal to the number of parameters being estimated. The Pearson chi2 is sometimes used as a goodness of fit measure. The Pearson Chi-square test is given as follows:
(39)
The degrees of freedom of this statistic equals the
number of model parameters minus 2.
RATE, a program written by Tuma (1980; Tuma &
Hannan, 1980), also computes a pseudo R2.
(40)
Two other statistics used to assess comparative fit
are the Akaike information criterion and the Schwartz Criterion.
(41)
In addition to the -2LL, these two comparative global
statistics provide two different ways of adjusting for degrees of freedom
and the number of predictor variables in the model. As a rule, the lower
the value of the statistic, the better the nested model.
SAS computes three nonparametric correlation coefficients
that correlate the predicted probabilities with the observed responses
of the model. These are the Kendall's tau-a, the gamma and the Somer's
D correlation coefficients. The formulas for these coefficients make use
of concordances and discordances.
![]()
(42)
A concordance is a pair of observations that include
two variables (where one variable is a predicted probability and the other
is an observed response), where an increase in one variable (say x) going
from one observation to the next is associated with an increase in the
other variable (say y). A discordance is where an increase in one variable
is associated with a decrease in the other variable, when one proceeds
from one observation to the next. The number of the different pairs are
those minus the number of ties. The number of ties are those where there
is no significant change between two variables when one proceeds from one
observation to another. The sum of the concordances minus the sum of the
discordances over the number of pairs ( which is equal to (n * (n - 1)
/ 2)) is a measure of the proportional concordance of the two variables.
The sum of the concordances minus the sum of the discordances over the
total number of concordances and discordances is another measure of proportional
concordances, yielding a Gamma correlation coefficient. The Somer's D statistic
calculated by SAS subtracts the number of ties from the total in the denominator
to correct for the number of ties in the measure, when it yields the proportional
concordance in measuring the monotonicity of the data. These measures address
to what extent prediction of a dependent variable is enhanced by knowledge
of covariates.
Estimation of the Model
This section describes parameter estimation and is
mathematically complex. It may be skipped by the reader who needs no information
on the estimation algorithm. Parameter estimation uses the log likelihood
function. Minus two times the log likelihood (-2LL) is distributed as a
chi-square. Maximization of the likelihood is performed by minimizing the
chi-square value, in an iterative manner (cf. Everitt, 1987; Pierre, 1986).
The minimization is performed by taking the partial derivatives of the
log likelihood with respect to particular parameter estimates, and setting
these equal to 0 to obtain the first order conditions that permit solutions
for the parameter estimates. The numerical algorithm used by computer packages
to find the minima along the first order functions is usually some version
of the Newton-Raphson method. This method uses a Taylor series expansion
to arrive at the parameter estimate. Using this method, the parameter estimates
at a given iteration are a function of the previous estimate minus the
product of derivative of the estimate times their second derivatives. The
second derivatives of these log likelihood functions are computed to form
the Fisher information matrix. The Newton-Raphson algorithm forces an iterative
movement of the parameter estimate from 0 along the parameter range until
the chi-square function (that is, -2LL) is minimized. If no local minimum
has been found, then the minimum chi-square will maximize the likelihood
that the parameter estimate of the model is most probably the correct one
(i.e., is maximally likely). Significance tests are obtained by solving
for the information matrix (minus the expectation of the inverse of the
matrix of second derivatives) for the asymptotic covariance matrix. The
variances of this matrix are the elements in the main diagonal. Their square
roots are the asymptotic standard errors for testing parameter significance.
When the sample size is sufficiently large and the ratio of parameter estimate
divided by its asymptotic standard error is squared, a Wald chi-square
statistic is formed. These squared ratios are distributed as chi-squares,
each with one degree of freedom. Significance level may be computed for
the individual coefficients in this manner.
Appendix 4
SAS Mainframe version 6.07, PC version 6.04
PROC LIFETEST METHOD=LT PLOTS=(S,H,P) INTERVALS=1 TO 12 BY 1;
TIME SPELL1*LEFT(0); STRATA SEX;
TITLE1 'AMERICAN TEACHER ATTRITION';
TITLE2 'THREE STATES COMBINED DATA';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;
SPSS for Mainframe, Windows, and PC+
SURVIVAL TABLES=SPELL1 BY SEX(1,2)/
INTERVALS=THRU 12 by 1/
STATUS = LEFT(1)/
PLOTS (ALL)/
COMPARE
BMDP1L Mainframe 1988.
/PROGRAM BMDP1L
/INPUT FILE='Mich.dat'.
Variables=3.
FORMAT = Free.
/Variable NAMES=SPELL1,LEFT,SEX.
/FORM TIME=SPELL1.
UNIT=YEARS.
STATUS=LEFT.
RESPONSE=1.
/ESTIMATE METHOD = LIFE.
PLOTS=SURV, DEN, HAZ.
GROUPING=SEX.
STRATA=SEX.
STATISTICS=BRESLOW.
LIMDEP Ver. 6.
SURVIVAL; LHS = SPELL1,LEFT;
INT = 13;
STR=SEX $
Appendix 5
SAS and SPSS Syntax for Discrete-Time Logistic
Regression Procedures
SAS version 6.04
txx1 = time * x1; PROC LOGISTIC CT; MODEL DV = time time2 x1 txx1 x2 x3 / corrb;
SAS version 6.07 and 6.08
Altough we used, SAS v. 604 to run the models shown here, it should be noted that as of release 6.07, you may use a DESCENDING option on the proc statement to reverse the ordering of the response variable, so that SAS will model the upper, rather than the lower, probability level. If this is done, then the DV =2 statement in Appendix
1 should be changed to DV = 0. An example of the PROC Statement that would be
used in this connection follows:
PROC LOGISTIC DESCENDING CT; MODEL DV = time time2 x1 txx1 x2 x3 / corrb;
SPSS version 4.01
compute txx1 = time * x1; Logistic Regression Variables = DV with Time Time2 x1 txx1 x2 x3 / classplot / print summary corr.
Appendix 6
SAS Output from Discrete-time Logistic Regression Model Although logistic models may be found in SAS, SPSS, SYSTAT, BMDP, and other statistical packages, setting up person-period datasets and then aggregating the periods is most easily done in SAS. For that reason, SAS (Ver. 6.07) syntax for a model using only the Michigan teaching data, as collected by Murnane et al. (1988), is presented below in annotated form. Line 1 Sets the line length to 80 columns. 1 OPTIONS LS=80; Lines 2 through 6 define the answer categories for the nominal variables. 2 PROC FORMAT; 3 VALUE SE 0='MALE' 1='FEMALE'; 4 VALUE RA 0='WHITE' 1='BLACK'; 5 VALUE LE 0='CENSORED' 1='LEFT'; 6 VALUE PT 1='FULL TIME' 2='PART TIME'; Line 11 defines the dataset. 11 DATA MITEACH; Line 12 reads in the dataset defining the record length and blocksize. For the MVS system, MIT is the data definition name. 12 INFILE MIT LRECL=472 BLKSIZE=31624; Lines 13-19 declare the variables by giving them variable names and defining the location in their data in the data file. It should be noted that the MADEG01-MADEG13 (Master's degree in year XX) array of variables are time varying covariates, which have the value of 0 until the year when the degree is conferred. From that year till the end of the person's spell, the value of that array is 1. 13 INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16 14 MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22 15 MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28 16 INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.) @59 17 (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12)(1.) 18 SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93 19 RETURNYR 94-95; Lines 20 through 30 convert the person-oriented dataset to the person-period dataset, as demonstrated by Singer & Willett. Line 20 defines the temp variable array as dummy variables D1 to D13. 20 ARRAY TEMP(13) D1-D13; Line 21 Defines MADEG as an array of dummy variables MADEG1 through MADEG13 of whether the teacher obtained a masters degree in that year. 21 ARRAY MADEG(13) MADEG01-MADEG13; Line 22 sets up the person-period case loop. 22 DO PERIOD = 1 TO MIN(SPELL1,13); Line 23 states that if the period is equal to the greatest year and if the observation is not censored, then the dependent variable equals 1. Otherwise, the dependen variable equals 2. Remember that PROC logistic in SAS models the lower probability, not the higher one. 23 IF PERIOD = SPELL1 AND LEFT NE 0 THEN DV= 1; ELSE DV = 2; Line 24 sets up the index of the loop that defines the time dummies. 24 DO INDEX = 1 TO 13; Line 25 sets a dummy variable D1 through D13 equal to one only when the index equals to correct period. 25 IF INDEX = PERIOD THEN TEMP(INDEX) = 1; ELSE TEMP (INDEX) = 0; The dur1 and dur1 squared variables are constructed in lines 26 and 27. Deviating from the practice of Singer & Willett, a different parameterization of time is undertaken. This approach uses fewer variables than the dummy variable approach and conserves power in the analysis. The duration variable and its square are then computed. 26 DUR1 = EXITYR - ENTERYR; 27 DUR1SQ = DUR1**2; Line 28 defines the end of the inner loop. 28 END; Line 29 outputs the variables within the loop as data. 29 OUTPUT; Line 30 defines the end of the outer loop. 30 END; Lines 31 through 39 give the variable labels for the covariates. 31 LABEL SEX= 'SEX OF TEACHER' 32 RACE='RACE OF TEACHER' 34 ENTERYR='FIRST SCHL YEAR OF TEACHING' 35 SPELL1='LENGTH OF FIRST TEACHING SPELL' 36 LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR' 37 EXITYR='FIRST YEAR OF CAREER INTERRUPTION' 38 DUR1 = 'DURATION OF FIRST SPELL'; Lines 40 and following give the syntax for the logistic regression procedure. CT requests the classification table for assessment of prediction accuracy. 40 PROC LOGISTIC CT; The model statement indicates that Y is the criterion and that the covariates are sex, race, duration, & duration squared. The temporal dependence is tested by the duration and duration squared terms. If these are significant, the nature of the duration dependence may be estimated. This test is useful for researchers trying to decide whether they should perform parametric analyses, as the duration coefficients the duration and its squared term will suggest an appropriate distribution of the errors. 41 MODEL DV = DUR1 DUR1SQ SEX RACE; The format statements below the model statements link formats with variables. 42 FORMAT SEX SE.; 43 FORMAT RACE RA.; 44 FORMAT LEFT LE.; 45 TITLE2 'MICHIGAN DATA'; The output for the requested analysis shows the name of the data set, the response variable, its number of levels and the number of valid observations for this analysis.
Appendix 7
SAS v. 6.04 Output For Discrete Time Logistic Regression Model
DATA SET: WORK.MITEACH
RESPONSE VARIABLE: DV
RESPONSE LEVELS: 2
NUMBER OF OBSERVATIONS: 64785
LINK FUNCTION: LOGIT
In the response profile, observations at each value of the DV are shown.
RESPONSE PROFILE
ORDERED
VALUE DV COUNT
1 1 18957
2 2 45828
Simple statistics for IVs are shown to check that they are being read in correctly.
STANDARD
VARIABLE MEAN DEVIATION MINIMUM MAXIMUM
SEX 0.707540 0.454896 0.00000 1.000
RACE 0.055661 0.229268 0.00000 1.000
DUR1 5.176708 2.745399 1.00000 12.000
DUR1SQ 34.335402 32.359466 1.00000 144.000
CRITERIA FOR ASSESSING MODEL FIT
INTERCEPT
INTERCEPT AND
CRITERION ONLY COVARIATES CHI-SQUARE FOR COVARIATES
AIC 78323.914 62749.278 .
SC 78332.993 62794.672 .
-2 LOG L 78321.914 62739.278 15582.636 WITH 4 DF (P=0.0001)
SCORE . . 15670.553 WITH 4 DF (P=0.0001)
In the printout below are given the logistic regression coefficients, asymptotic
standard errors, Wald chi-squares, and significance levels for each parameter, along
with its standardized estimate for comparing relative effects within the model. Some
of the variables are significantly related to the logit as is by the difference
between -2 LL for the intercept only and the -2 LL for the intercept plus covariates.
This value is the chi-square for covariates, with df equal to the number of parameters
in the model.
It is interesting that sex and race do not have by themselves significant main effects
as can be seen from the analysis of ML estimates below. Their standardized estimates
(for comparison within the model) are defined as the ratio of the slope parameter,
multiplied by 1 + ex to the standard deviation of the independent variable, are less
than .01 in magnitude. The question of whether there is a significant sex by time
interaction must be tested by another model.
That the time and time squared variables are found to be significant indicates that
duration dependence significantly contributes to the explanation of the dependent
variable. That is to say, the odds of teacher attrition is a function of a polynomial
of time.
ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES
PARAMETER STANDARD WALD PR > STANDARDIZED
VARIABLE ESTIMATE ERROR CHI-SQUARE CHI-SQUARE ESTIMATE
INTERCPT 2.5553 0.0388 4328.9606 0.0001 .
SEX -0.0119 0.0220 0.2905 0.5899 -0.002975
RACE 0.0199 0.0430 0.2132 0.6442 0.002510
DUR1 -1.1744 0.0147 6393.5830 0.0001 -1.777518
DUR1SQ 0.0716 0.00126 3245.5730 0.0001 1.276746
Below are ordinal correlations between predicted probabilities and the observed
responses.
ASSOCIATION OF PREDICTED PROBABILITIES AND OBSERVED RESPONSES
CONCORDANT = 74.3% SOMERS' D = 0.532
DISCORDANT = 21.1% GAMMA = 0.558
TIED = 4.6% TAU-A = 0.220
(868761396 PAIRS) C = 0.766
Classification tables are one of the best ways to assess goodness of
predictability. They give the percent correctly predicted and provides
the sensitivity, the percentage of event responses correctly predicted
to be an event, and the specificity, the percentage of nonevent responses
correctly predicted to be a nonevent. The false positive rate is the
percentage of predicted event responses that turned out to be nonevents.
Also provided is the false negative rate, or percentage of predicted nonevent
responses that turned out to be events. A classfication table provided by
SAS is presented below.
CLASSIFICATION TABLE
PREDICTED
EVENT NO EVENT TOTAL
+---------------------+
EVENT | 8715 10242 | 18957
OBSERVED | |
NO EVENT | 3425 42403 | 45828
+---------------------+
TOTAL 12140 52645 64785
SENSITIVITY= 46.0% SPECIFICITY= 92.5% CORRECT= 78.9%
FALSE POSITIVE RATE= 28.2% FALSE NEGATIVE RATE= 19.5%
NOTE: AN EVENT IS AN OUTCOME WHOSE ORDERED RESPONSE VALUE IS 1.
We may try another model where we delete the race variable, which in Michigan does not
seem to have a significant association with the logit of departing teaching and to
include the interaction of sex and time. From the lower AIC and SC for the intercept
and covariates, we see that this is a marginally better model. But from the
comparison of the Ordinal Correlation coefficients, we see that the difference is
almost negligible, insofar as sex and its interaction with time are not significant at
the .05 level.
Criteria for Assessing Model Fit
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 78323.914 62745.961 .
SC 78332.993 62791.356 .
-2 LOG L 78321.914 62735.961 15585.952 with 4 DF (p=0.0001)
Score . . 15675.267 with 4 DF (p=0.0001)
In SAS Ver 6.07 the odds ratio is added to the analysis of parameter estimates table
below. This value is merely the exponentiation of the logistic regression
coefficient.
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 2.6079 0.0476 3005.9602 0.0001 . 13.571
DUR1 1 -1.1866 0.0161 5465.3194 0.0001 -1.796131 0.305
DUR1SQ 1 0.0717 0.00126 3247.8473 0.0001 1.279461 1.074
SEX 1 -0.0814 0.0433 3.5308 0.0602 -0.020424 0.922
DURXSEX 1 0.0156 0.00831 3.5121 0.0609 0.027860 1.016
Association of Predicted Probabilities and Observed Responses
Concordant = 73.9% Somers' D = 0.533
Discordant = 20.7% Gamma = 0.563
Tied = 5.4% Tau-a = 0.220
Appendix 8
SAS, SPSS, and LIMDEP Syntax for Proportional Hazards
(Cox) Regression
SAS
PROC PHREG;
MODEL SPELL1*LEFT(0) = X1 X2 X3 / COVB;
RUN;
SPSS
COXREG VARIABLES=SPELL1 WITH X1 X2 X3/STATUS LEFT(1).
LIMDEP
SURV; Lhs = SPELL1,LEFT; rhs = x1,x2,x3; TVC = z1*time;
Appendix 9
SAS Syntax for Building Person-Period-Spell Dataset For Multiple Spell Analysis
OPTIONS LS=80 PS=56;
TITLE 'REPEATED EVENTS DISCRETE TIME DATA SET MODEL';
TITLE2' SAMPLE DATA';
TITLE3 'SPELL1 ANALYSIS';
PROC FORMAT;
VALUE SE 0='MALE' 1='FEMALE';
VALUE RA 0='WHITE' 1='BLACK';
VALUE LE 0='CENSORED' 1='LEFT';
VALUE PT 1='FULL TIME' 2='PART TIME';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;
DATA SPELLT1;
INFILE 'MTVC2.DAT';
INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR RETURN RETURNYR SPELL2 MADEG01-MADEG13;
LABEL SEX= 'SEX OF TEACHER'
RACE='RACE OF TEACHER'
ENTERYR='FIRST SCHL YEAR OF TEACHING'
SPELL1='LENGTH OF FIRST TEACHING SPELL'
LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR'
EXITYR='FIRST YEAR OF CAREER INTERRUPTION'
SPELL2='LENGTH OF FIRST INTERRUPTION SPELL'
RETURN='RETURNED TO TEACHING BY 84-85 SCHL YR'
RETURNYR='YR TEACHER RETURNED TO TEACHING';
* CREATION OF PERSON PERIOD DATA SET;
IF LEFT > 0 THEN LEFT = 1;
SPELL = 1;
ARRAY TEMP1(13) D1-D13;
DO PERIOD1 = 1 TO MIN(SPELL1,13);
IF PERIOD1 = SPELL1 AND LEFT NE 0 THEN DV = 1; ELSE DV = 2;
DO INDEX1 = 1 TO 13;
IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
TIME1 = PERIOD1;
TIME1SQ = PERIOD1*PERIOD1;
T1XSX = TIME1*SEX;
T1SQXSX = TIME1SQ * SEX;
END;
OUTPUT;
END;
TITLE2 'CONSTRUCTION OF SPELL 2 PERSON PERIOD DATASET';
TITLE3 'SPELL2 ANALYSIS';
DATA SPELLT2;
INFILE 'MTVC2.DAT';
KEEP SEX RACE RETURN RETURNYR TIME2 SPELL PERIOD2 T1-T13 DV TIME2SQ MADEG01-MADEG13;
INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR RETURN RETURNYR SPELL2 MADEG01-MADEG13;
SPELL=2;
ARRAY TEMP2(13) T1-T13;
DO PERIOD2 = 1 TO MIN(SPELL2,13);
IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;
DO INDEX2 = 1 TO 13;
IF INDEX2=PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2)= 0;
TIME2= PERIOD2;
TIME2SQ = PERIOD2*PERIOD2;
T2XSX = TIME2*SEX;
T2SQXSX = TIME2SQ * SEX;
END;
OUTPUT;
END;
TITLE3 'MULTIPLE SPELLS DISCRETE TIME DATA SET MODEL';
DATA ALL;
SET SPELLT1 SPELLT2;
PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
TIME PERIOD2*LEFT(0);
Appendix 10
SAS Construction of Plots for Multiple Spells Discrete Time Analysis
1 OPTIONS LS=80 PS=56 ;
2 TITLE 'REPEATED EVENTS DISCRETE TIME SURVIVAL & HAZARD FUNCTION PLOTS';
3 TITLE2 ' MICHIGAN DATA';
4 PROC FORMAT;
5 VALUE SE 0='MALE' 1='FEMALE';
6 VALUE RA 0='WHITE' 1='BLACK';
7 VALUE LE 0='CENSORED' 1='LEFT';
8 VALUE PT 1='FULL TIME' 2='PART TIME';
9 VALUE SP 0 = 'SPELL 1' 1 = 'SPELL 2';
10 FORMAT SEX SE.;
11 FORMAT RACE RA.;
12 FORMAT LEFT LE.;
13 FORMAT SPELL SP.;
14 DATA SPELLT1;
15 INFILE MIT LRECL=472 BLKSIZE=31624;
16 INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16
17 MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22
18 MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28
19 INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.)
20 @59 (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12) (1.)
21 SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93
22 RETURNYR 94-95 ;
23 LABEL SEX= 'SEX OF TEACHER'
24 RACE='RACE OF TEACHER'
25 ENTERYR='FIRST SCHL YEAR OF TEACHING'
26 SPELL1='LENGTH OF FIRST TEACHING SPELL'
27 LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR'
28 EXITYR='FIRST YEAR OF CAREER INTERRUPTION'
29 SPELL2='LENGTH OF FIRST INTERRUPTION SPELL'
30 RETURN='RETURNED TO TEACHING BY 84-85 SCHL YR'
31 RETURNYR='YR TEACHER RETURNED TO TEACHING';
32 * CREATION OF PERSON PERIOD DATA SET;
33 IF LEFT > 0 THEN LEFT = 1;
34 SPELL = 0;
35 ARRAY TEMP1(13) D1-D13;
36 DO PERIOD1 = 1 TO MIN(SPELL1,13);
37 IF PERIOD1 = SPELL1 AND LEFT NE 0 THEN DV = 1; ELSE DV = 2;
38 DO INDEX1 = 1 TO 13;
39 IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1)=0;
40 END;
41 TIME = PERIOD1;
42 TIMESQ = PERIOD1 * PERIOD1;
43 TXS = TIME*SEX;
44 OUTPUT ;
45 END;
46 PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
47 TIME PERIOD1*LEFT(0);
48
49 TITLE3 'SPELL1 ANALYSIS';
50 DATA SPELLT2;
51 KEEP SEX RACE RETURN RETURNYR SPELL2 SPELL PERIOD2 T1-T13 DV LEFT;
52 INFILE MIT LRECL=472 BLKSIZE=31624;
53 INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16
54 MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22
55 MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28
56 INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.)
57 @59 (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12)(1.)
58 SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93
59 RETURNYR 94-95 ;
60 SPELL=1;
61 ARRAY TEMP2(13) T1-T13;
62 DO PERIOD2 = 1 TO MIN(SPELL2,13);
63 IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;
64 DO INDEX2 = 1 TO 13;
65 IF INDEX2 = PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2) = 0;
66 END;
67 TIME = PERIOD2;
68 TIMESQ = PERIOD2 * PERIOD2;
69 TXS = TIME*SEX;
70 OUTPUT;
71
72 END;
73 PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
74 TIME PERIOD2*LEFT(0);
75 TITLE3 'SPELL2 ANALYSIS';
76
Appendix 11
SAS Output for Discrete Time Repeated Events Analysis: Spell 1
THE LIFETEST PROCEDURE
SUMMARY OF THE NUMBER OF CENSORED AND UNCENSORED VALUES
TOTAL FAILED CENSORED %CENSORED
178109 64785 113324 63.6262
SURVIVAL FUNCTION ESTIMATES
SDF |
|
|
|
|
|
S 1.0 + A++++A
U | ++
R | ++
V | A
I | ++
V | ++
A 0.8 + A+
L | ++
| +A+
D | ++
I | +A++
S | ++A++
T 0.6 + ++A++
R | ++A++
I | ++A++
B | ++A++++A++
U | ++A++++A
T |
I 0.4 +
O |
N |
|
F |
U |
N 0.2 +
C |
T |
I |
O |
N |
0.0 +
|
|
|
|
|
--+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PERIOD
Appendix 12 Continued
HAZARD FUNCTION ESTIMATES
HAZARD |
|
|
0.12 +
| A
| ++
| + +
| + +
| + +
0.10 + + A+
| + ++
| + A
| + +
H | + +
A | + +
Z 0.08 + + A+
A | + ++
R | + A
D | + ++
| + +
F | + A+
U 0.06 + + ++
N | + A
C | + ++
T | + +
I | + A+
O | + ++
N 0.04 + + A+
| + ++
| + A
| + ++
| + +
| + A+
0.02 + + ++
| + A
| +
| +
| +
| +
0.00 + A
|
|
-------+-------+-------+-------+-------+-------+-------+-------+-------
0 2 4 6 8 10 12 14
PERIOD
Appendix 13
SAS Output for Discrete Time Repeated Events Analysis: Spell 2
THE LIFETEST PROCEDURE
SUMMARY OF THE NUMBER OF CENSORED AND UNCENSORED VALUES
TOTAL FAILED CENSORED %CENSORED
248059 96518 151541 61.0907
SURVIVAL FUNCTION ESTIMATES
SDF |
|
|
|
|
|
S 1.0 + A++++A+
U | ++
R | +A+
V | ++
I | +A+
V | ++
A 0.8 + +A+
L | ++
| +A++
D | ++A+
I | ++
S | +A++
T 0.6 + ++A++
R | ++A++
I | ++A++
B | ++A++
U | ++A++++A
T |
I 0.4 +
O |
N |
|
F |
U |
N 0.2 +
C |
T |
I |
O |
N |
0.0 +
|
|
|
|
|
--+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PERIOD
Appendix 13 Continued
HAZARD FUNCTION ESTIMATES
HAZARD |
|
|
0.09 +
|
|
| A
0.08 + ++
| + +
| + + ++A+++A++
| + A+ +A+++A++
0.07 + + +A++
| + +A++
H | + +A
A | + ++
Z 0.06 + + +
A | + A
R | + +
D | + +
0.05 + + +
F | + A
U | + +
N | + +
C 0.04 + + +
T | + +
I | + +
O | + A
N 0.03 + +
| +
| +
| +
0.02 + +
| +
| +
| +
0.01 + +
| +
| +
| +
0.00 + A
|
|
-------+-------+-------+-------+-------+-------+-------+-------+-------
0 2 4 6 8 10 12 14
PERIOD
Appendix 14
SAS Syntax for Repeated Events Discrete Time Logistic Model
1 OPTIONS LS=80 PS=56 ;
2 TITLE 'REPEATED EVENTS DISCRETE TIME LOGISTIC MODEL';
3 TITLE2 'MICHIGAN DATA';
4 PROC FORMAT;
5 VALUE SE 0='MALE' 1='FEMALE';
6 VALUE RA 0='WHITE' 1='BLACK';
7 VALUE LE 0='CENSORED' 1='LEFT';
8 VALUE PT 1='FULL TIME' 2='PART TIME';
9 FORMAT SEX SE.;
10 FORMAT RACE RA.;
11 FORMAT LEFT LE.;
12 data SPELLT1;
13 INFILE MIT LRECL=472 BLKSIZE=31624;
14 input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
15 madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
16 madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
17 inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
18 @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
19 spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
20 returnyr 94-95 ;
21 Label sex= 'Sex of teacher'
22 race='Race of teacher'
23 Enteryr='First schl year of teaching'
24 Spell1='Length of first teaching spell'
25 Left='Left teaching before 1984-85 school year'
26 Exityr='First year of career interruption'
27 Spell2='Length of first interruption spell'
28 Return='Returned to teaching by 84-85 schl yr'
29 Returnyr='Yr teacher returned to teaching';
30 * Creation of Person period data set;
31 IF LEFT > 0 THEN LEFT = 1;
32 SPELL = 1;
33 ARRAY TEMP1(13) D1-D13;
34 DO PERIOD1 = 1 to MIN(SPELL1,13);
35 IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1; ELSE DV = 2;
36 DO INDEX1 = 1 to 13;
37 IF INDEX1 =PERIOD1 THEN TEMP1(INDEX1)=1; ELSE TEMP1(INDEX1)=0;
38 END;
39 TIME = PERIOD1;
40 TIMESQ = PERIOD1 * PERIOD1;
41 TXS = TIME*SEX;
42 OUTPUT ;
43 END;
44 Title3 'SPELL1 ANALYSIS';
46 DATA SPELLT2;
47 KEEP SEX RACE RETURN RETURNYR SPELL2 SPELL PERIOD2 T1-T13 DV;
48 INFILE MIT LRECL=472 BLKSIZE=31624;
49 input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
50 madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
51 madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
52 inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
53 @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
54 spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
55 returnyr 94-95 ;
56 SPELL=2;
57 ARRAY TEMP2(13) T1-T13;
58 DO PERIOD2 = 1 TO MIN(SPELL2,13);
59 IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;
60 DO INDEX2 = 1 TO 13;
61 IF INDEX2 = PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2) = 0;
62 END;
63 TIME = PERIOD2;
64 TIMESQ = PERIOD2 * PERIOD2;
65 TXS = TIME*SEX;
66 OUTPUT;
67 END;
68 TITLE3 'SPELL2 ANALYSIS';
70 Data ALL;
71 SET SPELLT1 SPELLT2;
72 PROC LOGISTIC CT;
73 MODEL DV = TIME TIMESQ SPELL SEX TXS /NOINT;
Appendix 15
SAS Output for Repeated Events Discrete Time Logistic Model
The LOGISTIC Procedure
Data Set: WORK.ALL
Response Variable: DV
Response Levels: 2
Number of Observations: 178109
Link Function: Logit
Response Profile
Ordered
Value DV Count
1 1 18957
2 2 159152
WARNING: 248059 observation(s) deleted due to missing values for response or explanatory
variables.
Criteria for Assessing Model Fit
Without With
Criterion Covariates Covariates Chi-Square for Covariates
AIC 246911.50 116190.06 .
SC 246911.50 116240.51 .
-2 LOG L 246911.50 116180.06 130731.44 with 5 DF (p=0.0001)
Score . . 111903.41 with 5 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
TIME 1 -0.1328 0.0114 135.5431 0.0001 -0.232271 0.876
TIMESQ 1 -0.00706 0.00101 49.2217 0.0001 -0.148752 0.993
SPELL 1 -1.6193 0.0293 3045.1225 0.0001 0 0.198
SEX 1 0.1521 0.0293 26.9172 0.0001 0.040285 1.164
TXS 1 0.0396 0.00676 34.3078 0.0001 0.072201 1.040
Association of Predicted Probabilities and Observed Responses
Concordant = 60.4% Somers' D = 0.281
Discordant = 32.3% Gamma = 0.303
Tied = 7.2% Tau-a = 0.053
(3017044464 pairs) c = 0.640
Classification Table
Correct Incorrect Percentages
------------ ------------ -------------------------------------
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
------------------------------------------------------------------------
0.000 18957 0 159E3 0 10.6 100.0 0.0 89.4 .
0.020 18879 4922 154E3 78 13.4 99.6 3.1 89.1 1.6
0.040 18519 17414 142E3 438 20.2 97.7 10.9 88.4 2.5
0.060 17567 34682 124E3 1390 29.3 92.7 21.8 87.6 3.9
0.080 15723 58220 101E3 3234 41.5 82.9 36.6 86.5 5.3
0.100 14344 72556 86596 4613 48.8 75.7 45.6 85.8 6.0
0.120 12376 88625 70527 6581 56.7 65.3 55.7 85.1 6.9
0.140 9816 107E3 52289 9141 65.5 51.8 67.1 84.2 7.9
0.160 3657 142E3 16964 15300 81.9 19.3 89.3 82.3 9.7
0.180 0 159E3 0 18957 89.4 0.0 100.0 . 10.6
Appendix 16
Kaplan-Meier Product-Limit Life Tables
The Kaplan-Meier Life Table calculations are repeated everytime an event is observed.
Suppose t1 < t2 < ... < tx represent the distinct times that events take place. If nj
= the size of the risk set just prior to the event at time ti, the time of the event,
and dj represents the number of events at the time of the event, then the formula for
the cumulative survival function during spell k is as follows:
(43)
k is the spell. The cumulative survival function is
therefore the product of survival probabilities throughout the spell, computed
at the time of each event. The standard error is given by Greenwood's formula:
(44)
Appendix 17
SAS Syntax for PHREG SINGLE SPELL MODEL
OPTIONS LS=80 PS=56 ;
TITLE 'DISCRETE TIME PHREG MODEL';
TITLE2 ' MICHIGAN DATA';
PROC FORMAT;
VALUE SE 0='MALE' 1='FEMALE';
VALUE RA 0='WHITE' 1='BLACK';
VALUE LE 0='CENSORED' 1='LEFT';
VALUE PT 1='FULL TIME' 2='PART TIME';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;
DATA SPELLT1;
INFILE MIT LRECL=472 BLKSIZE=31624;
input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
@59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
returnyr 94-95 ;
Label sex= 'Sex of teacher'
race='Race of teacher'
Enteryr='First schl year of teaching'
Spell1='Length of first teaching spell'
Left='Left teaching before 1984-85 school year'
Exityr='First year of career interruption'
Spell2='Length of first interruption spell'
Return='Returned to teaching by 84-85 schl yr'
Returnyr='Yr teacher returned to teaching';
* Creation of Person-Period data set;
IF LEFT > 0 THEN LEFT = 1;
SPELL = 1;
ARRAY TEMP1(13) D1-D13;
DO PERIOD1 = 1 to MIN(SPELL1,13);
IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1; ELSE DV = 2;
DO INDEX1 = 1 to 13;
IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
END;
TIME = PERIOD1;
TIMESQ = PERIOD1 * PERIOD1;
TXS = TIME*SEX;
OUTPUT ;
END;
PROC PHREG;
MODEL DV*LEFT(0) = TIME TIMESQ;
PROC PHREG;
MODEL DV*LEFT(0) = TIME TIMESQ SEX;
PROC PHREG;
MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS;
PROC PHREG;
MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS RACE/CORRB;
Appendix 18
1 OPTIONS LS=80 PS=56 ;
2 TITLE 'DISCRETE TIME PHREG MODEL';
3 TITLE2 'MICHIGAN DATA';
4 PROC FORMAT;
5 VALUE SE 0='MALE' 1='FEMALE';
6 VALUE RA 0='WHITE' 1='BLACK';
7 VALUE LE 0='CENSORED' 1='LEFT';
8 VALUE PT 1='FULL TIME' 2='PART TIME';
9 FORMAT SEX SE.;
10 FORMAT RACE RA.;
11 FORMAT LEFT LE.;
12 data SPELLT1;
13 INFILE MIT LRECL=472 BLKSIZE=31624;
14 input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
15 madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
16 madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
17 inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
18 @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
19 spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
20 returnyr 94-95;
21 Label sex= 'Sex of teacher'
22 race='Race of teacher'
23 Enteryr='First schl year of teaching'
24 Spell1='Length of first teaching spell'
25 Left='Left teaching before 1984-85 school year'
26 Exityr='First year of career interruption'
27 Spell2='Length of first interruption spell'
28 Return='Returned to teaching by 84-85 schl yr'
29 Returnyr='Yr teacher returned to teaching';
30 * Creation of Person period data set;
31 IF LEFT > 0 THEN LEFT = 1;
32 SPELL = 1;
33 ARRAY TEMP1(13) D1-D13;
34 DO PERIOD1 = 1 to MIN(SPELL1,13);
35 IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1; ELSE DV = 2;
36 DO INDEX1 = 1 to 13;
37 IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
38 END;
39 TIME = PERIOD1;
40 TIMESQ = PERIOD1 * PERIOD1;
41 TXS = TIME*SEX;
42 OUTPUT ;
43 END;
44
45 TITLE3 'SPELL1 ANALYSIS';
46 PROC PHREG;;
47 MODEL DV*LEFT(0) = TIME TIMESQ /CORRB;
48 PROC PHREG;;
49 MODEL DV*LEFT(0) = TIME TIMESQ SEX /CORRB;
50 PROC PHREG;;
51 MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS /CORRB;
52 PROC PHREG;;
53 MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS RACE/CORRB;
DISCRETE TIME PHREG MODEL
MICHIGAN DATA
SPELL1 ANALYSIS
Data Set: WORK.SPELLT1
Dependent Variable: DV
Censoring Variable: LEFT Left teaching before 1984-85 school year
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of
Event and Censored Values
Percent
Total Event Censored Censored
178109 64785 113324 63.63
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 1556206.27 1531998.41 24207.85 with 2 DF (p=0.0001)
Score . . 20585.38 with 2 DF (p=0.0001)
Wald . . 15832.20 with 2 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TIME 1 -0.077037 0.00585 173.33514 0.0001 0.926
TIMESQ 1 -0.016618 0.0006293 697.27383 0.0001 0.984
Estimated Correlation Matrix
TIME TIMESQ
TIME 1.000000000 -0.950405788
TIMESQ -0.950405788 1.000000000
SAS PHREG Procedure
Data Set: WORK.SPELLT1
Dependent Variable: DV
Censoring Variable: LEFT Left teaching before 1984-85 school year
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of
Event and Censored Values
Percent
Total Event Censored Censored
178109 64785 113324 63.63
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 1556206.27 1531099.50 25106.77 with 3 DF (p=0.0001)
Score . . 21442.65 with 3 DF (p=0.0001)
Wald . . 16690.03 with 3 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TIME 1 -0.075359 0.00585 165.90800 0.0001 0.927
TIMESQ 1 -0.016518 0.0006292 689.29111 0.0001 0.984
SEX 1 0.254973 0.00865 869.63660 0.0001 1.290
Analysis of Maximum
Likelihood Estimates
Variable Label
TIME
TIMESQ
SEX Sex of teacher
Estimated Correlation Matrix
TIME TIMESQ SEX
TIME 1.000000000 -0.950287614 0.009131492
TIMESQ -0.950287614 1.000000000 0.005355449
SEX 0.009131492 0.005355449 1.000000000 Sex of teacher
The PHREG Procedure
Data Set: WORK.SPELLT1
Dependent Variable: DV
Censoring Variable: LEFT Left teaching before 1984-85 school year
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of
Event and Censored Values
Percent
Total Event Censored Censored
178109 64785 113324 63.63
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 1556206.27 1531072.57 25133.70 with 4 DF (p=0.0001)
Score . . 21615.47 with 4 DF (p=0.0001)
Wald . . 16542.91 with 4 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TIME 1 -0.090814 0.00654 192.63688 0.0001 0.913
TIMESQ 1 -0.016351 0.0006296 674.42691 0.0001 0.984
SEX 1 0.190641 0.01508 159.83901 0.0001 1.210
TXS 1 0.020399 0.00394 26.78174 0.0001 1.021
Estimated Correlation Matrix
TIME TIMESQ SEX TXS
TIME 1.000000000 -0.865750539 0.368744370 -0.447773529
TIMESQ -0.865750539 1.000000000 -0.027734182 0.037006423
SEX 0.368744370 -0.027734182 1.000000000 -0.819580406
TXS -0.447773529 0.037006423 -0.819580406 1.000000000
SAS PHREG Procedure
Data Set: WORK.SPELLT1
Dependent Variable: DV
Censoring Variable: LEFT Left teaching before 1984-85 school year
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of
Event and Censored Values
Percent
Total Event Censored Censored
178109 64785 113324 63.63
Testing Global Null Hypothesis: BETA=0
Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 1556206.27 1530864.44 25341.83 with 5 DF (p=0.0001)
Score . . 21816.94 with 5 DF (p=0.0001)
Wald . . 16741.99 with 5 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk
Variable DF Estimate Error Chi-Square Chi-Square Ratio
TIME 1 -0.090666 0.00654 192.04170 0.0001 0.913
TIMESQ 1 -0.016380 0.0006296 676.95944 0.0001 0.984
SEX 1 0.198086 0.01509 172.38694 0.0001 1.219
TXS 1 0.020944 0.00394 28.22737 0.0001 1.021
RACE 1 -0.239196 0.01718 193.89984 0.0001 0.787
Estimated Correlation Matrix
TIME TIMESQ SEX TXS RACE
TIME 1.000000000 -.865707407 .368514067 -.447592615 -.001477467
TIMESQ -.865707407 1.000000000 -.027638547 .036720009 .003010263
SEX .368514067 -.027638547 1.000000000 -.818857098 -.031760381
TXS -.447592615 .036720009 -.818857098 1.000000000 -.008922722
RACE -.001477467 .003010263 -.031760381 -.008922722 1.000000000
This document has been accessed