Discrete Time Event History Models for Higher Education Research


by

Robert A. Yaffee
and
James T. Austin



9-15-94



Presented at the New York University Graduate Sociology Methodology Workshop on 8 March 1995


ABSTRACT

Time and change are important concepts for many scientific disciplines. We review statistical procedures that enable higher education researchers to match theories of time and change with appropriate analyses. The primary motivation for using event history analysis lies in the problems that arise when studying duration of time until an event occurs and when characterizing distributions of these phenomena. We discuss three procedures for nonrepeatable events: life tables analysis, discrete-time logistic regression, and Cox regression. We then consider repeatable events. We organize the presentation around an extended example of attrition using a large sample of high school teachers followed over 12 years. Extensive annotated computer syntax and output are presented. Complex mathematical material is limited to an appendix. One of our major conclusions is that higher education researchers should consider reframing their substantive research questions to focus on time and duration to take advantage of these sophisticated techniques.

DISCRETE-TIME EVENT HISTORY MODELS FOR HIGHER EDUCATION

The study of time and temporal change characterize multiple scientific disciplines (McGrath & Kelly, 1988; Thierry & Meijman, 1993), including higher education (Bloom, 1963). Researchers in higher education often study processes that operate over defined intervals. Many interesting phenomena are described by duration culminating in a terminating event, say completion of a college major. The time span of faculty retention, the length of time before students drop out or graduate, or the time taken to reach various faculty levels are topics of interest to researchers (Singer & Willett, 1993; Willett & Singer, 1991a, 1991b). Time is a frame of reference along which phenomena are observed. Reference to time permits assessment of stability and developmental trajectories in the phenomena. Repeated observations permit estimation of the rates of change, and they also allow researchers to observe the termination of the event under examination. The time axis is often defined in discrete units (years, months, semesters); therefore discrete time event history analysis is the type of analysis introduced here.

Such studies have an explanatory as well as an observational role. To have explanatory power, hypotheses about the phenomena require invariant generalizations, spatial contiguity, covariation, and temporal sequence (Cook & Campbell, 1979; Nagel, 1979, p. 74). Aligning phenomena within measured intervals is one way to conduct such analyses. The graphical display and statistical analysis of survival times comprise event history analysis. This analysis is becoming increasingly popular (Klein & Goel, 1992), not just in educational research, but beyond (Blossfeld, Hamerle, & Mayer, 1989; Miller, 1981). In higher education, researchers are interested in several populations: students, faculty, administrators, and support staff. Event history analysis applies to the study of student retention (Stage, 1988; Willett & Singer, 1991a), faculty tenure (Mueller, 1986), and job duration of high school teachers (Murnane, Singer, & Willett, 1988; Sclan, 1993; Singer & Willett, 1993). Thus, the duration and sequencing of events, along with predictors of duration, comprise inputs to event history analysis. Such procedures extend capability to study temporal processes and their determinants.

Organization of this Chapter

Several event history techniques for discrete-time data are presented in this chapter. Before discussing them, issues in the study of duration are addressed. These issues include retrospective and prospective longitudinal designs, data collection using a life history calendar, differentiation of person- and event-oriented datasets, censoring indicators, and treatment of time-varying covariates.

Following those preliminaries, four techniques are presented. Three techniques are for nonrepeatable events. We present life tables analysis, which describes survival time distributions and compares distributions of different groups, discrete-time logistic regression, which models the probability that of an event within a given period as a function of one or more covariates, and Cox regression (Cox, 1972), which models the hazard probability (hazard rate) as a function of a baseline hazard rate and a parameter vector. Then we switch from nonrepeatable to repeatable events and briefly discuss multispell models with one-way transitions. For all procedures, underlying assumptions are examined and computer examples are used to illustrate how to conduct valid analyses. Personal and mainframe computer applications are illustrated, but storage requirements may make the mainframe a better option for datasets with either many cases or many observation intervals.

An extended example used in this chapter is a study of the job duration of beginning high school teachers (Murnane et al., 1988, 1989). Using large samples from three states, generalizations about the career paths of high-school teachers in the United States were developed. First, however, we elaborate several problems in the analysis of durations to show the advantages of event history analysis.

PROBLEMS OF STUDYING DURATION

Certain problems are specific to longitudinal studies (Allison, 1982; Blossfeld et al., 1989). Clear definitions of events, including states of existence and state transitions, are crucial. Accurate measurement of beginning and ending times for the first state is needed to calculate the duration of the phenomenon. Clear transitions between states allow for an unambiguous measurement of the duration. Special designs, data collection procedures, dataset construction, variable coding, and treatment of cases are required to deal with duration data. If, for instance, researchers are interested in the topic of teacher attrition, they must provide a clear definition of the change of state (teaching to nonteaching). Teaching can be defined as conducting classes, but also giving homework and exams. Teaching is usually not defined as supporting teachers. Thus, leaving teaching means ceasing to conduct these essential classroom activities. Next, we consider retrospective and prospective longitudinal designs.

Retrospective and Prospective Designs

Retrospective designs feature interviews of respondents consisting of questions asking them to recall the time span of events and other personal characteristics. Such designs are cheaper and less time-consuming than prospective interviewing; however, they rely heavily on respondent recall. Because respondents are asked to recall beginnings and end-points of the phenomenon as part of data collection, questions should be organized to facilitate recollection of events and timings. From the start and end dates, duration is calculated. Other phenomena associated with this duration are also measured. First, focus is placed on key life events and then on the duration of interest associated with beginning and end points of key life events. Once such key events are recorded, memories of other events are linked. The data are collected for analysis by guided recall.

Prospective observation designs require monitoring over time, during which particular events may occur. Such plans are more costly and time-consuming than retrospective ones and also require increased training and supervision. But they are accurate because they avoid drawbacks of differential memory abilities and lapses of temporal frames of reference. In prospective designs, respondents are reinterviewed at discrete points over the period of observation. The observation period commences with the beginning of the study and lasts till the occurrence of an event. If the person drops out of the study for some other reason, is lost to follow-up, or did not experience the event under study during the observation period, that nonappearance must be coded appropriately. We call such observations censored.

Life History Calendar

We recommend that data be collected and entered using a Life History Calendar (LHC; Freedman, Thornton, Camburn, Alwin, & Young-Demarco, 1988). This calendar matrix has rows representing variables and columns representing time periods. Within columns, measurements are taken on variables for persons or other units of analysis represented by rows. Thus, changes in the values of the variable are recorded over time.

According to Freedman et al. (1988), the LHC improves the quality of retrospective data collection by facilitating visual and mnemonic timing of various events. First of all, a calendar-time framework provides a basis for the recall of events. More easily remembered events -- marriages, births, deaths -- are requested first and then serve as temporal markers for responses to other questions. More detailed sequences of events appear to be easier and more efficiently recalled using a framework of associations already established by the LHC. However, data evaluating these purported improvements provided by the LHC are required.

The time-axis in a LHC must be carefully conceptualized. Temporal units should provide the proper resolution for defining survival time. Units of time should be neither too small nor too large for study of the phenomenon. On the one hand, if the temporal units are too large, then all durations may fall within the first few intervals or periods. Differences between the lifetimes or survival times may be obscured by such gross time-line construction. On the other hand, if the temporal divisions are too fine, the number of observations within each of them may become small. Then the data become sparsely distributed, so that nonparametric analyses may be the only appropriate kind of analysis.

We recommend that, if erring in the construction of period length, it is best to make the intervals too small. If the intervals are too fine, they can always be aggregated to expand their time span. Aggregated intervals can provide alternative temporal definitions for the analysis. What a researcher cannot do, if the periods are too large, is to subdivide them for better temporal definition. This is a major rule of thumb for defining the temporal periods in a LHC.

The LHC presented in Table 1 helps to illustrate this tool. Suppose that the topic is teacher attrition and the researchers hypothesize that some teachers leave their careers because they marry and/or have children. The LHC uses temporal marker variables along with a variable indexing time spent teaching as follows. Calendar years and dates are indicated in the first row, months are found from the third and later columns. In this segment of the LHC, it can be seen that the teacher is 29 in December and that in the month of August turns 30. Marital status, read from the second row, changes from single to cohabitation with a partner in January, 1991. By April, the partners get married and departure from teaching occurs the month before marriage. Departure from teaching is temporally associated with an important life event, thus facilitating recall.

Table 1

Sample Life History Calendar Segment for One Person


Age

29

Ja

Fe

Mar

Apr

May

Jun

Jul

30

Sep

Oct

Marital

S

Coh

Coh

Coh

Mar

-

-

-

-

-

-

Teachng

T

T

T

Lvs

-

-

-

-

-

-

-


Key

Marital S = single, Coh = cohabiting, Mar = Marriage

Teaching T = teaching, Lvs = leaves teaching


Censoring

Censored events are those that are not observed within the study. An event may be censored for one of several reasons. If a respondent dropped out of a study for reasons other than those of the terminating event, the observation of the event is censored. If such a person moves away without advising researchers and without leaving a forwarding address, then that respondent, if not otherwise traceable, is lost to follow-up. Similarly, if the person does not experience the event before the end of study, this censoring would also have to be recorded. The observational time frame of the study -- in this case, a single spell or single episode (where the event is not repeated) -- is divided into discrete units called intervals or periods. If the observation episode is terminated by an event, the censoring variable could be coded as a 0 for the relevant periods; if the event does not occur -- in this example, leaving teaching -- then the censoring indicator for the event would be coded as a one. Within a period, if one forms a fraction by dividing those actually experiencing the event by those at risk -- that is, capable of experiencing the event -- then this estimate uses the censored observations as part of the denominator of those at risk but not as part of the numerator (those experiencing the event). This hazard rate is formulated later, but recognize that such events are right-censored and must be coded as such.

Figure 1 illustrates several types of censoring within a single spell, or one time frame. The vertical lines define the beginning and end of the study along a time axis. Observations terminated with an x are cases in which the teacher actually departs from the teaching field, that is, the event is observed. Cases b, e, and f are individuals who leave teaching. Observations without x's (cases a, c, d) ones in which the individualdoes not depart the teaching field, may have withdrawn, or are lost to follow-up. Such cases are censored.

Figure 1: Censored and uncensored cases

Constructing Datasets for Analysis

Two types of datasets are used for event history analysis. Person-oriented datasets can be used if they are partitioned into intervals by the computer. They have an observation per respondent and have measures of duration already coded or computable from beginning and ending times. They also include a censoring indicator, indicating whether the event was observed or not. They may also include measured variables believed to be predictive of the occurrence of the event (i.e., predictors or covariates).

Another type of dataset, the event-oriented dataset, is often used for discrete time logistic and Cox regression analysis. The unit of analysis is not one observation per person; rather, the unit of analysis is the person-period. Each time that a person in a prospective study is interviewed, an observation occurs. If a person is interviewed ten times over the life of the study, that person has ten records. There is a record for each person-period. Each record includes identification number of the respondent, a censoring indicator indicating whether the event was experienced during that interval (assuming that the event was not experienced prior), and values of other variables assessed during that time period. The censoring indicator is a dummy variable, signifying if the teacher left teaching during the last time period the teacher was observed. A 0 indicates that the person left teaching in that period, whereas a 1 indicates that the person did not. Each record has a dependent variable score, with one value for no evidence of the departing event and another value for departure from teaching. If the censoring indicator is 1, this indicates that this person did not depart from teaching for the number of periods (s)he was observed in the study. If the censoring variable is 0, then the person left teaching at the time of the last observation. For that last record, the dependent variable score would be a 1, unlike the scores for the other periods for that person, indicating no departure from the profession. Only the last record for a person will indicate departure from teaching, if the person's observations are not censored. An advantage of the person-period dataset is that it permits large sample analysis. If the study is retrospective, a person-period dataset will have a record for each period (in this case, year).

An example of a person-oriented data set is given in Table 2. In this dataset, the identification variable is an ID number and there is only 1 case per person. The event under study is departure from teaching. LASTPD indicates the last period in which the person taught. GENDER indicates sex of the respondent (0=male; 1=female). MARSTAT indicates marital status (0=single, living alone; 1=cohabiting, 2=married, living together; 3=separated; 4=divorced). Variables beginning with T indicate years that the person was teaching, with 1=teaching and 0=nonteaching. A person gets a 0 for the teaching variable if he or she left teaching while under observation. What is observed in the person-oriented dataset is that the first person, a single female, is observed teaching only for 1 period. She did not leave teaching while being observed and therefore was censored. She may have been transferred and thus lost to follow-up. The second person, a male, was observed for five periods and was also censored because he did not leave teaching while under our observation. We can see that marital status changed over time from single, living alone to cohabiting with a partner and in the second and third periods to marriage. The third respondent is a female observed for two periods. This person was not censored; she left teaching in the second period when her marital status changed from married and living with spouse, to divorced.

Table 2

Sample Segment of a Person-oriented Dataset


ID

LstPd

Censor

Gender

T1

T2

T3

T4

T5

MS1

MS2

MS3

MS4

MS5

01

1

1

1

1

.

.

.

.

1

.

.

.

.

02

5

1

0

1

1

1

1

1

0

1

1

2

2

03

2

0

1

1

0

.

.

.

2

4

.

.

.

For a person-period dataset, multiplying the number of cases by the number of observation periods yields the total records in the dataset. If we convert a person-oriented to a person-period dataset, Table 2 becomes Table 3. Note that the censoring variable pertains to the person, rather than to the person-period in these models. Annotated SAS and SPSS syntax to convert person-oriented to person-period datasets are presented in Appendices 1 and 2 (with output).

Table 3

Sample Segment of a Person-Period Dataset


ID

Censor

Gender

Teachg

Marital

01

1

1

1

1

02

1

0

1

1

02

1

0

1

0

02

1

0

1

1

02

1

0

1

1

02

1

0

0

2

03

0

1

1

2

03

0

1

0

4


Statistical Issues

Not only do conceptual and data management problems accompany duration data, but statistical problems arise. The simplest model is one in which the observation period is divided into discrete intervals. Discrete-time event history analysis studies the survival function, defined as the curve of probabilities of surviving until a particular time, provided that a person has not experienced the event under study. For example, given that undergraduates have completed 1, 2, 3, or more years of a major, what are the probabilities that they will drop out. Second, overall survival functions may be subdivided by one or more grouping variables into strata. One group may be experimental, the other a control group; one group may be males and the other females; or one curve may represent majority group members and another minority group members. We can test for significant differences in survival functions between strata. This approach treats the grouping variable as if it were a factor in an ANOVA. Lastly, we can use a regression analog in which the dependent variable is the logit, the natural log (ln) of the odds, of experiencing the event (dropping out of teaching). We use dummy or continuous predictor variables (covariates) to predict the logit.

All of these procedures must be able to deal with censoring. Censoring is defined as right or left, but we are concerned first with right-censoring. For right-censored cases, durations are not fully known. Some persons may not experience the event until after the study is finished. They have somehow discontinued their participation. These cases are right-censored. If it can be assumed that these observations are randomly distributed, then they do not bias parameter estimates (Teachman, 1983). An indicator must be created to signify the censoring status for each person (in computer syntax, we call this variable LEFT). When the person-record is not censored, an event is indicated in the last period observed, and such records are included in the number of persons leaving teaching for that period. When the record is censored, then the person is included in the risk set although the record is not included in the number of events for that period. Half of the censored records from that interval are subtracted from the denominator to adjust the sample size to provide for the corrected size of the risk set in the calculation of the hazard rate. A censoring indicator is required by most packages with life tables and Cox regression.

Another statistical issue for longitudinal analyses is that of covariates. If the observations are made with a retrospective design, variables measured for analysis may include both time-independent and time-dependent covariates. Time-constant covariates do not change over time. Obviously, the gender and ethnicity of a respondent usually do not change over time. These variables have the same value at each period for the same person. Time-varying covariates do change over time. Their values sometimes change in one or more of the intervals of the study. Holding the MA degree can be a time-varying covariate. In our teaching example, there would be up to 13 dummy variables (named MADEG), one for each observation period of the study. The value of this variable would remain 0 until the period in which the person received the MA. At that time and for the remaining periods that the person was included in the study, the value for this variable would be 1. Family income is another example of a time-varying covariate. At one point, a family's income may be below the poverty line. At a later point, their income may be above the poverty line. Persons interviewed may have varying values of family income each time that they are interviewed. Therefore, over time family income could change from below the poverty line to above it. Such changes bias the ordinary least squares (OLS) regression model applied to person-oriented datasets, because no provision is made for changes in the value of the covariate over time.

LIFE TABLES ANALYSIS

We begin by examining the survival time of the event of interest, which marks a change or transition from one state to another. The survival time (waiting time) is the length of time from the beginning of the study until the event occurs. If there is only one terminating event, then we say there is a single spell. If there are repeated events, in the present case repeated spells of teaching employment, then the analysis is called a multispell analysis. For didactic reasons, single spell analyses are emphasized here, and a model for repeated events is presented later. The time from which the events are measured is the beginning of the spell. The time at which the terminating event occurs is the end of the spell. If we are examining events that can have only one termination, like death, this first spell is the only spell. Time till first dropout of new teachers is a single spell analysis.

Within the total spell, we divide up the time interval of the study into periods. Periods are intervals within which we calculate survival time and probability. In higher education, many educational phenomena occur within discrete intervals (academic terms, years) and thus are amenable to such division. These are the times at which all of the sample is observed. These units of measurement are also useful in defining the risk set, defined as all those who may experience the event. The survival probability of the event is the probability that those in the risk set, defined as the proportion of those at risk of experiencing the event during a given period, survive the interval without experiencing the event. The survivorship function is a plot of survival probabilities over time (Singer & Willett, 1993, p. 7). Using the dataset, we analyze the job duration of new American high school teachers, using the time until leaving teaching measured in Michigan, North Carolina, and Colorado.

Within intervals, the probability of survival is the likelihood that a person will not experience the event of departure from teaching until some succeeding interval. Persons in the risk set who do not experience the event within the time of the interval are considered censored. The number of persons who actually experience this event, divided by the number of persons in the risk set and adjusted for those cases who are censored, defines the probability of a person within the risk set experiencing the terminating event. Censoring adjustment assumes a random distribution of cases within the interval. The midpoint of the number of cases censored is subtracted from the total number of cases at risk to yield an effective sample size. The number of cases experiencing the departure from teaching (here defined as failure) is the rate of failure for that interval. The complement of the failure probability is the survival probability. Examples of calculations are found in Table 4.

Life Table Survival Estimates

Panel A
                                          EFFECTIVE   CONDITIONAL
     INTERVAL        NUMBER                 SAMPLE    PROBABILITY
  [LOWER,  UPPER)    FAILED    CENSORED      SIZE      OF FAILURE

       0        1         0         0      63329.0            
       1        2     10304       1849     62404.5        0.1651
       2        3      6053       1815     50268.5        0.1204
       3        4      4722       1439     42588.5        0.1109
       4        5      3432       1668     36313.0        0.0945
       5        6      2754       1417     31338.5        0.0879
       6        7      1947       2144     26804.0        0.0726
       7        8      1455       3059     22255.5        0.0654
       8        9       919       3282     17630.0        0.0521
       9       10       630       2654     13743.0        0.0458
      10       11       388       3079     10246.5        0.0379
      11       12       188       3321      6658.5        0.0282
      12        .        56       4754      2433.0        0.0230

Panel B
                                          SURVIVAL   
     INTERVAL                             STANDARD   
 [LOWER,  UPPER)   SURVIVAL    FAILURE     ERROR    

       0        1     0.0000          0           
       1        2     1.0000          0          
       2        3     0.8349     0.1651    0.00149 
       3        4     0.7344     0.2656    0.00178 
       4        5     0.6529     0.3471    0.00194 
       5        6     0.5912     0.4088    0.00202 
       6        7     0.5393     0.4607    0.00207 
       7        8     0.5001     0.4999    0.00210 
       8        9     0.4674     0.5326    0.00213 
       9       10     0.4430     0.5570    0.00217 
      10       11     0.4227     0.5773    0.00221 
      11       12     0.4067     0.5933    0.00228 
      12        .     0.3952     0.6048    0.00236 



In the first interval, beginning at T0 and ending at T1 (end of the first year), no one in the three states left teaching. All persons were accounted for and none were censored, thus, the risk set is the effective sample size. As can be seen in Panel B of Table 4, 100% of those at risk survived and the probability of survival during this first year is thus 1.0. The failure rate is the complement of the survival probability. That is, subtract the survival rate from 1.0 and the result will be the event or "failure" rate. Now, suppose there is censoring. In the second year, extending from T1 to T2, not only do 10,304 persons leave teaching as indicated under the number failed, but 1849 observations were censored. The number at risk is adjusted under an assumption of constant (uniform) censoring to obtain the effective sample size. When these calculations are done at the midpoint of the interval, half of those censored are deducted from the number at risk to produce the effective sample size. Half of 1849 is 924.5 and this value is subtracted from the effective sample size in the previous interval (here, year) to obtain the effective sample size of the present interval, or 62404.5. By extension, calculation of the effective sample size of the third year (T2 to T3) is accomplished by taking the effective sample size of the previous year, 62404.5, and subtracting from it 10304, the number who departed teaching in the previous interval. The remainder is 52100.5. Then one-half of those censored in the previous year (1849/2 = 924.5) is added to half of those censored in the current year (1815/2 = 907.5), yielding 1831.5. The sum of half of the censored observations from both intervals is subtracted from the 52100.5, leaving 50268.5. The effective sample size is recalculated in this way for each period.

Information is contained in the survival and hazard rates for the groups under study. Survival time is time from the beginning of the observation until the event, in this case the departure of teachers from their field. To compute the discrete survival probability, take the proportion of teachers not leaving by the end of that year. This conditional probability for a period is multiplied by the probability of survival for the previous period. This result gives the addition to the failure rate that takes place from the current to the next period. The survival function is formed by obtaining the resulting failure rate for the next period and subtracting this newly formed failure rate for the next period from 1.0. The result is the survival probability for that next period. A survival function formed this way is a decreasing function that may or may not approach 0.

An example will help to illustrate the calculations. Let us examine the rates in the period from T1 to T2. The conditional rate of failure for this period is .1651 as seen in panel A of Table 4 above. This value is multiplied by the survival rate of the previous period, 1.0. The product is also .1651 and that is the amount added to the zero probability of failure for T1 to T2 to yield the probability of failure for T2 to T3. This amount is subtracted from 1.0 to yield the survival rate for T2 to T3, .8349. Similarly, to compute the failure rate for the third period, T3 to T4, we multiply the conditional rate of failure for T2 to T3, .1204, by the survival rate for the period, .8349, to obtain .1005, the value added to the cumulative failure distribution to give .2656 for T3 to T4. This amount is subtracted from 1.0 to obtain the survival rate for that period (.7344). The standard error of the survival rate, shown in Panel B of Table 4, is defined as the square root of the sum of the conditional rates of failure for the intervals from 1 to interval i, divided by the effective sample size and multiplied by the complement of the conditional rate of failure for each interval. In this way confidence intervals and statistical tests for comparing survival functions can be obtained.

Censored observations influence the hazard rate. They may increase the denominator of the hazard rate and deflate the hazard rate somewhat, or they may flatten it out and keep it from reaching 0. From the formula, the existence of censored observations slows down the decrease in the survival rate. At the beginning, the proportion surviving is 100 percent. The probability of those in the study surviving until the end of the first interval is 1.0. As time passes, more persons experience the event, and the proportion remaining in the field declines. By the end of year 6, some 50 percent of the teachers from the states remain in the field, as illustrated in Figure 2. This is the median survival time, useful for comparing survival times that include censored and uncensored cases. The median survival time in the graph of the survival function is at the intersection of the lines, where 50 percent of the cases survive beyond the 7-year period. The survival distribution function is formed by plotting the updated probabilities over time.

Figure 2

                        SAS Lifetest Procedure

                     SURVIVAL FUNCTION ESTIMATES


  SDF |
      |
      |
      |
      |
      |
  1.0 +   A++++A
      |         +
S     |          +
U     |          +
R     |           +
V     |            +
I     |             A
V 0.8 +              ++
A     |                 ++
L     |                  A
      |                   ++
D     |                     ++
I     |                       A+
S     |                         ++
T 0.6 +                           +A+
R     |                              ++
I     |                                +A++
B 0.5 |___________________________________ ++A+
U     |                                      | ++
T     |                                      |   +A++++A++
I     |                                      |            ++A++
O 0.4 +                                      |                 ++A++++A
N     |                                      |
      |                                      |
F     |                                      |
U     |                                      |
N     |                                      |
C     |                                      |  
T 0.2 +                                      |
I     |                                      |
O     |                                      |
N     |                                      |
      |                                      |
      |                                      |
  0.0 +                                      |
      |                                      |
      |                                      |
      |                                      |
      ----+----+----+----+----+----+----+--+----+----+----+----+----+----+----
          0    1    2    3    4    5    6    7    8    9   10   11   12   13

                                Years of teaching

Basic Mathematical Functions of Event History Analysis

Three important functions are the survivor function, hazard rate, and probability distribution function. The survivor function is formulated as follows:

(1)

This function is computed by multiplying one minus the conditional probability of the event, (1-qi), for each period. The survival probability of the event is the product of the survival probability for each preceding period and the probability of survival for the current period.

The ratio formed by the number of events under examination (here, leaving teaching) to the effective sample size is the conditional probability of the event, sometimes called the hazard rate. This probability of the "event" is the probability of the event under study conditional upon the event of interest not occurring prior to the current period. These conditional probabilities by period are found on the right side of Panel A in Table 4.

The hazard function can also be calculated. The failure rate at the beginning of the interval is also called the conditional rate of failure (leaving teaching). If this conditional rate, referred to in Equation 1 as q(t), is known, then the survival rate can be computed. The survival rate is the complement of the failure rate, sometimes referred to as the cumulative distribution function. The survival probability = 1 - q(t). This can be seen in Panel B of Table 4 above and is the proportion of events that have taken place before the end of the year period. From the conditional probability of failure, the failure rate can be calculated. When this conditional probability of failure is multiplied by the survival rate of the previous period, the amount added to the failure rate is obtained to compute the failure rate of the next time period. This cumulative event function is the probability density function.

The probability distribution function (p.d.f.) defines the probability that an event takes place within a time period, even though the interval is small. The p.d.f. can be mathematically formulated as follows:


(2)

This density function is the height of the function defining the probability that an event occurs within the time interval, or the derivative of the cumulative event distribution function, as shown in Equation 2. From the survival function, S(t) = Pr(T > t), Blossfeld et al. (1989, pp. 33-34) derive the cumulative distribution function F(t) as follows.

(3)

(4)

From these functions, the hazard rate can be derived. The relationship between the hazard and survival rate, and the p.d.f. can be compactly formulated as: h(t) = f(t) / S(t). Thus, we can see that the hazard rate, h(t), is formulated as:

(5)

This function is interpreted as the instantaneous risk within any interval as the size of that interval diminishes. The graph of the hazard function is shown in Figure 3 and the cumulative hazard function is formulated as:


(6)


We know that from h(t) = f(t)/S(t) and Equation 3, we may derive Equation 7:

(7)

For example, from the hazard rate and the probability density function, the survival rate can be inferred. From the survival rate and the cumulative distribution function, it is possible to calculate the hazard rate. To summarize, the graph of the hazard function in Figure 3 depicts the conditional probability of the event over the time periods. From this figure, note that the conditional probability of leaving teaching increases at to a maximum of .18 and then falls rapidly. By interval 12, the end of the study, the probability has declined to nearly .03.

Figure 3

                         AMERICAN TEACHER ATTRITION

  HAZARD |
    0.18 +          A
         |          +
         |          ++
         |          ++
         |          + +
    0.16 +         +  +
         |         +  +
         |         +   +
         |         +   +
         |         +   +
    0.14 +         +    +
         |         +    +
         |         +     +
         |         +     A
         |        +       ++
H   0.12 +        +         ++
A        |        +           A
Z        |        +            +
A        |        +             ++
R        |        +               +
D   0.10 +        +                A+
         |        +                  ++
F        |        +                    +A
U        |       +                       +
N        |       +                        ++
C   0.08 +       +                          +
T        |       +                           A+
I        |       +                             ++
O        |       +                               +A
N        |       +                                 +
    0.06 +       +                                  ++
         |       +                                    +
         |      +                                      A++
         |      +                                         ++A+
         |      +                                             ++
    0.04 +      +                                               +A
         |      +                                                 ++
         |      +                                                   ++
         |      +                                                     A
         |      +
    0.02 +      +
         |     +
         |     +
         |     +
         |     +
    0.00 +     A
         ---+----+----+----+----+----+----+----+----+----+----+----+----+----+--
            0    1    2    3    4    5    6    7    8    9   10   11   12   13

LENGTH OF FIRST TEACHING SPELL (SPELL-1)

Comparing Survival Rates

Survival functions may be compared for equality. For example, to decide whether teacher attrition is significantly different for males versus females, survival rates for males may be compared to those of females. The question is whether survival rates are related to gender. If these rates differ, then the homogeneity among the two groups should be significantly different. Two tests for this comparison are the Log-Rank (Mantel-Haenszel) and the generalized Wilcoxon (Gehan) test. A third test, the likelihood ratio test, assumes that the group data are exponentially distributed. This assumption is often violated, thus we focus on the former two. When we stratify the sample into groups, we request overlaid survival, hazard, and p.d.f. plots for inspection. An example of overlaid survival plots for male and female teachers is given in Figure 4. From the divergence, there appears to be a significant difference in teacher attrition among gender groups.

Figure 4

AMERICAN TEACHER ATTRITION SURVIVAL FUNCTION ESTIMATES

  SDF |
  1.0 + *++++*
      |       +
S     |        +
U     |        +
R     |         +
V     |   
I     |           *
V 0.8 +            ++
A     |              ++
L     |                *
      |                 ++
D     |                   ++
I     |                     *++
S     |                      ++++B+
T 0.6 +                        ++  ++
R     |                          A++ +B++
I     |                             ++A+ ++B++
B     |                                 ++    ++B++
U     |                                   +A++     ++B++
T     |                                       ++A++     ++B++++B++
I     |                                            ++A++          ++B++++B
O 0.4 +                                                 ++A++++A++
N     |                                                           ++A++++A
      |
F     |
U     |
N     |
C 0.2 +
T     |
I     |
O     |
N 0.0 +
      --+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
        0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
                        LENGTH OF FIRST TEACHING SPELL (SPELL-1)
                     LEGEND FOR STRATA SYMBOLS (A: SEX=F, B: SEX=M)

The log-rank test is a chi-square test with one df for each period. The expected number of departures for males is calculated as a proportion of the total number of departures divided by the total number of persons at risk multiplied by the number of males at risk. Similarly, the expected number of departures for females is computed as a proportion of the total number of departures divided by the total number of persons at risk in that interval, multiplied by the number of female teachers at risk. The log-rank test uses the difference between the total observed departures and the total expected departures, divided by the asymptotic standard error of departures. The differences are summed for the groups and yield an approximate chi-square statistic. The log-rank and the generalized Wilcoxon tests differ in that the weighting factor (omega) is 1.0 for the log-rank test and the size of the risk set for the generalized Wilcoxon test. A basic formulation of these equality tests is given in Equation 8.


(8)


In Table 5 below, SAS output for the equality of strata tests is given. Log-rank and generalized Wilcoxon tests are given for the groups and then the chi-square test results are given. The strata are significantly different in homogeneity for both tests (p < .001). Because the hazard rate is not constant, the likelihood ratio test is inappropriate and therefore is not addressed.

Table 5

Testing Homogeneity of Survival Curves over Strata (Time Variable Spell-1)

___________________________________________________________________________

RANK STATISTICS

SEX          LOG-RANK     WILCOXON

FEMALE         723.61     25040843
MALE          -723.61     -2.504E7



TEST OF EQUALITY OVER STRATA
                               PR >
TEST      CHI-SQUARE    DF  CHI-SQUARE

LOG-RANK     91.6043     1      0.0001
WILCOXON     49.2133     1      0.0001
-2LOG(LR)   189.1879     1      0.0001

___________________________________________________________________________

Assumptions

For many analyses, it is assumed that intervals are equidistant, although this assumption is not necessary. Life tables are useful for studying survival times without continuous covariates. If categorical covariates are used, then survival times may be stratified and compared with the equality of homogeneity tests. This, life tables are useful for preliminary analyses of hazard rates to determine what subsequent analyses might be appropriate. They are also used in sensitivity analyses comparing uncensored and censored groups.

Programs for Life Tables Analysis

A number of statistical programs perform discrete life tables analysis. Among them are SAS, SPSS, BMDP 1L, LIMDEP, and EGRET. The most popular mainframe packages are SAS, SPSS, and BMDP 1L. For the personal computer, SAS, SPSS, and LIMDEP are the most popular. We discuss SAS and SPSS command syntax in this section, explain SAS output, and provide SAS, SPSS, BMDP1L and LIMDEP command syntax for the discrete time life tables analysis in Appendix 4. SYSTAT SURVIVAL does not have a subroutine to analyze discrete time life tables, thus that package is not addressed here.

SAS syntax for life tables. The command syntax presented in Appendix 4 was used to generated the SAS output discussed above. The SAS procedure is invoked with the PROC LIFETEST command. The METHOD statement specifies the type of analysis. For discrete-time models, specify LT or ACT to indicate that a discrete-time analysis is requested. Otherwise, a Kaplan-Meier Product-limit analysis, inappropriate for discrete data, is produced. Because this is useful in analyzing residuals, an explanation of it is provided later. Plots are requested by letters, an S for survival function, H for hazard function, and P for the p.d.f. The LS and LLS options may be used for plots of the natural log of the survival function and the ln(-ln S) function. If the LS function is a straight line, the hazard function follows an exponential distribution, whereas if the LLS function is a straight line then the hazard function follows a Weibull distribution. Such information is helpful in determining the nature of the hazard function, especially for continuous-time models. Periodization is specified by the intervals subcommand. The single spell is divided into 13 one-year intervals for our example, time span is designated by the SPELL-1 time duration, and the censoring variable is LEFT. The 0 inside parentheses specifies the censoring value. If the event is not observed within the time span it is coded as 0. The statement STRATA SEX specifies that survival times should be analyzed by gender. Equality of homogeneity tests are then conducted to evaluate the differences.

SPSS syntax for life tables. SPSS syntax for the discrete time life table is similar to that of SAS. Life table analysis is requested for the length of the Spell-1 variable, broken down by the coded values of sex (M = 1, F = 2). There are 13 one-year intervals. The STATUS variable indicates the censoring variable. Unlike a SAS specification, the coding specified is that for no censoring. Regular events are coded as 1 and all others are considered censored. When all plots are requested, the survivor function, its log plot, the hazard function, and the p.d.f. are provided. COMPARE requests that the equality of homogeneity tests be conducted. SPSS provides a modified Generalized Wilcoxon test called the Desu-Lee test, which is sensitive to differences near the beginning of the survival function, whereas the Log-rank test is sensitive to differences near the end of the function (Blossfeld et al., 1989, pp. 126-127).

DISCRETE-TIME LOGISTIC REGRESSION ANALYSIS

When analyzing non-repeatable events, the survival, hazard, and p.d.f. are easily derived from one another. As shown in Equations 3 through 5, the survival function can be converted to the hazard function easily. Equation 9 shows how to derive the hazard rate from the probability density function:


(9)

Conversely, the survival function may be derived from the known hazard function, as shown in Equation 10. Although it is possible to begin with an analysis of survival times, it may be more useful to follow with an analysis based on the discrete time hazard rate. The discrete time hazard rate, or conditional probability of an event, is represented as h(t). This variable can be treated as a dependent variable in a regression analysis. It is the conditional probability of an event with respect to a baseline probability. The baseline probability is that for which the covariates (predictors) are at 0. If the probability of an event within a particular time period is h(t), then the probability that the event will not occur is its complement = [1 - h(t)]. The odds that the event will take place is the ratio of the conditional probability of the event to the conditional probability of the nonevent. When this ratio is compared to the baseline ratio, the odds ratio is constructed.


(10)


The odds ratio and its transformation, the logit, are given in Equation 11; the standard error is easily computable. The logit of the hazard rate can serve as a criterion for a logistic regression. The logit is modeled as a linear combination of a parameter vector (consisting of a constant term for the period plus the sum of the coefficients multiplied by their respective variable values; Hosmer & Lemeshow, 1989, p. 239). In Equation 12, note that a is the intercept for the jth period under consideration and is the baseline hazard rate against which covariates are evaluated. If both sides of the equation are multiplied by the baseline odds, then Equation 12 is formed. Hence, aj is the baseline hazard rate and the logit of the hazard is equal to the constant aj plus the parameter vector.





(11)

(12)

Alternatively, the logit can be viewed as a function of the hazard rate. Equation 13 gives the transformation of Equation 12 as a function of the hazard.


(13)


But a nonlinear regression with the hazard rate as a function of a exponential parameter vector is as intractable as the multiplicative odds version of the equation. By taking the antilog of (exponentiating) both sides of this equation, the multiplicative odds form of the equation is derived, as shown in Equation 14.


(14)


The problem with both equations is that their components are nonlinear, and thus less tractable than a linear form. The linear form of the regression, given in Equation 12, is more manageable and is the form assumed throughout this chapter.

Person-Period Dataset

Discrete-time logistic regression analysis is unlike standard logistic regression in that it is not based on a person-oriented dataset. Rather, this procedure is applied to each of the j periods of the spell. Results are then aggregated so that person-periods are the cases rather than observations. The number of persons is multiplied by the number of discrete time intervals in the first spell. An advantage of a person-period unit of analysis is that it enhances the sample size. SAS code for constructing a person-period dataset and the resulting data is given in Appendix 1. Similar SPSS code and output are given in Appendix 2. Now we turn to global statistical tests of model fit, using maximum likelihood estimation (an exposition of ML estimation and tests of model fit is provided in Appendix 3).

Global Tests of Model Fit

Several objective tests are available to assess global model fit. They compare the fit of models with parameters compared to a model without the parameters. Prominent among these tests are the likelihood ratio test, Wald chi-square, Score test, Pearson goodness of fit test, and the pseudo-R2. In addition, the Akaike Information Criterion (AIC), Schwartz Criterion (SC), Gamma, Somer's D, and classification table provide useful information about model goodness of fit. The classification table is a very useful summary measure. When the ordered response is 1 for an event and the ordered response is 2 for no response for SAS (SAS models the lower rather than the upper probability), and 1 for an event and 0 for a non-event in SPSS, a classification table is produced. The most complete version of this table is provided by SAS. Both SAS and SPSS calculate the percent correct (hits). But only SAS outputs the sensitivity (percent of event responses predicted correctly) and the specificity (percent of the non-events predicted to be nonevents). SAS also calculates the false positive rate, or the percentage of predicted event responses that were observed to be non-events, and the false negative rate, the percentage of predicted non-event responses observed to be events. A SAS classification table and its associated statistics are given in Table 7 and in Appendix 7. An expanded version of the SAS classification table is found in SAS version 6.07 and 6.08 (Appendix 15).

Table 6

Effect Coding for Categorical Variable of Race


RACE

D1

D2

D3

White

-1

-1

-1

Black

1

0

0

Hispanic

0

1

0

Other

0

0

1


Logistic Regression Coefficients

Logistic regression coefficients are interpreted as the amount of change in the logit that is associated with a unit change in the independent variable. If the coefficient is positive, then the logit change associated with a unit change in the independent variable is positive. If the sign is negative, then there is a decrease in the logit associated with a positive change of one in the independent variable. If the dummy variable coefficients were exponentiated-- that is, the natural log raised to that power of the coefficient -- then the transformed coefficient would represent the change in the odds of the event taking place owing to a unit change in the independent variable. Once this odds transformation is performed, coefficients greater than one indicate a positive change in the odds of the dependent variable, whereas coefficients less than 1.0 indicate negative changes in the odds of the dependent variable. To convert this change in the odds to a percentage, one uses the following transformation, which does not hold if the coefficients are not coded as dummy variables.


(15)


If the dichotomous variables are coded as deviations from means, or reference cell coding, then simple exponentiation of the logistic regression coefficient will not give the correct odds change in the dependent variable. Suppose that the gender variable is coded as 1 for male and -1 for female. Then the effect coding of categorical variables gets even more complicated. This kind of coding is sometimes used when the predictors are categorical. Hosmer and Lemeshow (1989, pp. 47-52) show what happens in the coding of a race variable with four categories. The categories are white, black, hispanic, and other. With effect coding there is a need for three variables; Table 7 contains the effect codes for the variables D1, D2, and D3.

Table 7

Classification Table for Logistic Regression



-------------------------------Predicted ---------------



Event

No Event

Total


Event

16

2

18
Observed

No Event

5

4

9


Total

21

6

27

Sensitivity = 88.9% Specificity = 44.4% Correct= 74.1%

False Positive rate = 23.8% False negative rate = 33.3%

For effect coding, the white category, will have -1 for each of the dummies. To obtain the logit for the black-white comparison, Hosmer and Lemeshow (1989) show that the greater the number of categories, the more difficult it is to convert the logit coefficient to the correct odds. To obtain the proper effect coding of black vs. white, the coding in Equation 16 should be used.The results are not the mere exponentiation of the parameter estimate. For this reason we recommend that such effect coding be avoided in running discrete time logistic regression. analysis.

(16)

If the variables are measured as integers, then the above formulae do not hold. A change in 10 units of the independent integer level variable then would lead to a percentage change of (exp(b)10 - 1)*100 in the rate of the dependent variable. In fact, if the coefficients are coded as integer level variables, then the general formula in equation 17 should be used.


(17)


Statistical significance of the logistic regression coefficients may be assessed in two ways. One of these methods was just described and utilizes the Wald chi-square test, although the statistic becomes somewhat unreliable as the ratio exceeds four. Therefore, we suggest a more reliable method, which constructs a model by assessing -2LL for the model without the variable to be tested. Then -2LL is calculated for the model with the variable(s) added. The ratio of these two numbers is the LR chi-square for the newly added variable(s). It is distributed as a chi-square with one degree of freedom. This method appears to be more reliable than the Wald chi-square and we recommend it.

Nonetheless, we do not recommend the stepwise option in discrete-time logistic regression. Stepwise methods may unwisely eliminate variables that should be included, if the subsequent inclusion of other variables causes the significance of a variable to drop below a particular statistical limit. For instance, main effect components of an interaction might be eliminated by such an algorithm. If the product vector is left in the model but the main effects are dropped out, then the interaction is improperly specified. As the number of variables in the model change, the listwise deletion option may bring about a change in the number of observations, upon which the likelihood ratio chi-square is based. In this way an automatic stepwise algorithm might damage the modeling process theoretically.

Time-dependent covariates. Not only is it necessary to consider the matter of coding the time-invariant covariates, it is necessary to consider the coding of time-dependent covariates also. The interpretation of these coefficients depends on their correct coding. For convenience, categorical and dichotomous variables should be coded as dummy variables.

Event-functional covariates. Some time dependent covariates vary as a function of the occurrence of an event. Some covariates refer to states. If we are examining the hazard rate of job change, it has been hypothesized that marriage might have an effect on the hazard rate of job shifts. To code the event of marriage, a dummy variable might be used. The coding of the marriage variable before an event is 0, and the coding of that variable after the time at which the event took place is 1. Dummy variable coding for before and after a particular time is simple. The variable is given a 0 before the event for each interval coded. Suppose an event occurs for an individual within period 10. At the time of the event, the variable is given a 1 indicating that the event has occurred (in other words: Event = 0; if interval > 9 then Event = 1). These codes may be used in both discrete-time and continuous-time models.

Time-functional covariates. There are some variables that are a function of calendar time. Age increases as a function of time. This functional relationship may be coded easily. Let us assume that a person was born in 1910. The age of the person will increase by 12 months as each year passes. If the intervals are measured in years. Then the age of the person is so incremented at each interval. Age = 12 * Int - 1. Such functional forms may be constructed easily in SAS, BMDP, and most other statistical packages.

Time-varying covariates. These variables, like population density, vary with time. But the variation need not be a function of time. Rather they may cycle, or vary upward and then downward, depending on birth, mortality, longevity, and migration rates. These values may be entered at each interval in a discrete time model as average values for the interval. Alternatively, they may be entered as midpoints of the values at the beginning and end of the interval. Some persons code the variable, such as annual income, as the amount being earned annually at the beginning of that interval. In these ways, time-varying covariates may be coded. If one were to plot the level of these covariates over time, it might be possible to arrive at the level for each interval by means of linear interpolation. Using the average value has been suggested by Allison (1982, 1984) in his work on discrete time methods. Blossfeld et al. (1989) suggest that the interpolated value be programmed for the beginning of the interval.

Problems may arise from the use of time-varying covariates. If a variable like income, which may increase over time, is used as an exogenous variable, then increases in income may be erroneously associated with longer survival times and correspondingly lower conditional risks of the event under consideration. Age may affect the life cycle of a process and that life cycle may be related to the duration of time till an event takes place. When causal variables change within an episode then the transition rates change not just between states, but within episodes as well (Tuma & Hannan, 1979). The historical period, associated with environmental conditions, may have an age-related effect on the rate of a process. Cohort effects may be related to the rate of the process also. Beware that when a cohort is defined by a birthdate, there is a linear dependency between cohort and birthdate that may create a model including both effects that is unidentified.

Residual Diagnostics

Just as with OLS regression, residual diagnostics are useful to assess the adequacy of the model (Stevens, 1984). They help to identify observations that are outliers -- that is, observations that are not well explained by the model. They assess the influence of the outlier not only on the chi-square of the model, but also on the individual logistic regression coefficients. Observations that are outliers may seriously distort the parameter estimates and diagnostics can help indicate by how much distortion results. By identifying outliers and influential data points, one may examine the observations for coding errors or atypical cases, possibly meriting deletion or, at the least, further examination.

Residual diagnostics that permit this assessment include the Pearson residual, DIFCHISQ, and the DFBETA. The Pearson residual is useful in detecting outliers that are not well predicted by the model. The Pearson chi-square is actually the squared sum of squares of the residuals. The formula for the Pearson residual for the ith observation is:


(18)


The DIFCHISQ is a residual diagnostic that shows the difference in the model chi-square that follows from the deletion of a particular observation. It is a statistic that can show the effect on the model of the observation under consideration and its formula can be found in equation 19 below.


(19)


While the DIFCHISQ is useful in assessing the influence on model fit of a particular observation, the DFBETA index is useful in assessing the influence on the standardized beta of a particular observation. The DFBETA measures the change in the standardized parameter estimate owing to the deletion of a particular observation. The formula for DFBETA is given in Equation 20.


(20)


Together, these residual diagnostics allow the analyst to examine the effect of the observation on the parameter or the model. In this way, outliers may be identified, examined, and possibly deleted.

Assumptions

Undergirding discrete-time logistic regression analyses are three basic assumptions, explicated by Singer and Willett (1992, p. 41). If these assumptions are untenable, the model and inferences may not be valid. The first assumption is that of linearity of the logit. The second assumption is proportionality of the odds, and the third assumption is no unobserved heterogeneity. The linearity assumption means that equal displacements of the independent variable are associated with equal displacements of the logit hazard profile. This assumption may be assessed by both graphical and statistical means. Graphically, this linearity may be observed by dividing the independent variable into strata. The graphical display of the vertical displacement of the logits as a function of the different strata should reveal that equal vertical displacements of the logit should correspond to equal horizontal displacements of the strata of the independent variable. Statistically, linearity can be tested by showing that the addition of polynomial terms should not significantly improve the fit of the model. At first, squared terms may be added. Then higher order curvilinear terms may be entered. At no time should the addition of these terms significantly lower the -2LL, which is distributed as a chi-square with the number of degrees of freedom equal to the terms entered in that set.

The proportionality of the odds assumption is sometimes referred to as the parallel slopes assumption. This assumption means that there is no interaction between the independent variable and time. That is to say, the increase in the predictor variable is associated with a proportional increase in the odds. In other words, a shift in time period is assumed to be associated with a constant vertical displacement of the logit hazard score. The statistical test of this condition is to construct J-1 period dummies and to test their interaction with the independent variable for significant decreases in the -2LL. Alternatively, one may use a time (t) and time squared (t2) term and set up product vector interactions, sequentially adding these terms and assessing for statistically significant reductions in the -2LL. If there is a significant interaction of the independent variable with time, then the hazard profiles of the different values of the variable may intersect or criss-cross. Moreover, if an independent variable is time-varying, then the hazard may fluctuate with increase in time that yields a non-proportionality of the odds. But this is not always, although it often is, the case..

The assumption of no unobserved heterogeneity is a third requirement. The logit hazard regression model contains no error term. All of the variation in the logit hazard profile is assumed to be contained in the specified variables. In other words, no misspecification of the model is permitted. Singer and Willett (1992) note that "omission of an important independent variable amounts to pooling of heterogeneous populations defined by the different values of the omitted predictor" (p. 38). The aggregated populations can have different hazard shape than that of the disaggregated hazard profile. When individuals in the high risk groups die out early, the hazard rate becomes depressed with the passage of time owing to the selection effect. The selection effect of unobserved heterogeneity might bias the hazard rate downwards, according to Allison (1984). This bias means that the time-dependent nature of the risk may alter the hazard profile accompanying random selection of subjects. However, it may be possible to include the effects of unobserved heterogeneity in the estimation procedure. Heckman and Singer (1984) have proposed a method for making the correction (cf. Trussell & Richards, 1985). Yamaguchi (1986) provides a thorough review of fixed and random effects methods and we direct the interested reader to that source for additional details. Jain and Vilcassim (1991) illustrate the use of an unobserved heterogeneity component in their investigation of household purchase timings.

Statistical Packages for Discrete-Time Logistic Regression Analysis

Although logistic models may be estimated using SAS, SPSS, SYSTAT, and BMDP, setting up a person-period dataset and then aggregating the periods is most easily done in SAS. For that reason, the SAS command syntax for such a model is presented in Appendices 5 and 6 in annotated form. The SAS output is then presented in Appendix 7 and briefly explained.

COX REGRESSION

If one assumes that the intervals are small in size and the discrete time hazards are small in magnitude, then the discrete time logistic model may be transformed into the Cox proportional hazards regression model (Cox, 1972; Cox & Oakes, 1984). As the time intervals become finer, the dependent variable in Equation 21 for this analysis turns into a ratio of hazards (Yamaguchi, 1991, p. 18). The discrete time hazard ratio is transformed as shown in that equation. This is the odds ratio. Once the natural logarithm is taken, the logit is formed and the following equations for a Cox regression results.


(21)



(22)



(23)


Equation 22 and 23 are estimated using partial likelihood. Cox (1975) showed that the model parameters may be estimated by ignoring the baseline hazard rate and maximizing the likelihood by minimizing -2LL of the model. When a researcher is not concerned with a baseline hazard rate, this can be a robust analysis and should be considered. The goodness of fit tests are the same as given above. They include the LR chi-square, the Wald chi-square, and the Score test statistic. The Akaike Information Criterion or Schwartz Criterion are often used to compare models, with a lower score indicating the preferred model.

The Cox regression coefficients are the logits of the hazard and may be exponentiated to obtain the risk ratio. This value represents the change in the odds associated with a unit change in the independent variable, particularly when the individual variables are dummy coded. Categorical variables are dummy coded and the continuous variables are evaluated as described above for the discrete time logistic model. As before, the odds refer to the ratio of the probability of experiencing the event to the probability of not experiencing it.

The assumptions of the proportional hazards regression model are linearity, proportionality of the odds, and no unobserved heterogeneity. Proportionality of the odds means that there is no significant interaction of a covariate with time. It is the parallel slopes assumption. The model should be tested with an interaction of the covariate with time to ascertain whether this assumption holds. If it does not, the interaction term should be left in the model. The proportionality of the odds assumptions means that the Cox model cannot handle many time varying covariates. If there are more than one or two time-varying covariates, then it may better to rely on the discrete time logistic model.

Statistical Packages for Proportional Hazards Analysis

Subprograms to estimate proportional hazards regression models are found in the SAS PHREG, BMDP 2L, SPSS COXREG, SYSTAT SURVIVAL, and LIMDEP SURVIVAL packages, among others. The SAS syntax for the Proportional Hazards regression presented here (and in Appendices 9 and 17) utilizes the PHREG procedure in Version 6.04 of SAS (1990). In earlier versions, the PHGLM supplementary library procedure performed this kind of analysis. An example of the PHREG procedure follows and the syntax is given in Appendices 14 and 15 and briefly described here. The Spell-1 variable is the length of the time and the left variable is again the censoring indicator with a 0 indicating censoring of the observation. The covariates are the x variables and a covariance matrix is requested. An example of the output from this analysis follows in Table 8. The SAS output is presented as it is the most complete form of output.

Table 8

DISCRETE TIME PHREG MODEL, MICHIGAN DATA, SPELL1 ANALYSIS
     Data Set: WORK.SPELLT1
     Dependent Variable: DV
     Censoring Variable: LEFT (Left teaching before 1984-85 school year)
     Censoring Value(s): 0
     Ties Handling: BRESLOW

Summary of the Number of Event and Censored Values
                                                       Percent
                     Total       Event    Censored    Censored

                    178109       64785      113324       63.63

                     Testing Global Null Hypothesis: BETA=0

                   Without        With
    Criterion    Covariates    Covariates    Model Chi-Square

    -2 LOG L     1556206.27    1531998.41    24207.85 with 2 DF (p=0.0001)
    Score              .             .       20585.38 with 2 DF (p=0.0001)
    Wald               .             .       15832.20 with 2 DF (p=0.0001)

                    Analysis of Maximum Likelihood Estimates

                    Parameter     Standard      Wald         Pr >          Risk
 Variable   DF       Estimate       Error    Chi-Square   Chi-Square      Ratio
 TIME        1      -0.077037      0.00585    173.33514       0.0001      0.926
 TIMESQ      1      -0.016618    0.0006293    697.27383       0.0001      0.984

The status variable indicates the code for the event of departure from teaching. The other codes for this variable indicate censoring. Here the left hand variables include the time variable and the censoring variable. The coding of the censoring variable in LIMDEP must be 1 for actual departures from teaching and 0 for censored observations. The right hand side variables include the x covariates plus a time varying covariate that is a function of time. This time varying covariate is a product of the variable z1 and the time variable.

APPLICATIONS OF DIFFERENT MODELS

A question generally arises as to which kind of analysis to use and when. It is usually useful to examine the survival function and the hazard rate as a first step because these functions play integral parts in the description of the duration of the phenomena under study. Therefore, usually a life tables analysis should come first. Often, the question of stratification arises. Are durations related to a categorical variables? Life tables analysis tests whether the survival rates can be stratified by a covariate so that the respective rates are significantly different. Therefore, life tables procedure should be performed first. Also, we prefer the SAS system because of the quality of its algorithms/output and the ease of construction of the person-period dataset.

Teachman (personal communication, August 1992) has commented on the relative advantage and disadvantages of the discrete time logistic model with respect to the Cox proportional hazards model. Among the relative advantages of the discrete time logistic model over the Cox regression model are handling of time-varying covariates, ease of explanation, and handling of ties. If the spell is easily and naturally divisible by intervals set up on an aggregated dataset, and if the intervals are not too large, then time aggregation bias need not pose a serious problem and the discrete time logistic model is usually a good choice. If the researcher can tradeoff a little bit of power as a consequence of the aggregation of time to test his hypotheses, then the discrete time logistic model is a good model to use. If there are a lot of time varying covariates, the discrete time logistic model may handle this form of parameter vector better than the proportional hazards model. On the other hand, if there are a lot of ties, then the discrete time logistic model should perform better.

There are situations where the Cox model may be preferred. If researchers are not concerned with the magnitude of the baseline hazard rate, they may prefer the simplicity of the Cox model. Researchers may have only one or two time-varying covariates in their parameter vector. Owing to the proportionality of the odds assumptions, the Cox model cannot handle many time-varying covariates well. But if there are only one or two, then the Cox model should be considered. If the covariates are fixed and proportional, this is a good model to apply. If there are few ties, then this model may be an appropriate one to use.

MULTIPLE SPELL MODELS

Several advanced discrete-time models deserve mention. One of them, the multiple spells model, deserves extensive treatment. Multispell models are also called multiepisode or multicycle models. There are two subtypes: one-way and two-way transitions (Yamaguchi, 1991, p. 46). We focus on the one-way transition models, under which a teacher might enter teaching, teach for a spell, and then leave. Later, the teacher may return to teaching before again departing. The one-way transition of interest is leaving teaching. The focus is duration of the teaching spell. In a two-way transition, the model may include a teaching-departure and a teaching-return transition. In such models, there may be an in-teaching spell and an out-of-teaching spell. Yamaguchi (1991) addresses the subject of two-way transitions in which a transition can be reversed (p. 46). Singer and Willett clearly explain the nature of such models in their treatment of in- and out-of-teaching spells (Willett & Singer, 19??). Multispell models are relevant for studying episodes of absenteeism, unemployment, or recidivism.

Discrete-time models may be used to analyze repeatable events by examining the duration of time from the beginning of the spell to the end of the spell, in terms of periods, for each spell. Thus, the spell lengths are aggregated as well. Because these models involve multiple persons within multiple periods within multiple spells, problems of left-censoring emerge that could complicate the analyses. There is no optimal ay to deal with left-censored observations at this time, unless the hazard rate is constant or unless there is preexisting duration dependency data from which the missing (i.e., left-censored) data may be estimated. However, constant hazard rates or such prehistorical dependency data are rare in practice, for which reason it is very important to be sure that the beginning time of the spell is correctly recorded.

As we have noted, multispell models involve person-period-spell datasets. In the analysis, the focus is on what contributes to the length of the teaching spell. We are interested in one-way transitions (leaving teaching) as the terminating event. There is a person-period dataset for each spell of the model; thus, if there are two spells, then there are two person-period datasets describing these in-teaching spells. The datasets are concatenated to create the person-period-spell dataset. In Appendix 9, a straightforward example of a SAS program converting a person-oriented dataset to a person-period-spell dataset is given. In SPSS, this conversion is complex and requires three steps. First, the person-period dataset is constructed for the first spell, using the SPSS syntax in Appendix 3, and it is saved as a system file. Second, a person-period dataset is constructed for the second spell, and it is saved as a separate system file. Finally, the two system files are concatenated with the ADD FILES command to form the person-period-spell dataset. This procedure is slightly different in SPSS for Windows in that one calls up a system file from one spell and then adds cases from the other spell system file. In this way, a multiple spells system file is created. It is on this dataset that the procedure is run. Multispell datasets may consume a lot of computer memory, depending on the number of observations, thus preparing them for processing on a personal computer may be inefficient.

In multiepisode models, the survival function, p.d.f., and hazard functions take into account the spell under consideration. Each spell has a survival, p.d.f., and hazard function, as well as a set of covariates. Examples from a two-spell Michigan teaching attrition model are found in Appendices 9 and 10. Survival and hazard rates plotted for each spell of this model in Appendix 11 and 12. A working assumption is that the spells are conditionally independent. The spells are considered a "modulated renewal" stochastic process consisting of spells that are identically and independently distributed (Yamaguchi, 1991, p. 47). The set of covariates associated with the spell under examination may change from spell to spell. The second spell may be a function of "prehistory," including the covariate vector of the first spell. This possibility can be tested. Further, the return to teaching and length of the second teaching spell may be a function of the duration of the previous teaching spell. The prehistory therefore may include the length of previous spells along with the covariate vectors associated with them. Portions of this prehistory may turn out to be significant after testing (Blossfeld, Hamerle, & Mayer, 1989, pp. 60-64). We shall first show the survival and hazard functions for an individual spell and then we shall show them for all spells. The survivor function for the kth spell (k = 1,2,...) is:


(24)


The hazard rate for a particular spell, the spell-specific hazard rate, which is the limit of probability as the time interval decreases, is given in Equation 25 (Blossfeld et al., 1989).

(25)

From this spell specific hazard rate, we can obtain a multiple spells discrete time logistic regression formula. This formula is very similar to that in Equation 5, except that the multispell model combines person-periods and spells into one formula, as can be seen in Equation 26.

(26)

The time and time-squared variables are added to check the duration dependency --that is, a relationship of the outcome variable as a function of the time in a state. The latter term allows for the assessment of curvilinearity in duration dependency and is useful if one considers parametric modeling of the equation. The sign of the coefficient indicates concavity upward or downward -- a negative squared term indicates a concave downward, while a positive squared term indicates a concave upward. The spell variable indicating the spell number indicates whether this is a significant factor. The coefficient of the spell variable shows the shift that comes about when one moves from one spell to next. If it is significant, the shift in logit space from one spell to the other is noteworthy. As this parameter estimate would be equal to the intercept and because it is a linear function of the intercept, the intercept must be dropped from this model to avoid multicollinearity and allow for estimation of the spell parameter. An example of this estimation procedure is found in Appendices 14 and 15. Interactions of the covariates with time and time squared should initially be used to test proportionality of the hazards. In Equation 26, an example of such an interaction is between time and x1. A SAS program to compute a discrete-time logistic regression model with an interaction term between time and sex shows how nonproportional hazards may be modeled, when the assumption of proportional hazards is violated. These are logistic regression coefficients which show the shift in natural-log hazard space associated with a unit change in the temporal or other covariate. A classification table is presented to show sensitivity, specificity, and false positive and false negative rates (Appendices 7 & 15). The same procedure may be run in SPSS.

COMPETING RISK MODELS

Another discrete-time model involves the concept of competing risks. Competing risks must be mutually exclusive and collectively exhaustive for the models to be transition specific. The risks must be completely different from one another in that either one or the other can happen, but not both, at the same time. When these models are analyzed, time until the transition-specific event is analyzed, while alternative transitions are treated as censored. Thus, there is a different analysis for each transition specific risk or event. Consider the event of dropping out of school and possible primary explanations for so doing. One may drop out of school mainly to go to a better school, or because of financial hardship, or for family problems, or to get married and raise a family, or if one becomes very ill, or if one dies. . Dropping out for another predominant reason may be coded as the last alternative. Together, these explanations comprise the set of risks (main reasons) for dropping out of school for a student. Ideally, they are mutually exclusive and collectively exhaustive in their locus of primary reasons for leaving school.

CONTINUOUS TIME MODELS

An additional class of models is referred to as continuous-time models (Tuma & Hannan, 1984, advocate this type of model). When the temporal data are continuous in nature, discrete-time models are unnecessary. If the data refer to particularly salient events, which are easily and exactly remembered, then exact times may be recalled and continuous time models may be used. When the data arise from continuous monitoring of events and the recording of such events is accurate, precise durations may be calculated. One example of this analysis is the medical monitoring of patients. In these cases continuous time models may be used in either nonparametric or parametric form.

The Kaplan-Meier product-limit life table is a nonparametric continuous-time model (Miller, 1981, p. 46). A brief explanation of this form of life table is presented in Appendix 16. Alternatively, parametric regression in continuous-time models requires accurate assessment of the distribution of errors, or the parameter estimates are biased. It is necessary to ascertain through examination of the hazard rate whether error distributions are exponential, Gompertz (Makeham), Weibull, log-logistic, log-normal, or gamma. If the hazard rate is constant, the model may be exponential. If the hazard rate is either increasing or decreasing, the distribution may be Gompertz or Weibull. If the hazard rate cycles (increases and then decreases), the distribution may be log-logistic.

Examination of the residuals will help determine the nature of the fit. Researchers may compute the average or median residual duration for uncensored as well as censored data. Even if one uses a pseudo-R2 statistic to assess the fit, it is not always possible to be certain that the parametric distribution selected is the correct one, especially because of possible unobserved heterogeneity. For this reason, it is often safer to employ the discrete-time model than risk serious parameter bias that results from incorrect specification within a parametric framework.

CONCLUSIONS

Our presentation has developed the rationale, procedures, and assumptions of discrete-time event history analysis. The rationale centers on the increasing requirements of higher education researchers and practitioners to evaluate time until the occurrence of events (e.g., graduation, student attrition, faculty tenure). After developing three specific techniques for analyzing time until an event and relating the times to predictors, we used an extended example involving teacher attrition to demonstrate several computer packages that estimate the event history analysis models that were presented. We also presented a brief description of one multispell model with a one-way transition, in order to expand researcher knowledge of a more advanced approach to temporal data. Competing risks and continuous time models were briefly described but not elaborated. Writings from a range of disciplines were integrated, as can be seen by the explosion of interest in event history analysis over the last decade.

Researchers, spanning social science disciplines and levels of analysis, use event history analysis to study the longevity of cohabitation or marriage, medical treatments for chronic illness, and consumer product failure times. At the individual level of analysis, sociologists and demographers study the duration of cohabitation or marriage (Teachman, 1983). At an aggregate level of analysis, the survival of organizations (Hannan & Freeman, 1977), corporate illegal behavior (Baucus & Near, 1991), survival of governments (Hannan & Carroll, 1981), and decolonization (Strang, 1990) are of interest. Sociologists interested in criminal behavior examine variables that influence time until recidivism (i.e., re-arrest) of offenders. Similar techniques are used by economists to study the duration of employment and unemployment periods (Flinn & Heckman, 1982a, 1982b; Heckman & Singer, 1984, 1985; Kiefer, 1988), and by organizational psychologists to study absenteeism (Fichman, 1989; Harrison & Hulin, 1989) or time until turnover in civilian and military jobs (Morita et al., 1989). Psychologists study time to relapse of smokers (Hatziandreu et al., 1990), alcoholism (Swan & Denk, 1977) and other forms of addiction. Finally, medical researchers often study survival after different treatments for disease, for example, heart transplants (Crowley & Hu, 1977; Lawless, 1982). Because diseases often have beginnings, durations, terminations/remissions, and recurrences/recoveries, they are ideal for event history analyses (accurately called survival analyses).

One major conclusion, and one that is nearly inevitable, is a requirement to consider reframing substantive research questions in higher education. The refocusing should center on the duration until event occurrence, combining the unique advantages of the person-period data matrix and the capability to develop explanatory models using several covariates (i.e., time-independent and/or time-varying). Further, institutional research units at many universities provide a rich source of archival duration data that is amenable to the sophisticated analyses that were discussed here. The applicability of survival analysis to multiple higher education populations (students [clients], teachers [service providers], and administrative [i.e., executive and support personnel]), and even to institutions themselves demands increased cognizance of this flexible family of procedures.

Notes

1. The data and codebook were respectively gathered and prepared under the direction of Murname, Singer and Willett at Harvard University and made available to this author through the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan. We use the same variables names found in the ICPSR codebook and are grateful to both Murname, Singer, and Willett on the one hand and to the Inter-University Consortium for Political and Social Research on the other for making these data available to us.

2. A difficult type of censoring is left-censoring. Such observations include unknown beginning time points. Unless the hazard rate is constant, it is difficult to estimate the duration of the spell under circumstances of left-censoring. This kind of censoring may occur with data characterized by multiple spells. For example, studies of unemployment may involve repeated spells of unemployment, where in retrospect the beginning of the unemployment spells may be indefinitely recalled.

3. Singer & Willett derive the log likelihood function and their derivation is used here.

REFERENCES

Allison, P.D. (1982). Discrete-time methods for the analysis of event-histories. In S. Leinhardt (Ed.). Sociological methodology 1982 (pp. 61-98). San Francisco, CA: Jossey-Bass.

Allison, P.D. (1984). Event history analysis: Regression for longitudinal event data. Newbury Park, CA: Sage.

Austin, J.T., Yaffee, R. A., & Hinkle, D. E. (1992) Logistic Regression for Research in Higher Education. In J. Smart (Ed). Higher Education: Handbook of Theory and Research VIII (pp. 379-410). New York, NY: Agathon Press.

Baucus, M.S., & Near, J.P. (1991). Can illegal corporate behavior be predicted? An event history analysis. Academy of Management Journal, 34, 9-36.

Bloom, B.S. (1964). Stability and change in human characteristics. New York: Wiley

Blossfeld, H.P., Hamerle, A.., & Mayer, K.U. (1989). Event history analysis: Statistical theory and application in the social sciences. Hillsdale, NJ: Erlbaum.

Cox, D.R. (1972). Regression models and life tables. Journal of the Royal Statistical Society, B, 34, 187-220.

Cox, D.R. (1975). Partial likelihood. Biometrika, 62, 269-276.

Cox, D.R., & Oakes, D. (1984). Analysis of survival data. London: Chapman & Hall.

Crowley J., & Hu, M. (1977). Covariance analysis of heart transplant survival data. Journal of the American Statistical Association, 72, 27-36.

Everitt, B.S. (1987). Introduction to optimization methods and their application in statistics. London, UK: Chapman & Hall.

Fichman, M. (1988). Motivational consequences of absence and attendance: Proportional hazard estimation of a dynamic motivation model. Journal of Applied Psychology, 73, 119-134.

Flinn, C.J., & Heckman, J.J. (1982a). Models for the analysis of labor force dynamics. In G. Rhodes & R. Basmann (Eds.), Advances in econometrics (Vol. 1, pp. 35-96). Greenwich, CT: JAI Press.

Flinn, C.J., & Heckman, J.J. (1982b). New methods for analyzing individual event histories. In S. Leinhardt (Ed.), Sociological methodology 1982 (pp. 99-140). San Francisco, CA: Jossey-Bass.

Freedman, D., Thornton, A., Camburn, D., Alwin, D., & Young-Demarco, L. (1988). The Life History Calendar: A technique for collecting retrospective data. Sociological Methodology, 17, 37-68.

Hannan, M.T., & Carroll, G.R. (1981). Dynamics of formal political structure: An event-history analysis. American Journal of Sociology, 46, 19-35.

Hannan, M.T., & Freeman, J.H. (1977). The population ecology of organizations. American Journal of Sociology, 82, 929-964.

Heckman, J., & Singer, B. (1982). Population heterogeneity. In K.G. Land & A. Rogers (Eds.), Multidimensional mathematical geography (pp. 567-598). New York: Academic Press.

Heckman, J., & Singer, B. (1984). A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica, 52, 271-320.

Heckman, J., & Singer, B. (Eds.). (1985). Longitudinal analysis of labor market data. New York: Cambridge University Press.

Heckman, J., & Singer, B. (1985). Social science duration analysis. In J. Heckman & B. Singer (Eds.), Longitudinal analysis of labor market data (pp. 39-110). New York: Cambridge University Press.

Hosmer, D.W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.

Jain, D.C., & Vilcassim, N.J. (1991). Investigating household purchase timing decisions: A conditional hazard function approach. Marketing Science, 10, 1-23.

Kalbfleisch, J.D., & Prentice, R.L. (1980). The statistical analysis of failure time data. New York: Wiley.

Kiefer, N.M. (1988). Economic duration data and hazard functions. Journal of Economic Literature, 26, 646-679.

Klein, J., & Goel, P. (1992). Survival analysis: State of the art (NATO advanced workshop proceedings). Dordrecht, Holland: Kluwer.

Lawless, J.F. (1982). Statistical models and methods for lifetime data. New York: Wiley.

McGrath, J.E., & Kelly, J. (1988). Time and human interaction. Newbury Park, CA: Sage.

Miller, R.G. (1981). Survival analysis. New York: Wiley.

Morita, J.G., Lee, T.W., & Mowday, R.T. (1989). Introducing survival analysis to organizational researchers: A selected application to turnover research. Journal of Applied Psychology, 74, 280-292.

Murnane, R.J., Singer, J.D., & Willett, J.B. (1988). The career paths of teachers: Implications for teacher supply and methodological lessons for research. Educational Researcher, 17, 22-30.

Nagel, E. (1979). The structure of science (2nd ed.). Indianapolis, IN: Hackett.

Pierre, D.A. (1986). Optimization theory with applications. New York: Dover.

SAS Institute, Inc. (1990). PHREG preliminary documentation. Cary, NC: Author.

SAS Institute, Inc. (1990). Logistic procedure. In Statistics user's guide (4th ed., Vol. 2, Ver. 6). Cary, NC: Author.

Sclan, E.M. (1993). The effect of perceived workplace conditions on beginning teachers work commitment, career choice commitment, and planned retention. Unpublished Ed.D. dissertation, Teachers College, Columbia University.

Singer, J.D., & Willett, J.B. (1993). It's about time: Using discrete­time survival analysis to study duration and the timing of events. Journal of Educational Statistics, 18, 155-195.

Stevens, J.P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95, 334-344.

Strang, D. (1990). From dependency to sovereignty: An event history analysis of decolonization 1870-1987. American Sociological Review, 55, 846-860.

Swan, G.E., & Denk, C.E. (1987). Dynamic models for the maintenance of smoking cessation: Event history analysis of late relapse. Journal of Behavioral Medicine, 10, 527-554.

Teachman, J. (1983). Analyzing social processes: Life tables and proportional hazards models. Social Science Research, 12, 263-301.

Trussell, J., & Richards, T. (1985). Correcting for unobserved heterogeneity in hazard models using the Heckman-Singer procedure. In N.B. Tuma (Ed.), Sociological methodology 1985 (Vol. 15, pp. 242-276). San Francisco, CA: Jossey-Bass.

Tuma, N.B. (1980). Invoking RATE. Menlo Park, CA: SRI International.

Tuma, N.B. (1982). Nonparametric and partially parametric approaches to event history analysis. In S. Leinhardt (Ed.), Sociological methodology 1982 (Vol. 12, pp. 1-60). San Francisco, CA: Jossey-Bass.

Tuma, N.B., & Hannan, M.T. (1978). Approaches to the censoring problem in analysis of event histories. In K. Schuessler (Ed.), Sociological methodology 1979 (Vol. 8, pp. 209-240). San Francisco, CA: Jossey-Bass.

Tuma, N.B., & Hannan, M.T. (1984). Social dynamics: Models and methods. New York: Academic Press.

Tuma, N.B., & Hannan, M.T. (1980). Invoking RATE (2nd ed.). Menlo Park, CA: SRI International.

Tuma, N.B., Hannan, M.T., & Groeneveldt, L.P. (1979). Dynamic analysis of event histories. American Journal of Sociology, 84, 820-854.

Willett, J.B., & Singer, J.D. (1989). Two types of questions about time: Methodological issues in the analysis of teacher career path data. International Journal of Educational Research, 13, 421-437.

Willett, J.B., & Singer, J.D. (1991a). From whether to when: New methods for student dropout and teacher attrition. Review of Educational Research, 61, 407-450.

Willett, J.B., & Singer, J.D. (1991b). How long did it take? Using survival analysis in educational and psychological research. In L. Collins & J. Horn (Eds.), Best methods for the analysis of change (pp. 310-327). Washington, DC: American Psychological Association.

Willett, J.B., & Singer, J.D. (1993). It's deja-vu all over again: Using multiple-spell discrete-time survival analysis. Manuscript submitted for review.

Yaffee, R.A. , Lorenz. V.C. & Politzer, R.M., (1993) Models Explaining Gambling Severity among Patients undergoing Treatment in Maryland: 1983 through 1989. In W.R. Eadington & J.A. Cornelius (Eds.), Gambling Behavior and Problem Gambling .(pp. 657-677). Reno, NV.: University of Nevada Press, Reno.

Yamaguchi, K. (1986). Alternative approaches to unobserved heterogeneity in the analysis of repeatable events. In N.B. Tuma (Ed.), Sociological methodology 1986 (Vol. 16, pp. 213-249). San Francisco, CA: Jossey-Bass.

Yamaguchi, K. (1991). Event history analysis. Newbury Park, CA: Sage.

CONTRIBUTORS

Robert A. Yaffee is a statistical consultant with the Statistics and Social Science Group of the Academic Computing Facility, Courant Institute of Mathematical Sciences, New York University. He is Director of Research of the Compulsive Gambling Center, Inc.in Baltimore, Maryland. He received his Ph.D. from The Graduate Faculty of Political and Social Science of the New School for Social Research. He has lectured and written articles on logistic regression and ordinal logit models and has also applied these techniques to research on addiction severity among pathological gamblers for the Maryland Task Force on Gambling Addiction.

James T. Austin is an Assistant Professor of Psychology at The Ohio State University. He earned his Ph.D. in 1987 from Virginia Polytechnic Institute and State University, with a specialization in Industrial-Organizational Psychology. His research on goal-setting, criterion measurement, and structural equation modeling has appeared in industrial-organizational and quantitative psychology journals.

ACKNOWLEDGEMENT

The authors would like to thank Richard Murnane of the Harvard Graduate School of Education for collecting and making the teacher career data available to the Inter-University Consortium for Political and Social Research (ICPSR) in addition to Judy D. Singer and John B. Willett, also of the Harvard Graduate School of Education, for the opportunity to review their excellent treatments of single and multiple-spell discrete time event history methods using that data. To them, we remain heavily indebted. We would also like to thank the Inter-University Consortium for Political and Social Research for providing a copy of the dataset used for our analysis and Jay D. Teachman of the University of Maryland, College Park, and of the Institute of Social Research at the University of Michigan, for his valuable inspiration and assistance. We would also like to thank the Academic Computing Facility of New York University for the use of the IBM 9121 on which we constructed the multiple spells datasets for analysis.


Appendix 1

Converting a Person-Oriented Dataset to Person-Period Dataset in SAS

PROC FORMAT;
 VALUE SX 0='MALE' 1='FEMALE'; VALUE RA 0='WHITE' 1='BLACK'; 
 VALUE LE 0='CENSORED' 1='LEFT';
OPTIONS LS=80 PS=55;
TITLE 'SAS CONVERSION OF PERSON-ORIENTED TO PERSON-PERIOD'; TITLE2 '  DATA SET  ';
DATA ONE; INFILE 'MICHTCH.DAT';
  INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR; LABEL ID='RESPONDENT ID'
     SEX='SEX OF TEACHER' SPELL1='LENGTH OF FIRST TEACHING SPELL'
     LEFT='LEFT TEACHING BEFORE 84-85 SCHOOL YEAR' ENTERYR= 'YEAR FIRST ENTERED TEACHING'
     EXITYR='FIRST YEAR OF CAREER INTERRUPTION' ;
  IF LEFT > 0 THEN LEFT = 1;
ARRAY TEMP(13) D1-D13; DO PERIOD = 1 TO MIN(SPELL1,13);                
    IF PERIOD = SPELL1 AND LEFT NE 0 THEN DV = 1;  ELSE DV = 2; *creation of dep var;
    DO INDEX = 1 TO 13;                           
    IF INDEX = PERIOD THEN TEMP(INDEX) = 1; ELSE TEMP(INDEX) = 0; 
   SPELL1SQ = SPELL1 * SPELL1;
   SPXSX = SPELL1*SEX; SP2XSX = SPELL1SQ*SEX;
  END;  OUTPUT;  END;
FORMAT SEX SX.; FORMAT RACE RA.; FORMAT LEFT LE.;
PROC PRINT; VAR ID D1-D13 LEFT DV SPELL1 SPELL1SQ; RUN;
PROC PRINT; VAR D1-D13 PERIOD LEFT DV; RUN;

Note: The basic array construction used here is from Willett, 1993, pp. 192-193). Coding of the DV requires that events leaving teaching are coded as 1 and non-events staying in teaching as 2.

                                                           S
                                                       S   P  P
                                                       P   E  E
                                                       E   L  R L
        O                              D D D D   S     L   L  I E
        B      I     D D D D D D D D D 1 1 1 1   E     L   S  O F        D
        S      D     1 2 3 4 5 6 7 8 9 0 1 2 3   X     1   Q  D T        V
         1 376523417 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE   13 169  1 CENSORED 2
         2 376523417 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE   13 169  2 CENSORED 2
         3 376523417 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE   13 169  3 CENSORED 2
         4 376523417 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE   13 169  4 CENSORED 2
         5 376523417 0 0 0 0 1 0 0 0 0 0 0 0 0 MALE   13 169  5 CENSORED 2
         6 376523417 0 0 0 0 0 1 0 0 0 0 0 0 0 MALE   13 169  6 CENSORED 2
         7 376523417 0 0 0 0 0 0 1 0 0 0 0 0 0 MALE   13 169  7 CENSORED 2
         8 376523417 0 0 0 0 0 0 0 1 0 0 0 0 0 MALE   13 169  8 CENSORED 2
         9 376523417 0 0 0 0 0 0 0 0 1 0 0 0 0 MALE   13 169  9 CENSORED 2
        10 376523417 0 0 0 0 0 0 0 0 0 1 0 0 0 MALE   13 169 10 CENSORED 2
        11 376523417 0 0 0 0 0 0 0 0 0 0 1 0 0 MALE   13 169 11 CENSORED 2
        12 376523417 0 0 0 0 0 0 0 0 0 0 0 1 0 MALE   13 169 12 CENSORED 2
        13 376523417 0 0 0 0 0 0 0 0 0 0 0 0 1 MALE   13 169 13 CENSORED 2
        14 386479798 1 0 0 0 0 0 0 0 0 0 0 0 0 FEMALE  1   1  1 LEFT     1
        15 364497052 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE    1   1  1 LEFT     1
        16 366545454 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE   13 169  1 CENSORED 2
        17 366545454 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE   13 169  2 CENSORED 2
        18 366545454 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE   13 169  3 CENSORED 2
        19 366545454 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE   13 169  4 CENSORED 2
        20 366545454 0 0 0 0 1 0 0 0 0 0 0 0 0 MALE   13 169  5 CENSORED 2
        21 366545454 0 0 0 0 0 1 0 0 0 0 0 0 0 MALE   13 169  6 CENSORED 2
        22 366545454 0 0 0 0 0 0 1 0 0 0 0 0 0 MALE   13 169  7 CENSORED 2
        23 366545454 0 0 0 0 0 0 0 1 0 0 0 0 0 MALE   13 169  8 CENSORED 2
        24 366545454 0 0 0 0 0 0 0 0 1 0 0 0 0 MALE   13 169  9 CENSORED 2
        25 366545454 0 0 0 0 0 0 0 0 0 1 0 0 0 MALE   13 169 10 CENSORED 2
        26 366545454 0 0 0 0 0 0 0 0 0 0 1 0 0 MALE   13 169 11 CENSORED 2
        27 366545454 0 0 0 0 0 0 0 0 0 0 0 1 0 MALE   13 169 12 CENSORED 2
        28 366545454 0 0 0 0 0 0 0 0 0 0 0 0 1 MALE   13 169 13 CENSORED 2
        29 378504498 1 0 0 0 0 0 0 0 0 0 0 0 0 MALE    4  16  1 LEFT     2
        30 378504498 0 1 0 0 0 0 0 0 0 0 0 0 0 MALE    4  16  2 LEFT     2
        31 378504498 0 0 1 0 0 0 0 0 0 0 0 0 0 MALE    4  16  3 LEFT     2
        32 378504498 0 0 0 1 0 0 0 0 0 0 0 0 0 MALE    4  16  4 LEFT     1

                                 Appendix 2
                                   Panel 1
        Converting Person-Oriented Datasets to Person-Period Datasets in SPSS

SET WIDTH 80/UNDEFINED=NOWARN/MXWARNS=100000.
TITLE 'SAS CONVERSION OF PERSON PERIOD DATA SET'.
INPUT PROGRAM.
DATA LIST FILE='MICHTCH.DAT' FREE/ ID SEX RACE SPELL1 LEFT ENTERYR EXITYR.
VAR LABELS ID='RESPONDENT ID' / SEX='SEX OF TEACHER' / SPELL1='LENGTH OF FIRST TEACHING
       SPELL' / LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR' / ENTERYR= 'YEAR FIRST
       ENTERED TEACHING' / EXITYR='FIRST YEAR OF CAREER INTERRUPTION'.
VALUE LABELS SEX 0='MALE' 1='FEMALE'/RACE 0='WHITE' 1='BLACK'/LEFT 0='CENSORED' 1='LEFT'.
VECTOR TEMP(13).
 LOOP PERIOD = 1 TO MIN(SPELL1,13).
    LEAVE ID SEX RACE SPELL1 ENTERYR EXITYR LEFT.
    COMPUTE DV = 0. 
    IF (PERIOD = SPELL1 AND LEFT NE 0) DV = 1.
    LOOP INDEX = 1 TO 13. 
    IF (LEFT > 0) LEFT = 1.
    COMPUTE TEMP(INDEX) =0. 
    IF (INDEX = PERIOD) TEMP(INDEX) = 1.
    COMPUTE TIME1 = PERIOD.
    COMPUTE TIME1SQ = PERIOD*PERIOD.
    COMPUTE T1XSX = TIME1*SEX.
    COMPUTE T1SQXSX = TIME1SQ*SEX.
   END LOOP.
   END CASE.
  END LOOP.
VAR LABELS TIME1 'THE LENGTH OF TIME ELAPSED IN SPELL1' / TIME1SQ 'TIME ELAPSED IN SPELL1
SQUARED' / T1XSX 'TIME1 BY SEX INTERACTION' / T1SQXSX 'TIME1 SQUARED BY SEX INTERACTION'.
END INPUT PROGRAM.
  LIST VARIABLES= TIME1 TIME1SQ SEX RACE PERIOD LEFT DV.
SAVE OUTFILE='MT1B.SAV'.

Appendix 2

Panel 2

Note: This program was run on SPSS for Windows version 5 on a 486/50 with 16MB RAM. It produced the same output as the above SAS program, with the exception of the coding of the dependent variable (i.e., event is coded as a 1, non-event is coded as a 0).

Output of SPSS Program in Appendix 2

    TIME1  TIME1SQ      SEX     RACE   PERIOD     LEFT      DV
    1.00     1.00      .00      .00     1.00      .00      .00
    2.00     4.00      .00      .00     2.00      .00      .00
    3.00     9.00      .00      .00     3.00      .00      .00
    4.00    16.00      .00      .00     4.00      .00      .00
    5.00    25.00      .00      .00     5.00      .00      .00
    6.00    36.00      .00      .00     6.00      .00      .00
    7.00    49.00      .00      .00     7.00      .00      .00
    8.00    64.00      .00      .00     8.00      .00      .00
    9.00    81.00      .00      .00     9.00      .00      .00
   10.00   100.00      .00      .00    10.00      .00      .00
   11.00   121.00      .00      .00    11.00      .00      .00
   12.00   144.00      .00      .00    12.00      .00      .00
   13.00   169.00      .00      .00    13.00      .00      .00
    1.00     1.00     1.00      .00     1.00     1.00     1.00
    1.00     1.00      .00      .00     1.00     1.00     1.00
    1.00     1.00      .00      .00     1.00      .00      .00
    2.00     4.00      .00      .00     2.00      .00      .00
    3.00     9.00      .00      .00     3.00      .00      .00
    4.00    16.00      .00      .00     4.00      .00      .00
    5.00    25.00      .00      .00     5.00      .00      .00
    6.00    36.00      .00      .00     6.00      .00      .00
    7.00    49.00      .00      .00     7.00      .00      .00
    8.00    64.00      .00      .00     8.00      .00      .00
    9.00    81.00      .00      .00     9.00      .00      .00
   10.00   100.00      .00      .00    10.00      .00      .00
   11.00   121.00      .00      .00    11.00      .00      .00
   12.00   144.00      .00      .00    12.00      .00      .00
   13.00   169.00      .00      .00    13.00      .00      .00
    1.00     1.00      .00      .00     1.00     1.00      .00
    2.00     4.00      .00      .00     2.00     1.00      .00
    3.00     9.00      .00      .00     3.00     1.00      .00
    4.00    16.00      .00      .00     4.00     1.00     1.00

Appendix 3

Maximum Likelihood Estimation

Discrete-time logistic regression uses maximum likelihood (ML) parameter estimation. Using the likelihood equation it is possible to derive an objective function to be minimized for the parameters to be estimated. Because the dependent variable assumes values of either 1 or 0, depending upon whether the person experiences the event of leaving teaching or not within the period under consideration, Singer and Willett (1993) noted that the binomial probability outcome provides the likelihood function (L) shown in Equation 27. The binomial probability of experiencing the event or not experiencing the event multiplied by the log-odds of experiencing the event, when it does occur, is formulated in Equation 27:


(27)


In the likelihood function, c is designated the censoring variable. This variable tells whether the person's career ended in the event or not is incorporated in the version of the likelihood equation in 35. If the person did not drop out of teaching in the kth interval, then the teacher's observation is censored and c = 1. If the person left teaching in the jth interval, then this observation is not censored and c = 0.


(28)


According to Singer and Willett (1992), if it is assumed that the individuals in the sample are independent, the likelihood function is the product of the probabilities of their observed data as shown in 18. The probability that T = j for uncensored individuals and the probability that T > 0 for censored persons.


(29)


This translates into


(30)


Taking the logarithms gives the log likelihood (LL) function


(31)


Singer and Willett (1992) show that expansion of the middle term yields:


(32)


This expansion permits collection of terms of the log likelihood function as:


(33)


Reparameterizing this function by taking the censoring indicator c and replacing it with the outcome of the event history process, y, the log likelihood (LL) function is obtained:


(34)


which in turn can be reformulated as:


(35)


For each person there are j intervals (periods). Therefore each person contributes j cases to the dataset. The likelihood function is then computed over all persons over all intervals. By taking the natural log of this function, the loge likelihood is obtained. It is this function that can be multiplied by - 2 to obtain a function that is distributed as a chi-square, which can then be minimized to obtain the parameter estimates of the model. The loge likelihood function is shown in Equation 35 and Equation 36.


(36)


Global Model Tests

A global model test is the likelihood ratio chi-square (LR). The test statistic is the ratio of -2 loge likelihood of the model with parameters to -2 loge likelihood of the null model, without parameters. The ratio of these two values is itself distributed as chi-square. As a heuristic, the LR chi-square and the Pearson chi-square can be compared. When they are approximately equal, this comparison suggests that there is no sample size problem. But if there is a substantial discrepancy between this goodness of fit chi-square and the likelihood ratio chi-square, then the sample size is probably too small for asymptotic estimation. The Wald chi-square test is computed by pre- and post-multiplying the inverse of the variance-covariance matrix by the parameter vector and its transpose. Formulas for both tests are shown in Equation 37.


(37)


The Score test is a similar global chi-square test, computed by premultiplying the transpose of the vector of partial derivatives of the likelihood of the parameters of the model with respect to the parameters by the inverted information matrix (the asymptotic variance-covariance matrix) and post-multiplying that product by the vector of partial derivatives of the model.


(38)


The number of degrees of freedom for this test is equal to the number of parameters being estimated. The Pearson chi2 is sometimes used as a goodness of fit measure. The Pearson Chi-square test is given as follows:


(39)


The degrees of freedom of this statistic equals the number of model parameters minus 2.

RATE, a program written by Tuma (1980; Tuma & Hannan, 1980), also computes a pseudo R2.





(40)


Two other statistics used to assess comparative fit are the Akaike information criterion and the Schwartz Criterion.


(41)


In addition to the -2LL, these two comparative global statistics provide two different ways of adjusting for degrees of freedom and the number of predictor variables in the model. As a rule, the lower the value of the statistic, the better the nested model.

SAS computes three nonparametric correlation coefficients that correlate the predicted probabilities with the observed responses of the model. These are the Kendall's tau-a, the gamma and the Somer's D correlation coefficients. The formulas for these coefficients make use of concordances and discordances.


(42)


A concordance is a pair of observations that include two variables (where one variable is a predicted probability and the other is an observed response), where an increase in one variable (say x) going from one observation to the next is associated with an increase in the other variable (say y). A discordance is where an increase in one variable is associated with a decrease in the other variable, when one proceeds from one observation to the next. The number of the different pairs are those minus the number of ties. The number of ties are those where there is no significant change between two variables when one proceeds from one observation to another. The sum of the concordances minus the sum of the discordances over the number of pairs ( which is equal to (n * (n - 1) / 2)) is a measure of the proportional concordance of the two variables. The sum of the concordances minus the sum of the discordances over the total number of concordances and discordances is another measure of proportional concordances, yielding a Gamma correlation coefficient. The Somer's D statistic calculated by SAS subtracts the number of ties from the total in the denominator to correct for the number of ties in the measure, when it yields the proportional concordance in measuring the monotonicity of the data. These measures address to what extent prediction of a dependent variable is enhanced by knowledge of covariates.

Estimation of the Model

This section describes parameter estimation and is mathematically complex. It may be skipped by the reader who needs no information on the estimation algorithm. Parameter estimation uses the log likelihood function. Minus two times the log likelihood (-2LL) is distributed as a chi-square. Maximization of the likelihood is performed by minimizing the chi-square value, in an iterative manner (cf. Everitt, 1987; Pierre, 1986). The minimization is performed by taking the partial derivatives of the log likelihood with respect to particular parameter estimates, and setting these equal to 0 to obtain the first order conditions that permit solutions for the parameter estimates. The numerical algorithm used by computer packages to find the minima along the first order functions is usually some version of the Newton-Raphson method. This method uses a Taylor series expansion to arrive at the parameter estimate. Using this method, the parameter estimates at a given iteration are a function of the previous estimate minus the product of derivative of the estimate times their second derivatives. The second derivatives of these log likelihood functions are computed to form the Fisher information matrix. The Newton-Raphson algorithm forces an iterative movement of the parameter estimate from 0 along the parameter range until the chi-square function (that is, -2LL) is minimized. If no local minimum has been found, then the minimum chi-square will maximize the likelihood that the parameter estimate of the model is most probably the correct one (i.e., is maximally likely). Significance tests are obtained by solving for the information matrix (minus the expectation of the inverse of the matrix of second derivatives) for the asymptotic covariance matrix. The variances of this matrix are the elements in the main diagonal. Their square roots are the asymptotic standard errors for testing parameter significance. When the sample size is sufficiently large and the ratio of parameter estimate divided by its asymptotic standard error is squared, a Wald chi-square statistic is formed. These squared ratios are distributed as chi-squares, each with one degree of freedom. Significance level may be computed for the individual coefficients in this manner.

Appendix 4

SAS Mainframe version 6.07, PC version 6.04

PROC LIFETEST METHOD=LT PLOTS=(S,H,P) INTERVALS=1 TO 12 BY 1;
  TIME SPELL1*LEFT(0); STRATA SEX;
TITLE1 'AMERICAN TEACHER ATTRITION';
TITLE2 'THREE STATES COMBINED DATA';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;

SPSS for Mainframe, Windows, and PC+

SURVIVAL  TABLES=SPELL1 BY SEX(1,2)/
          INTERVALS=THRU 12 by 1/
          STATUS = LEFT(1)/
          PLOTS (ALL)/
          COMPARE

BMDP1L Mainframe 1988.

/PROGRAM    BMDP1L 
    /INPUT      FILE='Mich.dat'.
                Variables=3.
                FORMAT = Free.
    /Variable   NAMES=SPELL1,LEFT,SEX.
    /FORM       TIME=SPELL1.
                UNIT=YEARS.
                STATUS=LEFT.
                RESPONSE=1.            
   /ESTIMATE    METHOD = LIFE.
                PLOTS=SURV, DEN, HAZ.
                GROUPING=SEX.
                STRATA=SEX.
                STATISTICS=BRESLOW.

LIMDEP Ver. 6.

SURVIVAL; LHS = SPELL1,LEFT; 
          INT = 13;
          STR=SEX $ 

Appendix 5

SAS and SPSS Syntax for Discrete-Time Logistic Regression Procedures

SAS version 6.04

txx1 = time * x1;
PROC LOGISTIC CT;
  MODEL DV = time time2 x1 txx1 x2 x3 / corrb;

SAS version 6.07 and 6.08

Altough we used, SAS v. 604 to run the models shown here, it should be noted that as of release 6.07, you may use a DESCENDING option on the proc statement to reverse the ordering of the response variable, so that SAS will model the upper, rather than the lower, probability level. If this is done, then the DV =2 statement in Appendix

1 should be changed to DV = 0. An example of the PROC Statement that would be

used in this connection follows:

PROC LOGISTIC DESCENDING CT;
MODEL DV =  time time2 x1 txx1 x2 x3 / corrb;



SPSS version 4.01

compute txx1 = time * x1; Logistic Regression Variables = DV with Time Time2 x1 txx1 x2 x3 / classplot / print summary corr.

Appendix 6

        SAS Output from Discrete-time Logistic Regression Model

Although logistic models may be found in SAS, SPSS, SYSTAT, BMDP, and other
statistical packages, setting up person-period datasets and then aggregating the
periods is most easily done in SAS.  For that reason, SAS (Ver. 6.07) syntax for a
model using only the Michigan teaching data, as collected by Murnane et al. (1988), is
presented below in annotated form.

Line 1 Sets the line length to 80 columns.

1          OPTIONS  LS=80;   

Lines 2 through 6 define the answer categories for the nominal variables.

2          PROC FORMAT; 
3             VALUE SE 0='MALE' 1='FEMALE'; 
4             VALUE RA 0='WHITE' 1='BLACK'; 
5             VALUE LE 0='CENSORED' 1='LEFT';
6             VALUE PT 1='FULL TIME' 2='PART TIME';

Line 11 defines the dataset. 

11         DATA MITEACH; 

Line 12 reads in the dataset defining the record length and blocksize. For the MVS
system, MIT is the data definition name.

12             INFILE MIT LRECL=472 BLKSIZE=31624;

Lines 13-19 declare the variables by giving them variable names and defining the
location in their data in the data file.  It should be noted that the MADEG01-MADEG13
(Master's degree in year XX) array of variables are time varying covariates, which
have the value of 0 until the year when the degree is conferred.  From that year till
the end of the person's spell, the value of that array is 1.

13  INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16
14        MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22
15        MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28
16        INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.) @59
17        (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12)(1.)
18        SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93
19        RETURNYR 94-95;

Lines 20 through 30 convert the person-oriented dataset to the person-period dataset,
as demonstrated by Singer & Willett.

Line 20 defines the temp variable array as dummy variables D1 to D13.

20  ARRAY TEMP(13) D1-D13;

Line 21 Defines MADEG as an array of dummy variables MADEG1 through MADEG13 of whether
the teacher obtained a masters degree in that year.    

21  ARRAY MADEG(13) MADEG01-MADEG13;  

Line 22 sets up the person-period case loop.

22  DO PERIOD = 1 TO MIN(SPELL1,13);  

Line 23 states that if the period is equal to the greatest year and if the observation
is not censored, then the dependent variable equals 1.  Otherwise, the dependen
variable equals 2.  Remember that PROC logistic in SAS models the lower probability,
not the higher one. 

23  IF PERIOD = SPELL1 AND LEFT NE 0 THEN DV= 1; ELSE DV = 2;

Line 24 sets up the index of the loop that defines the time dummies.


24  DO INDEX = 1 TO 13;

Line 25 sets a dummy variable D1 through D13 equal to one only when the index equals
to correct period.

25  IF INDEX = PERIOD THEN TEMP(INDEX) = 1; ELSE TEMP (INDEX) = 0;

The dur1 and dur1 squared variables are constructed in lines 26 and 27. Deviating from
the practice of Singer & Willett, a different parameterization of time is undertaken.
This approach uses fewer variables than the dummy variable approach and conserves
power in the analysis.  The duration variable and its square are then computed.

26  DUR1 = EXITYR - ENTERYR;
27  DUR1SQ = DUR1**2;

Line 28 defines the end of the inner loop.

28  END;

Line 29 outputs the variables within the loop as data.

29                   OUTPUT;

Line 30 defines the end of the outer loop.

30   END;

Lines 31 through 39 give the variable labels for the covariates.

31   LABEL SEX= 'SEX OF TEACHER'
32                    RACE='RACE OF TEACHER'
34                    ENTERYR='FIRST SCHL YEAR OF TEACHING'
35                    SPELL1='LENGTH OF FIRST TEACHING SPELL'
36                    LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR'
37                    EXITYR='FIRST YEAR OF CAREER INTERRUPTION'
38                    DUR1 = 'DURATION OF FIRST SPELL';

Lines 40 and following give the syntax for the logistic regression procedure.  CT
requests the classification table for assessment of prediction accuracy.  

40        PROC LOGISTIC CT;

The model statement indicates that Y is the criterion and that the covariates are sex,
race, duration, & duration squared.  The temporal dependence is tested by the duration
and duration squared terms.  If these are significant, the nature of the duration
dependence may be estimated.  This test is useful for researchers trying to decide
whether they should perform parametric analyses, as the duration coefficients the 
duration and its squared term will suggest an appropriate distribution of the errors. 

41          MODEL DV = DUR1 DUR1SQ SEX RACE;

The format statements below the model statements link formats with variables.

42        FORMAT SEX SE.;
43        FORMAT RACE RA.;
44        FORMAT LEFT LE.;
45        TITLE2 'MICHIGAN DATA';

The output for the requested analysis shows the name of the data set, the response
variable, its number of levels and the number of valid observations for this analysis.

                               Appendix 7

     SAS v. 6.04 Output For Discrete Time Logistic Regression Model

     DATA SET:                          WORK.MITEACH
     RESPONSE VARIABLE:                 DV
     RESPONSE LEVELS:                   2
     NUMBER OF OBSERVATIONS:            64785
     LINK FUNCTION:                     LOGIT

In the response profile, observations at each value of the DV are shown.
     
                                RESPONSE PROFILE
                           ORDERED
                             VALUE      DV     COUNT

                                 1       1     18957
                                 2       2     45828

Simple statistics for IVs are shown to check that they are being read in correctly.

                                  STANDARD
  VARIABLE            MEAN       DEVIATION         MINIMUM         MAXIMUM

  SEX             0.707540        0.454896         0.00000           1.000
  RACE            0.055661        0.229268         0.00000           1.000
  DUR1            5.176708        2.745399         1.00000          12.000
  DUR1SQ         34.335402       32.359466         1.00000         144.000

                    CRITERIA FOR ASSESSING MODEL FIT

                               INTERCEPT
                 INTERCEPT        AND
   CRITERION       ONLY       COVARIATES    CHI-SQUARE FOR COVARIATES

   AIC           78323.914     62749.278         .
   SC            78332.993     62794.672         .
   -2 LOG L      78321.914     62739.278    15582.636 WITH 4 DF (P=0.0001)
   SCORE              .             .       15670.553 WITH 4 DF (P=0.0001)

In the printout below are given the logistic regression coefficients, asymptotic
standard errors, Wald chi-squares, and significance levels for each parameter, along
with its standardized estimate for comparing relative effects within the model.  Some
of the variables are significantly related to the logit as is  by the difference
between -2 LL for the intercept only and the -2 LL for the intercept plus covariates.
This value is the chi-square for covariates, with df equal to the number of parameters
in the model.
     
It is interesting that sex and race do not have by themselves significant main effects
as can be seen from the analysis of ML estimates below.  Their standardized estimates
(for comparison within the model) are defined as the ratio of the slope parameter,
multiplied by 1 + ex to the standard deviation of the independent variable, are less
than .01 in magnitude.  The question of whether there is a significant sex by time
interaction must be tested by another model.  

That the time and time squared variables are found to  be significant indicates that
duration dependence significantly contributes to the explanation of the dependent
variable. That is to  say, the odds of teacher attrition is a function of a polynomial
of time.
        


ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES

            PARAMETER    STANDARD      WALD           PR >      STANDARDIZED
 VARIABLE    ESTIMATE      ERROR    CHI-SQUARE     CHI-SQUARE     ESTIMATE

 INTERCPT      2.5553      0.0388    4328.9606         0.0001              .
 SEX          -0.0119      0.0220       0.2905         0.5899      -0.002975
 RACE          0.0199      0.0430       0.2132         0.6442       0.002510
 DUR1         -1.1744      0.0147    6393.5830         0.0001      -1.777518
 DUR1SQ        0.0716     0.00126    3245.5730         0.0001       1.276746

Below are ordinal correlations between predicted probabilities and the observed
responses.
           
ASSOCIATION OF PREDICTED PROBABILITIES AND OBSERVED RESPONSES
CONCORDANT = 74.3%          SOMERS' D = 0.532
DISCORDANT = 21.1%          GAMMA     = 0.558
TIED       =  4.6%          TAU-A     = 0.220
(868761396 PAIRS)           C         = 0.766

Classification tables are one of the best ways to assess goodness of 
predictability. They give the percent correctly predicted and provides
the sensitivity, the percentage of event responses correctly predicted
to be an event, and the specificity, the percentage of nonevent responses
correctly predicted to be a nonevent.  The false positive rate is the
percentage of predicted event responses that turned out to be nonevents.
Also provided is the false negative rate, or percentage of predicted nonevent
responses that turned out to be events.  A classfication table provided by
SAS is presented below.


                                    CLASSIFICATION TABLE
                                          PREDICTED
                                        EVENT   NO EVENT      TOTAL
                                   +---------------------+
                          EVENT    |     8715      10242 |    18957
               OBSERVED            |                     |
                          NO EVENT |     3425      42403 |    45828
                                   +---------------------+
                          TOTAL         12140      52645      64785

SENSITIVITY= 46.0%  SPECIFICITY= 92.5%  CORRECT= 78.9%
FALSE POSITIVE RATE= 28.2%  FALSE NEGATIVE RATE= 19.5%
NOTE: AN EVENT IS AN OUTCOME WHOSE ORDERED RESPONSE VALUE IS 1.

We may try another model where we delete the race variable, which in Michigan does not
seem to have a significant association with the logit of departing teaching and to
include the interaction of sex and time.  From the lower AIC and SC for the intercept
and covariates, we see that this is a marginally better model.  But from the
comparison of the Ordinal Correlation coefficients, we see that the difference is
almost negligible, insofar as sex and its interaction with time are not significant at
the .05 level.

                       Criteria for Assessing Model Fit
                               Intercept
                 Intercept        and
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC           78323.914     62745.961         .
   SC            78332.993     62791.356         .
   -2 LOG L      78321.914     62735.961    15585.952 with 4 DF (p=0.0001)
   Score              .             .       15675.267 with 4 DF (p=0.0001)

In SAS Ver 6.07 the odds ratio is added to the analysis of parameter estimates table
below.  This value is merely the exponentiation of the logistic regression
coefficient.




                   Analysis of Maximum Likelihood Estimates
              Parameter Standard    Wald       Pr >    Standardized Odds
  Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

  INTERCPT 1     2.6079   0.0476  3005.9602     0.0001            .   13.571
  DUR1     1    -1.1866   0.0161  5465.3194     0.0001    -1.796131    0.305
  DUR1SQ   1     0.0717  0.00126  3247.8473     0.0001     1.279461    1.074
  SEX      1    -0.0814   0.0433     3.5308     0.0602    -0.020424    0.922
  DURXSEX  1     0.0156  0.00831     3.5121     0.0609     0.027860    1.016

         Association of Predicted Probabilities and Observed Responses
                   Concordant = 73.9%          Somers' D = 0.533
                   Discordant = 20.7%          Gamma     = 0.563
                   Tied       =  5.4%          Tau-a     = 0.220


                            Appendix 8

        SAS, SPSS, and LIMDEP Syntax for Proportional Hazards
 (Cox) Regression

SAS

PROC PHREG;
MODEL SPELL1*LEFT(0) = X1 X2 X3 / COVB;
RUN;

SPSS

COXREG VARIABLES=SPELL1 WITH X1 X2 X3/STATUS LEFT(1).

LIMDEP

SURV; Lhs = SPELL1,LEFT; rhs = x1,x2,x3; TVC = z1*time;


                              Appendix 9

        SAS Syntax for Building Person-Period-Spell Dataset For Multiple Spell Analysis

OPTIONS  LS=80 PS=56;
TITLE 'REPEATED EVENTS DISCRETE TIME DATA SET MODEL';
TITLE2'  SAMPLE DATA';
TITLE3 'SPELL1 ANALYSIS';
PROC FORMAT;
   VALUE SE 0='MALE' 1='FEMALE';
   VALUE RA 0='WHITE' 1='BLACK';
   VALUE LE 0='CENSORED' 1='LEFT';
   VALUE PT 1='FULL TIME' 2='PART TIME';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;
DATA SPELLT1;
    INFILE 'MTVC2.DAT';
 INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR RETURN RETURNYR SPELL2 MADEG01-MADEG13;
     LABEL SEX= 'SEX OF TEACHER'
           RACE='RACE OF TEACHER'
           ENTERYR='FIRST SCHL YEAR OF TEACHING'
           SPELL1='LENGTH OF FIRST TEACHING SPELL'
           LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR'
           EXITYR='FIRST YEAR OF CAREER INTERRUPTION'
           SPELL2='LENGTH OF FIRST INTERRUPTION SPELL'
           RETURN='RETURNED TO TEACHING BY 84-85 SCHL YR'
           RETURNYR='YR TEACHER RETURNED TO TEACHING';
* CREATION OF PERSON PERIOD DATA SET;
  IF LEFT > 0 THEN LEFT = 1;
  SPELL = 1;
ARRAY TEMP1(13) D1-D13;
  DO PERIOD1 = 1 TO MIN(SPELL1,13);
    IF PERIOD1 = SPELL1 AND LEFT NE 0 THEN DV = 1;  ELSE DV = 2;
    DO INDEX1 = 1 TO 13;
    IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
    TIME1 = PERIOD1;
    TIME1SQ = PERIOD1*PERIOD1;
    T1XSX = TIME1*SEX;
    T1SQXSX = TIME1SQ * SEX;
  END;
 OUTPUT;
  END;
TITLE2 'CONSTRUCTION OF SPELL 2 PERSON PERIOD DATASET';
TITLE3 'SPELL2 ANALYSIS';
DATA SPELLT2;
    INFILE 'MTVC2.DAT';
    KEEP SEX RACE RETURN RETURNYR TIME2 SPELL PERIOD2 T1-T13 DV TIME2SQ MADEG01-MADEG13;
    INPUT ID SEX RACE SPELL1 LEFT ENTERYR EXITYR RETURN RETURNYR SPELL2 MADEG01-MADEG13;
    SPELL=2;
ARRAY TEMP2(13) T1-T13;
  DO PERIOD2 = 1 TO MIN(SPELL2,13);
   IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;
   DO INDEX2 = 1 TO 13;
   IF INDEX2=PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2)= 0;
   TIME2= PERIOD2;
   TIME2SQ = PERIOD2*PERIOD2;
   T2XSX = TIME2*SEX;
   T2SQXSX = TIME2SQ * SEX;
 END;
 OUTPUT;
  END;
TITLE3 'MULTIPLE SPELLS DISCRETE TIME DATA SET MODEL';
DATA ALL;
  SET SPELLT1 SPELLT2;
PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
                         TIME PERIOD2*LEFT(0);


                             Appendix 10

   SAS Construction of Plots for Multiple Spells Discrete Time Analysis

1          OPTIONS  LS=80 PS=56 ;
2          TITLE 'REPEATED EVENTS DISCRETE TIME SURVIVAL & HAZARD FUNCTION PLOTS';
3          TITLE2 '  MICHIGAN DATA';
4          PROC FORMAT;
5             VALUE SE 0='MALE' 1='FEMALE';
6             VALUE RA 0='WHITE' 1='BLACK';
7             VALUE LE 0='CENSORED' 1='LEFT';
8             VALUE PT 1='FULL TIME' 2='PART TIME';
9             VALUE SP 0 = 'SPELL 1' 1 = 'SPELL 2';
10         FORMAT SEX SE.;
11         FORMAT RACE RA.;
12         FORMAT LEFT LE.;
13         FORMAT SPELL SP.;
14         DATA SPELLT1;
15             INFILE MIT LRECL=472 BLKSIZE=31624;
16             INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16
17              MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22
18              MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28
19              INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.) 
20              @59 (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12) (1.)
21              SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93
22              RETURNYR 94-95 ;
23              LABEL SEX= 'SEX OF TEACHER'
24                    RACE='RACE OF TEACHER'
25                    ENTERYR='FIRST SCHL YEAR OF TEACHING'
26                    SPELL1='LENGTH OF FIRST TEACHING SPELL'
27                    LEFT='LEFT TEACHING BEFORE 1984-85 SCHOOL YEAR'
28                    EXITYR='FIRST YEAR OF CAREER INTERRUPTION'
29                    SPELL2='LENGTH OF FIRST INTERRUPTION SPELL'
30                    RETURN='RETURNED TO TEACHING BY 84-85 SCHL YR'
31                    RETURNYR='YR TEACHER RETURNED TO TEACHING';
32         * CREATION OF PERSON PERIOD DATA SET;
33           IF LEFT > 0 THEN LEFT = 1;
34           SPELL = 0;
35       ARRAY TEMP1(13) D1-D13;
36        DO PERIOD1 = 1 TO MIN(SPELL1,13);
37             IF PERIOD1 = SPELL1 AND LEFT NE 0 THEN DV = 1;  ELSE DV = 2;
38             DO INDEX1 = 1 TO 13;
39        IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1)=0;
40           END;
41           TIME = PERIOD1;
42           TIMESQ = PERIOD1 * PERIOD1;
43           TXS = TIME*SEX;
44          OUTPUT ;
45           END;
46         PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
47           TIME PERIOD1*LEFT(0);
48
49         TITLE3 'SPELL1 ANALYSIS';
50      DATA SPELLT2;
51       KEEP SEX RACE RETURN RETURNYR SPELL2 SPELL PERIOD2 T1-T13 DV LEFT;
52             INFILE MIT LRECL=472 BLKSIZE=31624;
53       INPUT ID 1-9 SEX 10 RACE 11 BIRTHP 12-13 MAJOR $ 14-15 MADEG01 16
54        MADEG02 17 MADEG03 18 MADEG04 19 MADEG05 20 MADEG06 21 MADEG07 22
55        MADEG08 23 MADEG09 24 MADEG10 25 MADEG11 26 MADEG12 27 MADEG13 28
56        INST1 29-30 SUBJECT 31-32 @33 (ASC1_01 - ASC1_13) ($ 2.)
57        @59 (LEVEL1-LEVEL13) (1.) ENTERYR 72-73 @74 (PTTIME01-PTTIME12)(1.)
58        SPELL1 86-87 LEFT 88-89 EXITYR 89-90 SPELL2 91-92 RETURN 93
59        RETURNYR 94-95 ;
60           SPELL=1;
61         ARRAY TEMP2(13) T1-T13;
62           DO PERIOD2 = 1 TO MIN(SPELL2,13);
63             IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;
64             DO INDEX2 = 1 TO 13;
65             IF INDEX2 = PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2) = 0;
66          END;
67           TIME = PERIOD2;
68           TIMESQ = PERIOD2 * PERIOD2;
69           TXS = TIME*SEX;
70          OUTPUT;
71
72           END;
73         PROC LIFETEST PLOTS=(S,H) METHOD=LT NOTABLE INTERVALS=1 TO 13 BY 1;
74                             TIME PERIOD2*LEFT(0);
75         TITLE3 'SPELL2 ANALYSIS';
76  


                            Appendix 11

        SAS Output for Discrete Time Repeated Events Analysis:  Spell 1

                             THE LIFETEST PROCEDURE
            SUMMARY OF THE NUMBER OF CENSORED AND UNCENSORED VALUES

                       TOTAL     FAILED   CENSORED  %CENSORED

                      178109      64785     113324    63.6262

                       SURVIVAL FUNCTION ESTIMATES

  SDF |
      |
      |
      |
      |
      |
S 1.0 + A++++A
U     |       ++
R     |         ++
V     |           A
I     |            ++
V     |              ++
A 0.8 +                A+
L     |                  ++
      |                    +A+
D     |                       ++
I     |                         +A++
S     |                             ++A++
T 0.6 +                                  ++A++
R     |                                       ++A++
I     |                                            ++A++
B     |                                                 ++A++++A++
U     |                                                           ++A++++A
T     |
I 0.4 +
O     |
N     |
      |
F     |
U     |
N 0.2 +
C     |
T     |
I     |
O     |
N     |
  0.0 +
      |
      |
      |
      |
      |
      --+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
        0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
                                        PERIOD


                          Appendix 12 Continued

                         HAZARD FUNCTION ESTIMATES

  HAZARD |
         |
         |
    0.12 +
         |            A
         |            ++
         |            + +
         |            + +
         |            +  +
    0.10 +           +    A+
         |           +      ++
         |           +        A
         |           +         +
H        |           +          +
A        |           +           +
Z   0.08 +           +            A+
A        |           +              ++
R        |           +                A
D        |          +                  ++
         |          +                    +
F        |          +                     A+
U   0.06 +          +                       ++
N        |          +                         A
C        |          +                          ++
T        |          +                            +
I        |          +                             A+
O        |         +                                ++
N   0.04 +         +                                  A+
         |         +                                    ++
         |         +                                      A
         |         +                                       ++
         |         +                                         +
         |         +                                          A+
    0.02 +         +                                            ++
         |         +                                              A
         |        +
         |        +
         |        +
         |        +
    0.00 +        A
         |
         |
         -------+-------+-------+-------+-------+-------+-------+-------+-------
                0       2       4       6       8      10      12      14

                                         PERIOD


                                  Appendix 13

     SAS Output for Discrete Time Repeated Events Analysis:  Spell 2

                             THE LIFETEST PROCEDURE
            SUMMARY OF THE NUMBER OF CENSORED AND UNCENSORED VALUES
                       TOTAL     FAILED   CENSORED  %CENSORED
                      248059      96518     151541    61.0907

                           SURVIVAL FUNCTION ESTIMATES
  SDF |
      |
      |
      |
      |
      |
S 1.0 + A++++A+
U     |        ++
R     |          +A+
V     |             ++
I     |               +A+
V     |                  ++
A 0.8 +                    +A+
L     |                       ++
      |                         +A++
D     |                             ++A+
I     |                                 ++
S     |                                   +A++
T 0.6 +                                       ++A++
R     |                                            ++A++
I     |                                                 ++A++
B     |                                                      ++A++
U     |                                                           ++A++++A
T     |
I 0.4 +
O     |
N     |
      |
F     |
U     |
N 0.2 +
C     |
T     |
I     |
O     |
N     |
  0.0 +
      |
      |
      |
      |
      |
      --+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
        0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
                                        PERIOD


                           Appendix 13 Continued 

                           HAZARD FUNCTION ESTIMATES

  HAZARD |
         |
         |
    0.09 +
         |
         |
         |            A
    0.08 +            ++
         |            + +
         |            +  +  ++A+++A++
         |            +   A+         +A+++A++
    0.07 +           +                       +A++
         |           +                           +A++
H        |           +                               +A
A        |           +                                 ++
Z   0.06 +           +                                   +
A        |           +                                    A
R        |           +                                     +
D        |           +                                      +
    0.05 +          +                                        +
F        |          +                                         A
U        |          +                                          +
N        |          +                                          +
C   0.04 +          +                                           +
T        |          +                                            +
I        |          +                                            +
O        |          +                                             A
N   0.03 +         +
         |         +
         |         +
         |         +
    0.02 +         +
         |         +
         |         +
         |         +
    0.01 +        +
         |        +
         |        +
         |        +
    0.00 +        A
         |
         |
         -------+-------+-------+-------+-------+-------+-------+-------+-------
                0       2       4       6       8      10      12      14

                                           PERIOD


                                 Appendix 14

        SAS Syntax for Repeated Events Discrete Time Logistic Model

1          OPTIONS  LS=80 PS=56 ;                                               
2          TITLE 'REPEATED EVENTS DISCRETE TIME LOGISTIC MODEL';                
3          TITLE2 'MICHIGAN DATA';                                            
4          PROC FORMAT;                                                         
5             VALUE SE 0='MALE' 1='FEMALE';                                     
6             VALUE RA 0='WHITE' 1='BLACK';                                     
7             VALUE LE 0='CENSORED' 1='LEFT';                                   
8             VALUE PT 1='FULL TIME' 2='PART TIME';                             
9          FORMAT SEX SE.;                                                      
10         FORMAT RACE RA.;                                                     
11         FORMAT LEFT LE.;                                                     
12       data SPELLT1;                                                        
13          INFILE MIT LRECL=472 BLKSIZE=31624;                              
14       input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
15         madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22 
16         madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
17         inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)        
18         @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
19         spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93     
20         returnyr 94-95 ;                                                
21          Label sex= 'Sex of teacher'                                     
22                race='Race of teacher'                                    
23                Enteryr='First schl year of teaching'                     
24                Spell1='Length of first teaching spell'                   
25                Left='Left teaching before 1984-85 school year'           
26                Exityr='First year of career interruption'                
27                Spell2='Length of first interruption spell'               
28                Return='Returned to teaching by 84-85 schl yr'            
29                Returnyr='Yr teacher returned to teaching';               
30  * Creation of Person period data set;                                
31        IF LEFT > 0 THEN LEFT = 1;                                         
32           SPELL = 1;                                                         
33        ARRAY TEMP1(13) D1-D13;                                              
34         DO PERIOD1 = 1 to MIN(SPELL1,13);                                  
35         IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1;  ELSE DV = 2;     
36         DO INDEX1 = 1 to 13;                                             
37         IF INDEX1 =PERIOD1 THEN TEMP1(INDEX1)=1; ELSE TEMP1(INDEX1)=0;
38         END;
39           TIME = PERIOD1;                                                    
40           TIMESQ = PERIOD1 * PERIOD1;                                        
41           TXS = TIME*SEX;                                                    
42          OUTPUT ;                                                            
43           END;                                                               
44          Title3 'SPELL1 ANALYSIS';
46         DATA SPELLT2;
47          KEEP SEX RACE RETURN RETURNYR SPELL2 SPELL PERIOD2 T1-T13 DV;    
48          INFILE MIT LRECL=472 BLKSIZE=31624;                              
49          input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
50          madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
51          madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
52          inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)        
53          @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
54          spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
55          returnyr 94-95 ;                                                
56           SPELL=2;                                                           
57         ARRAY TEMP2(13) T1-T13;                                              
58           DO PERIOD2 = 1 TO MIN(SPELL2,13);                                  
59             IF PERIOD2 = SPELL2 AND LEFT NE 0 THEN DV = 1; ELSE DV=2;        
60             DO INDEX2 = 1 TO 13;                                             
61             IF INDEX2 = PERIOD2 THEN TEMP2(INDEX2)=1; ELSE TEMP2(INDEX2) = 0;
62          END;                                                                
63           TIME = PERIOD2;                                                    
64           TIMESQ = PERIOD2 * PERIOD2;                                        
65           TXS = TIME*SEX;                                                    
66          OUTPUT;                                                             
67           END;  
68         TITLE3 'SPELL2 ANALYSIS';
70         Data ALL;                                                            
71           SET SPELLT1 SPELLT2; 
72          PROC LOGISTIC CT;                                                   
73           MODEL DV = TIME TIMESQ SPELL SEX TXS /NOINT;                       


                                   Appendix 15

       SAS Output for Repeated Events Discrete Time Logistic Model
                                                                                
                             The LOGISTIC Procedure                             
     Data Set: WORK.ALL                                                         
     Response Variable: DV                                                      
     Response Levels: 2                                                         
     Number of Observations: 178109                                             
     Link Function: Logit                                                       
                                                                                
                                Response Profile                                
                                                                                
                           Ordered                                              
                             Value      DV     Count                            
                                                                                
                                 1       1     18957                            
                                 2       2    159152                            
                                                                                
WARNING: 248059 observation(s) deleted due to missing values for response or explanatory
variables.                                     
                                                                    
                       Criteria for Assessing Model Fit                         
                                                                                
                  Without        With                                           
   Criterion    Covariates    Covariates    Chi-Square for Covariates           
                                                                                
   AIC           246911.50     116190.06         .                              
   SC            246911.50     116240.51         .                              
   -2 LOG L      246911.50     116180.06    130731.44 with 5 DF (p=0.0001)      
   Score              .             .       111903.41 with 5 DF (p=0.0001)      
                                                                                
                   Analysis of Maximum Likelihood Estimates                     
                                                                                
              Parameter Standard    Wald       Pr >    Standardized     Odds    
  Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio    
                                                                                
  TIME     1    -0.1328   0.0114   135.5431     0.0001    -0.232271    0.876    
  TIMESQ   1   -0.00706  0.00101    49.2217     0.0001    -0.148752    0.993    
  SPELL    1    -1.6193   0.0293  3045.1225     0.0001            0    0.198    
  SEX      1     0.1521   0.0293    26.9172     0.0001     0.040285    1.164    
  TXS      1     0.0396  0.00676    34.3078     0.0001     0.072201    1.040    

         Association of Predicted Probabilities and Observed Responses          
                                                                                
                   Concordant = 60.4%          Somers' D = 0.281                
                   Discordant = 32.3%          Gamma     = 0.303                
                   Tied       =  7.2%          Tau-a     = 0.053                
                   (3017044464 pairs)          c         = 0.640                                                                                             
             Classification Table                              
                                                                                
              Correct      Incorrect                Percentages                 
           ------------  ------------  -------------------------------------    
     Prob          Non-          Non-           Sensi-  Speci-  False  False    
    Level  Event  Event  Event  Event  Correct  tivity  ficity   POS    NEG     
    ------------------------------------------------------------------------    
    0.000  18957      0  159E3      0     10.6   100.0     0.0   89.4     .     
    0.020  18879   4922  154E3     78     13.4    99.6     3.1   89.1    1.6    
    0.040  18519  17414  142E3    438     20.2    97.7    10.9   88.4    2.5    
    0.060  17567  34682  124E3   1390     29.3    92.7    21.8   87.6    3.9    
    0.080  15723  58220  101E3   3234     41.5    82.9    36.6   86.5    5.3    
    0.100  14344  72556  86596   4613     48.8    75.7    45.6   85.8    6.0    
    0.120  12376  88625  70527   6581     56.7    65.3    55.7   85.1    6.9    
    0.140   9816  107E3  52289   9141     65.5    51.8    67.1   84.2    7.9    
    0.160   3657  142E3  16964  15300     81.9    19.3    89.3   82.3    9.7    
    0.180      0  159E3      0  18957     89.4     0.0   100.0     .    10.6    


                                   Appendix 16

                 Kaplan-Meier Product-Limit Life Tables


The Kaplan-Meier Life Table calculations are repeated everytime an event is observed.
Suppose t1 < t2 < ... < tx represent the distinct times that events take place.  If nj
= the size of the risk set just prior to the event at time ti, the time of the event,
and dj represents the number of events at the time of the event, then the formula for
the cumulative survival function during spell k is as follows:

(43)


k is the spell. The cumulative survival function is therefore the product of survival probabilities throughout the spell, computed at the time of each event. The standard error is given by Greenwood's formula:


(44)


                                 Appendix 17

                  SAS Syntax for PHREG SINGLE SPELL MODEL

OPTIONS  LS=80 PS=56 ;
TITLE 'DISCRETE TIME PHREG MODEL';
TITLE2 '  MICHIGAN DATA';
PROC FORMAT;
   VALUE SE 0='MALE' 1='FEMALE';
   VALUE RA 0='WHITE' 1='BLACK';
   VALUE LE 0='CENSORED' 1='LEFT';
   VALUE PT 1='FULL TIME' 2='PART TIME';
FORMAT SEX SE.;
FORMAT RACE RA.;
FORMAT LEFT LE.;
DATA SPELLT1;
    INFILE MIT LRECL=472 BLKSIZE=31624;
    input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
     madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
     madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
     inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
     @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
     spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
     returnyr 94-95 ;
     Label sex= 'Sex of teacher'
           race='Race of teacher'
           Enteryr='First schl year of teaching'
           Spell1='Length of first teaching spell'
           Left='Left teaching before 1984-85 school year'
           Exityr='First year of career interruption'
           Spell2='Length of first interruption spell'
           Return='Returned to teaching by 84-85 schl yr'
           Returnyr='Yr teacher returned to teaching';
* Creation of Person-Period data set;
  IF LEFT > 0 THEN LEFT = 1;
  SPELL = 1;
  ARRAY TEMP1(13) D1-D13;
  DO PERIOD1 = 1 to MIN(SPELL1,13);
    IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1;  ELSE DV = 2;
    DO INDEX1 = 1 to 13;
    IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
  END;
  TIME = PERIOD1;
  TIMESQ = PERIOD1 * PERIOD1;
  TXS = TIME*SEX;
 OUTPUT ;
  END;
 PROC PHREG;
  MODEL DV*LEFT(0) = TIME TIMESQ;
 PROC PHREG;
  MODEL DV*LEFT(0) = TIME TIMESQ SEX;
 PROC PHREG;
  MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS;
 PROC PHREG;
  MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS RACE/CORRB;


                             Appendix 18

1          OPTIONS  LS=80 PS=56 ;
2          TITLE 'DISCRETE TIME PHREG MODEL';
3          TITLE2 'MICHIGAN DATA';
4          PROC FORMAT;
5             VALUE SE 0='MALE' 1='FEMALE';
6             VALUE RA 0='WHITE' 1='BLACK';
7             VALUE LE 0='CENSORED' 1='LEFT';
8             VALUE PT 1='FULL TIME' 2='PART TIME';
9          FORMAT SEX SE.;
10         FORMAT RACE RA.;
11         FORMAT LEFT LE.;
12         data SPELLT1;
13             INFILE MIT LRECL=472 BLKSIZE=31624;
14             input id 1-9 sex 10 race 11 birthp 12-13 major $ 14-15 madeg01 16
15              madeg02 17 madeg03 18 madeg04 19 madeg05 20 madeg06 21 madeg07 22
16              madeg08 23 madeg09 24 madeg10 25 madeg11 26 madeg12 27 madeg13 28
17              inst1 29-30 subject 31-32 @33 (asc1_01 - asc1_13) ($ 2.)
18              @59 (Level1-level13) (1.) enteryr 72-73 @74 (PTTIME01-PTTIME12) (1.)
19              spell1 86-87 left 88-89 exityr 89-90 spell2 91-92 return 93
20              returnyr 94-95;
21              Label sex= 'Sex of teacher'
22                    race='Race of teacher'
23                    Enteryr='First schl year of teaching'
24                    Spell1='Length of first teaching spell'
25                    Left='Left teaching before 1984-85 school year'
26                    Exityr='First year of career interruption'
27                    Spell2='Length of first interruption spell'
28                    Return='Returned to teaching by 84-85 schl yr'
29                    Returnyr='Yr teacher returned to teaching';
30         * Creation of Person period data set;
31           IF LEFT > 0 THEN LEFT = 1;
32           SPELL = 1;
33         ARRAY TEMP1(13) D1-D13;
34           DO PERIOD1 = 1 to MIN(SPELL1,13);
35             IF PERIOD1 = SPELL1 and LEFT NE 0 THEN DV = 1; ELSE DV = 2;
36             DO INDEX1 = 1 to 13;
37             IF INDEX1 = PERIOD1 THEN TEMP1(INDEX1) = 1; ELSE TEMP1(INDEX1) = 0;
38           END;
39           TIME = PERIOD1;
40           TIMESQ = PERIOD1 * PERIOD1;
41           TXS = TIME*SEX;
42          OUTPUT ;
43           END;
44
45         TITLE3 'SPELL1 ANALYSIS';
46          PROC PHREG;;
47           MODEL DV*LEFT(0) = TIME TIMESQ /CORRB;
48          PROC PHREG;;
49           MODEL DV*LEFT(0) = TIME TIMESQ SEX /CORRB;
50          PROC PHREG;;
51           MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS /CORRB;
52          PROC PHREG;;
53           MODEL DV*LEFT(0) = TIME TIMESQ SEX TXS RACE/CORRB;


                           DISCRETE TIME PHREG MODEL                           
                                  MICHIGAN DATA
                                SPELL1 ANALYSIS

     Data Set: WORK.SPELLT1
     Dependent Variable: DV
     Censoring Variable: LEFT      Left teaching before 1984-85 school year
     Censoring Value(s): 0
     Ties Handling: BRESLOW

                            Summary of the Number of
                            Event and Censored Values
                                                       Percent
                     Total       Event    Censored    Censored

                    178109       64785      113324       63.63

                     Testing Global Null Hypothesis: BETA=0

                   Without        With
    Criterion    Covariates    Covariates    Model Chi-Square

    -2 LOG L     1556206.27    1531998.41    24207.85 with 2 DF (p=0.0001)
    Score              .             .       20585.38 with 2 DF (p=0.0001)
    Wald               .             .       15832.20 with 2 DF (p=0.0001)

                    Analysis of Maximum Likelihood Estimates
                    Parameter     Standard      Wald         Pr >          Risk
 Variable   DF       Estimate       Error    Chi-Square   Chi-Square      Ratio

 TIME        1      -0.077037      0.00585    173.33514       0.0001      0.926
 TIMESQ      1      -0.016618    0.0006293    697.27383       0.0001      0.984

                          Estimated Correlation Matrix

                                       TIME            TIMESQ

                   TIME         1.000000000      -0.950405788
                   TIMESQ      -0.950405788       1.000000000


                              SAS PHREG Procedure

     Data Set: WORK.SPELLT1
     Dependent Variable: DV
     Censoring Variable: LEFT      Left teaching before 1984-85 school year
     Censoring Value(s): 0
     Ties Handling: BRESLOW

                            Summary of the Number of
                            Event and Censored Values
                                                       Percent
                     Total       Event    Censored    Censored

                    178109       64785      113324       63.63

                     Testing Global Null Hypothesis: BETA=0

                   Without        With
    Criterion    Covariates    Covariates    Model Chi-Square

    -2 LOG L     1556206.27    1531099.50    25106.77 with 3 DF (p=0.0001)
    Score              .             .       21442.65 with 3 DF (p=0.0001)
    Wald               .             .       16690.03 with 3 DF (p=0.0001)

                    Analysis of Maximum Likelihood Estimates
                    Parameter     Standard      Wald         Pr >          Risk
 Variable   DF       Estimate       Error    Chi-Square   Chi-Square      Ratio

 TIME        1      -0.075359     0.00585      165.90800     0.0001       0.927
 TIMESQ      1      -0.016518     0.0006292    689.29111     0.0001       0.984
 SEX         1       0.254973     0.00865      869.63660     0.0001       1.290

    Analysis of Maximum
   Likelihood Estimates

 Variable   Label

 TIME
 TIMESQ
 SEX        Sex of teacher

                          Estimated Correlation Matrix

                    TIME            TIMESQ               SEX

TIME         1.000000000      -0.950287614       0.009131492
TIMESQ      -0.950287614       1.000000000       0.005355449
SEX          0.009131492       0.005355449       1.000000000      Sex of teacher


                              The PHREG Procedure

     Data Set: WORK.SPELLT1
     Dependent Variable: DV
     Censoring Variable: LEFT      Left teaching before 1984-85 school year
     Censoring Value(s): 0
     Ties Handling: BRESLOW

                            Summary of the Number of
                            Event and Censored Values

                                                       Percent
                     Total       Event    Censored    Censored

                    178109       64785      113324       63.63

                     Testing Global Null Hypothesis: BETA=0

                   Without        With
    Criterion    Covariates    Covariates    Model Chi-Square

    -2 LOG L     1556206.27    1531072.57    25133.70 with 4 DF (p=0.0001)
    Score              .             .       21615.47 with 4 DF (p=0.0001)
    Wald               .             .       16542.91 with 4 DF (p=0.0001)


                    Analysis of Maximum Likelihood Estimates

                    Parameter     Standard      Wald         Pr >          Risk
 Variable   DF       Estimate       Error    Chi-Square   Chi-Square      Ratio

 TIME        1      -0.090814      0.00654    192.63688       0.0001      0.913
 TIMESQ      1      -0.016351    0.0006296    674.42691       0.0001      0.984
 SEX         1       0.190641      0.01508    159.83901       0.0001      1.210
 TXS         1       0.020399      0.00394     26.78174       0.0001      1.021

                          Estimated Correlation Matrix

                 TIME        TIMESQ           SEX           TXS

 TIME     1.000000000  -0.865750539   0.368744370  -0.447773529
 TIMESQ  -0.865750539   1.000000000  -0.027734182   0.037006423
 SEX      0.368744370  -0.027734182   1.000000000  -0.819580406  
 TXS     -0.447773529   0.037006423  -0.819580406   1.000000000


                              SAS PHREG Procedure

     Data Set: WORK.SPELLT1
     Dependent Variable: DV
     Censoring Variable: LEFT      Left teaching before 1984-85 school year
     Censoring Value(s): 0
     Ties Handling: BRESLOW

                            Summary of the Number of
                            Event and Censored Values

                                                       Percent
                     Total       Event    Censored    Censored

                    178109       64785      113324       63.63

                     Testing Global Null Hypothesis: BETA=0

                   Without        With
    Criterion    Covariates    Covariates    Model Chi-Square

    -2 LOG L     1556206.27    1530864.44    25341.83 with 5 DF (p=0.0001)
    Score              .             .       21816.94 with 5 DF (p=0.0001)
    Wald               .             .       16741.99 with 5 DF (p=0.0001)

                    Analysis of Maximum Likelihood Estimates
                    Parameter     Standard      Wald         Pr >          Risk
 Variable   DF       Estimate       Error    Chi-Square   Chi-Square      Ratio

 TIME        1      -0.090666      0.00654    192.04170       0.0001      0.913
 TIMESQ      1      -0.016380    0.0006296    676.95944       0.0001      0.984
 SEX         1       0.198086      0.01509    172.38694       0.0001      1.219
 TXS         1       0.020944      0.00394     28.22737       0.0001      1.021
 RACE        1      -0.239196      0.01718    193.89984       0.0001      0.787

                          Estimated Correlation Matrix

            TIME            TIMESQ          SEX            TXS           RACE

TIME       1.000000000     -.865707407      .368514067    -.447592615   -.001477467 
TIMESQ     -.865707407     1.000000000     -.027638547     .036720009    .003010263
SEX         .368514067     -.027638547     1.000000000    -.818857098   -.031760381
TXS        -.447592615      .036720009     -.818857098    1.000000000   -.008922722
RACE       -.001477467      .003010263     -.031760381    -.008922722   1.000000000 

This document has been accessed times since 7/30/99