
Nonrobustness of the Carryover Effects of Small Classes in Project STARby Kitae Sohn  2015 Background: Class size reduction (CSR) is an enduring school reform undertaken in an effort to improve academic achievement and has been widely encouraged in the United States. Supporters of CSR often cite the positive contemporaneous and carryover effects of Project STAR. Much has been discussed regarding the robustness of the contemporaneous effects but not regarding that of the carryover effects. Purpose: This article checks the robustness of the carryover effects of STAR’s small classes. Setting: STAR was undertaken in 75–79 schools in Tennessee. Participants: Each year in the experimental period, 6,000–7,000 students in grades K–3 participated in the experiment, for a total of 12,000 students during the entire period. Intervention: As students initially entered STAR schools, they were (arguably) randomly assigned to small classes with 13–17 students, regular classes with 22–25 students without teacher aides, and regular classes with 22–25 students with teacher aides. The experiment was performed from 1985 through 1989, but information on STAR students continued to be collected thereafter. Research Design: STAR is a randomized controlled field experiment. Data Analysis: In this article, STAR schools are divided into “effective” schools and “ineffective” schools. Effective schools are defined as schools where the test scores of students in small classes were statistically significantly higher than those of students in regular classes at the 5% level in both math and reading. By contrast, ineffective schools are defined as schools where the test scores of students in small classes were not statistically significantly higher than those of students in regular classes at the 5% level in either math or reading. Separately for effective schools, schools other than effective schools, and ineffective schools, the academic achievement of students is regressed on variables indicating small class assignment, along with student characteristics and schoolbyentry wave fixed effects. Findings: The carryover effects of CSR are not robust; they are driven mostly by effective schools, which account for at most a quarter of STAR schools. During this investigation, it is revealed that, in contrast to the protocol of randomization, observable student characteristics in these schools are not randomly distributed between small and regular classes. They are instead distributed in such a way as to increase the academic achievement of students in small classes and decrease that of students in regular classes. Recommendations: Caution is recommended when citing the positive carryover effects of STAR. Class size reduction (CSR) is an enduring school reform that has attracted not only researchers but also the public in the United States. According to a compilation of the Education Commission of the States (2005), CSR has been encouraged or required in 24 states since 1977. The Student Achievement Guarantee in Education (SAGE) program in Wisconsin is well known for its success in raising academic achievement (e.g., Molnar et al., 1999), whereas a CSR program in California was criticized for its failure (e.g., Stecher & Bohrnstedt, 2002). In Florida, beginning with the 2010–2011 school year, the maximum number of students in each core class was capped. In addition to individual states, the federal government encouraged CSR both directly and indirectly; a federal CSR program was separately authorized in fiscal year 1999 and then folded into Title II in 2001. Although many surveys and experiments concerning CSR exist, Project STAR (Student Teacher Achievement Ratio) stands out by virtue of its large scale, longterm randomization. Mosteller, Light, and Sachs (1996, p. 814) extolled STAR as “one of the great experiments in education in U.S. history,” and Finn and Achilles (1999, p. 97) even claimed that the experiment “came to eclipse all of the research that preceded it.” Hence, it would not be surprising to find that the data are still in use and that the arguably positive effects of CSR in STAR have played a nonnegligible role in justifying the implementation of CSR across the states. Whether CSR improved academic achievement in STAR during the experiment is still open to debate. Researchers engaged in the experiment and some independent researchers have argued that students in small classes outperformed students in regular classes with and without fulltime teacher aides (Finn & Achilles, 1990; Krueger, 1999; Word et al., 1990). On the other hand, some researchers have called into question the robustness of the results to nonrandom attrition and noncompliance with the treatment and have pointed out that no statistically significant treatment effects existed in the majority of schools (Ding & Lehrer, 2005, 2010; Hanushek, 1999; Sohn, 2010). In contrast to the heated debate on the effects of CSR in STAR, relatively few objections have been raised about the carryover effects of CSR beyond the experimental period. The former group has argued that students in small classes achieved higher scores on tests of cognitive skills even after the experiment (Finn, Fulton, Zaharias, & Nye, 1989; Krueger & Whitmore, 2001; Nye, Hedges, & Konstantopoulos, 1999). Despite this, no explicit arguments have been put forward about the robustness of the carryover effects. Therefore, this article contributes to the literature by checking the robustness of the effects of CSR after the experimental period. Our methodology differs from that of most studies that reported the positive carryover effects in that estimations by groups of schools are performed. This strategy is motivated by the concern that when all schools are pooled, large effects in a small number of schools (and little effects in most schools) can mistakenly lead one to argue that the effects exist for all schools. This article finds that the carryover effects disappear once a small number of STAR schools are excluded, which indicates that the carryover effects are not as robust as previously understood. This finding is consistent with the report that STAR students provided only weak evidence of the positive effects of CSR on adult outcomes (Chetty et al., 2011). PROJECT STAR STAR is a largescale, randomized experiment undertaken in Tennessee from 1985 through 1989, at a cost of about $12 million. Each year, 6,000–7,000 students in 75–79 schools participated in the experiment, for a total of 12,000 students during the entire period. Although the experiment was costly, it was justified on the grounds that other research on CSR had not established a definitive answer about its effects. In a review of 76 studies published after 1954, Ryan and Greenfield (1975) found that observable and unobservable variables were not adequately controlled for to clearly illustrate the effects of CSR. Instead of relying on a qualitative review, Glass and Smith (1978, 1979) applied a then new statistical procedure, called metaanalysis, to quantitatively summarize a total of 725 comparisons in 77 studies on CSR. This metaanalysis concluded that CSR had substantial, positive effects on academic achievement, which were robust for the “source of data, subject taught, duration of instruction, pupil IQ and type of achievement measure” (Glass & Smith, 1979, p. 12). However, the Educational Research Service (1980) pointed out that, in contrast to the claim that the positive effects of CSR were robust, the results were driven mostly by only 14 experimental studies. Even of these 14 studies, only six were relevant to class size situations that were typical of elementary and secondary schools. Similarly, Slavin (1984, p. 10) dismissed the results of the metaanalysis, demonstrating that the positive effects were “entirely due to studies of tutoring, not of class size as it is usually understood.” Hence, when the implementation of STAR was under consideration, the effects of CSR had not yet been clarified. If STAR had been ideally implemented, it would have proceeded as follows. Kindergarten schools in Tennessee would have been randomly selected for the experiment, and under the mandatory enrollment policy for kindergarten schools, kindergarten students in the selected schools would have been randomly assigned to small classes with 13–17 students (henceforth, S), regular classes with 22–25 students without teacher aides (henceforth, R), and regular classes with 22–25 students with teacher aides (henceforth, RA). Throughout the experimental period, no students would have left or entered the schools, and all the students would have remained in their initially assigned class types. Teachers would also have been randomly assigned to each class type every year and given no differential treatment. Finally, all the students would have taken tests of cognitive skills before and after the experiment to check whether random assignment was properly implemented. Unfortunately, the experiment was not implemented ideally. Schools volunteered for the experiment, and only schools large enough to accommodate at least 57 students were selected. In addition, because kindergarten education was not mandatory at that time in Tennessee, kindergarten students enrolled in STAR schools were likely to be of a select group. Moreover, students constantly left and entered STAR schools, so the overall attrition rate was almost 50%. Because students left and entered each year, S, R, and RA class sizes were not maintained within the designated ranges. Hence, some small classes contained more students than other regular classes did. Teachers were randomly assigned to each class type, but 54 of 340 secondgrade teachers from 15 STAR schools were provided with a threeday training course. Finally, it is uncertain whether the class assignment was randomly made because students did not take tests before the experiment. PREVIOUS LITERATURE Sohn (forthcoming) provided an excellent review of the STAR literature, so this section focuses on the immediate issues. Some of the problems listed in the previous section could be minor. For example, although schools were not randomly selected, this problem could be addressed by random assignment of students within schools. If actual class sizes went beyond their designated ranges, the variable of class size, rather than class type, could be used to estimate the effects of CSR. Also, because teachers were randomly assigned every year and the number of trainees was small, the threeday training would not have had large differential effects on the academic achievement of students among different class types (Mosteller, 1995). A more serious controversy arose with respect to the size of the effects of selective attrition and noncompliance with the treatment. One side has argued that the size was small and the positive effects were robust (e.g., Finn & Achilles, 1999; Krueger, 1999), whereas the other side has asserted that the size was substantial and the positive effects were not robust (e.g., Ding & Lehrer, 2010; Hanushek, 1999). In particular, Ding and Lehrer (2010) demonstrated that once selective attrition and noncompliance with the treatment were accounted for, the positive effects of CSR in STAR were absent in second and third grades. This article checks whether similar results are obtained for academic performance beyond third grade. So far, the controversy has been limited to the positive effects of CSR in STAR during the experiment—that is, kindergarten through third grade. However, it has been claimed that the positive effects persisted beyond the experimental period. Finn et al. (1989) estimated that students in S in third grade scored higher than their counterparts in R and RA in reading and math (effect sizes of 0.11–0.16) when they were fourth graders. Nye et al. (1999) extended the study period to eighth grade to find that students in S in third grade outperformed their counterparts even in eighth grade by effect sizes of 0.133 and 0.158 in reading and math, respectively. Effect sizes were similar in fourth and sixth grades. Finn, Gerber, Achilles, and BoydZaharias (2001) not only confirmed these results for grades K–8 but also stressed that staying longer in small classes further improved academic achievement. Looking beyond eighth grade, Krueger and Whitmore (2001) focused on academic achievement in high school. They estimated that students who were initially assigned to S were 2.7 percentage points more likely to take the ACT or SAT and scored 0.12–0.13 standard deviations (SD) higher on the ACT than those in R and RA combined (henceforth, RR). In contrast to the heated debates on the effects of CSR for grades K–3, to the best of our knowledge, few contrasting arguments have been exchanged concerning the effects of CSR beyond third grade. This article initiates this exchange by demonstrating that the positive effects are not robust to the exclusion of a small number of schools. In this process, one will understand why the effects of CSR persisted for students who initially attended this small number of schools but did not persist for others. It turns out that CSR had nothing to do with these differential effects; indeed, the main reason is that students were nonrandomly assigned into S and RR in those schools. DATA This article analyzes followup data on STAR students. At the end of the experiment in the spring of 1989, when most of the STAR students had completed third grade, all STAR students returned to regular classes, but data on their academic achievement were collected throughout high school. Academic achievement in this article is measured by test scores in math and reading, whether the student graduated from high school, whether the student took the ACT or SAT, and ACT equivalent scores. All these outcome variables have been typically examined in the literature on STAR. In this article, R and RA students are aggregated, as usually treated in the literature. The reason is that half of the R and RA students were randomly reassigned to the other class type in the second year and stayed there until the end of the experiment. In fact, a parttime teacher aide was available 25–33% of the time on average even in R because it was required to provide students with services that were typically available. Partly for the mentioned reasons, the difference in academic achievement between R and RA students was not statistically significant at conventional levels. Hence, there are essentially two class types in this article: S and RR. This article starts by distinguishing “effective” schools from “ineffective” schools. As done by Sohn (2010), effective schools are defined as schools where the test scores of S students were statistically significantly higher than those of RR students at the 5% level in both math and reading. By contrast, ineffective schools are defined as schools where the test scores of S students were not statistically significantly higher than those of RR students at the 5% level in either math or reading. Using a cutoff at the 10% level does not change the substance of the results (not shown). Because randomization took place within schools at each entry grade and because nonrandom attrition and noncompliance with the treatment are suspected, the school that each student attended first is used to indicate whether the student attended an effective or an ineffective school. ESTIMATIONS Three strategies are used to see whether the robustness checks for the carryover effects are themselves robust. Because the main goal is to check the robustness of effects that some researchers argued, it is necessary to adopt conventionally used empirical models. Our model is similar to those of Krueger and Whitmore (2001) and Dee and West (2011). Specifically,
(1) where refers to the academic achievement of student i in classroom c of school s in entry wave w. Depending on specifications, is a variable of total math scale score on the Comprehensive Tests of Basic Skills (CTBS, a normreferenced test battery), total reading scale score on the CTBS, probability of graduating from high school, probability of taking the ACT or SAT collegeadmissions tests by the senior year of high school, or ACTconverted scores. All test scores are normalized to have a mean of zero and a SD of one, so measures the magnitude of in terms of SD. indicates that the student was initially assigned to a small class, so this variable measures intenttotreat effects (Angrist, Imbens, & Rubin, 1996). includes student characteristics, such as race, gender, eligibility for free lunch, days between birthday and January 1, 1977, and its squared term. Free lunch status indicates that the student was eligible for free lunch in at least one grade during the experiment. If randomization was perfectly done, it would be unnecessary to control for , but doing so improves the precision of estimations. represents schoolbyentry dummies. Before Krueger and Whitmore (2001), school fixed effects were usually entered to estimate the effects of CSR in STAR because randomization took place within schools. However, Krueger and Whitmore (2001) explained that because students were randomized not only within schools but also at their entry levels, controlling for schoolbyentry dummies would reflect the nature of randomization more clearly. Dee and West (2011) followed this rationale. Finally, is a random error and adjusted to reflect heteroscedasticity clustered at the schoolbyentrywave level. In the second specification, in Equation 1 is replaced with four dummies, which indicate the number of years in small classes with a dummy for the zero year omitted. This specification is of interest because Finn and Achilles (1999) and Finn et al. (2001) argued that as students stayed longer in small classes, their test scores increased even further. It is possible to include the variable of the number of years in small classes rather than dummies for years in small classes, but the former imposes a linear restriction, which forces staying two years to have effects twice those of staying one year, and so on. Using the dummies relaxes this arbitrary restriction. In fact, it turns out that staying in S for one or two years did not have positive effects on test scores, which justifies using a series of dummies rather than a continuous variable. It is also noteworthy that the dummies could be endogenous because student attrition was not random. In the third specification, is replaced with class size instrumented by the status of S. The rationale behind the instrument strategy is that class type was randomly assigned, so S was correlated with a small number of students in a classroom but uncorrelated with , a method employed by Krueger (1999) and Krueger and Whitmore (2001). Class size rather than class type is introduced because the ranges of the number of students overlapped between S and RR, and attrition took place nonrandomly. In addition, the variable of class size provides more precise effects of CSR for one student in a classroom than the dichotomous variable, that is, . This empirical strategy differs from those employed to produce the positive effects of CSR after the experiment. Our estimations are performed by groups of schools, whereas the other studies that showed the positive results generally pooled all schools. Estimations by pooling assume that is the same for all schools, but estimations by groups of schools assume that differs by groups of schools. Accordingly, the assumption of the latter is more flexible than that of the former; the former cannot detect whether a small number of schools produce large (with little effects in most schools), whereas the latter can. As explained in the next section, estimations by groups of schools turn out to be critical, and they produce results contrasting to the positive results; for comparison purposes, estimation results by both pooling and grouping are provided. Sohn (2010) also grouped schools and demonstrated that the grouping was important in critically assessing the positive effects of CSR during the experiment. Our strategy also differs from those employed to challenge the positive effects of CSR during the experiment. For example, Hanushek (1999) cast doubt on the positive effects, but he stopped at simply pointing out that a small number of schools drove the effects, without actually performing multivariate estimations. Ding and Lehrer (2010) also disputed the positive effects, but their strategy is suitable for estimating the effects of CSR during, but not after, the experiment as well as for a short, but not a long, experimental period. In addition, interpreting results in the next section, one needs to be reminded that the goal of this article is to check whether it is true to reject the null hypothesis (). Hence, the main interest concerns the statistical significance of , rather than the size of (the size can be easily understood because test scores are standard normalized). Large sample sizes work against the purposes of our check, because as sample size increases, so does the likelihood of estimating a statistically significant , whereas our exercise is about detecting a statistically insignificant . In this sense, if is statistically insignificant, it only strengthens our position. RESULTS DESCRIPTIVE STATISTICS Table 1 reports descriptive statistics of independent and dependent variables. Descriptive statistics are listed for both effective and ineffective schools. Because figures for schools other than effective schools lie in between these two cases, these statistics are reported in the appendix. By scrutinizing the statistics, one can understand differences between effective and ineffective schools and speculate about the robustness of the effects of CSR in STAR. Table 1. Descriptive Statistics: Comparison by Initially Assigned Class Type in Each Type of School
Concomitantly, there are five points worth noting. First, the number of ineffective schools is always larger than that of effective schools; in most cases, there are more than twice as many ineffective schools. A scrutiny of the total number of STAR schools reveals that effective schools account for only about one fourth of the total. This is one of the main reasons that Ding and Lehrer (2005) and Hanushek (1999) raised doubts about the positive effects of S in STAR. Second, there is virtually no difference between effective and ineffective schools when it comes to the difference in the number of students in RR and S. More specifically, the differences are 4.6 students and 4.3 students for effective and ineffective schools, respectively. Although these numbers are statistically significantly different at the 5% level, a difference of 0.3 students between the two class types is minuscule. Third, all the independent variables differ between RR and S in effective schools, whereas this is not the case for ineffective schools. Of course, the treatment variable, that is, class size, differs between RR and S in both types of schools. However, the percentage of girls in S is 5.6 percentage points higher than that for RR in effective schools, whereas no such difference is found in ineffective schools. Also, 9.3 percentage points more White/Asian students are initially assigned to S than to RR in effective schools, but the difference for ineffective schools is only 2.3 percentage points. Moreover, the difference in the latter is only weakly significant. Similarly, 9.1 percentage points fewer students are eligible for free lunch in S than RR in effective schools, but there is no statistically significant difference in ineffective schools. Finally, S students are 50.7 days younger than RR students in effective schools are, whereas the difference is only 11.4 days for ineffective schools. Overall, S students in effective schools have characteristics that improve academic achievement. For example, Krueger (1999, Column 8 in Panel A of Table 5) found that girls, White/Asian students, and students ineligible for free lunch outperformed their counterparts by 4.39, 8.44, and 13.07 percentile ranks, respectively. Fourth, differences in dependent variables are not statistically significant between RR and S students in ineffective schools at the 5% level, whereas differences always exist for effective schools, except for the probability of graduation from high school. Recall that S and RR students are divided based on academic achievement during the experiment, but the division is still detectable for academic achievement after the experiment. This point anticipates our main results. Fifth, academic achievement for ineffective schools lies in between those for S and RR students in effective schools regardless of measures used, not surprisingly, except for the probability of graduation from high school. For example, math scores in fourth grade are 0.193 SD for RR students and 0.178 SD for S students in effective schools. On the other hand, the score range is 0.074–0.089 SD for students in ineffective schools, and the difference is not statistically significant. This pattern appears for all except one of the dependent variables. The overall pattern is as follows. The differences in RR and S class sizes are almost the same in both effective and ineffective schools, but almost all measures of academic achievement indicate that S students are superior to RR students in effective schools but not in ineffective schools. In other words, the treatment is the same, but the outcomes differ between effective and ineffective schools. Moreover, the academic achievement of both RR and S students in ineffective schools falls in between those of RR and S students in effective schools, which indicates that certain variables other than the difference in class sizes increase the academic achievement of S students and decrease that of RR students in effective schools. In fact, observable student characteristics are distributed, intentionally or not, in such a way as to yield such results. This finding makes one suspect that unobservable student characteristics are also unevenly distributed. Furthermore, the proportion of effective schools is rather small relative to the total number of STAR schools. Sohn (2010) demonstrated that the uneven distributions of observable characteristics in effective schools yielded the same results for kindergarten through third grade, and this subsection demonstrates that the patterns persisted through eighth grade. In fact, the persistence is not surprising because some important characteristics are fixed, such as sex, race, free lunch status, and birthday. It could be argued that free lunch status is not fixed, but from the point of view of students, this characteristic is given. In examining the descriptive statistics, one can surmise that the effects of small classes on academic achievement in STAR are driven mostly by a small number of STAR schools; in addition, one can understand why effective schools are indeed effective. In the next section, robustness is checked more systematically by testing whether small classes are effective in raising academic achievement in schools other than effective schools. MATH ACHIEVEMENT IN GRADES FOUR, SIX, AND EIGHT Table 2 shows effects associated with S on math in fourth, sixth, and eighth grades. Although not shown, similar results are obtained for fifth and seventh grades. As can be seen, positive effects are driven mostly by students in effective schools, the proportion of which is about a quarter of the entire sample. Specifically, in fourth grade, when all the samples are considered, students who were initially assigned to S score 0.091 SD higher in math than do students who were initially assigned to RR. However, when the estimation is made only for students in effective schools, the size of the positive effects becomes much larger, that is, 0.247 SD. When students in schools other than effective schools are considered, the positive effects disappear. Unsurprisingly, students in ineffective schools do not get any positive effects from S. Similar patterns are observed for sixth and eighth grades. Among students in effective schools, S students score 0.155 SD and 0.161 SD higher than do RR students in sixth and eighth grade, respectively. However, S students in schools other than effective schools or in ineffective schools do not show any positive effects. Table 2. Robustness of the Effects of Class Size Reduction: Math
Notes: Included in all the specifications but not reported in this table are variables of race, gender, eligibility for free lunch, days between birthday and January 1, 1977, its square term, and schoolbyentry wave fixed effects. Standard errors, adjusted for schoolbyentry wave, are reported in parentheses. ***p <0.01. **p < 0.05. *p <0.10. The case of cumulative effects of S reveals similar results. For all the samples in fourth grade, positive effects do not persist for students who stayed in S for one or two years. Staying in S for three years yields 0.130 SD positive, but weakly significant, effects. Only students who stayed in S for four years show statistically significant positive effects in the form of 0.110 SD. Note that staying in S is likely to be endogenous as explained earlier. Hence, these positive effects capture effects of other factors (e.g., stable families) and thus cannot be entirely attributed to staying for four years in S. When the effects of class size rather than class type are estimated, positive effects are observed mostly for S students only in effective schools. For example, for all the samples in fourth grade, a class with one student fewer raises the figure by 0.021 SD, but these positive effects are largely attributed to effective schools. Similar patterns appear for sixth and eighth grades. READING ACHIEVEMENT IN GRADES FOUR, SIX, AND EIGHT As described in Table 3, the positive effects of S are larger on reading, although the pattern is similar to math. For example, the average score of reading in fourth grade is 0.163 SD higher for the entire sample in fourth grade, but the size of the positive effects is much larger for students in effective schools than students in schools other than effective schools. Students in ineffective schools do not get benefits from being assigned to S across the grade levels. Table 3. Robustness of the Effects of Class Size Reduction: Reading
Notes: Included in all the specifications but not reported in this table are variables of race, gender, eligibility for free lunch, days between birthday and January 1, 1977, its square term, and schoolbyentry wave fixed effects. Standard errors, adjusted for schoolbyentry wave, are reported in parentheses. ***p <0.01. **p < 0.05. *p < 0.10. When statistical significance is ignored, the number of years in S is positively associated with reading scores through eighth grade, which is consistent with Finn and Achilles’ (1999) arguments. When statistical significance is considered, however, only staying in S for four years yields statistically significant effects except in fourth grade. Also, as grade levels increase, the size of the positive effects of staying in S for four years decreases from 0.204 SD in fourth grade to 0.119 SD in eighth grade for the entire sample. The results are similar for students in effective schools, with only the size being larger. When class size, rather than class type, is considered, even students in ineffective schools get some benefits from smaller class sizes for reading in fourth grade. However, the positive effects disappear beyond fourth grade at conventional levels of significance. The positive effects of S seem to persist for students in schools other than effective schools in fourth and eighth grade, but the size is just about half that of effective schools. HIGH SCHOOL OUTCOMES Table 4 shows that, consistent with math and reading achievement in fourth through eighth grades, high school outcomes are driven mostly by effective schools. Although being initially assigned to S does not improve a student’s chance of graduating high school, it raises the probability of taking the ACT or SAT by 2 percentage points. Although the specifications differ slightly, this result is similar to Krueger and Whitmore’s (2001, Column 3 in Table 5) marginal effects, that is, 1.7 percentage points. However, statistically significant effects at conventional levels are observed only for students attending effective schools with much improvement in the probability concerned. Table 4. Robustness of the Effects of Class Size Reduction: High School Outcomes
Notes: Included in all the specifications but not reported in this table are variables of race, gender, eligibility for free lunch, days between birthday and January 1, 1977, its square term, and schoolbyentry wave fixed effects. Standard errors, adjusted for schoolbyentry wave, are reported in parentheses. ***p <0.01. **p < 0.05. *p <0.10. Because the probability of taking the ACT or SAT is higher for S students, less skilled S students would be more likely to take the ACT or SAT compared with RR students. If this is the case, students who were initially assigned to S did not necessarily earn higher scores in the ACT or SAT. In fact, ACTconverted scores are not statistically significantly different between S and RR students. Because of this endogeneity problem, Krueger and Whitmore (2001) used the Heckman selection model and linear truncation method to demonstrate that the students who were initially assigned to S earned higher scores. Because the assumption of normal errors for the Heckman selection model was too strong, they used the linear truncation method, which does not rely on the said assumption. Later, because both methods were imperfect, they derived bounds to check whether the results obtained from both methods were included in the bounds. We do not go to such an extent, but it suffices for the purpose at hand to show that S students in effective schools account for almost all the improvement in ACTconverted scores. In fact, only S students in effective schools show higher scores (0.449 SD), although the estimate is weakly significant. Concerning the number of years in small classes, some interesting facts emerge. Before analyzing the results, notice that the dummies of interest are different from those of Nye et al. (1999). Our dummies indicate only one year in S, only two years in S, and so on, whereas the dummies in the latter indicate one year in S, two or more years in S, and so on. Hence, our estimates are not directly comparable with those of the latter, and the dummies in the latter tend to yield higher estimates than ours. The probability of graduating from high school is 4 percentage points higher for students who remained in S for four years in the whole sample. However, this improvement is clearly driven by S students in schools other than effective schools. Moreover, students attending S for one year show a lower probability of taking the ACT or SAT, driven mostly by S students in ineffective schools and in schools other than effective schools. However, this probability is statistically significantly higher for students who stayed in S for four years in any type of school, although the improvement is highest for S students in effective schools. This pattern suggests that the variable could indeed be endogenous. In cases where students who stayed in S for only one year belonged to unstable families and these family characteristics negatively affected academic achievement, it could be possible to see the aforementioned pattern. In fact, although statistically not significant, similar patterns are observed in math and reading scores in fourth through eighth grades in some schools, with patterns more pronounced for math; this pattern does not seem to appear in connection with ACTconverted scores. However, it does appear for students who stayed in S for three and four years in schools other than effective schools and in ineffective schools. When class size is the variable of interest, for example, students attending a class one student smaller would have a greater (0.6 percentage point) chance of graduating from high school. However, this effect is explained largely by S students in schools other than effective schools and in ineffective schools, which is in stark contrast to previous results. It is difficult at present to come up with adequate explanations to “explain away” this unexpected result. Finally, the higher probability of taking the ACT or SAT can be explained by S students in effective schools, and the case is similar for ACTconverted scores. CONCERNS Six concerns need to be addressed before concluding this article. First, if the positive effects of S on academic achievement are driven mostly by effective schools, which seems to be the case, one could argue that this result is unsurprising because effective schools are defined as schools where students benefited from S. However, the distinction between effective and ineffective schools is drawn using test scores before fourth grade, and in this sense, initial random assignment to S is exogenous to academic achievement in and beyond fourth grade. If the positive effects of S existed and persisted, the effects would have appeared in the advanced grades regardless of whether students attended effective schools, even if the effects did not appear during the experiment. Moreover, if some effects disappear when a small number of samples are excluded from the estimation, these effects are considered “not robust,” by definition. Hence, if the carryover effects of S disappear when a small number of schools, namely effective schools, are removed from the estimations, the carryover effects cannot be regarded as robust. Second, if an education production function is structured so that S affected postexperiment academic performance only through academic performance during the experiment, it should not be surprising to find that students who did not get any benefits during the experiment (i.e., students in ineffective schools) did not get any benefits after the experiment, either. Although this argument has some merit, it is unconvincing to argue that the treatment of S did not have any independent effects on academic performance during the experiment. Although the mechanism is beyond the scope of this article, the treatment could improve noncognitive skills such as motivation, persistence, diligence, and punctuality, which in turn might have improved academic performance after the experiment. Even if the education production function is constructed as such, it is still possible that academic performance that was not measured during the experiment would affect academic performance after the experiment. Third, it could be argued that because some of the observable student characteristics are controlled for, the uneven distributions of the characteristics should not create effects independent of the treatment on the academic achievement of students in effective schools. However, the issue at hand is not about controlling for observable characteristics but about possible uneven distributions of unobservable characteristics that are relevant for academic achievement, but imperfectly controlled for. For example, the fact that there were more students eligible for free lunch in RR than S in effective schools is likely to indicate that more academically successful students, independent of class size, were initially assigned to S in effective schools but not in other schools. It seems that S captures the effects of these unobservable characteristics. Fourth, the distinction between effective and ineffective schools is based on the assumption that students and teachers were distributed at random, regardless of class types. However, it turns out that the assumption is wrong at least at the observational level in some schools. This inconsistency could lead one to argue that another attempt at distinction should be made to take into account the nonrandom characteristic of the sample. One possible attempt is to control for observable characteristics along with class type for each initially attended school. This method raises more questions than it answers, however. It is unclear to what extent observational characteristics are controlled for because Sohn (2010) demonstrated that not only student but also teacher and studentteacher matching characteristics were nonrandomly distributed between S and RR. If all observational characteristics are controlled for, the degree of freedom would be so small that it is likely that effective schools would be classified as schools other than effective schools (Type II error). In addition, this method is ineffective to control for unobservable characteristics correlated with academic performance. Moreover, this argument misses the rationale behind the distinction. The distinction is not made to test whether randomization took place perfectly for all the schools. A priori, it is unknown whether randomization was done as such, so we should start with the given assumption that randomization was perfectly made. Fifth, the effects of S may be positive and statistically significant for students in schools where randomization was perfectly done. This concern is related to the fourth concern. If this is the case, the strength of our arguments mentioned earlier would weaken. Although it is difficult to select such schools, one approximation of the selection is to select schools where no differences by class type are found at the 5% significance level for our main covariates (i.e., percentages of girls, Whites, and freelunch status, and days between the student’s birthday and January 1, 1977). Table 5 lists the results for some dependent variables for illustrative purposes; the results for other dependent variables are similar. The main message is that small classes have little positive effect, and even when estimates are statistically significant, the sizes are smaller than those found for effective schools. An exception is results for the probability of high school graduation, but the sizes are not large enough to extol the virtue of small classes. Table 5. Academic Performance of the Unbiased Sample
Notes: Included in all the specifications but not reported in this table are variables of race, gender, eligibility for free lunch, days between birthday and January 1, 1977, its square term, and schoolbyentry wave fixed effects. Standard errors, adjusted for schoolbyentry wave, are reported in parentheses. ***p < 0.01. **p < 0.05. *p < 0.10. Last but not least, one could assert that a Hawthorne effect motivated some teachers assigned to small classes, and they drove the positive CSR effects in effective schools. However, it is a post hoc justification; before the results were obtained, no one would know who would be motivated by a Hawthorne effect, if any. Note that effective schools accounted for at most one quarter of STAR schools. If a Hawthorne effect had prevailed, the proportion would have been greater than one quarter. Unfortunately, this concern leads to more questions, such as how and why those motivated teachers differed from nonmotivated teachers. Moreover, teachers were randomly assigned to classes every year, meaning that the same teacher might be assigned to a small class one year and to a regular class next year. In this situation, it is not plausible that teachers switched their motivation on and off every year depending on their class sizes. When a Hawthorne effect is considered, a John Henry effect also needs to be considered; that is, teachers assigned to regular classes (the control group) might have been more motivated to overcome their disadvantages (large class sizes). If a Hawthorne effect had driven our results, a John Henry effect could also have explained our results. Furthermore, a John Henry effect would have had greater explanatory power (three quarter) than a Hawthorne effect (one quarter). However, because of the aforementioned reasons, a John Henry effect is not a convincing explanation either. CONCLUSIONS This article argues that the carryover effects of small classes in STAR are not robust. In general, students who were initially assigned to S appear to outperform students who were initially assigned to RR in math and reading in and beyond fourth grade, but this happens mostly for students in a small number of STAR schools, that is, effective schools. Once subject to scrutiny, it is found that observable student characteristics are unevenly distributed in such a way as to raise the academic performance of S students in effective schools but not for S students in ineffective schools. However, the differences between S and RR class sizes are almost the same in both effective and ineffective schools. Therefore, the treatment is the same in both effective and ineffective schools, but the positive effects are estimated only for effective schools during and beyond the experimental period. The natural question would be, “What are the causes of these biased positive effects?” Considering the uneven distributions of observable characteristics shown in Table 1, it is probable that uneven distributions of unobservable characteristics are the culprit and that S in effective schools captures these effects. The nonrobustness of the effects of small classes in STAR warns that researchers and policy makers should be cautious when they rely on STAR to promote CSR. They often cite the positive effects of S during and beyond the experimental period. However, the matter of whether the positive effects of S actually existed during the experiment is still controversial. This article casts serious doubt on even the carryover effects. Our results also reflect the current trend in research on Project STAR; most recent studies have demonstrated that the positive effects of CSR in STAR were none or much smaller than previously believed (e.g., Chetty et al., 2011; Ding & Lehrer, 2010; Sohn, 2010). This article pays attention only to the carryover effects of S on academic achievement in STAR. Although the carryover effects are not robust, CSR might be more effective when it is combined with particular subjects, teaching methods, or student compositions (Finn, Pannozzo, & Achilles, 2003). In the extreme, it may be that CSR alone is not effective at all in raising academic achievement. Theoretically, the consideration of CSR in combination versus CSR alone is important, and the lack of this possibly leads to the controversy over the effectiveness of CSR. For example, suppose that the class size became smaller and the teacher changed her teaching methods accordingly. If there was an improvement in the academic performance of students in the class, it would be difficult to tease out whether the improvement resulted from the small class size or the new teaching methods. If one does not know that the teacher changed teaching methods, he would argue that the small class size contributed to the improvement. However, if another knows about the changes in teaching methods, he would be skeptical of the positive effects of CSR. In light of this, future research on CSR will be more fruitful if it identifies those factors that are complementary to CSR. References Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of American Statistical Association, 91(434), 444–472. Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D. (2011). How does your kindergarten classroom affect your earnings? Evidence from Project STAR. Quarterly Journal of Economics, 126(4), 1593–1660. Dee, T., & West, M. (2011). The noncognitive returns to class size. Educational Evaluation and Policy Analysis, 33(1), 23–46. Ding, W., & Lehrer, S. F. (2005). Class size and student achievement: Experimental estimates of who benefits and who loses from reductions. Queen’s Economics Department Working Paper 1046, Queen’s University, Ontario, Canada. Ding, W. ,& Lehrer, S. F. (2010). Estimating treatment effects from contaminated multiperiod education experiments: The dynamic impacts of class size reductions. Review of Economics and Statistics, 92(1), 31–42. Education Commission of the States. (2005). State classsize reduction measures. Denver, CO: Author. Educational Research Service. (1980). Class size research: A critique of recent meta–analyses. Phi Delta Kappan, 63(4), 239–241. Finn, J. D., & Achilles, C. M. (1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27(3), 557–577. Finn, J. D., & Achilles, C. M. (1999). Tennessee’s class size study: Findings, implications and misconceptions. Educational Evaluation and Policy Analysis, 21(2), 97–109. Finn, J. D., Fulton, D., Zaharias, J., & Nye, B. A. (1989). Carryover effects of small classes. Peabody Journal of Education, 67(1), 75–84. Finn, J. D., Gerber, S. B., Achilles, C. M., & BoydZaharias, J. (2001). The enduring effects of small classes. Teachers College Record, 103(2), 145–183. Finn, J. D., Pannozzo, G. M., & Achilles, C. M. (2003). The “why’s” of class size: Student behavior in small classes. Review of Educational Research, 73(3), 321–368. Glass, G., & Smith, M. L. (1978). Meta analysis of the relationship of class size and student achievement. San Francisco, CA: Far West Laboratory for Education Research. Glass, G., & Smith, M. L. (1979). Metaanalysis of research on class size and achievement. Educational Evaluation and Policy Analysis, 1(1), 2–16. Hanushek, E. A. (1999). Some findings from an independent investigation of the Tennessee STAR experiment and from other investigations of class size effects. Educational Evaluation and Policy Analysis, 21(2), 143–163. Krueger, A. B. (1999). Experimental estimates of education production functions. Quarterly Journal of Economics, 114(2), 497–532. Krueger, A. B., & Whitmore, D. M. (2001). The effect of attending a small class in the early grades on collegetest taking and middle school test results: Evidence from Project STAR. Economic Journal, 111(468), 1–28. Molnar, A., Smith, P., Zahorik, J., Palmer, A., Halbach, A., & Ehrle, K. (1999). Evaluating the SAGE program: A pilot program in targeted pupilteacher reduction in Wisconsin. Educational Evaluation and Policy Analysis, 21(2), 165–177. Mosteller, F. (1995). The Tennessee study of class size in the early school grades. Future of Children, 5(2), 113–127. Mosteller, F., Light, R. J., & Sachs, J. A. (1996). Sustained inquiry in education: Lessons from skill grouping and class size. Harvard Educational Review, 66(4), 797–842. Nye, B., Hedges, L. V., & Konstantopoulos, S. (1999). The longterm effects of small classes: A fiveyear followup of the Tennessee class size experiment. Educational Evaluation and Policy Analysis, 21(2), 127–142. Ryan, D. W., & Greenfield, T. B. (1975). Review of class size research. In D. W. Ryan & T. B. Greenfield (Eds.), The class size question: Development of research studies related to the effects of class size, pupiladult, and pupilteacher ratios (pp. 170–231). Toronto, Ontario, Canada: The Ontario Institute for Studies in Education. Slavin, R. E. (1984). Meta analysis in education: How has it been used? Educational Research, 13(8), 6–15. Sohn, K. (2010). A skeptic’s guide to Project STAR. KEDI Journal of Educational Policy, 7(2), 257–272. Sohn, K. (forthcoming). A review of research on Project STAR and path ahead. School Effectiveness and School Improvement. Stecher, B. M., & Bohrnstedt, G. W. (Eds.). (2002). Class size reduction in California: Findings from 1999–00 and 2000–01. Sacramento: California Department of Education. Word, E. R., et al. (1990). The state of Tennessee’s Student/Teacher Achievement Ratio (STAR) Project: Technical report 1985–1990. Nashville: Tennessee State Department of Education. APPENDIX Descriptive Statistics: Comparison by Initially Assigned Class Type in Schools Other than Effective Schools
Notes: Standard errors are reported in parentheses. ***p <0.01. **p < 0.05. *p <0.10.


