Postscript to the CPA Position Statement on the TOEFL

Postscript to the Canadian Psychological Association’s Position Statement on the TOEFL

Marvin L. Simner, Chair
Canadian Psychological Association,
Professional Affairs Committee Working Group on Test Publishing Industry Safeguards

Approved by the Board of Directors, Canadian Psychological Association, May, 1999
Copyright © 1999
Canadian Psychological Association
Société canadienne de psychologie
Permission is granted to copy portions of this document for educational use.

Canadian Psychological Association
Société canadienne de psychologie
151 Slater St., Suite 205
Ottawa, Ontario
K1P 5H3

Title: Postscript to the Canadian Psychological Association’s Position Statement on the TOEFL

ISBN 1896538517

Postscript to the Canadian Psychological Association’s Position Statement on the TOEFL

In March, 1997, the Board of Directors of the Canadian Psychological Association (CPA) approved a position statement that called upon Canadian universities to refrain from using the Test of English as a Foreign Language (TOEFL) as a standard for university admission. This call was prompted by evidence in a report to the Board (Simner, 1998) which suggested that the TOEFL was being employed not only in a manner that was contrary to recommendations by the Educational Testing Service (ETS), which publishes the TOEFL, but also in a manner that could prove harmful to many Canadian immigrants and refugees.

Briefly stated, the evidence showed that in Canada the decision to accept a nonnative English speaking applicant is often based primarily on the applicant’s overall TOEFL score, and only secondarily on the applicant’s prior academic performance. What is both surprising and troublesome about this procedure is that over the years a considerable body of evidence has accumulated which shows that only a weak relationship exists between TOEFL scores and academic achievement at both the undergraduate and graduate levels. Because much of this evidence was collected by ETS itself, ETS has repeatedly advised university officials to avoid making admission decisions based solely on TOEFL scores.

Despite this advice, however, surveys indicate that since 1982 many universities have increased their TOEFL admission cutoffs. In Ontario, for example, by 1995, ten universities had raised their minimum undergraduate cutoffs from 550 (which is close to the 70th percentile) to scores that ranged from 580 through 600 (which are near the 90th percentile). Needless to say, these higher cutoffs mean that, whereas previously the top 30% of applicants would have been eligible for consideration, now only the top 10% are eligible. Hence, it is quite likely that today large numbers of immigrants and refugees, many of whom may otherwise be qualified for university admission, may very well be denied admission to these universities.

The CPA position statement and the supporting report were mailed to each Canadian university, along with a request that the universities review their use of the TOEFL and adopt procedures in keeping with the guidelines established by ETS. Whereas the reactions from the universities, for the most part, have been positive (Berkowitz, 1998), and thus far several have either initiated or are considering initiating procedures in line with this request¹, there were some universities which felt the concerns expressed in the position statement were inappropriate and unwarranted.

The major objection had to do with the claim by these universities that the TOEFL was not being used as a standard for admission with the aim of predicting future academic achievement, as charged by CPA, but instead was being employed for decision making purposes only to ensure that students who are proficient in English are allowed to enrol in English speaking universities. It was further claimed that this use of the TOEFL was justified. For instance, in a letter to the Board one university administrator wrote: "Although we have not undertaken studies of the psychometric properties of the TOEFL, we regularly review the technical data submitted by ETS on this measure. In light of these reports and our own experiences with the instrument, we consider that it does what it purports to do (by ETS) quite well namely assess level of English proficiency." Hence, according to those who took issue with the Board, it was argued that if the Board had reviewed the evidence provided by ETS in support of the TOEFL as a measure of English language proficiency, the position statement would not have been necessary.

The technical data referred to above are readily available to users of the TOEFL in the TOEFL Test and Score Manual. The findings that are said to favor the TOEFL as a measure of English language proficiency are given under the following headings in the Manual: criterion-related validity, construct validity, and content validity. Because the universities that objected to the position statement rely on these findings to justify their use of the TOEFL for decision making, the question is whether these findings truly justify this use. Drawing upon material in the 1997 version of the Manual the purpose of this postscript is to evaluate the major findings that appear under each heading. To this end we will first cite the findings and then comment on their nature.

Criterion-related Validity

The standard requirement for evidence offered in support of the criterion-related validity of a test holds that scores on that test should be systematically related to certain appropriate criterion measures (Anastasi & Urbina, 1997; Ward, Stoker, & Murray-Ward, 1996; Nunnally & Bernstein, 1994). To satisfy this requirement, over the years ETS has gathered evidence on the relationship between the TOEFL and other tests of English language proficiency, as well as evidence on the relationship between the TOEFL and teacher evaluations of classroom discourse, oral interviews and writing samples. The following material appears on pages 34 and 35 in the 1997 edition of the Manual.

A study conducted by Maxwell (1965) at the Berkeley campus of the University of California found an [sic] .87 correlation between total scores on the TOEFL test and the English proficiency test used for the placement of foreign students at that campus. This correlation was based on a total sample of 238 students (202 men and 36 women, 191 graduates and 47 undergraduates) enrolled at the university during the fall of 1964. Upshure (1966) conducted a study to determine the correlation between TOEFL and the Michigan Test of English Language Proficiency. This was based on a total group of 100 students enrolled at San Francisco State College (N = 50), Indiana University (N = 38), and Park College (N = 12) and yielded a correlation of .89. Other studies comparing TOEFL and Michigan Test scores have been done by Pack (1972) and Gershman (1977). In 1966 a study was carried out at the American Language Institute (ALI) at Georgetown University comparing scores on TOEFL with scores on the ALI test developed at Georgetown. The correlation of the two tests for 104 students was .79.

In addition to comparing TOEFL with other tests, some of these studies included investigations of how performance on TOEFL related to teacher ratings. In the ALI Georgetown study the correlation between TOEFL and these ratings for 115 students was .73. Four other institutions reported similar correlations. Table 4 gives the data from these studies. At each of the institutions (designated by code letters in the table) the students were ranked in four, five, or six categories based on their proficiency in English as determined by university tests or other judgments of their ability to pursue regular academic courses (American Language Institute, 1966).

Table 4. Correlations of Total TOEFL Scores with University Ratings

University

Number of Students

Correlations with Ratings

A

215

.78

B

91

.87

C

45

.76

D

279

.79

In a study conducted on the five-section version of the test used prior to 1976, Pike (1979) investigated the relationship of the TOEFL test and its subsections to a number of alternate criterion measures, including writing samples, cloze tests, oral interviews, and sentence-combining exercises. In general, the results confirmed a close relationship between the five sections of the TOEFL test and the English skills they were intended to measure. Among the most significant findings of this study were the correlations between TOEFL subscores and two nonobjective measures: oral interviews and writing samples (essays).

Comment

Unfortunately it is not possible to verify all of the claims in the first paragraph of the above quote because three of the five studies mentioned in this paragraph (American Language Institute, Gershman, and Upshure) are unpublished and therefore unavailable for review. However, in 1984 ETS published detailed summaries of a number of studies conducted between 1963 and 1982 that involved the TOEFL (Hale, Stansfield, & Duran, 1984), among which are the studies by Maxwell and Pack. According to the summaries, Maxwell did indeed report a correlation of .87 between the TOEFL and the University of California, Berkeley, Test of English as a Foreign Language. The findings by Pack, on the other hand, were less impressive. Based on a sample of 402 students, "the correlations between equivalent sections on both tests were Aural or Listening Comprehension, .45; Structure-Grammar, .52; Vocabulary, .62; Reading Comprehension, .49; and total test score, .66" (Hale et al., 1984, p. 160). In fact, in their assessment of these findings, Hale et al. concluded that "the correlation between TOEFL and the Michigan test was only moderate. Thus, these two tests are not interchangeable." (p. 161)

The material in the second paragraph of the quote deals with the correlation between TOEFL scores and teachers’ ratings of English language proficiency. Here the emphasis is solely on the findings from the unpublished study by the American Language Institute. This emphasis is somewhat puzzling in that two other investigations were summarized by Hale et al. (see Komvichayungyuen, 1978, pages 129-130 and Schrader & Pitcher, 1970, pages 183-185), both of which were concerned with the same issue and made use of samples similar in size to the ones reported in Table 4. The study by Komvichayungyuen, which involved 57 students, reported correlations ranging from .45 to .55, while the study by Schrader and Pitcher, which made use of 108 students, reported correlations ranging from .34 to .43. Needless to say, if the correlations obtained in these other two studies had been included, the findings in Table 4 would not have been nearly as consistent and therefore not as impressive.

The Pike study referred to in the last paragraph of the quote was also summarized by Hale et al. (see pages 166-168). In line with the remarks concerning this study, scores on an essay test and from an oral interview yielded correlations that ranged from the low 80s to the high 90s with the various TOEFL subtests. Although these findings are clearly in keeping with the comments at the end of the paragraph, this paragraph too suffers from missing information. In particular, nine other studies were summarized by Hale et al. that dealt with the same issue. Two of the studies (Osanyinbi, 1975 (see page 157-159); Pitcher & Ra, 1967 (see page 169-170)) reported correlations between the TOEFL and essay test results whereas the other seven (Abadzi, 1976 (see pages 17-20); Clark & Swinton, 1979 (see page 64-65); Clark & Swinton, 1980 (see page 66-67); Gradman & Spolsky, 1975 (see page 94-96); Hillman, 1973 (see page 109-112); Mullen, 1978 (see page 141-143); Osanyinbi, 1975 (see page 157-159)) reported correlations between the TOEFL and oral interviews or other forms of oral assessment obtained from devices such as the Test of Spoken English (TSE). Unlike the findings reported by Pike, the majority of the correlations found in these other studies ranged in the vicinity of .30 to .60, which are clearly much lower than the correlations obtained by Pike. Moreover, and in line with these other findings, in a recent investigation by Henning and Cascallar (1992) the majority of the correlations between TOEFL scores and oral as well as written communicative competence were only in the mid 30s. Thus here too the missing material contained evidence that was only moderately supportive of a meaningful relationship between the TOEFL and the criterion measures.

In essence, the picture that emerges is that the evidence cited by ETS under Criterion-related validity is somewhat selective and thus not entirely accurate. In the first paragraph both the location and the narrative surrounding the Pack study suggest that Pack obtained results similar to the results obtained by the American Language Institute, Gershman, and Upshure. However, a careful reading of the summary of the Pack study in Hale et al. reveals that this is not the case. Moreover, if the findings by Komvichayungyuen (1978) and Schrader and Pitcher (1970) had been included in the second paragraph, the results in Table 4 would have provided a less optimistic view of the relationship between the TOEFL and teacher ratings of English language proficiency. Finally, if the outcome of the many other studies that examined the nature of the relationship between the TOEFL and oral interviews, etc. had been included in the last paragraph, it would have been evident that only a marginal to moderate, rather than a close relationship exists between the TOEFL and these other ways of evaluating English language proficiency.

It is also worth noting that in the majority of the studies mentioned above in which findings are reported on the relationship between the TOEFL and oral interviews, written assignments, etc., the information from these other measures of English language proficiency was gathered either at the same time or shortly after the TOEFL was given. In the study by Pike, for instance, all of the other instruments "were administered a short time after the administration of TOEFL" (Hale et al., p. 166). In the study by Henning and Cascallar (1992), the TOEFL, the TSE, and the TWE (Test of Written English) were all administered within a two to ten day period.

This matter is quite important because, presumably, the major purpose of using the TOEFL as an admissions screening device is not to determine how well a student performs in English at the time the TOEFL is taken, but instead to determine how well the student is likely to perform in the future, which typically means some 8-10 months later after the student has arrived on campus and is immersed in an English speaking environment. Hence, the evidence needed to support the TOEFL as a screening device is evidence in favor of predictive validity, yet the findings from these studies deal only with concurrent validity which may be largely irrelevant to the issue. Moreover, the only follow-up study that we were able to locate suggests that while there may be some relationship between TOEFL scores and first year performance in university English courses, this relationship might not hold beyond the first year. In particular, Pack found that, whereas TOEFL scores were "significantly related to the grade obtained in the first English course taken, they are not related to grades obtained in subsequent English courses nor are they related to the probability that an examinee will graduate" (Hale et al. p. 161).

In terms of follow-up work, the study by Schrader and Pitcher (Hale et al., p. 183-185) is also worth mentioning. These authors reported that after an eight-week summer university orientation program given in English, students’ scores on the TOEFL itself increased from an average of 570 to 601. This evidence is especially important, because it suggests that if the new cutoff of 600 mentioned above is strictly adhered to, applicants who are rejected due to scores of 580 to 590 might well achieve 600 or more and thus become eligible for admission if given the opportunity to improve their English after arriving on campus and before starting classes.² In general, then, it would seem that any attempt to justify the use of the TOEFL as a means of predicting a university applicant’s future command of English based solely on evidence from concurrent validity studies with the TOEFL must be viewed with some suspicion.

Finally, there is also reason to question whether the comparisons mentioned above between the TOEFL and the other formal tests of English language proficiency are even appropriate for assessing the criterion validity of the TOEFL. Note in the Pack study that the comparisons were between "equivalent sections" of both tests. The reason for mentioning this point is that there appears to be considerable overlap in content, not only between the TOEFL and the Michigan Test of English Language Proficiency but between the TOEFL and these other language tests as well. Indeed, according to Palmer, Associate Director of the American Language Institute at Georgetown University, "All of these tests, the Michigan test, the TOEFL test, and ALIGU (American Language Institute Georgetown University) test, are all high level proficiency tests. They all measure pretty much the same thing. As a consequence, they all tend to correlate" (Palmer & Woodford, 1978, p. 509).

In her discussion of criterion validity, Anastasi (1982) addressed the matter of using existing tests to establish the validity of a new test when both contain overlapping content.

When the new test is an abbreviated or simplified form of a currently available test, the latter can properly be regarded as a criterion measure. Thus, a paper-and-pencil test might be validated against a more elaborate and time-consuming performance test whose validity had previously been established. Or a group test might be validated against an individual test. The Stanford-Binet, for example, has repeatedly served as a criterion in validating group tests. In such a case, the new test may be regarded at best as a crude approximation of the earlier one. It should be noted that unless the new test represents a simpler or shorter substitute for the earlier test, the use of the latter as a criterion is indefensible. (p. 142)

Two issues are raised in this paragraph. The first issue has to do with whether the existing tests against which the TOEFL is commonly compared have themselves been properly validated. To address this issue it is instructive to consider the validity claims in favor of the Michigan Test of English Language Proficiency (MTELP), published by the English Language Institute at the University of Michigan, because this test appears to be the one most often used to verify the claims for the criterion validity of the TOEFL. In its most recent form the MTELP is called the Michigan English Language Assessment Battery (MELAB). The 1994 edition of the MELAB Technical Manual cites as evidence in support of the criterion validity of the MELAB the same three sets of comparisons used to establish the criterion validity of the TOEFL, namely 1) teacher assessments of students’ English proficiency, 2) students’ scores on independent measures of written and oral proficiency, and 3) students’ performances on other English language proficiency batteries that are deemed similar to the MELAB.

As was the case with the TOEFL the first two sets were concerned largely with concurrent validity and so suffer from the same drawback mentioned above in connection with the evidence on the TOEFL (see p. 56-61 in the MELAB Technical Manual, 1994). Moreover, the follow-up evidence that does exist on the Michigan test is in line with the follow-up findings on the TOEFL. Schrader and Pitcher (Hale et al., p. 183-185) reported correlations of only .19 and .33, respectively, between the MTELP, which was administered before the students arrived in the United States, and the students’ average grade in written and spoken English at the end of an eight week orientation period after arriving on campus. Similar results were obtained by Pack (Hale et al., p. 160-161), who correlated MTELP scores with grades in university English courses, and by Abadzi (Hale et al., p. 17-21), who reported on the relationship between the MTELP and first (rxy = .29) as well as second (rxy = .00) semester grade point averages.

Equally troublesome in this use of the Michigan test by ETS to support the criterion validity of the TOEFL is the evidence from the third comparison. Here the MELAB Manual mentions only one test battery to which the MELAB was compared in order to establish its criterion validity and this battery is the TOEFL. Needless to say, it would seem tautological and therefore somewhat dubious for ETS to use the relationship between the TOEFL and the MELAB (or MTELP) in support of the criterion validity of the TOEFL, if this same relationship is used by the English Language Institute at the University of Michigan as evidence in support of the criterion validity of the MELAB. In fact, given the similarity between the two, it may be more appropriate to refer to this relationship as evidence in favor of alternate-form reliability than criterion-related validity.

The second issue raised by Anastasi has to do with whether the TOEFL is a "simpler or shorter version" of the existing tests to which it is compared. The paper-based version of the TOEFL at present consists of three sections and requires about three hours to complete. The paper-based version of the MELAB also consists of three sections and also can be completed in about three hours. Furthermore, according to descriptions in the manuals of both, it appears that the earlier versions of the two tests were also quite similar in length and in time required for completion. Thus, here too there is reason to question whether it is appropriate to employ the MELAB (or the MTELP) as a means of supporting the criterion validity of the TOEFL, since it would certainly seem that the TOEFL is neither a simpler nor a shorter version of the Michigan test.

Construct Validity

Construct validity refers to the extent to which a test measures some underlying theoretical trait or construct. On pages 36-38 in the 1997 TOEFL Manual, two sets of findings are summarized to illustrate the construct validity of the TOEFL. The first set involves a comparison between native and nonnative speakers of English while the second bears on the relationship between the TOEFL and other tests that are said to measure verbal aptitude. We will deal with each set in turn.

1) Native vs. Nonnative Speakers

In early attempts to obtain construct-related evidence of validity for the TOEFL test, two studies were conducted comparing the performance of native and nonnative speakers of English on the test. Angoff and Sharon (1970) found that the mean TOEFL scores of native speakers in the United States were much higher than those of foreign students who had taken the same test. Evidence that the test was quite easy for the American students is found in the observations that their mean scores were not only high but homogeneously high relative to those of the foreign students; that their score distributions were highly negatively skewed; and that a high proportion of them earned maximum or near-maximum scores on the test.

A more detailed study of native speaker performance on the TOEFL test was conducted by Clark (1977). Once again, performance on the test as a whole proved similar to that of the native speakers included in the Angoff and Sharon study. The mean raw score for the native speakers, who took two different forms of the TOEFL test, was 134 (out of 150). This compared to mean scores of 88 and 89 for the nonnative speakers who had originally taken the same forms.

Comment

Both the Angoff and Sharon study as well as the Clark study are summarized in Hall et al. (see pages 42-43 and pages 62-63, respectively). From the information in the summaries and in the complete version of the Clark study (which is available from ETS), it would appear that both investigations suffer from a serious design problem that interferes with interpretation. In the Angoff and Sharon study the native English speakers were entering freshmen at a western state university, whereas in the Clark study the native English speakers were students in a college preparatory program who had completed a considerable number of high school English courses in which they had received top grades. The nonnative speakers in the two studies, on the other hand, were the individuals who provided the normative data in the TOEFL manual. Hence, the nonnative speakers consisted of unselected examinees from a wide range of backgrounds who took the TOEFL largely for the purpose of applying to college or university. Thus in both studies comparisons were made between groups of individuals that may have differed on a number of dimensions other than native command of English. For instance, differences might have been present between the native and nonnative speakers in overall level of academic ability, IQ, and even test anxiety. The latter is particularly important given the fact that performance on the TOEFL would have had little impact on the future careers of the native speakers, whereas just the opposite is true of the nonnative speakers, and test anxiety by itself can certainly lower an examinee’s score.³ Because no attempt was made to control for these or other factors that might have affected the results it is difficult if not impossible to reach any meaningful conclusion from either study.

Also, in view of the statement made in connection with the Angoff and Sharon study that "the test was quite easy for the American students," it is worth drawing attention to the following material from a summary by Hale et al. of yet another unpublished study. As this material suggests, it would seem that at least some parts of the TOEFL may be more difficult for the average native English speaker than is often assumed.

The work of Angoff and Sharon (1971) suggests that native English speakers have little difficulty with TOEFL. However, an unpublished study with 88 native English-speaking high school students shows that, while the (North American) subjects had little difficulty with TOEFL overall, the Structure and Written Expression and the Reading Comprehension and Vocabulary sections were found to contain items that were more difficult than would be expected for native speakers. In the case of each of these subtests, one-fourth to one-fifth of the items were answered incorrectly by at least 80 percent of the subjects. Lack of grammatical skills influenced the difficulty of the Structure and Written Expression section. The difficulty of the vocabulary items was affected by abstractness and frequency of vocabulary. The difficulty of the reading comprehension items was influenced by the need to make complicated judgements and inferences. The occurrence of unexpectedly difficult items for native English speakers complicates the interpretation of performance on such items. (p. 34)

Finally, it is also instructive to consider Table 9 in a separate report by ETS (TOEFL Test and Score Data Summary, 1997-98 Edition), which lists the mean TOEFL scores achieved by native speakers of more than 150 languages. Among the languages cited in the table is English. According to the table, the mean score obtained by 4,921 native English speakers who took the TOEFL from July 1996 through June 1997 was 590. The reason for drawing attention to this score is that 590 is below the cutoff of 600 which, as noted above, is now in use at a growing number of Canadian universities and only applies to non-native English speaking applicants. Hence, there may very well be many native English speaking applicants who would be denied admission to these universities if all applicants, regardless of the applicant’s native language, citizenship,⁴ or country of origin were required to take the TOEFL.

2) Relationship between TOEFL and Verbal Aptitude Tests

Other evidence of TOEFL’s (construct) validity is presented in studies that have focused on the relationship of the TOEFL test to some widely used aptitude tests. The findings of these studies contribute to the construct-related validity evidence by showing the extent to which the test has integrity as a measure of proficiency in English as a foreign language. One of these studies (Angelis, Swinton, and Cowell, 1979) compared the performance of nonnative speakers of English on the TOEFL test with their performance on the verbal portions of the GRE Aptitude (now General) Test (graduate-level students) or both the SAT and the Test of Standard Written English (undergraduates). As indicated in Table 6 (below), the GRE verbal performance of the nonnative speakers was much lower and less reliable than the performance of the native speakers. Similar results were reported for undergraduates on the SAT verbal and the TSWE (Table 7).

Table 6. TOEFL/GRE Verbal Score Comparisons

Mean

S.D.

Rel

S.E.M.

TOEFL
523

69

.95

15

(Nonnatives) (N = 186) GRE-V
274

67

.78

30

Native Speakers (N = 1,495) GRE-V
514

128

.94

32

Table 7. TOEFL/SAT and TSWE Score Comparisons

Mean

S.D.

Rel

S.E.M.

TOEFL
502

63

.94

16

(Nonnatives) (N = 210) SAT-V
269

67

.77

33

Native Speakers (N = 1,765) SAT-V
425

106

.91

32

(Nonnatives) (N = 210) TSWE
28

8.8

.84

4

Native Speakers (N = 1,765) TSWE
42.4

11.1

.89

3.7

Comment

A proper interpretation of the findings in Tables 6 and 7 hinges on what is meant by an aptitude test. According to Anastasi (1982), "aptitude tests serve to predict subsequent performance. They are employed to estimate the extent to which the individual will profit from a specified course of training, or to forecast the quality of his or her achievement in a new situation" (p. 393). Coupling this definition with the predictive validity findings on the SAT, GRE, and the TSWE summarized below, it is questionable whether it is indeed appropriate to refer to any of these devices as aptitude tests.

The SAT, which is published by the College Board, is probably the most widely used college admissions test in the United States, with an estimated yearly volume of around one million examinees. The TSWE, which is also published by the College Board, is given in conjunction with the SAT. Whereas the former is intended to be used as an admissions test, the latter is intended only to help colleges and universities place students in appropriate English composition courses. It is not recommended as an admission instrument. The reason for mentioning this point is that the SAT is typically validated against first year grade point averages, while the TSWE is typically validated against performance in first year English courses.

The evidence reported by the College Board (1997) itself indicates that for native speakers of English, the median correlation between the SAT-V and freshman grade point average, according to curriculum (e.g., business, liberal arts, engineering, etc.), is only in the mid to high 30s (p. 5). Similar findings have been reported for the TSWE with regard to English courses. For example, based on a sample of 569 freshmen, Michael and Shaffer (1979) found a correlation of .41 between TSWE scores and first semester grades in an English course emphasizing written expression. In terms of nonnative English speakers, the evidence summarized by the College Board shows that the SAT-V underpredicts freshman grade point average, which means that in the case of these students they are likely to perform better in university than would be anticipated from their SAT scores. Although it is true that the findings from both the native and the nonnative English speaking groups are based on students already in university and therefore are likely to underestimate, at least to some extent, the actual relationship between the SAT, TSWE, and achievement due to the restricted range of scores, nevertheless, the magnitude of the correlations do raise a serious question about the appropriateness of referring to either test as an aptitude test. This point was brought home in a recent critique of the SAT by Schwartz (1999).

Even the College Board seems uncertain about what the S.A.T. measures. In 1993, the College Board ceased referring to the S.A.T. as an aptitude test and renamed it the Scholastic Assessment Test. That lasted just over a year. Since 1995, it has been referred to simply as the S.A.T., and the College Board now describes it as a test of "developed math and verbal reasoning skills." (p. 35)

The situation is similar for the GRE-V. In terms of foreign students, both Shay (1975; as cited in Hale et al., p. 190) and Kaiser (1986) obtained correlations of less than .20 between scores on the GRE-V and graduate school grade point averages whereas in terms of native English speakers, a recent meta-analysis of 22 studies revealed an average correlation of less than .30 between scores on the GRE-V and graduate school grade point average (Morrison & Morrison, 1995).

In short, given the findings on the relationship between these three tests and their criterion measures coupled with the recent action by the College Board in connection with the SAT, the most that can be said about the differences reported in Tables 6 and 7 is that the meaning of these differences is far from clear.

Content Validity

Content validity refers to whether experts agree that the items on a test are representative of the domain that the test is said to measure. In the case of the TOEFL, evidence in favor of content validity appears to be of primary importance, as the following passage from page 34 in the 1997 Manual indicates.

Content-related evidence for the TOEFL test is a major concern of the TOEFL Committee of Examiners, which has developed a comprehensive list of specifications for items appearing in the different sections of the test. The specifications identify the aspects of English communication, ability, and proficiency that are to be tested and describe appropriate techniques for testing them. The specifications are continually reviewed and revised as appropriate to ensure that the test reflects both current English usage and current theory as to the nature of second language proficiency.

Comment

The detailed information on test development provided by Peirce (1992), a former employee of the Test Development Department at ETS, clearly illustrates the extreme care with which the items on the TOEFL are constructed and certainly lends credibility to this statement. The issue here, however, is not with the manner in which the items are selected nor with the contention by ETS that experts agree with the appropriateness of the overall content of the TOEFL. Rather, the concern is over the degree of importance to attribute to content validity in the case of a test that is widely used for decision-making purposes. Murphy and Davidshofer (1998) had the following to say about this matter.

It seems clear that content validity is important to understanding test scores. However, there is some controversy over whether content validity can be used to establish the validity of decisions based on test scores. A number of researchers have suggested that a content validity approach might be useful in determining whether specific tests could be used in applications such as personnel selection. The basic argument is as follows: (1) tests are used to predict performance on the job; (2) job performance requires certain abilities and skills; (3) if the tests require the same abilities and skills as those required on the job, then tests could be used to predict job performance; and (4) therefore, the validity of a test for selection decisions can be established by comparing the content of the test with the content of the job. This type of definition of content validity has been widely accepted both by industry and by the federal government. However, most experts agree that content validity is relevant only in determining the validity of measurement (does the test measure what it claims to measure?), not in determining the validity of decisions that are made based on test scores.

Carrier, Delessio, and Brown (1990) investigated the hypothesis that judgements about the content validity of tests would allow one to assess the validity of those tests as predictors of important criteria (namely, actual on-the- job performance) … They found that expert judgments of content validity were significantly correlated (in a statistical sense) with levels of criterion-related validity, but that these correlations were small. Their results suggest that content- related evidence is useful but not sufficient for assessing the criterion-related validity of psychological tests (p. 154).

Although we were unable to find any investigations of the TOEFL that were comparable to the investigation by Carrier et al. (1990), because English language is obviously employed in university courses we believe it would be appropriate to use, as a means of evaluating "performance on the job," the predictive validity correlations that have been reported between the TOEFL and overall academic achievement in university. By way of example, the correlations cited below, from a study by Light, Xu, and Mossop (1987), are between TOEFL scores and graduate school grade point averages according to major. It is worth noting that these correlations are similar to the correlations in the majority of studies summarized in Hale et al. at both the undergraduate and graduate levels. It is also worth noting that the TOEFL scores in a number of these studies ranged from approximately the 5th to the 99th percentile. Hence, it is unlikely that the low level correlations shown below could have resulted from a restricted TOEFL range. Instead, it would seem that the magnitude of these correlations reflects a genuine lack of any meaningful relationship between TOEFL scores and academic achievement.

Major	Correlation
Humanities/Fine Arts	.13
Science/Mathematics	.04
Social Science	.22
Education	.30
Business	.02
Public Affairs	.03
Library Science/Social Welfare/Criminal Justice	.17

Indeed, Light et al (1987), in commenting on their data, concluded that "merely knowing how a student scored on TOEFL will tell us practically nothing we need to know to predict the student’s academic performance" (p. 255). Many others, when referring to their own data, reached similar conclusions. For example, Ayers and Quattlebaum (1992) stated that "the TOEFL score was not an effective predictor of academic success, as measured by total GPA based on all courses required in the program of study" (p. 974), while Hwang and Dizney (1970) found that whereas "TOEFL is a relatively good predictor of grades in ESL for Chinese graduate students … its use to predict the academic success of Chinese graduate students is doubtful" (pp. 476-477). Hence, it would certainly seem that the comments by Murphy and Davidshofer on the findings by Carrier, et al (1990) apply not only to the relationship between expert opinion and items on personnel selection tests but also to the relationship between expert opinion and the TOEFL.

Finally, and also in terms of decision-making, it will be recalled that by 1995 ten Ontario universities had raised their undergraduate cutoffs from 550 to as high as 600. As mentioned above, this means that whereas previously the top 30% of applicants would be considered for admission, now only the top 10% are likely to be considered. To determine the net gain associated with decision making using this elevated cutoff score, a common procedure is to refer to the tables developed by Taylor and Russell which take into account the combined effects of the predictive validity of a test and the percentage of applicants whose scores exceed a given cutoff (Anastasi & Urbina, 1997; Murphy & Davidshofer, 1998). Since the predictive validity of the TOEFL, as determined by the relationship between TOEFL scores and overall academic achievement, is in the vicinity of .20, and since 30% of the applicants exceeded the original cutoff whereas now only 10% exceed the new cutoff, the Taylor/Russell tables suggest that the overall net gain in the percentage of applicants who are now likely to be successful in university as a result of having raised the cutoff from 550 to 600 is only in the neighborhood of 4%. Hence, it would seem that this gain of 4% was accomplished at a considerable cost, namely, the rejection of an additional 20% of those applicants who score at the top end of the TOEFL scale and for whom the probability of graduating from university is extremely high (for evidence on this point see Simner, 1995).

Conclusion

As mentioned in the preface, ETS has repeatedly advised users not to employ the TOEFL as the sole criterion for university admission. The basis for this advice is the weak relationship between TOEFL scores and academic achievement. In light of the foregoing critique, which raises a number of questions concerning both the meaningfulness and the magnitude of the relationship between the TOEFL and the various measures of English language proficiency to which it has been compared, there appears to be little reason to challenge this advice or to modify the stand taken by CPA. In other words, regardless of whether the TOEFL is used as an admission standard to predict academic achievement or to judge English language proficiency, it is inappropriate to use the TOEFL for decision making purposes and to reject an applicant on the basis of the applicant’s TOEFL performance without regard to the applicant’s other qualifications. Instead, users of the TOEFL would do well to heed the further advice by ETS and

base the evaluation of an applicant’s readiness to begin academic work on all available relevant information, not solely on TOEFL test scores;

not use rigid cutoff scores to evaluate an applicant’s performance on the TOEFL test; (and)

consider TOEFL section scores as well as total scores; consider the kinds and levels of English proficiency required in different fields and levels of study and the resources available at the institution for improving the English language skills of nonnative speakers. (TOEFL Test and Score Manual, 1997, p. 26)

The implementation of these recommendations is perhaps best illustrated in the following material from the MELAB Technical Manual (1994), which describes how the MELAB and the TOEFL are employed at the University of Michigan.

Applicants are considered for admission to undergraduate study with Final MELAB scores above 80 and all part scores at 80 or above, or with TOEFL scores above 560 and all section scores at 56 or above. Rigid cut scores are not applied. All relevant information about English language proficiency is used by admissions staff so the policy is applied in a flexible manner. Applicants whose Final MELAB or part scores are 85 or lower, or whose TOEFL scores are 600 or lower or whose section scores are 60 or lower (and also those without a Test of Written English score of at least 5.0) are generally required to have their English language proficiency reevaluated upon arrival. As a result of this on-campus testing, students may be required to take an EAP (English for Academic Purposes) course. Typically, at UM about half of the entering undergraduates are exempted from English language work. The typical requirement for the other half is one EAP mini-course, usually a writing course that meets two hours a week. The EAP writing course must be taken before the student enrolls in a regular university composition course. EAP courses are taken concurrently with other academic course work. (p. 14)

End notes

^{1Of the ten Ontario universities
that had cutoffs between 580 and 600 in 1995, according to Kelly (1998, pp. 142-145), six
had either altered or were in the process of considering altering their cutoffs by 1998.
For example, whereas in 1995 the University of Ottawa had minimum cutoffs that ranged from
580-600 for all programs except Engineering and Science (which employed 550), in 1998
consideration was given to applicants with scores as low as 500. Similarly, York
University had lowered its cutoff from 580 to 560, King’s College (an affiliate of
the University of Western Ontario) had reduced its cutoff from 580 to 550 and the
University of Guelph was considering lowering its cutoff from 600 to 550 as long as
applicants with scores as low as 550 could provide appropriate documentation of a
reasonable command of English. Elsewhere in Canada changes were also being suggested. For
instance, the Senate Admissions Committee at the University of British Columbia recently
recommended reducing the cutoff from 580 to 550 (Senate of the University of British
Columbia, Minutes of December 16, 1998).
In addition to these alterations other modifications in the TOEFL admission
requirements were noted in the material summarized in Kelly. For example, whereas Brock
University had stated in 1995 that the TOEFL was mandatory for all nonnative English
speaking students, in 1998 exceptions were made for applicants with a grade of 70% or
higher in an English OAC 1 course or in a Grade 12 English course completed outside
Ontario. Similarly, Laurentian University by 1998 had reduced its minimum Canadian
residency requirement from 5 years to 4 years for nonnative English speaking applicants
who wished to avoid taking the TOEFL.
^{2For additional evidence that many monolingual
adults are able to achieve reasonable competence in a second language, see Bialystok
(1997, pp. 125-126) as well as Snow and Hoefnagel-Hohle (1978).
^{3The following passage from a widely used study
guide intended to help nonnative English speakers pass the TOEFL no doubt expresses the
anxiety and concern felt by many who confront the TOEFL as a standard for university
admission.

You are well aware that the TOEFL is one of the most important examinations that you
will ever take. Your entire future may well depend on your performance on the TOEFL. The
results of this test will determine, in great measure, whether you will be admitted to the
school of your choice. (Babin, Cordes, & Nichols, 1987, p. 3)}}}

4Because Canada is officially bilingual, some Canadian universities waive the TOEFL admissions requirement for Francophone Canadian citizens educated in Canada. This means, of course, that in the case of these universities a non Canadian Francophone immigrant from say Belgium, France, or Switzerland may be required to take the TOEFL while their Canadian counterpart would not. Whether such action is legally defensible, however, is uncertain because it could be viewed as discriminatory and therefore in violation of the various provincial Human Rights Codes which prohibit discrimination in educational settings on grounds of citizenship as well as country of origin.

References

Anastasi, A. (1982). Psychological Testing. New York, NY: MacMillan Publishing Co. (5th ed.).

Anastasi, A., & Urbina, S. (1997). Psychological Testing. Toronto, ON: Prentice Hall Canada (7th ed.).

Ayers, J.B., & Quattlebaum, R.F. (1992). TOEFL performance and success in a masters program in engineering. Educational and Psychological Measurement, 52, 973-975.

Babin, E.H., Cordes, C.V., & Nichols, H.H. (1987). TOEFL: Test of English as a Foreign Language. New York, NY: Arco.

Berkowitz, P. (1998). The use and abuse of English language tests. University Affairs, January, 12-13.

Bialystok, E. (1997). The structure of age: in search of barriers to second language acquisition. Second Language Research, 13, 116-137.

Carrier, M.,R., Dalessio, A.T., & Brown, S.H. (1990). Correspondence between estimates of content and criterion-related validity values. Personnel Psychology, 43, 85-100.

College Board. (1997). Common sense about SAT score differences and test validity. Research Notes, RN-01, June, 1-12.

Hale, G., Stansfield, C.W., & Duran, R.P. (1984). Summaries of studies involving the Test of English as a Foreign Language, 1963-1982. Research Reports (Report 16). Princeton, NJ: Educational Testing Service.

Henning, G., & Cascallar, E. (1992). A preliminary study of the nature of communicative competence (TOEFL Research Report 36). Princeton, NJ: Educational Testing Service.

Hwang, K-Y., & Dizney, H.F. (1970). Predictive validity of the Test of English as a Foreign Language for Chinese graduate students at an American University. Educational and Psychological Measurement, 30, 475-477.

Kaiser, J. (1986). The validity of the GRE aptitude test for foreign students. College Student Journal, 20, 403-410.

Kelly, B. (1998). (Ed.). English Language Requirements. INFO: The Guide to Ontario Universities for Secondary School Students. (Fall issue, No. 55). Guelph, ON: Ontario Universities’ Application Centre.

Light, R.L., Xu, M., & Mossop, J. (1987). English proficiency and academic performance of international students. TESOL Quarterly, 21, 251-261.

Michael, W.B., & Shaffer, P. (1979). A comparison of the validity of the Test of Standard Written English (TSWE) and of the California State University and Colleges English Placement Test (CSUC-EPT) in the prediction of grades in a basic English composition course and of overall freshman-year grade point average. Educational and Psychological Measurement, 39, 131-145.

Morrison, T., & Morrison, M. (1995). A meta-analytic assessment of the predictive validity of the quantitative and verbal components of the Graduate Record Examination with graduate grade point average representing the criterion of graduate success. Educational and Psychological Measurement, 55, 309-316.

Murphy, K.R. & Davidshofer, C.O. (1998). Psychological Testing: Principles and Applications. Upper Saddle River, NJ: Prentice Hall (4th ed.).

Nunnally, J.C. & Bernstein, I.H. (1994). Psychometric Theory. Montreal, Quebec: McGraw-Hill (3rd ed.)

Palmer, L.A., & Woodford, P.E. (1978). English tests: Their credibility in foreign student admissions. College and University, 53, 500-510.

Peirce, B.N. (1992). Demystifying the TOEFL reading test. TESOL Quarterly, 26, 665-691.

Simner, M.L. (1995). Interim report to the ad hoc subcommittee on English Language Proficiency for Admission. Report submitted to the Senate, University of Western Ontario, December 7, 1995, Exhibit II, Item 3, Appendix 1.

Simner, M.L. (1998). Use of the TOEFL as a standard for university admission: A position statement by the Canadian Psychological Association. European Journal of Psychological Assessment, 14, 261-265.

Schwartz, T. (1999, January 10). The test under stress. The New York Times Magazine, 30-35, 51, 56, 63.

Snow, C.E., & Hoefnagel-Hohle, M. (1978). The critical period for language acquisition: Evidence from second language learning. Child Development, 49, 1114-1128.

Ward, A.W., Stoker, H.W., Murray-Ward, M. (1996). Educational Measurement: Origins, Theories and Explications (Vol. 1). New York, NY: University Press of America.

Acknowledgement

The author acknowledges with appreciation the many valuable suggestions offered by the following members of the Working Group on an earlier draft of this postscript: Sampo Paunonen, Nicholas Skinner, P. Anthony Vernon.

CPA Publications | CPA Homepage