Response to Mr. Maxim's Mega Test Renorming

Darryl Miyaguchi's Comments on Paul Maxim's
"Renorming Ron Hoeflin's Mega Test"

References:
1. Ron Hoeflin's "Sixth Norming of the Mega Test."
2. Paul Maxim's "Renorming Ron Hoeflin's Mega Test."
3. Fred Vaughan's "Intelligence Filters [.TXT]." (appears in the online version of Gift of Fire)
4. Kevin Langdon's Reply to Paul Maxim. (Gift of Fire no. 81, republished and revised here)
5. Darryl Miyaguchi's Generic I.Q. Chart
6. SAT percentile ranks table

Introduction

Mr. Maxim writes that the norming procedure used by Ron Hoeflin to norm his Mega Test does not produce the claimed results for the 4-sigma and 4.75-sigma levels (cutoff points for the Prometheus Society and Mega Society, respectively). In his sixth norming of the Mega Test, Dr. Hoeflin placed the 4-sigma level at a Mega Test raw score of 36, and the 4.75-sigma level at a Mega Test raw score of 43. Mr. Maxim says these raw scores should be 39 and 46, based upon a replication of Dr. Hoeflin's own norming process.

I do not think that Mr. Maxim, as claimed, has properly replicated Dr. Hoeflin's norming process in at least one important respect: the determination of the SAT score that represents the 4-sigma level in the general population. Also, in arguing his case, Mr. Maxim fails to account for the fact that the distribution of testees taking unsupervised, difficult tests such as the Mega or Kevin Langdon's Adult Intelligence Test (LAIT), is not Gaussian. In practice, many more Mega and LAIT testees score at the highest levels than would be expected (assuming a Gaussian distribution) from the number of scores at lower levels.

Kevin Langdon, in rebutting and criticizing Mr. Maxim's methods, indirectly criticizes Dr. Hoeflin's methods, since much of what Mr. Maxim describes actually does follow the same procedures Dr. Hoeflin used. With regard to using SAT data to represent IQ's in the general population up to the 4-sigma level, I favor Ron Hoeflin's interpretation over Kevin Langdon's.

Mr. Maxim's Statement of Purpose

Mr. Maxim writes:

"My renorming does not attempt to theoretically analyze that [Mega Test] norming process [as described in Dr. Hoeflin's letter to Mr. Maxim of December 19, 1995], but rather replicates it, using the same data Dr. Hoeflin used; hence, it constitutes an audit of his results."

SAT

The factor-of-3 shift

The first place to look for a discrepancy between Dr. Hoeflin's norming method and Mr. Maxim's replication is with the SAT, which represents the largest sample of previous scores. Since Dr. Hoeflin wants to use the SAT testees and scores as a representation of the general population, he adjusts for the slightly greater intelligence of the SAT testees by making a factor-of-3 shift in their percentiles. The way this is done can be illustrated most simply with an example. Suppose that we want to figure out what percentile ranking in the SAT group corresponds to a 99.997th percentile ranking in the general population. To use the factor-of-3 shift, first convert 99.997 into a "rarity" number (i.e., roughly one person in 30,000 people in the general population is expected to be in the 99.997th percentile or above). Apply the factor-of-3 adjustment to the rarity to get the equivalent rarity expected in the SAT group, namely one person in 10,000. This rarity corresponds to a percentile ranking of about 99.99. As a second example, the 99th percentile in the general population can similarly be calculated to be roughly equivalent to the 97th percentile in the SAT group. [For a convenient, generic conversion of IQ's to percentiles and rarities, or vice-versa, reference my "Generic I.Q. Chart"].

Mr. Maxim did not apply this factor-of-3 shift. Instead, he took (100 - percentile) and multiplied by 4/3 to calculate the equivalent SAT percentile. In our first example, the 99.997th percentile constitutes the top 0.003% of the general population. Mr. Maxim multiplied 0.003% by 4/3 to calculate the equivalent SAT number: the top 0.004%, or the 99.996th percentile. Our second example of the 99th percentile in the general population would produce an equivalent percentile of 98.7 for the SAT group using Mr. Maxim's method. Clearly, this is not the method that Dr. Hoeflin used in his sixth norming.

The result of applying Mr. Maxim's different adjustment factor is to decrease the number of SAT testees who are assigned to the 4-sigma level: Dr. Hoeflin estimated that a combined verbal + math SAT score of 1550 represents the 4-sigma level; Mr. Maxim comes up with 1565. I would note that Kjeld Hvatum, in In-Genius #15 (the journal of the Top One Percent Society), independently corroborated Dr. Hoeflin's factor-of-3 correction. He writes:

"Incidentally, the PSAT/NMSQT data provides a way to estimate the selectivity of the SAT takers at various levels, because the PSAT is more of a "forced" test in many schools, and the PSAT and SAT scales are equated (via a factor of 10). The ETS provides PSAT estimates "that would be obtained if ALL students at these grade levels took the test." A quick check indicates a factor of 3 is approximately the selectivity at the higher score levels for the SAT."

Kjeld Hvatum's "Selectivity by I.Q." chart, which Mr. Maxim references, shows that the 4-sigma level in the general population (99.997th percentile) is achieved with an SAT score of 1550, the same number that Dr. Hoeflin arrived at.

Kevin Langdon discounts the use of SAT data to determine the 4-sigma level

Kevin Langdon, in Gift of Fire #81, criticizes Mr. Maxim for employing the SAT data to derive equivalent 4-sigma Mega Test scores. Really, this is a criticism of Ron Hoeflin's technique. Kevin Langdon writes:

"Mr. Maxim leaned heavily in his analysis on data from the Scholastic Aptitude Test, while ignoring the fact that the four-sigma level on the SAT is well above the test ceiling. ... To put it another way, the SAT fails to discriminate among approximately the top .02 percent of the general population."

But Ron Hoeflin disagrees, at least up to the 4-sigma level:

"These weighted averages [of Mega score equivalents of standard IQ tests] differ from the SAT-based results reported in table T3 by less than one Mega Test raw score point at each of the twelve sigma levels from 1.25 to 4.00, overall, the SAT-based results averaging just one-sixth of a point higher than the weighted averages from the other five tests. But at 4.25 sigmas the results differ by 2.4 Mega Test raw score points, which suggests that the data from these tests is becoming too unreliable to be trusted at any higher levels.

What Dr. Hoeflin did, which is apparent from his statement above and by looking at Kjeld Hvatum's chart, was to discount SAT scores within half a sigma from the ceiling, which from Kjeld Hvatum's chart, is about 4.75 sigma. In OATH #10, Dr. Hoeflin concurs with this estimate:

"According to standard statistical tables, the 99.9999 percentile or one-in-a-million level corresponds to 4.7534 S.D. or 176 I.Q. This would correspond to a pseudo-I.Q. of 169 or 4.31 S.D. or the 99.9992 percentile for a sample of college-bound students. Statistical charts available from the Educational Testing Service put this percentile equal to a combined verbal + math aptitude score on the S.A.T. of about 1600, which is the maximum possible score. To be precise, of about 5 million college bound students who took the S.A.T. in the 5-year period 1984 through 1988, exactly 35 students got perfect S.A.T. scores of 1600, which equates with the 99.9993 percentile."

Lacking the data that Kevin Langdon uses to support his claim that ceiling bump prevents SAT data from being a useful discriminator above the 99.98th percentile (about 3.5 sigma), I would have to side with Dr. Hoeflin on the issue of the validity of using SAT data to represent 4-sigma levels in the general population.

Another of Kevin Langdon's criticisms that targets both Mr. Maxim and Dr. Hoeflin concerns the issue of sample size, which Mr. Langdon believes is too low. In his sixth norming of the Mega Test, Dr. Hoeflin uses 222 SAT / Mega Test score pairs, 10 of which he believed to be at or above the 4-sigma level. Mr. Maxim argued that only 6 score pairs qualified at the 4-sigma level (since he differed with Dr. Hoeflin in his estimate of the SAT's 4-sigma level). Mr. Langdon writes:

"Furthermore, Mr. Maxim is drawing conclusions based on only ten Mega data points, too small a sample to be statistically meaningful, and the discrepancy between the score Mr. Maxim believes should represent the four-sigma level and the score used by Dr. Hoeflin is trivial, amounting to only 1.5% of the test range."

This statement disputes the usefulness of the Mega test in discriminating near the 4-sigma level, if the difference between a Mega raw score of 36 and 39 amounts to statistical "noise." Personally, I would would like to have seen maybe 30 data points, but the large base of test scores required to produce such a number at the 4-sigma level is unlikely to be attained during the useful lifetime of this test.

GRE

What should the shift for the GRE be?

Mr. Maxim writes that the "data emanating from ETS indicates that this [the 4-sigma level] should be 1620." Dr. Hoeflin does not attempt to calculate the 4-sigma level for GRE scores. I suspect that are at least two reasons why he does not do so, one of which is mentioned by Mr. Maxim -- the problem of ETS lowering the maximum raw score in the late 70's from 1800 to 1600. The second problem arises when one tries to equate GRE percentiles with general population percentiles. Mr. Maxim does not mention how he does this, and unless the ETS performed a norming study of the GRE on the national population, I would guess that Mr. Maxim did not correct for the fact that GRE testees are more intelligent than the general population.

LAIT

Mr. Maxim does not account for the "self-selection" of highly intelligent individuals.

Mr. Maxim states that the LAIT produces higher-than-expected scores when compared with the standard tests. This may or may not be the case, especially if many of the reported LAIT scores were from the first norming, which Dr. Hoeflin has noted produced higher IQ's at the upper levels (by about 5 IQ points) than his second norming. This would be a legitimate cause for distrusting LAIT scores as a basis for norming the Mega Test. A second argument that Mr. Maxim uses against the LAIT, however, I find to be flawed:

"To begin with, the incidence of putative "4-sigma" scores in the LAIT sample (16 of 77) is far higher than that of any other sample, and in percentage terms (20.78%) is over six thousand times higher than the incidence of "4-sigma" in the general population (.00333%). This stems directly from the fact that LAIT is an inflationary test, and generated numerous invalid "4-sigma" scores. This distortion may also be noted in the fact that, in order for 20.78% of any sample to score at the "4-sigma" level or above, the entire sample would require a mean IQ over 3.2 sigma -- that is, over IQ 152, which is about ten points higher than the general profile for MEGA testees, and for LAIT testees as well."

What Mr. Maxim does not consider is the phenomenon of "self-selection;" that is, the higher one's IQ, the more likely (s)he is to have taken the LAIT or Mega tests. The intuitive argument that supports this observation is that the most intelligent among us are more likely to be aware that such tests exists, and in addition, are more likely to submit answers. That Mr. Maxim is assuming a Gaussian distribution is apparent from his calculation of the mean IQ of the LAIT testees: if the top 21% in a Gaussian distribution scored at the 4-sigma level or above, the mean would lie about 0.8 sigma below that at 3.2 sigma -- hence his statement that the LAIT testees have a mean IQ of about 152. I would be interested in finding out what the actual mean turned out to be for the 76 people who submitted LAIT scores. My guess is that it's even higher than Mr. Maxim estimated, and that this is to be expected because the LAIT and Mega distributions are naturally "top-heavy," and not necessarily because they are inflationary tests with respect to the standard tests (although that possibility still exists). Dr. Hoeflin, in order to extrapolate his SAT data to the 4.75 sigma level, empirically determined the curve of observed-to-expected Mega Test raw scores at various percentile levels. This, as far as I know, was the raison d'être of the sixth norming of the Mega test, and its implications appear to have been ignored by Mr. Maxim.

CTTM

I cannot explain the discrepancy that Mr. Maxim finds in Dr. Hoeflin's determination of the Mega test raw-score equivalent to the CTTM 4-sigma level. Kevin Langdon notes that the ceiling of the CTTM is 158, but the scores shown in Dr. Hoeflin's data (IQ's 179 and 174) seem to indicate otherwise.

Last words from Kevin Langdon

"It should be noted that Mr. Maxim did not calculate means, standard deviations, or average deviations of Mega raw scores and previous scores, did not calculate a measure of reliability for the test, did not estimate the standard error of test scores, and did not provide a formula for converting raw scores to I.Q.'s, despite his objection to my failure to explicitly state the conversion formula for calculating LAIT I.Q.'s from scaled scores in the statistical report on the LAIT.

Mr. Maxim has demonstrated no command of the basic principles of psychometric statistics. His argument in this article is entirely self-serving and without scientific merit."

Mr. Maxim did not publish things like the Mega Test's reliability or the standard error of test scores, but then again, neither has Dr. Hoeflin, to my knowledge. This criticism seems somewhat misplaced given Mr. Maxim's stated purpose of replicating Dr. Hoeflin's procedures.

Summary

There is at least one important place where I believe Mr. Maxim has erred in his replication of Dr. Hoeflin's norming of the Mega Test, specifically in his determination of the 4-sigma level for the SAT; it's possible that he has also erred in that same determination for the GRE. His statement that the LAIT has inflated the Mega norming may have some merit, although one of the arguments used to support this position is flawed. Kevin Langdon appears to have used this opportunity to criticize Ron Hoeflin's norming of the Mega Test, via his criticism of Paul Maxim's article; I have not seen data to bolster Mr. Langdon's contention that the SAT is not a useful discriminator beyond the 99.98th percentile of the general population.

Addendum (11/8/97): I now realize that it isn't necessary to convert percentiles to rarity numbers to apply the factor-of-3 shift. I have provided a comparative table of SAT scores showing both "college-bound seniors" and a "national sample of high school seniors," which provides supporting data to the assumption of a factor-of-3 shift.

Return to the Uncommonly Difficult I.Q. Tests page.