ABSTRACT
The formal licensing of physicians in the United States began with the 1889 Supreme Court Decision Dent v. West Virginia. From that time forward, tests, in one form or another, have played a crucial role in medical licensing. In this essay we trace the history of testing from its beginnings in Xia dynasty China, 4000 years ago, though its adoption for the Indian civil service system by the British Raj, and finally ending with the 1992 introduction of the modern United States Medical Licensing Examination (USMLE). The focus here is on the most important development in testing since the Jesuits introduced written exams to the West in 1599 — the substitution of a large number of objectively scored multiple choice exam questions for a relatively small number of essays or interview questions. This approach provided increased reliability and validity of score, broadened the number of topics that could be addressed, diminished the cost of the exam, allowed results to be calculated almost instantly, and, through the use of computerized test administration, provided the opportunity for tests to be individually tailored for each examinee while maintaining comparability of scores across all examinees.
Keywords:
- Xia Dynasty
- NBME
- USMLE
- Shibboleth
- Medical Licensure
- Dent v. West Virginia
- Jesuits
- Adaptive Testing
- Multiple-choice exams
- John P. Hubbard
Introduction
On May 5, 2015, two important dates in the history of testing will be celebrated. The one of immediate interest is the centenary of the events that gave rise to the National Board of Medical Examiners and the modern era of the testing component in medical licensing. Serendipitously, exactly three thousand nine hundred years previously (on Wu Yue Wu Hao in the 85th year of the Xia Dynasty — May 5th 1985 BCE) the signal event occurred that started testing. It was on this date that an anonymous functionary, tasked by his emperor to find a path to improve the way that government officials were chosen, was struck by what was the fundamental tenet of testing: A small sample of behavior, measured under controlled circumstances, could be predictive of a broader set of behaviors in uncontrolled circumstances.
This was the intellectual beginning of a program of testing in China that has continued, with only one minor interruption1 until the present day. The Chinese testing program imagined by that functionary, his very bones now long dust, was essentially a civil service examination program and many of its performance-based components and procedures bore a remarkable resemblance to the earliest medical credentialing exams in the United States.
The Chinese tests were designed to cull candidates for public office; job-sample tests were used, with proficiency in archery, arithmetic, horsemanship, music, writing and skill in the rites and ceremonies of public and social life. The testing procedures they instituted closely resemble those in use today. For example, because they required objectivity, candidates' names were concealed to ensure anonymity; they sometimes went so far as to have the answers redrafted by a scribe to hide the handwriting. Tests were often read by two independent examiners, with a third brought in to adjudicate differences. Test conditions were as uniform as could be managed — proctors watched over the exams given in special examination halls that were large, permanent structures consisting of hundreds of small cells. The examination procedures were so rigorous that candidates sometimes died during the course of the exam.2
The pathway connecting ancient China to 20th century America is remarkably direct. The Chinese testing program became the model that the British used in their design of the Indian Civil Service Exam system, installed in 1833 during the Raj. This, in turn, was the template that Senator Charles Sumner and Representative Thomas Jenckes used in designing the Civil Service Act passed by the U.S. Congress in January 1883.3
Tracing the Shibboleth
Despite this clear historical pathway, testing did not jump directly from China to India. There is overwhelming evidence that this fundamental idea spread inexorably throughout the known world. About 600 years after the Chinese testing program began we know that it had spread at least as far as the Middle East.
In Judges 12:4–6, during the time of David, we are told how, after the Gileadites captured the fords of the Jordan leading to Ephraim, they developed a one-item test to determine the tribe to which the survivors of the battle belonged. ‘If a survivor of Ephraim said, “Let me cross over,” the men of Gilead would ask him, “Are you an Ephraimite?” If he replied, “No,” they said, “All right, say Shibboleth.” If he said, “Sibboleth,” because he could not pronounce the word correctly, they seized and killed him at the fords of the Jordan.’
THE CHINESE TESTING PROGRAM BECAME THE MODEL THAT THE BRITISH USED IN THEIR DESIGN OF THE INDIAN CIVIL SERVICE EXAM SYSTEM, INSTALLED IN 1833 DURING THE RAJ.
Forty-two thousand Ephraimites were killed at that time. This total might have included some Gileadites with a lisp, but we will never know, for there is no record of any validity study being performed. The need for such studies was not made explicit until the end of the 16th century when the Jesuits published their famous “11 rules” for exams that are in spirit essentially identical to those we follow today (see appendix).
19th century medical training in the United States followed the same sort of apprenticeship model that was used by carpenters, plumbers, and other skilled professions. The Shibboleth that allowed them to practice was a combination of the endorsement of their master and some sort of licensure procedure established by the areas in which they intended to practice. In the early 1880s, West Virginia became the first state to enact and implement a genuinely restrictive medical license law. The challenges to this law eventually ended up in the U.S. Supreme Court, whose 1889 decision (Dent v. West Virginia) upheld the licensing law and confirmed a state's right to regulate the practice of medicine. Thus began state licensing of physicians in the United States.4 When a formal exam was administered, it was typically oral and neither its content nor its scoring was objective and standardized. Marmaduke Dent, the attorney representing his cousin Frank before the U.S. Supreme Court, challenged the West Virginia statute on these very grounds. Dent asserted (and the court agreed) that there were “no objective criteria against which to measure the board's completely subjective assessment of what distinguished a passing performance from a failing performance.” And even if the board members agreed upon appropriate standards, “the standards could not be consistently applied when every candidate was examined separately and every examination was different.”5
The shift to a more rigorous and standardized approach in our national licensing program had its formal beginnings on May 5, 1914, in the Willard Hotel in Washington, D.C.6
The testing procedures initially adopted for medical licensure have a long and storied history. University exams, begun at the University of Bologna in 1219 and continued in 1257 by Robert de Sorbon, chaplain of Louis IX, in the community of scholars at what would evolve into the Sorbonne, were entirely oral. Written exams did not recur until the 16th century with the work of the Jesuits, mentioned earlier.7
The need for a second watch
The expression “A man with one watch knows what time it is, but a man with two watches is never sure” helps us understand the evolution of testing. Newton gave the world its first watch, and for a while we knew the time, but eventually Einstein and Heisenberg gave us a second watch and we haven't been sure since. Sometimes this is interpreted as an argument for ignorance, which is quite the opposite of my point, for science advances when we have some notion of our own uncertainty.
Written exams required an expert grader to assess the quality of the answer, and the examinees' scores were typically a summary of those assessments. We can only imagine the chain of events that led the West to revert to the Chinese practice of having multiple graders. Perhaps some examinees complained about their scores and when they were rescored a different result ensued; or accidentally some exams were scored twice and it was discovered that the results were not the same. We don't know what the key motivating events were, but certainly it was clear by the beginning of the 20th century that fair scoring of exams required a second watch.
In 1909 the College Board, through its fledgling college entrance exams, made a remarkable discovery. It found that the variation observed in the scoring of a single essay over many graders was about the same as the variation in scores of the same essay over many examinees. It concluded that this was unacceptably inaccurate and that graders needed to be trained better. Almost a century later, in a study of licensing exam results for California teachers, the renowned psychometrican Darrell Bock reported “the variance component due to raters was equal to the variance component due to examinees.8” It may be that this century-old problem is still due to insufficient training of graders, but more likely it is the subjectivity inherent in the task of grading itself.
As testing became more widespread, and with experience, our eyes became accustomed to the Byzantine dimness surrounding its use, the necessity of multiple graders added costs to testing. But balanced against these costs was the realization of how much testing improved manpower utilization. The spreading belief that expanding testing beyond its relatively narrow confines would improve industrial efficiency led to attempts to streamline the practices of testing without compromising its efficacy.
One of the instigators of this movement was the work of Albert Binet and Theodore Simon who, in 1905, published their eponymous test to measure the intelligence of children. Though the test was cumbersome to administer, it was wildly successful.9 A decade later this success led Stanford University's Lewis Terman to develop a less cumbersome version that could be both mass administered and objectively scored.10 Terman's success, coupled with the need for the efficient classification of soldiers for World War I, drove the formation of Robert Yerkes' Vineland Committee in 1917 that, within just a week, developed the modern multi-part multiple-choice exam. Its eight sections were designed to be administered in about an hour and among those eight were such familiar item types as arithmetic reasoning, synonym-antonyms and verbal analogies. The test forms they prepared, then called “the Army alpha” (to distinguish it from “Army beta,” the nonverbal version for illiterate examinees) was a testing model followed widely, with only modest differences, ever since. The exam could be administered quickly, and scored objectively and automatically using a stencil.11
MEDICAL LICENSING EXAMS IN THE FIRST HALF OF THE 20TH CENTURY FOLLOWED THE CENTURIES-OLD TRADITION OF CONSTRUCTED RESPONSE ITEMS.
Multiple Choice Exams, John Hubbard and Modern Medical Licensing
Medical licensing exams in the first half of the 20th century followed the centuries-old tradition of constructed response items, some written and some oral, rigorously administered over several days and scored by multiple expert raters. But this changed at mid-century with the happy confluence of three major events:
The success and growing popularity of the College Entrance Exam produced by the College Board led, in 1947, to the founding of the Educational Testing Service. Here, in one place, was concentrated the psychometric experience and expertise developed over the previous four decades.
The publication, in 1950, of Harold Gulliksen's Theory of Mental Tests, which finally provided a rigorous consilience of the statistical and theoretical methodologies required to both support the scoring of modern tests and to measure their efficacy.12
The ascension of John P. Hubbard to be the head of the National Board of Medical Examiners (NBME), a post he held from 1950 until 1975.
Let us elaborate on the effect of these three events in order.
The need for economically practical mass administration of tests that faced the U.S. military during World War I gave rise to a huge increase in the development of multiple-choice items. Then, as now, there was concern that such a format was incapable of testing certain proficiencies that were crucial. Sometimes these concerns were well founded, but surprisingly often it turned out that the multiple-choice option worked better than even its most ardent supporters could have hoped. Why?
The answer draws on the psychometric/statistical developments described in Gulliksen's foundational book and stems from the basic fact that scores derived from any test format are imperfect. They contain errors. These errors fall principally into two broad categories:
(i) The estimates fluctuate symmetrically around their true values, due to variations in the examinee and the scorer. On some days we perform better than on others. As mentioned earlier, variation among raters of essays has always been substantial and seems relatively insensitive to improved rater training. In addition, scores also fluctuate due to the specific realization of the test item; if we want to study writing and so ask for an essay on Kant's epistemology we are likely to get less fluid responses than if we asked for one on “My Summer Vacation.”
(ii) The estimate can also contain some bias if the item used is measuring a proficiency that is not exactly what we are specifically concerned about. Suppose, for example, we are interested in measuring writing ability and instead of testing it in the obvious way, by asking the examinee to write an essay, we use multiple-choice items designed to measure general verbal ability (e.g., items involving verbal analogies, antonym/synonym interpretation, sentence completion). We are measuring something related to writing ability, but not writing ability specifically.
The test that is most predictive of future behavior is one that minimizes the sum of both kinds of errors. What the experience gained in the first half of the 20th century (and reconfirmed many times since) was that in a remarkably wide range of proficiencies multiple-choice items were superior to their much older cousin, the essay. This surprising result is because the bias that multiple-choice items might introduce was much smaller than the errors introduced by subjective scoring and the limitations in breadth of subject matter coverage that are the unavoidable concomitants of essay style exams.
...VARIATION AMONG RATERS OF ESSAYS HAS ALWAYS BEEN SUBSTANTIAL AND SEEMS RELATIVELY INSENSITIVE TO IMPROVED RATER TRAINING. IN ADDITION, SCORES ALSO FLUCTUATE DUE TO THE SPECIFIC REALIZATION OF THE TEST ITEM.
In one study examinees were given a test made up of three half hour sections.13 Two of the sections required the writing of an essay; the third was comprised of 40 multiple-choice verbal ability items. The essays were each scored with two raters (and sometimes a third to adjudicate any large disagreements), and the multiple choice section was scored automatically. It was found that the score on the multiple-choice section was more highly correlated with either essay score than the two essay scores were with one another. What this means practically is that if we want to predict performance on a future essay test we could do so more accurately with a multiple-choice test than we could with a parallel essay test. Some argued that 30 minutes is too short for a valid essay test — perhaps, but if the essays were allocated an hour, the one-hour multiple-choice test would also improve, probably more than the essays.14
...IN A REMARKABLY WIDE RANGE OF PROFICIENCIES MULTIPLE-CHOICE ITEMS WERE SUPERIOR TO THEIR MUCH OLDER COUSIN, THE ESSAY.
It is worth emphasizing the advantage that asking many small questions has over asking very few large ones. In the latter case an unfortunate choice of question can yield an equally unfortunate outcome (“I knew just about everything in that subject except that one small topic.”) In the former case, there is still the possibility of such unfortunate choices, but through the larger sampling of topics, the effect of such bad luck is ameliorated considerably.
This brings us to the third and final crucial component in the development of the modern medical licensing exam: the 1950 ascension of John P. Hubbard to be the head of the National Board of Medical Examiners. Even prior to Hubbard's arrival, NBME's reach was extensive. Most states accepted the NBME Certifying Exam as meeting their examination requirement and many drew on NBME's item pool for their own, state-developed tests.15 Recognizing the NBME's growing influence, he began to investigate how medical licensing exams could be improved soon after he took office; how NBME's continuing goal of measuring more precisely the knowledge and competence of medical students and physicians, and thus better assessing their qualifications for medical practice, could be achieved. He recognized the limitations of the traditional oral and essay formats of exams16 and so, in 1951, instigated collaboration with the then newly hatched Educational Testing Service to explore shifting to a more accurate, objectively scorable, format.17 After careful consideration, and extensive study, action was taken in 1954 to discontinue essay testing and to substitute multiple-choice testing for the NBME Part I and II examinations. While not without controversy18 the adoption of more objective testing formats paved the way for the NBME to test larger numbers of examinees and eventually expand its services to other markets.
With this running start, let us next discuss some of the technical work of the last century that has made the USMLE the Shibboleth for modern medical practice.
There have been two fundamental changes in testing over the past century.
The first grew from the realization that the subjectivity involved in scoring answers that were constructed by examinees yielded enormous error. And so gradually there has been a shift toward constructing items that could be scored objectively. A little of the evidence supporting this shift was discussed previously.
The second shift was in test construction paradigms that went from considering the test as a single entity (where the examinee's score was usually represented as the proportion of items answered correctly), to a much more flexible form in which the test is drawn from a large pool of components — some of which are selected as needed to estimate the examinee's ability in some optimal fashion. Thus the individual test item, or sometimes a fixed combination of items — a testlet — became the fungible unit of the test.
This shift in test structure was captured by three signal events in the four decades between 1950 and 1990.
The first was the 1950 publication of Harold Gulliksen's Theory of Mental Tests, which provided the machinery necessary for rigorous scoring of tests using what has become known as True Score Theory. The unit of measure was the test itself, and so the proportion of the test that was answered correctly characterized performance. Different forms of the test were equated so that the scores on different forms could be compared.19
The second breakthrough was the 1968 publication of Fred Lord and Mel Novick's Statistical Theories of Mental Test Scores. It signaled a new era in which tests could be built which were customized for each examinee and yet still be standardized. It put a capstone on true score theory while simultaneously providing a rigorous statement of a new approach in which the test item becomes the fungible unit of measurement — item response theory (IRT).20 IRT was crucial if the dream of efficiently creating individualized tests for each examinee was to be realized. Such a dream was instigated by the growing power and availability of high-speed computing. A combination of individually calibrated test items, a statistical theory that allowed us to calculate comparable scores for tests that might be made up of wildly different mixtures of items, and a computer that could construct such tests on-the-fly. IRT also made it possible to use a large pool of items from which one could sample to make up any particular individual's test. This portended a major improvement in test security.
The shift from an essay to a multiple-choice format in 1954 yielded enormous benefits; the material covered by the test was expanded, the test forms were scored much more accurately, and thus the inferences made from tests' scores became more valid, the scores themselves were more reliable, and simultaneously the exams were much cheaper to administer. Hubbard's revolutionary and courageous change in the tests' format in one swoop expanded the capability of the National Board while improving the services it provided.
THE SHIFT FROM AN ESSAY TO A MULTIPLE-CHOICE FORMAT IN 1954 YIELDED ENORMOUS BENEFITS; THE MATERIAL COVERED BY THE TEST WAS EXPANDED, THE TEST FORMS WERE SCORED MUCH MORE ACCURATELY.
Military testing, which had played such an important role in shifting the testing paradigm to the multiple-choice format in 1917, made possible the third breakthrough 50 years later, which developed from the possibility of computerizing the test administration.
The principal test that the U.S. military gives to sort recruits into various training programs is the Armed Services Vocational Aptitude Battery (the ASVAB). It was a long test with 10 parts and required two days to administer. In an effort to reduce this time and to help control other problems, in the 1960s and 1970s the Office of Naval Research funded research first by ETS's Fred Lord and later Minnesota's David Weiss. What they came up with was a way to meld the strengths of individualized assessment and the standardization and reliability of modern multiple-choice tests.21
The aim was to construct a practical and standardized equivalent of a wise-old examiner who would sit with a candidate for an extended period of time and tap into all aspects of the candidate's skills and knowledge; asking neither more nor fewer questions than required for the accuracy of the inferences planned. This is especially important for tests used for diagnostic purposes.
This remarkable goal was, in fact, accomplished by presenting previously calibrated individual test items to the examinees on a computer. After each item was presented it was scored instantaneously and the computer selected another item. If the prior item was answered incorrectly an easier one was presented; if it was answered correctly, it was followed by a more difficult one. In this way the test would focus quickly on the ability level of the examinee. This allowed the test to yield the same precision as a typical fixed format test, while using only about half the items. It can also cycle through various subtopics as required for diagnostic testing. Such tests are called Computerized Adaptive Tests or CATs for short.
THE EVOLUTION OF MEDICAL LICENSING EXAMS HAS MOVED A GREAT DISTANCE OVER THE PAST CENTURY, WITH THE MODIFICATIONS INSTITUTED BY JOHN HUBBARD HAVING YIELDED THE MOST PROFOUND IMPROVEMENTS.
The shift was formally chronicled by the 1990 publication of the definitive text22 Computerized Adaptive Testing, which laid out and illustrated how a test could be individually constructed to suit each examinee while also being standardized. Now scores obtained from exams that were very different in the items from which they were constituted could be compared directly. Because this book also laid out in detail how such exams could be built and scored, it motivated a number of testing organizations to try out this new technology.
What many discovered was that building a CAT requires a great deal of work. Huge banks of items must be constructed, pre-tested and calibrated so that the item administration algorithm can select them appropriately. In the first rush of enthusiasm, a fair number of large-scale tests have been successfully converted into CAT format (e.g., the ASVAB) and remain so to this day. Others (e.g., the Graduate Record Exam) were transformed into CATs only to be changed back after deciding that the extra expense involved was not justified.23
In 1998, NBME — after perhaps a decade of consideration — decided to computerize the administration of major parts of the USMLE, but chose not to make the test adaptive. The decade of experience accumulated since that time has confirmed the wisdom of proceeding slowly. The computerized administration has provided a big gain in control and security, although at significant cost. But it has yielded a test bed in which new item-types can be tried. And, while the move to a full-fledged adaptive administration is not yet justified, CAT remains an attractive option if the use of the test includes instructional diagnosis. For this purpose, the ability to efficiently isolate precisely those areas that need remediation is likely invaluable.
The evolution of medical licensing exams has moved a great distance over the past century, with the modifications instituted by John Hubbard having yielded the most profound improvements. But this evolution is not done yet. As technology continues to provide us with new tools as well as new challenges to security, the character of the USMLE — as well as the platform on which it is presented — is likely to undergo startling changes in the coming decades. One vexing problem associated with computerized administration is the practical requirement that tests be administered continuously. It was just too expensive to utilize the old tactic of gathering hundreds of examinees in a gymnasium three or four times a year and passing out #2 pencils; while #2 pencils are cheap, laptop computers are not. So instead, testing centers are set up and examinees are scheduled to come to them a few at a time. Such an approach has some economic challenges, for maintaining testing centers is expensive, and the continuous testing required by computerized testing yields security challenges. Giving a test continuously means that the items used on the tests must be changed very often; consequently, there must be a lot of them.
But there is no going back — for the modern USMLE has many item types that cannot be administered in a paper and pencil format (e.g., ones where the examinee listens to a recorded heart beat and must make a diagnosis). But technology may again provide a solution. Tablet computers have become both capable and relatively inexpensive. We may soon be able to return to the old style of a few mass administrations in which the 21st century #2 pencil is instead a specially made tablet. This is but one of the likely future possibilities of exams for medical licensure.
But it is the orienting attitude of questioning the status quo that John Hubbard instilled in the DNA of the National Board that continues the drive toward improvement through change. This attitude melds perfectly with the tenets of modern quality control. Whenever we have a complex system, whether it is a manufacturing process or the licensing of physicians, it is well established that the worst way to improve matters is to convene a blue-ribbon panel to lay out the character of the future of licensure exams. This doesn't work because the task is too difficult. What does work is the institutionalization of a constant process of experimentation in which small changes are made and evaluated. If the change improves matters make a bigger change in the same direction. If it makes matters worse, reverse field and try something else. In this way the process gradually moves toward optimization and so when the future arrives the licensing process of the future is waiting for it.
WHENEVER WE HAVE A COMPLEX SYSTEM... IT IS WELL ESTABLISHED THAT THE WORST WAY TO IMPROVE MATTERS IS TO CONVENE A BLUE-RIBBON PANEL TO LAY OUT THE CHARACTER OF THE FUTURE OF LICENSURE EXAMS.
Appendix
The Jesuit Ratio Studiorum of 1599 11 Rules for Written Examinations
It is to be understood that absentees on the day assigned for composition receive no consideration in the examination unless their absence was owing to exceptional circumstance.
All should come early to class so that they can take down the theme of the composition and the instructions given by the prefect or his substitute, and thus be able to finish within the class period. After silence has been enjoined, no one may speak to another, not even to the prefect or his substitute.
All should come supplied with books and necessary writing materials so that there will be no need to ask anything of another during the time of writing.
The papers should be up to the standards of each one's class and clearly written in the vocabulary and style demanded by the theme. Ambiguous expressions will be construed unfavorably, and words omitted or hastily altered to avoid a difficulty will be counted as errors.
Seat-mates must be careful not to copy from one another; for if two compositions are found to be identical or even alike, both will be open to suspicion since it will be impossible to discover which one was copied from the other.
As a precaution against dishonesty, any student who for good reason is permitted to leave the room after writing has begun, must deposit with the prefect or his substitute his theme outline and whatever he has written.
After a student has finished his writing assignment, he should remain at his desk and carefully check over his work, make corrections and revisions until he is satisfied. Once he has handed in his composition it will be too late to make changes. Under no circumstances must his paper be returned to him.
Each one must fold his composition as the prefect directs and write on the back his full name in Latin. This will facilitate arranging the papers in alphabetical order.
When a student brings his composition to the prefect, he should bring all his books along and be ready to leave the classroom at once and in silence. Those who remain should not change their places, but finish their work at their own desks.
If anyone has not finished his composition in the time allotted, let him hand in what he has written. Accordingly, all should know precisely how much time is allowed for writing and how much for rewriting and revising.
When the students come to the oral examination, they should bring with them the textbooks which contain the subject matter of the course. While one student is being examined, the others should listen attentively and refrain from prompting in any way, and from offering corrections unless called upon to do so.
About the Author
↵Howard Wainer is Distinguished Research Scientist at the National Board of Medical Examiners.
- Copyright 2014 Federation of State Medical Boards. All Rights Reserved.
Endnotes and References
- 1.↵During the 1911 revolution that overthrew the Qing Dynasty and its Emperor, the national testing program was temporarily suspended. It was resumed as soon as peace was restored.
- 2.↵Teng, Ssu-yu (1943). Chinese influence on the western examination system. Harvard Journal of Asiatic Studies, 7, 267–312.
- 3.
- 4.↵Mohr, J. C. (2013). Licensed to Practice: The Supreme Court Defines the American Medical Profession. Baltimore: Johns Hopkins University Press.
- 5.
- 6.↵Johnson, D. A. & Chaudhry, H. J.(2012). Medical Licensing and Discipline in America: A History of the Federation of State Medical Boards. Lanham, MD: Lexington Books.
- 7.
- 8.↵Bock, R. D. (1991). “The California Assessment.” A talk given at the Educational Testing Service, Princeton, N. J. on June 17, 1991.
- 9.
- 10.
- 11.
- 12.↵Gulliksen, H. O. (1950). Theory of Mental Tests, New York: Wiley. (Reprinted in 1987 by Lawrence Erlbaum Associates; Hillsdale, NJ).
- 13.↵Wainer, H., Lukele, R. & Thissen, D.(1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250.
- 14.↵The unreliability of judges is a widespread, but not universal, result. It is certainly true for rating such outcomes as essays, but for very narrowly defined tasks expert judgment can be workable. For example, in judging the severity of hip fractures by orthopedic surgeons it was found that only 5% of the variability of responses was due to differences in opinion among raters, while 95% was due to variability among x-rays. (Baldwin, P., Bernstein, J. & Wainer, H. (2009). Hip Psychometrics. Statistics in Medicine, 28(17), 2277–2292.)
- 15.
- 16.↵As documented in minutes of the NBME's Examination Committee as early as 1949, the NBME was having difficulty “handling the large number of papers, the variance in grading, and the difficulty in obtaining grades in time to meet the needs of medical schools.”
- 17.↵At an Annual Meeting in May 1950, Hubbard reported that the Examination Committee had reviewed the advantages and disadvantages of such a change and that both the Examination Committee and the Executive Committee had decided that this subject merited further study and investigation.
- 18.↵At this time 16 states withdrew from having NBME do their licensure testing. Among these were some big states: e.g., Florida, Texas, Pennsylvania, which reverted to their own tests, primarily oral/interview formats. They ignored the 350-year-old lessons cited by the Jesuits and the 60-year-old precedent laid out in Dent v. West Virginia. Hubbard's courage and determination cannot be overstated. The loss of income caused by the departure of 1/3 of NBME's participating boards threatened its very existence. But he was right, and most of the secessionist states must have recognized this because they returned to the fold within 6 years. The one holdout (Texas) returned with the start of the USMLE in 1992.
- 19.
- 20.↵There were earlier statements of IRT, the most relevant here was the 1960 monograph by the Dane Georg Rasch (Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Denmarks Paedagogiske Institut. (Republished in 1980 by the University of Chicago Press of Chicago)) which, though much more limited in application than the general models in Lord & Novick, was the test-scoring model adopted by NBME (Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.)
- 21.
- 22.
- 23.↵The two competing costs are those associated with building and administering the test — the costs of writing, pretesting and calibrating a large number of items that span all of the subject areas and all of the difficulty levels, as well as the costs of individual administration. Balanced against these costs are the savings in examinee time, since an adaptive test takes roughly half the time as a fixed format test of equal accuracy. For the military it meant that the test would shift from being a two-day affair, with all of the housing and other costs associated with an overnight stay, to a one-day test. In addition there was the saving of the opportunity costs for the second day. In the end it was determined that the personnel savings offset the testing costs. On the other hand, the examinee costs for the GRE were not borne by the testing organization that administered it and so what little savings achieved did not justify the expenses incurred. It is likely that, if the CAT-GRE could have been mass administered, rather than administered continuously, the financial calculus would have been different.




