Last week, a short editorial I wrote, Two key questions about how New York City admits students into its elite public schools, appeared in the Washington Post. Since there are limits on how much can be said in a paper I figured I say the rest of what I have to say here. So here we go…
In early March the city was once again rocked by the annual hand-wringing over the appallingly low number of black and hispanic students admitted to the specialized high schools. However, this year has a different feel to it than years past and it seems N. Hannah Jones’s work bringing the segregation of New York schools to the forefront of the national media, the installation of a new school’s Chancellor who is willing to address the segregation in schools, and Bill de Blaiso’s proposal for changes to the admission process are causing a great deal of consternation among those who believe in the current process and accept the test as an objective measure of “merit.” While myriad objections to the mayor’s proposed changes and plenty of alternative proposals have surfaced, the conversation around changes to how the SHSAT is used has failed to ask two key questions. First, is the SHSAT a good test? Second, is using a test, even if it’s good, as the sole basis for admission a good idea?
The answer to the second question is an unequivocal no. The answer to the first is probably not. The research that supports those answers follows.
Is the SHSAT a Good Test?
The lack of any serious interrogation of whether the SHSAT as a well-designed tool to fairly measure academic achievement or predict future academic success is deeply problematic. The SHSAT has a host of problems and questionable statistics that makes relying on it in any significant way also unsound educational practice.
The test changes format more often then the Knicks changes coaches
As a test prep tutor,
prep book author, and prep company founder, I’ve spent almost 3 decades researching, analyzing, and helping students prepare for the SHSAT and over that time the test has been changed substantially at least 15 times. This is cause for concern because whenever a standardized test is changed, responsible psychometricians base those changes on sound theory and substantial research. Changes to large-scaled admissions tests are therefore typically infrequent and based on historic data used to evaluate how well the assessment accurately, fairly, objectively, and reliably assess the standards it was designed to assess. While longitudinal comparisons aren’t strictly necessary for the SHSAT’s sorting of students year to year, it helps the research basis of the test to have a comparable test to allow better comparisons year over year. Unfortunately, NYSDOE has released only one validity study in the past 30 years and no research on the construct of the exam itself. The frequency of the changes, and even the relative stability from 2007 to 2017, raise questions about what research has guided the changes or non-changes year to year.
The fact that the test changes so frequently with no impact on the quality of graduates from the specialized high schools also argues against the utility of the exam as a necessary factor in that success.
Another example of curious changes to the SHSAT can be seen in the Flesch-Kincaid Reading Level analysis I compiled below using released sample passages from the released Handbook exams from the 2013 – 2014 Handbook as compare to those in the 17 – 18 and 18 – 19 version.
This analysis demonstrates one way in which the structure of the exam is changed periodically for unexplained reasons and with undefined impact on students. The FK levels show that from from year to year the reading complexity of the passages offered might vary fairly substansibly and further shows us that using the SHSAT to choose “the best” students is analogous to using a miscalibrated scale as the sole means to choose the “best” NFL players for your team. Yes, bigger people tend to be most likely to succeed in the NFL but it’s certainly not the scale that’s most useful in identifying those players.
Pearson has a no good, very bad, terrible reputation for shoddy work
The process of producing multiple versions of the exam to administer at multiple sites on 4 different days requires that up to 18 different forms of the test be created for each year. This means that NCS Pearson, Inc is writing, evaluating, revising, submitting to DOE for proofing, making further revisions to 18 different versions of the SHSAT in under 12 months year after year. This production schedule is highly concerning since it is faster than that of most major admissions tests. By comparison, prior to the most recent SAT redesign, the College Board released a 250+ page specifications document fully two years before the first redesigned SAT was administered to test-takers. Not only does it appear that other tests take longer before making changes to items (the psychometric term for questions) or specifications of the test but the development period for items appears to be longer. For the SAT, it’s about 2.5 to 3 years between when an item is first written to the point at which a test-taker first sees the item. According to College Board, the writing revision process is so rigorous that on average only about 50% of written items are actually used (this can be for many reasons among them an item may be determined to advantage or disadvantage a group of students). The most recent changes to the SHSAT were announced in the fall of 2016, a vendor was selected later that year, and the newest version of the test was administered in October of 2017.
The need for public disclosure of a technical manual on the SHSAT is even more important because the current test publisher, Pearson, has a long history of including flawed questions on exams and making scoring errors. Flawed questions like Pearson’s infamous 7th Grade NYS ELA Pineapple passage (which posed three unanswerable questions about a pineapple racing a hare based on a passage by Daniel Pinkwater who considers himself a nonsense writer) undermine assessments and are especially concerning when the test is the sole means of entry into these schools. There is example, after example, after example of problems with Pearson created exams. Most recently Pearson included a question on the Massachusetts Comprehensive Assessment System (MCAS), which asked students to write from the point of view of a racist slave owner. Pearson’s errors have in the past led to the company losing contracts to deliver state assessments for New York state, Texas, and NC. Other states have included fees in their contracts if Pearson has flawed questions or scoring delays. After this long history of sloppy work, how the NYSDOE, which fired Pearson as the vendor for the NYS assessments, continues to use a Pearson created exam as the sole measure of admission is confusing at best.
The scoring is weird and suspicious if not outright wrong
Further eroding my faith in the SHSAT as a fair, reliable, or objective measure is its scoring scale. The SHSAT reports scores in 1 point increments on a scale of 0 to 400 in math and the same in ELA. This means that the SHSAT purports to show 801 levels of distinction between test-takers. All other major standardized tests, which all have greater research bases and are given to more test-takers annually, do not claim that level of accuracy. This purported fine-tuned discrimination is further undermined by the fact that the SHSAT has fewer questions than many other major admissions tests. The SHSAT is scored on just 94 questions, while the ACT uses 215 questions to reach a score and the SAT needs 154, the ISEE used for Boston Latin has 160 questions and 181 increments of distinction. Generally, the number of scaled scores on admissions tests is fairly close to the number of increments in the scoring scale in order to prevent frequent or large gaps in the scoring scale. The SHSAT has many gaps in their scoring scale yet still purports to deliver scores from 0 to 800. If the false precision wasn’t enough to raise serious questions, the method of converting from raw score (the total number of questions answered correctly) to scaled score (0 to 800) should raise the hackles of fairness watchdogs.
A 2005 research report showed that the test strangely rewards students with imbalanced scores over those test-takers with the same number of correct answers but whose performance was balanced in both math and verbal. This scoring quirk, which by all reports is still part of the test, would award a higher scaled score to a test-taker who scored 40 on verbal and 47 on math (for a total of 87 raw points) than a test-taker who scored a 44 on each section (earning a total of 88 raw points). So it’s highly likely that a test-taker who gets one more question wrong than another could have a scaled score 2 points or 20 points lower depending on where on the score scale they are and what test form the test-taker received. Combining the increment problem with the scoring quirks exacerbates the quirkiness and randomness of this test as a reasonable tool for awarding seats in the specialized schools.
The final test scoring oddity is the that raises questions about the scientific validity of the exam is the reported Standard Error of Measurement (SEM). SEM is essentially the variability of a test score. It shows what the test author calculates as the error in a test score. The reported SEM on the SAT is 40 points which means that if a test-taker gets a 1040 today, College Board predicts that if they took the test again tomorrow that test-taker would get a score anywhere from 1000 to 1080. Oddly, the SHSAT reports a smaller SEM that the SAT. Given that the SHSAT has more score increments, it seems odd that its SEM is reportedly so small.
|Range||SEM||Ratio SEM / Range|
|SHSAT||0 – 800||20.4||0.0255|
|SAT||400 – 1600||40||0.0333|
|ACT||1 – 36||1.27||0.0352|
|LSAT||120 – 180||2.6||0.0433|
|GMAT||200 – 800||29||0.0483|
With the preponderance of questions surrounding Pearson and the SHSAT one would expect to find a wealth of research before this test would be used in such an absolute and important way, yet almost no research exists about the test. In admissions testing industry, all other major tests make retired test questions and research reports readily available, yet Pearson does not. The seeming violations of psychometric standards and practices and the unwillingness of Pearson to release research reports makes it seem like there is an intent to hide a poorly designed test more than anything else at work here. Publishers of the SAT, ACT, and LSAT (Law School Admission Test) make available more than one-third of their nationally administered exams each year and produce technical reports at least every other year to provide transparency into its assessment. The GMAT (Graduate Management Admission Test) and GRE (Graduate Record Exam), which are computer-based, release actual exam questions in books and preparation materials and also publish technical reports. The lack of disclosure of information about the specifications and performance of SHSAT tests means there is no way to verify whether the test is properly designed, scored, and calibrated.
Released practice questions constantly raise doubts about quality
Those who support the current system often argue that the SHSAT is an “objective” test because it “the same” for everyone, that is a fallacy and seems to demonstration a lack of understanding of psychometric testing. If one were to give the question below to a group of people many would get it right and others would get it wrong, but that doesn’t make it a good question (here is a bad on from the SHSAT Handbook 2017 – 2018. A closer evaluation of the question will show that it’s poorly designed and would reveal little about the academic abilities of test-takers. There are many examples like that over the years, some worse than others, the ones that are most subtle are probably the most problematic because it’s unlikely students would catch them. Here are a few quirks, I’d like to add to that discussion.
2017 – 2018 Handbook
2018 – 2019 Handbook
Here is a link to a few other examples I tweeted :
— SHSAT Boys (@akilbello) March 23, 2019
Another way to put this is “there is no way to verify whether the SHSAT is a bad test on which some bright students are doing better than others or a decent test that is sorting test-takers by levels of preparation for this test.”
Why Use SHSAT scores In Direct Violation of All Science and Convention?
Even if we could trust the competence of Pearson, the exam presents enough problems that it warrants the reconsideration of using this exam as the sole means of entry into specialized schools or using it at all. New York’s specialized schools are the only schools in the United States that grant admission entirely based on the results of an admission test. This test, unlike any other admission test, can only be taken once and thus represents not only an extreme use of an admission test but also the most extreme method of administration. This is high-stakes testing run amok.
Hi Akil! We believe when you are making decisions that impact students’ lives, more information is always better than less information. ACT advocates for decision makers to use multiple measures when assessing students’ academic readiness and take a holistic approach to (1/2)
— ACT (@ACT) April 10, 2019
Not only is NY singular in its decision to use a test in this way, it is doing so in a fashion that flouts the standards of good practice established by all leading associations of assessment test developers (the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education) and recommendations of not only every other school but also every test writer from LSAT to GMAT to GRE to College Board. Every psychometrician, psychological testing organization, and test maker take a clear and adamant stance that a test should not be used by itself to determine admission. The College Board has made clear that they believe the best use of test scores for admission is in combination with other factors and that it shouldn’t be used by itself. In fact Pearson even suggests that multiple criteria are better.
“[We] understands that concerns around the role of assessments are varied and real. We believe that quality assessments are useful to the learning experience, but they are just one measure of the knowledge and skills that learners need. They do not, and will never, completely define the sum total of what a good education ought to provide.” – Pearson
Whether removing the SHSAT will increase diversity in specialized schools or not, it’s important that NYC schools not use a constantly changing tool created by a company that refuses to adhere to good research or design practices. If New York hopes to lead the nation in education and equity, being the only school district that holds to a poorly designed instrument that was implemented to prevent desegregation of the city’s schools is not the way to do it. There is a preponderance of evidence that establishes that the SHSAT not only doesn’t not meet the research standards to serve as sole means of admissions to NYC’s specialized schools but also its development may not be driven by evidence and ongoing research. Anyone interested in a fair educational system and good educational practice should question the use of this test at all. Anyone who believes in the integrity of the public school system should be against sending the message that the grades given by teachers, the results of statewide assessments, and the strength of schools curriculum are all less valuable than a three-hour test that research shows adds almost nothing to the prediction of success.