The global pandemic has wreaked havoc and the educational status quo and disrupted not only how students are taught and tested. Schools are radically adjusting how they deliver lessons and testing agencies are revamping how they deliver their assessment. The inability to gather in a room has forced the cancellation of statewide assessments, the shortening of AP exams, and major admissions tests (like the ACT, GMAT, GRE, ISEE, LSAT, SAT, and SSAT) to develop online at-home testing options. This is the moment to fix many of the things that are broken about assessments.
Since their beginnings in the late 1800s, standardized tests have become an oppressive force in U.S. education and have influenced many other areas of society, yet in that same time there have been only marginal changes in the tests themselves. Despite reams of papers and scores of conference presentations, a high score on a 4 or 5 answer-choice aggressively-timed test continues to be treated as the epitome of intellectual demonstration. This has gone so far that, in the midst of a global pandemic, one (intentionally unnamed or linked to) author questioned whether epidemiologists should be heeded above economists, asking, “How smart are they? What are their average GRE scores?”
Test-making agencies (Pearson, QuestStar, College Board, ACT, ETS, etc) and their defenders have claimed that standardized tests are “objective” and “demonstrate readiness,” but critics ask “readiness to do what?” and are the costs worth the marginal information provided. These tests do demonstrate certain skills and some underlying knowledge but certainly not “readiness” or “aptitude.” What is fact is that standardized tests are not the completely objective and unbiased tools they often held out to be. Questions about passage content, question design, reading load, and gender biases have plagued these tests, as have recent inconsistencies in equating, frequent scoring problems, administration errors, and repeated cheating scandals.
These problems and scandals have led to a surge in colleges adopting test optional policies, parents opting out, and most recently graduate schools joining in the GRExit. Even test-makers’ efforts to bolster the utility of admissions exams, in the form of the College Board’s Landscape (nee Environmental Context Dashboard), ACT Academy, and the College Board’s Khan Academy SAT Practice, have added fuel to the argument that performance on these tests are largely determined by access to test preparation and built in test biases, and thus reflect wealth and access rather than readiness or ability.
With the current global crisis further highlighting problems with equitable access to the tests, perhaps it’s time for test makers to more seriously reconsider their role. In fact, it’s long past time for test makers to stop defending an outdated antiquated system which perpetuates many inequities and develop tools that support learning.
Here are a few suggestions on how to do that.
Reduce Score Scales
The number of increments on most standardized tests create a perception of fine-tuned accuracy and high-level specificity that research does not support. The SAT is scored on a 400 – 1600 point scale in 10 point increments, suggesting that this test identifies 121 different levels of proficiency (skill, knowledge, or performance depending on what you think the test measures). That the SAT is able to use a mere 154 questions to make distinctions between 121 different levels of ability is hard to believe. Further complicating matters is the fact that the ACT reports only 36 different score increments and yet is used to make the same distinctions between college applicants as the SAT. A reduction of score scales on both of these tests would not only align these tests with performance levels that state tests provide, but would also diminish the false precision that current score scales encourage.
Reducing score scales would also reinforce the research undergirding the tests. Nearly all psychometricians and test-makers warn that each score is indicative of a score range not a precise number. According to the College Board, an SAT score of 1010 is effectively the same as a 970 and as a 1050. Yet universities, districts, and scholarship granting agencies continue to award money and admissions based on these immaterial differences.
Even the testing agencies when they conduct validity studies do not report data with the level of precision that their score scale suggests. One has to wonder why testing agencies publish research reports using score ranges rather than the “precise” score increments they report to schools and test takers?
A student-focused agency would work to encourage responsible use of test scores. A student-focused agency would even change the test’s score scale to prevent unscientific incorrect use. A simple shift of reported scales to a 5- or 6-point performance level would create a perspective shift, reduce improper distinctions, and diminish the impact of test preparation. This small change would not only benefit test-takers but it would also improve the value of the tests. Additionally, many state assessment tests (such as the CA PARCC report below) have score scales but instead focus on performance levels more prominently, placing greater importance on the broader interpretation of the scores rather than encouraging the false impression that minute differences are actually meaningful.
Improve the Speed and Detail of Feedback
Given modern technology, it’s extraordinarily shameful that test-makers continue to provide test-takers, families, and schools with only limited, vague information and a meaningless numeric value following a test. Further, exacerbating this sad state of affairs is that results take weeks, if not months, to be returned to students and their teachers (in the case of the SAT and ACT teachers have to go through herculean efforts to access the limited performance data). If an assessment is to be a useful tool to improve learning, results must be delivered in a timely fashion that allows teachers and students to review and learn from that performance. Unfortunately, this is another area where testing agencies value their profits over improving education.
Delay to receive scores | Scores given | Test Questions Availability | |
SAT | 3 weeks | Subsection Scaled and raw | For a fee on 3 of 7 national test dates |
ACT | 3 – 8 weeks | Subsection Scaled and raw | For a fee on 3 of 7 national test dates |
SHSAT | 4+ months | Total scaled | No |
Smarter Balanced (CA) | 2 – 3 months | Subsection Scaled Score | No |
PARCC (NJ) | 3 – 4 months | SubsectionScaledScore, Performance Level | No |
NYS Assessments | 3 – 4 months | A sampling of questions released |
Timely detailed reports and the ability to review the questions on a test are core elements of identifying areas in need of improvement and these are denied to almost all test-takers in order to allow test makers to reuse questions in the future. This is especially important for tests taken by students who have ample opportunity to improve weak areas before continuing their education. When testing agencies claim to be organizations that are dedicated to supporting and fostering continued education yet choose not to focus on providing tools that foster educational development and improvement after testing, they reveal an operational model that values profit above all else.
Tragically, the technology and tools to deliver useful reports already exist but have not been made accessible to the students who take the tests. If done right, standardized tests are constructed to meet particular standards and each question is designed to measure particular skills with particular levels of complexity. This data is all recorded before a test is administered, so why would test agencies not share with a student that the questions they mostly got wrong were “geometry > triangles > algebraic expressions” rather than simply “geometry”? The former is actionable in a way the latter is not. The SAT student report (above) shows categories such as “Heart of Algebra” that do not align with any class or topic a student knows. Below the ACT CollegeReady report provides a fairly detailed topic performance analysis unfortunately it’s only available to those who use ACT’s new product the CollegeReady assessment (not the ACT college admissions test).
Other test creators have varied levels of detail in their student reports but they are generally offered at an additional fee, further supporting that these organizations choose to function as businesses rather than student-focused educational institutions. Right now, after testing GMAT test-takers can buy an Enhanced Score Report that provides more detailed analysis of performance and GRE takers can get a Diagnostic Service (for free!). Large scale assessments (STAR, Smarter Balanced, PARCC, SAT, MCAT, etc) would all be more useful if their creators first and foremost thought of them as diagnostic tools designed to provide feedback.
In addition to making tests unhelpful as learning tools, delayed unhelpful score reports create an environment in which the test preparation industry can step in and provide, for a fee, the analytical services that families want. Until testing agencies and departments of education decide to stop leaving analysis of student performance to the testing preparation industry families and educators will continue to have legitimate concerns about the utility of standardized assessments. Perhaps one should not be surprised by the resistance to providing substantive information since the only reason these testing agencies provide even the scant information they currently do is a law passed first in NY and then nationally.
Separate Speededness from Ability Assessment
Psychometrician have defined a speeded test as one in which “completing the questions in the allotted time presents a significant challenge” and distinguish a highly-speeded test from a power test, which presents a smaller number of more complex questions. For many students, the time constraints are the most significant factor in their performance. Under highly-speeded conditions, questions of whether the tests are measuring actual knowledge and abilities or simply speed of navigating the test abound. While state assessments in MA, NY and other states have transitioned to either untimed or generously timed testing, admissions tests continue to actively and aggressively incorporate speededness into the construction of the test. This rewarding of speed over depth of analysis exacerbates the disconnect between the familiarity of school-based tests and admissions tests, further under-cutting the utility of these tools and enhancing the fear that drives cheating and purchasing of extended time, test agencies would reconsider supporting highly speeded assessments as authentic measures of knowledge and ability.
While the time limits and speededness of the tests provide logistical efficiency for the testing agencies, the bias created in favor of certain test-takers may outweigh those benefits. Academics, sociologists, and psychometricians have all pointed out the limited value of the speeded tests. College presidents have spoken out about the false assumptions on which this is based. Even economists have demonstrated the problematic nature of this type of testing.
“Is it American conceit that speed and profundity are the same thing it’s someone who is facile and quick is necessarily better.” – Leon Botstein, President Bates College
Modernize the Assessments
The vast majority of standardized tests are essentially the same as the tests first used in the early 1900s. Somehow the science of assessment has been stubbornly resistant to improvements in practice despite changes in educational theory or technology. If test-makers stopped ignoring “proximal knowledge” and “reasonable understanding” they could create fairer more informative tools.
A more effective assessment would heed research that has shown that many tests are using poorly designed “distractor” answer choices that make tests longer and less useful. Distractor answers, which supposedly indicate a test-taker’s lack of knowledge could actually be a demonstration of sound logical ability and appropriate thinking? If a test-taker performs a task 90% correctly should a modern assessment not acknowledge that that is a different level of ability than a test-taker who performed 10% of the task correctly? This information is likely already coded into the item creation metrics of most standardized tests, so why not include that consideration into either the scoring, feedback, or both. Further, the punishing of test-takers who demonstrate an “approximation of knowledge” undervalues this crucial skill of the information age.
Creating a modern test that focused on identifying demonstrable 21st century skills rather than memorization of easily searchable factoids, rules, and formulas would benefit not only the test agency but also American education. Sadly, this is another area in which testing agencies have largely ignored research and examples. The GMAT has for almost a decade included an Integrated Reasoning section that is designed to integrate critical thinking, reading, and quantitative assessment in each item, yet no other major assessment has adopted this technology or attempted to create something similar.
Further, a modern diagnostic assessment would provide information not only on the total number of right and wrong answers but the quality of those wrong answers. A test report that identifies a test-taker propensity for picking partial answers versus the test-taker who is performing incorrect operations and following the wrong approach would not only provide better information but it would encourage learning skills that are valuable in the workplace of the future.
Remove Multiple Choice from Math
One of the most obvious means to remove external noise from standardized testing would be to remove the multiple choice element from the exam. Multiple choice answers not only allow random guessing to benefit a test-takers score but also allows strategies like plugging in and backsolving to allow someone to get a right answer even without knowing the math concept being tested. Several major tests (the SAT, GRE, and SHSAT) have questions that are non-multiple choice for several years both in computer format and paper format.
The following questions from the SHSAT, SAT, and GRE, respectively, are all non-multiple choice formats that have been implemented in very limited ways in admissions tests but are proven to be valid computer-scorable alternatives to the 4 or 5 answer format.
An additional benefit of removing or minimizing multiple choice questions is addressing gender differences since research shows that boys perform better on multiple-choice tests and girls on open-ended questions.
Actively combat score misuse
Just as test makers have lawyers and lobbyists who actively work for broader use of their products and to fight any restrictions on how the agencies conduct their business, they could and should employ lobbyists to curb the worst abuses of test scores that occur.
Standardized tests scores are regularly used for purposes that go well beyond the test design and score use intent. Test scores have been used determine teacher bonuses, by Amazon to choose the location H2, by Google to hire, by banks to evaluate credit-worthiness, by families to choose what school district to move to, by schools to admit toddlers to kindergarten, by districts to promote from one grade to the next, by policymakers to award school funding, by colleges to gatekeep entry into dual credit programs, by states to certify teachers, and by the world to compare teaching. Test scores have gone so far beyond their intended use and original design that they’ve become a proxy for almost everything. Clawing that back would actually make testing more useful.
“When we talk about the number of students who are likely to be college successful, you’ve gotta take those numbers with a grain of salt. Policy makers or folks who hope to sell tests or get the ear of policy-makers will make unfortunate statements, will not listen to the folks in research who try to educate them and try to contextualize the scores.” – Wayne Camera, Vice President of Research, ACT
Improve the Conversation About Tests
While many have diminished the use of language that suggest tests are determinative of a test-taker’s future, the marketers of the tests are still perpetuating language that harkens back to the ideas of aptitude and innate intelligence. If the test-makers were more diligent about describing in their tests as something more akin to “demonstrations of accumulated knowledge and skills” the public would likely be more accepting of it as a tool demonstrating limited but valuable information. The continued description of the tests as measures of “preparedness” or “readiness” or “ability” presents the tests as something that determines who will be successful in college (or even worse the ambiguously defined “career”) rather than as a tool to assist in determining who has a particular set of skills at that particular moment in time. Even more egregious is ACT’s implicit suggestions of causal relationships between scores and college persistence. The continued overselling of these tests has continued to undermine educators, administrators, and the public’s faith in standardized tests. Broader deeper richer analysis with quicker more specific feedback from large scale assessments would allow them to be actual tools of opportunity and diagnostic assessment rather than tools of sorting and simplistic categorization.
“Antiquated standardized tests are still serving as the backbone for measuring achievement. Our students can’t escape Kelly’s century-old invention.” – Cathy N. Davidson , The Washington Post
In this moment, as test makers scramble to roll out large-scale online at-home testing, there is an opportunity to not simply replicate the paper test on screen but to instead use all the tools at their disposal to make better assessments. Can test-makers create a secure, efficient tool and provide near instantaneous feedback rather than scores? Do they have the will (profit motivation) to redesign tests as learning-focused and student-focus tools?
Further reading:
- Want a Better Assessment? Start by Asking the Right Question – Arthur VanderVeen
- How to Reform Testing – Robert J. Sternberg
- Standardized tests for everyone? In the Internet age, that’s the wrong answer. – Cathy Davidson
- Are Multiple-choice Items Too Fat? – Thomas M. Haladyna, Michael C. Rodriguez & Craig Stevens
- The optimal number of options for multiple-choice questions on high-stakes tests: application of a revised index for detecting nonfunctional distractors. – Raymond MR, Stevens C, Bucak SD.
- Three Options Are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research – Michael C. Rodriguez
- 15 common math questions from the SATs that everyone gets wrong – Insider.com – Talia Lakritz
- Do Multiple Choice Questions Pass the Test? US News
- PARCC Technical Report for 2019 Administration and NJ Score Guide
- Mapping the ACT, Compass Prep blog, retrieved 1/20/2020
- The Test and the Art of Thinking, Documentary film
2 thoughts on “It’s Time to Fix Standardized Testing”