This entry is going to be just a little different from the norm. I had a great opportunity this week and I wanted to share it. I was invited to be on a committee that is part of the test development cycle for the state teacher examinations. The part of the cycle that I was involved in was the standard setting (deciding cut scores/passing scores) for the English Language Skills and Essay subtests for the General Knowledge exam.
There were 33 of us on the committee. Most participants were classroom teachers, 6-12, teaching some form of English (9-12, AP, etc.). There were a few instructional coaches and curriculum supervisors, but I was the only person who directly worked with any sort of assessment department. The exam contract is with Pearson, but I’m pretty sure the people who were in charge of the day were from the DOE. I think I’ll take this moment to remind everyone I live in Florida, so for me this was a trip up to Tallahassee. And I’d just like to give a shout of praise out to the DOE for having convenient, free parking at their facility. Which is more than I can say for my district’s main administrative building. Even though we’re one of the top 10 largest districts in the country. But I digress.
We started the day by taking the 40-question, multiple-choice ELS exam. No one got a perfect score (which should tell you something about the quality of the exam…), but I felt pretty good that I got 39 right. Then we went through, question by question, and did a judgment score on a scale of 1-10. The idea here was to think of the examinees that would be in the subset of just barely qualifying to be a successful passing teacher. We would expect them to pass (whatever standard we ended up setting) the entire test, but just by the skin of their teeth. Then, we looked at each item individually and judged what percent of that population would get that item correct. We did this independently. But then they ran our data and we got to see the distribution of “scores” and the median as well. We were able to compare our judgments to the judgments of the group and we were given the performance data from the actual administration of the exam. It was the data from the first time the test had been administered.
What I found extremely interesting was that our (as a whole) group rating was significantly higher than the actual performance on the exam. For example, the median rating on a question might have been an 8, which meant that we would expect 71-80% of that subset group would answer that question correctly, but the same question might have actually only had a 60% passing rate – or worse. There were only a few (as in 5 or less) items where our median rating was the same or lower than the actual performance. We did that twice and then based on that data we gave our opinion on the score that we thought should be the passing score. What was really interesting is how group (mostly – I was an outlier for the most part) seemed to judge the items from an English teacher point of view rather than remembering that people of all backgrounds take the test. Because it is a General Knowledge test, no matter what subject or level you want to teach, you take this test and have to pass all the subtests (there is also a Math and one other – Science or maybe Social Studies). No one could seem to understand why their ratings were so much higher than the actual performance. They seemed so caught up in every teacher absolutely having to know all these rules of grammar and conventions that they couldn’t see the forest through the trees.
It was really difficult for me to decide on a passing score. I was torn between whether or not to have a standard that was so high that the people who pass would have a strong grasp of English language skills or to make the passing rate low enough that people who were probably decent enough to be teachers could get into the classroom, even if they were a little weak on the grammar. It was like deciding between quality and quantity. And then I had to keep in mind that an examinee could take the exam as many times as he/she wanted (had to pay, of course, but there was no limit on attempts). So do I make it a more challenging passing score so that the people who really wanted to pass so they would go back and study harder and learn more of the rules. It was a real dilemma for me, knowing that my decision would impact thousands of people’s careers. Obviously my decision wasn’t final, but our data was going to be presented to the Commissioner of Education as recommendations.
I’m curious – what direction do you go? Quality or quantity? And keep in mind that I would never set it low enough that a complete moron could get through, I was just thinking about those people who were planning on teaching math, science, or social studies – subject areas that don’t routinely teach (or even need) the intricacies of grammar and conventions.
Deciding on the essay cut score was a little easier, because no matter what subject one teaches, writing is an important component – even just technical writing. So I felt more comfortable setting a challenging passing score for that subtest.
So. Tell me. What do you think a reasonable passing score (percentage) for an English language skills exam should be? Quality or quantity?