As a classroom teacher, I was always looking for ways to effectively assess my students’ learning. I came up with some great ways to differentiate through product, but sometimes, I just had to use a traditional assessment. I always thought I was pretty good at creating those assessments, but once I got my current job working for Assessment and Accountability, I realized I’d been doing lots of things that are not best practices when it comes to traditional assessments. I’ve decided to share some biggies with you in the hopes that your classroom assessments can be more valid, effective, and help you inform your instruction.
Be cognizant of item complexity, difficulty, and the distribution. We want assessments to tell us about our students. If there is a test that is so easy everyone gets 90%+, that doesn’t discriminate well at all. We don’t learn anything about what needs to be retaught or extended. If a test is so difficult that the average is 30%, that doesn’t discriminate well, either. We still don’t learn anything. It’s important that there are a variety of items in terms of complexity and difficulty. Think about it this way: you want to have some items that will tell you the difference between your D and F students. Yes, every A/B student will get those items right, but that’s okay; you’ve put them on the assessment to tell you about the struggling students’ mastery. Then, you have items that will tell you the difference between your A and B students. yes, every D/F student will get those items wrong, but that’s okay; you’ve put them on the assessment to tell you about the high-performing students’ mastery. Now, difficulty and complexity are NOT the same thing. Difficulty is how many students answer the question correctly (high correct % = easy question; low correct % = challenging/hard question). Complexity is based on the Depth of Knowledge (Webb) or Taxonomy (Bloom). You can have a high complexity question that is easy and you can have a low complexity question that is hard. You won’t know the official difficulty level of a question until students take the assessment. You can speculate on the difficulty level, but that is performance based. Complexity, however, is NOT based on performance. A simple recall question is low complexity, regardless of how students perform on the item. You need to have a balance of both. When you write your questions, try to keep these ratios in mind: 10-15% easy, 15-20% difficult, the rest should be average. 10-15% low complexity, 15-20% high complexity, the rest should be moderate. So, for a 50-item test, the breakdown might look like this: 7 low complexity questions, 10 high complexity questions, and 33 moderate complexity questions. When planning/writing I might anticipate a difficulty breakdown of 5 easy questions, 10 difficult questions, and 35 average questions. Now, when my students take the test, I might find that the actual difficulty breakdown looks like this: 2 easy questions, 40 difficult questions, and 8 average questions. I know now that my test was too difficult. It’s not telling me anything useful. I really should make a new test. OR it might look like this: 30 easy questions, 5 difficult questions, and 15 average questions. I know that my test was too easy. It’s not useful. I really should make a new test. The more questions you write, the more information you get. That being said, sometimes quizzes will only have 5-10 questions, so do the best you can with what you have to work with.
Sometimes, you won’t have the time or ability to make an entirely new assessment if your first one didn’t perform as you’d hoped. In that case, it may be time to employ a scale (or curve). Here, there are some important things to take into consideration. The first is the overall grade distribution of your class. If you are a consistent grader, then your students’ performance on any given assessment should be similar to their performance in your class in general. Essentially, you’d expect an A/B student to get an A/B score on any given assessment. In general. So you can use that distribution to scale your assessment score. If your class breakdown has 4 A’s, 7 B’s, 11 C’s, 6 D’s, and 3 F’s, then you can scale the assessment to be close to that breakdown. That DOESN’T mean that you give the 4 A kids an A on the assessment. They might have performed really poorly and end up with one of the lower grades. What matters is the distribution. Don’t look at names. Here’s what your scaling process might look like:
|Grade breakdown in class||Grade breakdown on assessment (based on traditional 90-80-70 scale)||Raw score on assessment||Scale score on assessment|
|13% A (4 ST)||3% A (1 ST – 44)||45-50||40-50 (4 ST = 13%)|
|23% B (7 ST)||10% B (3 ST – 40, 41, 44)||40-44||34-39 (6 ST = 19%)|
|36% C (11 ST)||16% C (5 ST – 35, 35, 36, 38, 39)||35-39||29-33 (11 ST = 36%)|
|19% D (6 ST)||32% D (10 ST – 30, 30, 31, 31, 32, 32, 33, 33, 33, 34)||30-34||20-28 (5 ST = 16%)|
|10% F (3 ST)||39% F (12 ST – 29, 29, 28, 28, 27, 26, 20, 19, 19, 18, 17, 15)||<30||<20 (5 ST = 16%)|
As you can see, the distribution of grades on the assessment is now similar to that of the class as a whole. You could also do it by looking at the overall distribution of grades for ALL your classes (for that course). Now, the original average for the exam was about a 61%. It is difficult to calculate a new assessment % based on scale grades, because there are no raw scores from which to work with. As a teacher, you’d have to come up with the % you’d want to assign for an “A”, “B”, “C”, etc. You could also break down the scale even further to include “+” and “-“ grades if you wanted to get more specific.
Another option is to go through the assessment and look at items that performed below a certain threshold (perhaps 25%, since that’s the guessing rate) and throw them out. Recalculate the grades and then go from there.
Another option is a flat curve, which is where you add a set amount of percentage points to everyone’s scores. This does NOT work well when one or two students performed very well but everyone else tanked. One way you can do this method is to look at your highest score and see how many points it would take to bring that score up to a certain threshold. For example, on this assessment, only 1 student earned an A (>90%). However, that raw score might have only been a 45/50. So as the teacher, I might say, “I’d like the highest score on the assessment to be a 98%. That would be a 49/50. So I’m going to a) add 4 raw score points to everyone’s score or b) add 8% points to everyone’s score.” In this scenario, the distribution of assessment scores won’t relate to your class grades, but it will raise the average. You can see that the scores are not similar to the overall grades in the class, but the overall average for the test is higher (8% higher, actually). The original class average was about 61%. Now the class average is about 68%. Much closer to that “C” average.
Here’s how that would look:
|Grade breakdown in class||Grade breakdown on assessment (based on traditional 90-80-70 scale)||Raw score on assessment||Flat curve score on assessment|
|13% A (4 ST)||3% A (1 ST – 45)||45-50 +4||10% A (3 ST)|
|23% B (7 ST)||10% B (3 ST – 40, 41, 44)||40-44 +4||13% (4 ST)|
|36% C (11 ST)||16% C (5 ST – 35, 35, 36, 38, 39)||35-39 +4||32% (10ST)|
|19% D (6 ST)||32% D (10 ST – 30, 30, 31, 31, 32, 32, 33, 33, 33, 34)||30-34 +4||29% (9 ST)|
|10% F (3 ST)||39% F (12 ST – 29, 29, 28, 28, 27, 26, 20, 19, 19, 18, 17, 15)||<30 +4||19% (6 ST)|
Yet another option is to scale the assessment so that it follows a normal distribution curve. In this case, you would want to end up with a roughly equal (but small) number of both A’s and F’s, a roughly (but slightly larger) number of B’s and D’s, and then the majority of scores would be C’s. For this particular class with 31 students, I would anticipate my normal distribution curve to look something like this: 2 A’s, 4 B’s, 19 C’s, 4 D’s, 2 F’s. There’s a little room for playing around. I might want it to be 3 A’s, 5 B’s, 17 C’s, 4 D’s, 2 F’s. That’s pretty close, too. I then adjust the grades accordingly. I would list the students’ grades from highest to lowest and the top 3 would be A’s, the next 5 B’s, the next 17 C’s, the next 4 D’s, and the last 2 F’s.
Keep in mind that these methods of scaling/curving/norming individual assessments can be done for overall class grades, too. This is useful if you have a particularly low-achieving class but won’t be looked favorably upon if you have 20 failures at report card time.
It’s important to realize that if you curve/scale/norm your assessments, that doesn’t make you a bad teacher. You can still get information about your students’ mastery and use it to inform your instruction without punishing the students for faulty questions or a test that you simply made too difficult. The important thing is that if you realize your tests are too difficult, make an effort to change it. Either change what/how you teach or change the anticipated difficulty of your tests. Think about why you are assessing students.
Final Thoughts: Think about the purpose of assessment. Any assignment, really. We plan, we teach, we assess, and continue the cycle until our students master what we’re responsible for teaching them (or the end of the year gets here…whichever comes first…usually the end of the year). Assigning something or testing students with the sole goal of “teaching them a lesson” or intentionally promoting failure doesn’t fit into that cycle (plan/teach/assess). Assessment sometimes gets a bad rep, but if it truly fits into the cycle, it shouldn’t. Assessment (testing) is a part of the education cycle. If we don’t figure out what kids know, how can we teach them appropriately? And keep in mind all assessment doesn’t have to be summative or cumulative. We can design, give, and use interim (ongoing, formative, whatever you want to call it) assessment to make micro-cycles of the plan/teach/assess loop. Teachers do it all the time without realizing it. Every question asked is an assessment, whether it’s during a discussion or on a paper/pencil test. Making traditional assessment work for you and your students is just one piece of the puzzle. I hope this series has helped with that.