Wednesday, November 11, 2009

Fun with Point Biserials

A few weeks ago, I was spending a lot of time analyzing test data, specifically, the p values and pt. biserials of a set of English language arts tests for grades 3 through 11, in order to determine What Went Wrong.

In the preponderance of cases, of course, nothing went wrong. The items performed more or less as expected, with the correlations one might expect--i.e., high-achieving students got the easy questions and most of the hard questions right, while struggling students may have gotten the easy questions right, but pretty much got the hard questions wrong. And it follows that no wrong answer in these solid items attracted more students than the right answer.

But there were items with wacky data, items for which the high-achieving students picked wrong answers, or items with wrong answers that lured a higher number of students, leaving the right answer feeling like an awkward wallflower in a darkened gym at the middle school dance. My task was to review these items and figure out the big why. It is a testament to my thorough and absolute geekiness that I LOVE DOING THIS WORK. I could do it all day long, every day. Oh, my goodness. More fun than a barrel of monkeys. It was such a pleasure that I found it difficult to tear myself away at the end of the day (and honesty compels me to add that I did sneak back to it at night after my daughters were asleep).

Why, you may ask. Oh, for so many reasons! One being that it is fun to play detective, to deconstruct an item by conducting an investigative inquiry which concludes in determining the most likely source of the trouble, whether it be a stem of staggering verbosity, or a fundamental unsoundness in the premise of the item, or a bad practice (e.g., attempting to assess two or more skills with one item). It's like picking up a big tangled knotted mess of yarn and, starting with one end, delicately unraveling the knots and twists and then rolling the yarn back into a nice orderly ball. But with your brain. How often do we really have the opportunity to use our brains in this way, and truly, what can possibly be more satisfying that solving a problem?

Because assessment content development, as its own universe, is governed by its own rules, and--to me, in my ridiculous and relentless geekiness--these rules have a sort of simple elegance, the beauty of rigor and orderliness. (Were I to wax biblical on the matter, I would start talking about the necessity for doing things decently and in order.) So to use the data as a mirror to reflect the soundness (or unsoundness) of the item, diagnose the trouble, and then find a means of repair requires the use of a range of tools that include a knowledge of the rules (and knowing when it is acceptable to bend them) and of how students might interpret an item (which is not always how a test developer intends) and of the content area itself. And it also requires a bit of intuition, that ability to sense your way in the dark that just comes from thoroughly knowing something, in the same manner you could navigate your home with the lights off because you know where the couch is and where stands that big coffee table with the treacherously sharp corners.