Showing posts with label multiple choice. Show all posts
Showing posts with label multiple choice. Show all posts

Friday, July 26, 2013

A Great Deal Done Imperfectly

Better a little which is well done, than a great deal imperfectly.--Plato

In my last post, I may have given the impression that editors have more power than they do, and that perhaps that they have time to consider the ramifications of failing to provide all the necessities, or that they are willfully negligent. Salaried as they may be, editors often find themselves in an unenviable pickle, with compressed development cycles and few resources. The industry's reliance on freelance personnel increases the workload of front-line staff, who may now have to manage groups of writers in addition to performing other duties. Each writer must add about an hour a week in emails, phone calls, and admin tasks--and that's if the writer is low-maintenance.

It's also likely that editors want to provide all the necessities, but those necessities don't exist and the schedule doesn't allow time for editors to develop them. (Some of the most experienced item writers are able to work around the deficiencies, but the work of the less experienced will be affected.)

No one--except the one at the top of the pyramid, I imagine-- is resting on a velvet cushion.

I may have left another inaccurate impression: that it's all about the money. It's not. How can it be? This is not a high rolling game. What I mean to say is that when writers don't have what they need to do their best work, everyone loses.

The industry continues to become less hospitable to the people actually doing the work of creating the tests--or, more accurately, the people writing the passages and questions from which the tests are assembled--which results in a great deal done imperfectly.

Writers lose time and money; they also lose the best of all rewards, the satisfaction of a job well done, simply because how can you do a task perfectly when the task hasn't been clearly defined, and when you ask for clarification, you're directed to figure it out?

The companies lose much, much more. The lower pay and the more pain (inconvenience? Call it what you will. I mean all of those tiny ducks that are pecking us to death) to the writers, the lower quality the work, and the fewer writers willing to undertake that work, those fewer writers being the ones who have no choice: the least proficient, the least experienced. And the most highly skilled writers simply decide they've had enough and they move on to greener (or at least different) pastures.

Most importantly, the children who are taking the tests have already lost when they're faced with low-quality materials that don't provide them with a fair chance to demonstrate what they know and can do.

All right. Let's move on. I'm eager to address the basic rules of item writing (a version of which you can see here, in the Quality Control Checklist published by CCSSO), but I realize I should first define some terms.

An item is a test question. An item may be discrete, or may depend on some external stimulus, such as a reading passage or a chart or a map or something else.

Here is a discrete item:
Why does my dog Sophie bark at mail carriers? 
A She is flat-out crazy.
B She is outraged by uninvited guests.*
C She knows something about them that we don't. 
D She wants to register a protest about mail delays.

The above is a multiple-choice question, and contains a stem ("Why does my dog bark at mail carriers?") and four answer choices: one correct response (B, as far as I can tell, but I think maybe C is a possible right answer) and three distractors. Distractors, which used to be known as "foils," are wrong answers. Don't get hung up on the language--the point is never to distract nor entice the test-taker to bubble the wrong answer; the point is to create wrong answers that have a reasonable foundation in common mistakes kids would make with that particular skill or bit of content knowledge. More on this later. But tests should never be tricky.

A multiple-choice item is usually worth one score point, and used to be budgeted for one minute of test-taking time, not including the time it takes to read a passage or examine whatever stimuli is needed to answer the question.

There are other item formats: constructed-response items, which are also known as open-ended items. These require the student to provide a response. The response may be as short as a word or a phrase, or, in the case of extended-constructed-response items, the response may be a complete essay.

Here is a short constructed-response item:
Write two words to describe my dog Sophie. Use details to support your answer.
And here is the scoring rubric:
2 points: The response includes two accurate describing words, and is supported by relevant evidence. 
1 point: The response includes one accurate describing word, and is supported by relevant evidence, OR the response includes two accurate describing words with no supporting evidence. 
0 points: The response is blank, illegible, off-topic, or otherwise impossible to score.
A short constructed-response item would usually have a score point range of 0-2 or 0-3, and would be budgeted for 5-10 minutes. More than that is usually reserved for an ECR, which could take as few as 15 minutes, or as long as an hour or more for a full essay.

An extended-constructed-response item would look like this:
Considering Sophie's protective nature, do you think it is wise for strangers to approach her? Why or why not? Write an essay in which you discuss the wisdom of approaching a dog with whom you are personally unacquainted.
I don't provide a writing rubric because they are complex creations, but you may see some examples here and here. The score point ranges for ECR items vary, depending on the traits of writing and number of domains. That is, an essay might be scored for organization, style, and conventions. If the question depends on the student's comprehension of a passage, the essay might be scored for both reading and writing.

Bear in mind that these sample items are jokes, and as such, aren't examples of exemplary items, primarily because they require a great deal of prior knowledge, and so the test-taker who is unfamiliar with Sophie and dogs in general will perform less well than the test-taker who is on a first-name basis with Sophie and/or other dogs. There are other, less egregious flaws, but we'll get to those when we get to them.

If you have an item you'd like me to examine, explain, or deconstruct, feel free to post it in the comments. Check the copyright first.





 What I'm reading: Forgot to mention I was also finishing up The Claverings by Anthony Trollope. Then it's back to As I Lay Dying. I gave up on the other.

Wednesday, March 14, 2012

Speaking of Language, Cross-Referenced to Value-Added

Some item writers love multiple-choice language items; some hate them. I pitch my tent in the former category. Several years ago, I wrote all of the language items (writing conventions and writing strategies) that appeared on multiple parallel forms of a statewide high school exit examination. 

Language items may be either standalone or passage-dependent. The former are discrete entities, e.g.:
Read the sentence.
The Supreme Court may rule in favor of restrictions to freedom of speech when words are considered insendiary. 
Which is the correct spelling of the underlined word?
A incendiary *   B incendairy    C ensendiary     D Leave as is.

Passage-dependent language items accompany an editing passage as previously discussed

Language items may address any kind of writing skill or content knowledge targeted by an assessment: conventions (punctuation, capitalization, spelling), usage (grammar, diction), style (sentence structure and variety, diction), and organization (focus, elaboration, and support). (Rarely, writing applications skills are also assessed by multiple-choice items; one can easily see the difficulty of assessing any applied skill in this mode.)

There is some overlap (you see "diction" may fall into the usage camp or the style camp, for example, and sometimes grammar items fall into conventions, but basically, you might think of language items as mechanics and style/organization. Proofreading items then generally address mechanics, while editing and revision items may target either mechanics or style/organization, depending on the assessable skills.

Not all item writers can write language items. It's almost more of an editorly than a writerly undertaking, requiring a combination of specialized knowledge, persnicketiness, and an excellent grammar handbook. Some language items produced by unqualified writers are incomprehensible.

Sometimes these slip through the quality control cracks because language items are difficult to review because one must read very closely in order to identify that the error is an error, that the correct response is indeed correct, that the error is the type of error indicated by the standard/objective, and that there are no errors other than those intended. When we read, our brains automatically correct much of the error that we see, even when trained to do otherwise, so to review an item with intentional error that may also contain unintentional error is asking a lot. It's overwhelming, especially if you add in the editing passage, which means you have to do a lot of checking back and forth. Not to mention that a thorough knowledge of English usage and grammar is a rare commodity these days.

I have strong opinions about the content and construction of language items, to wit:
1. A language item should include only one type of error. A spelling item should not be contaminated by punctuation errors.
2. Each wrong answer choice should contain only one error.
3. The errors in the item must be the kinds of errors that students at the targeted grade level would reasonably make.
4. The errors should be obvious to the student who possesses the skill or content knowledge being assessed.
5. Trivial, why-bother sentences (or passages) should not be used to assess language skills. Language items should use actual facts in the stimulus sentences and paragraphs, rather than the easy but lame sentence.

Here is an example of an item with a trivial stimulus sentence:
Read the sentence.
 I haven't seen ______ since December.
Which pronoun should be used in the sentence?
A him *   B she    C they    D we

Here is an example of a language item based in fact:
Read the sentence.
 Langston Hughes had been writing poetry for years before Vachel Lindsay helped ____ publish his work.
 Which pronoun should be used in the sentence?
A he     B him *    C them    D they


I've written language items based on marine biology, space exploration, phenotypic plasticity, you name it. What is interesting in the world is a lot. There's no need to write about the purely meaningless.


UPDATE: Fixed some formatting with the MC items. Those are kind of tricky with the line breaks and indents.


UPDATE: Oh! I forgot to mention a recent dethspicable deplorable practice, that of using sentences from previously published (usually classic literature) as the stimulus for language items, either containing (newly imposed) embedded error or offering students options for improvements to the original (classic) writing.


Your mind is probably as boggled as mine was when first I came across this nastiness in the woodshed unspeakable horror unsound practice.


How might this work, you ask? Not well, as illustrated in the following examples.


Here is an example of the addition of error to a line from classic literature:



Read the sentence from Leo Tolstoy's Anna Karenina. 
Everyone thinks of changing the world, but no one thinks of changing themselves.
Which word would best replace the underlined word in the sentence? 
A himself*   B myself   C ourselves   D Leave as is.  


Here is an example of an invitation to students to improve upon a line from classic literature:


Read the sentence from Leo Tolstoy's Anna Karenina. 
Everyone thinks of changing the world, but no one thinks of changing himself.
To avoid repetition, which word would best replace the underlined word in the sentence? 
A altering*   B becoming   C improving   D renovating   
Why would anyone perform such sacrilege and blasphemy do such a thing? From the best of intentions.


An item writer, bored with the trivial, why-bother stimulus sentences, feels inspired to use sentences culled from the books she loves, sentences that are themselves little masterpieces of beauty, wit, and style. Won't this be good for the students? And there is no one to stop her, as these sentences are stolen borrowed from works now in the public domain.


Good intentions, road to you-know-where, cross-referenced to the law of unintended consequences.


Wednesday, October 7, 2009

Poppycock, Folderal, Nonsense

. . . in the immortal words of Todd Farley.

About a week ago, someone sent me a link to an Op-Ed piece in the New York Times by Todd Farley, author of Making the Grades: My Misadventures in the Standardized Testing Industry.

Farley's experiences aren't unique. Like Farley, I am a writer who sort of fell into the test publishing industry by accident. Like Farley, I stayed in the industry long after I thought I would have gone on to what I thought would be my real career of writing novels or screenplays or something, anything.

Both of us started our careers in hand-scoring, so hand-scoring is what I will talk about, specifically the hand-scoring of open-ended test questions. Multiple-choice questions are simple to score, because there is only one correct answer. All multiple-choice test questions are machine-scored. The answer sheets or test booklets are scanned, the answer choices verified by machine, and the scores are then computer-generated. Sometimes there are mistakes in the programming that must be corrected, for example, the correct answer to a given question was actually C but was identified somewhere along the line as A. Sometimes there are mistakes with a student's name or identification number that lead to a mistaken score. Sometimes--and this happened with my daughter's third-grade California STAR testbook--the testbook or answer sheet has juice spilled all over it, and so a false score may be generated. Where humans are involved, there will be some error somewhere, it is unavoidable, let us simply endeavor to put checks in place to catch the errors and processes to correct them.

The scoring of open-ended questions is a horse of a whole nother color. By its nature, there must be some subjectivity. In support of standardization are an array of tools that include a scoring guide or rubric, sample student responses at each score-point-level, and anchor papers and rangefinders. A rubric lists the characteristics of the response at each score point, a sample response gives an example of what kind of response is expected, an anchor is a student response that embodies the score point level, and rangefinders show what may be expected at the high, middle, and low ends of the spectrum within a score point level.

It sounds like a complicated process, and it is. And it's not without its ridiculous moments. And I have to say that though I found much about handscoring interesting, the work itself was tedious and the routine unbearable. But it's not the Orwellian circus of nonsense Farley describes. Or maybe it is at the company where Farley worked; it wasn't at CTB McGraw-Hill when I worked in hand-scoring there.

I am only about a quarter into the book, so maybe there will be some sort of Aristotelian discovery on Farley's part. At this point, he sounds like one of the disgruntled hand-scorers, and there were some of those, people who just never got it, never were able to internalize the scoring criteria and constraints, the ones whose scores had to be checked and re-checked so often that eventually they were let go. He says that he failed to qualify as a scorer for a writing test, which does make one wonder whether this type of work simply was not a good fit for him. Not that I can vouch for what happened at Pearson, as I've not worked there.

I will also say that--although I do not at all see myself as a flag-waver for the test publishing industry, and that I have my own strong feelings about the mis-use of tests and what seems to me to be an abuse of tests and how they are used and what they symbolize and how the data are manipulated--sitting in the mocking judgment seat is generally easy to do. I have plenty of ridiculous stories of my own. We humans are ridiculous, it's in our nature, and thank God that we are, it makes the world so much more entertaining.

And this book is just that--entertainment, a joke that is masked as an indictment of the industry. For myself, I'd be a lot more interested in a thoughtful exploration of the subject, one that takes into account the need for measurement in teaching, and the demand for standardization (because that seems to be the only way to ensure any kind of fairness or equity), and how we could possibly balance these kinds of standardized measurements with classroom performance and evaluations from teachers.

CORRECTION: I mean "Folderol." Geez. And to think I won first place in the 8th grade spelling bee. What did I tell you? Human error.