
Exploring How to Improve Assessment with AI
By Kristen DiCerbo, Ph.D., Chief Learning Officer at Khan Academy

Can AI help us improve assessment? Can it give us a better understanding of what students know and can do? Classroom assessment generally comes in two forms: teacher generated assessments meant for the teacher and student and large scale assessment where students are taking the same assessment as other students in the school, district, state, and/or country and results are shared widely. In general these large scale assessments have been limited in the types of questions that can be asked, in part because the responses are automatically scored. As a result, many feel that large-scale assessments are not able to measure the skills and knowledge that students actually have or that are important.
Generative AI opens new ways to design assessments that better reflect what we actually want students to learn and do. There are two specific areas that generative AI might help with: new kinds of assessment activity and new kinds of scoring.
Exploring a new kind of assessment activity
Since last January, we have been piloting a feature called Explain Your Thinking with select pilot schools. This feature is meant to mimic the conversations that teachers have with students about their work. They might sit with a student and say things like, “Tell me why you did that step next,” or “What does that answer tell you about the problem?” Our Explain Your Thinking similarly asks students to first answer a traditional question and then engage in a conversation with the AI about their answer. We use prompting behind the scenes to guide the AI to ask questions that get at particular conceptual ideas.
Generative AI opens new ways to design assessments that better reflect what we actually want students to learn and do.
In our research, we examined both whether the conversation gave new information and whether the generative AI was an accurate scorer of the conversations. We looked at 220 conversations in algebra and 296 conversations in geometry. We asked whether the conversation revealed more about the students’ understanding than the students’ first responses did. In other words, did we get more understanding of what the student knows from a conversation than we would have from a single open-ended response? About 20.0% of students for the algebra item and 36.1% for the geometry item did not demonstrate understanding initially but did so by the end of their conversation with the AI. That is a substantial number of students who demonstrated more understanding in the conversational setting. We are heartened to see these preliminary results and are eager to further explore how questioning students about their thinking leads them to reveal more about their understanding.
Each conversation has criteria which can be scored as correct or incorrect. We built an AI scorer to judge after each turn whether the criteria had been met or not, and then to give a score at the end of the conversation. The AI scorer demonstrated good alignment with human raters at both the turn and conversation levels. The AI scorer demonstrated good alignment with human raters at both the turn and conversation levels. Read more about this work in the paper Measuring Student Understanding via Multi-Turn AI Conversations, led by principal psychometrician Jing Chen.
Of course in order to make the feature Explain Your Thinking work in an assessment scenario, the AI can’t give away the answer or even hints during the conversation. We know that many of the AI models want to be helpful so we had to test new ways to ensure it wouldn’t provide help to students. One way to do this is by setting up a system where we prompt the AI to self-critique the response from the AI before it is shown to the test-taker. To test this idea, we found 176 conversations in a group of 597 test cases where the AI was likely to try to give a hint. We then ran versions of the AI with and without self-critique to see how often they would give a hint. Self-critique dramatically reduced the rate of inappropriate hints from 65.9% to 6.1%. Remember, this isn’t a random sample of conversations, but conversations in which the AI is very likely to give a hint, so the actual percentage of time the AI would give a hint with self-critique would actually be lower than 6%. You can read the details in the paper Beyond the Hint by prompt engineers Tyler Burleigh and Jenny Han, and me.
Exploring a new kind of scoring
In order to make these conversational assessments, we need to write both the item prompts and the criteria by which to score them. We then need to test these and, if they are not reliable, revise them and test again. If we have to test every item with students, you can imagine that this process would take months to years to create one assessment. In order to cut that down, and make sure we are only testing with real students when we think we have good items, we experimented with a system that uses AI to generate 150 synthetic responses to an item so it can be tested and revised before piloting. Using this tool, we can run many iterations, and in particular make sure the criteria used for grading are able to result in reliable scores. We authored 17 items with the tool, and they collectively resulted in 68 iteration cycles. Before iteration, only 59% of the items could be scored reliably, but with the tool, all 17 met the criteria for reliable scoring, and this was accomplished over days not years. Read more about the tool in the paper Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data by senior prompt engineer Tyler Burleigh, principal psychometrician Jing Chen, and me.
If we do decide to use generative AI for scoring, there are a LOT of considerations before we do so. We drafted a framework of all the things we need to think about: Measurement Purpose, System Design, Model Selection, Item Development, Pilot and Live Testing, and Risk Mitigation. You can read more about this in A Framework for Live Scoring Constructed Response Items with Commercial LLMs by senior psychometrician Scott Frohn, assessments director Lauren Deters, senior prompt engineer Tyler Burleigh, and me.
If we do decide to use generative AI for scoring, there are a LOT of considerations before we do so.
We look forward to continuing to conduct careful research on the possibilities of generative AI to enhance assessment and give us a richer picture of what students know and can do.
Onward!
Source link



