In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.
Descriptor-based assessment
It is suggested that descriptor based assessments can be used to combine the purposes of formative and summative assessments, but this chapter outlines why this isn’t really the case.
Statements given in formative assessments across many different lessons can potentially be aggregated to give a summative assessment. Although problems arise with the aggregation itself – do students need to meet all of the statements at a particular level? To get round this different systems apply different methods and algorithms.
I have direct experience of using this assessment model for grading student internal assessment in DP biology classes. The guide advocates best-fit grading.
Although different systems give different names to the categories and levels that they give to assessments using descriptors, they aim to create a summative, shared meaning. Even if we don’t intend to create a summative assessment, the moment we give a grade or make a judgement about performance we are trying to create a shared meaning, we are making a summative inference. As discussed in the previous chapter, for that summative inference to be reliable the assessment has to follow certain restrictions: it has to be taken in standard conditions, distinguish between candidates and sample from a broad domain.
Everytime we make a summative inference we are making a significant claim that requires a high standard of evidence to justify it.
This model does not provide valid formative data because the model is based on descriptions of performance, they lack specificity as to the process and therefore cannot be used responsively in the way that true formative assessment should be. They do not analyse what causes performance. They are designed to assess final performance against complex tasks.
The reasons descriptors are not useful for formative feedback because:
- They are descriptive not analytic
- They are generic not specific
- They are short term not long term
Pupils may benefit from tasks that are not able to be measured by the descriptors. The descriptors do not provide a model of progression. For example making inferences from text requires a large amount of background knowledge, gaining that knowledge and assessing the gain cannot be done using activities that can be assessed by these descriptors.
You may argue that teachers are free to use whatever activities they want in lessons as long as they realise that only some of them are capable of contributing to the final grade.
However some of the systems warn against using other tasks and can be designed to be used every lesson and require judgements to be made against hundreds of different statements, leaving little time for anything else.
Therefore there is lots of pressure to use class time to be the type of task you can get a grade from. Quizzes and vocab tests get a bad reputation as they don’t provide evidence for the final grade. If every lesson is set up to allow pupils to demonstrate fully what they are capable of, then lessons start to look less like lessons and more like exams.
Generic feedback based on descriptors does not analyse where the issues are and importantly, tell students how to improve. The descriptor only allows tasks that allow grading by descriptor. This will leave it hard for the teacher to diagnose where a pupil has gone wrong. Even at more advanced levels of understanding there is still value in isolating the different components of a task to aid diagnosis.
Educators can’t even agree on what learning is at least doctors and nurses agree on what health is!
Generic targets provide the illusion of clarity, they seem like they are providing feedback but they are actually more like grades. If a comment is descriptive and not analytic, then it is effectively grade-like in it’s function: it is accurate but unhelpful.
Finally these descriptors do not distinguish between short term performance and long term learning. This is the most difficult factor in establishing reliable formative inferences. If we want to know if a student needs more teaching on a topic, then formative assessment right after a unit will not allow us to make such an inference.
In terms of summative assessments there are several reasons why using prose descriptors is problematic. These are:
- Judgements are unreliable – underlying questions can vary in difficulty but can still match the prose description. If different teachers both within and between schools utilise different questions and tasks (that match the prose description) to summatively assess students, this introduces uncertainty into the assessment and cannot be used to produce a shared meaning. Ensuring examiners interpret these statements in the same way is also extremely difficult. Descriptors lack specificity and can therefore be interpreted differently, making judgements unreliable. Different judgements cannot produce a shared meaning.
- Differing working conditions – different tasks completed in different situations cannot produce shared meaning. Subtle differences in how tasks are present will influence how well students can do.
- Bias and stereotyping – humans are subject to unconscious bias and stereotypes even more so under pressure. This will influence the judgements that they make.