Categories
Books

Notes on making good progress?: Chapter 7

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Improving formative assessments

Formative assessments should be:

  • Specific
  • Frequent
  • Repetitive
  • Recorded as raw marks

Specific questions allow teachers to diagnose exactly what a pupil’s strengths and weaknesses are, and they make it easy to work out what to do next, whereas open and complex questions like essays or real-world problems are not particularly well-suited to this. Short answer and MCQs can be very precise. MCQs, despite their reputation, are excellent for diagnosis and indicate what pupils might need to work on next. MCQs give a specific diagnosis of conceptual understanding and are labour saving.

Criticism of MCQs include that they are easy for pupils to answer, but the risk can be mitigated in several ways. You can increase the number of distractors, you can increase the number of questions, you can also include more than one right answer. Answers can be analysed at the level of the class. MCQs can target misconceptions very effectively. Misconceptions are an important part of a progression model often because they involve particularly tricky and fundamental concepts without which pupils cannot progress.

They are very easy to analyse. You can record not just whether the pupil got the question right or wrong but which distractors they chose. When the analysis is done on a topic that has been recently taught then it becomes much more helpful. We don’t necessarily need to re-teach topics but can ensure to highlight those misconceptions again if the curriculum is structured in a way to allow this. Explanatory paragraphs in the question bank for each MCQ make it very easy to give feedback. Once the feedback has been delivered the teacher can follow up with another set of similar questions to see if the pupil has understood this time around. MCQs with together with this kind of in-depth, specific and precise feedback, can form a vital part of a progression model in any subject.

Research shows that the act of recalling information from memory actually helps to strengthen the memory itself. That is, testing doesn’t just help measure understanding; it helps develop it. This is called the testing effect. This effect can certainly apply to summative tests too so long as they don’t force students away from retrieval and into problem-solving search. The power of the testing effect is that it introduces desirable diffficulties. self-testing is much more effective revision than re-reading. Re-reading makes pupils feel familiar with the content but doesn’t guarantee thought. Testing makes it clear if students have understood something.

Generally assessment should not take place too close to the period of study as we can’t make a valid inference about whether a pupil about whether students have learned the material. If a student gets the question right very soon after study we are not provided with a valid inference. Some of the questions set for recap at the start of the lesson or for homework or at the end of the lesson should cover previously learned material.

Recording grades frequently forces formative assessment into a summative model. We could simply stop recording formative assessment as this assessment aims to be responsive not reportive. If we do record marks these should not be converted to grades. When converting to grades you are asserting that the difficulty of the two assessments is the same and that you are trying to derive a shared meaning. Also, the aim of formative assessments is to set questions that are closely tied to what is being studied.

Categories
Books

Notes on making good progress? by Daisy Christodoulou

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Categories
Books

Notes on making good progress?: Chapter 6

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Life after ‘Life after levels’: creating a model of progression

The two assessment systems described in the previous two chapters suffer from the same flaws:

  • The expect the same assessment to produce two very different inferences
  • lead to overgrading and overtesting
  • lead to unhelpful feedback
  • lead to the measurement of formative progress with summative grades
  • inadvertently encourage a focus on short-term performance and discourage long-term learning.

The purpose of a grade is to describe performance not measure progress. No assessment system can succeed unless it is based on a clear and accurate understanding of how pupils make progress in difference subjects.

Assessments have to be selected and designed with reference to their purpose. Different assessments serve different purposes and have to be designed accordingly. We cannot rely on one assessment or style of assessment for the all the assessment information we need. Pupils who get better at decoding phonemes do become better readers; those who establish a clear sequence of historical events do get better at source analysis.

A good assessment system must not only clarify the current state and the goal state, which is can do through the use of summative assessments, but it must also establish a path between the two: the model of progression.

Textbooks have a role here. They can be used to communicate the the model of progression. In science they provide exemplars, it is noted in Kuhn’s work that scientists gain expertise by learning many examples. Textbooks offer an effective and detailed way of communicating a progression model.

Modern textbooks can look very different to the older ones, as they can now be online and do not have to feature just prose.

A progression model needs to be specific, not generic, and it needs to break complex skills down into small tasks that do not overload pupil’s limited working memories. As the model builds, pupils will be able to manage more complex tasks because they have memorised and automated the initial steps, but the model must start with the basics.

It will look different in different subjects and for different concepts within the same subject. Teachers are required to make decisions about what tasks are most likely to lead to the attainment of the end goal in that particular topic.

We need to clarify what the final aim of education is and we must use these aims not exam success to build our progression model. Exams are only samples of wider domains and because of this, there will always be ways of doing them well that do not lead to genuine learning. However if we set mastery of a domain as a goal the exams will be valid measures.

Goodhart’s law: when a measure becomes a target it loses it’s value as a measure.

If our end goal is success on an exam we will end up with a progression model which leads to exam success but not to the wider goals we really want.

Teaching to the test and exam prep does not correspond to the problems that students will face in real life so, if they have focussed excessively on these types of questions it will compromise the validity of the results.

If pupils are graded every term or every few weeks and dramatic improvements are expected then cramming and teaching to the test is likely the only methods that will provide short term improvements.

Memorising the right thing vs the the wrong thing is exemplified by memorising model essays or memorising lines of poetry. Memorising poetry  helps pupils move towards the end goals of the English curriculum in ways that are not achieved by memorising essays.

How should we make decisions about what knowledge is worth remembering and what isn’t? Daniel Willingham provides some pointers – how could these be applied in group 4 specifically bio?

Lessons should be viewed in the context of the progression model. Remembering somethings will create meaning, other not so much. The same lesson may or may not create meaning depending on the sequence it is part of. It is possible for a lesson to be highly effective if part of one sequence and ineffective if part of another.

In establishing a progression model we first have to establish what it is we want a pupil to be able to achieve. We have to define this in terms of the fundamental concepts we want them to master, not in terms of exam success.

Isobel Beck recommend a list of 400 words per year to be taught for the first 10 years of education. The research on teaching vocabulary also  suggests that pupils learn examples of vocabulary in context rather than definitions.

Subjects may not always be the best way to think about progression models. Some traditional subjects have arbitrary content. Some subjects have content that is in other subjects in different national systems.

Progression models should be focussed on concepts that we want students to acquire. The metaphor of marathon training for a progression model is particularly helpful

Categories
Books

Notes on making good progress?: Chapter 5

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Exam-based assessment

Exams based on the difficulty model produce a lot of information that can potentially be used formatively as pupil performance on each question can be measured using the principle of question-level or gap analysis.

In the quality model of exam what looks like an exam is still an assessment that relies heavily on descriptors.

Exams sample a domain however and are not direct measurements. While a test that samples from a domain may allow inferences about the domain but it does not allow inferences about the sub-domains because the number of questions for each sub-domain will be too small. The resolution is too granular and not fine enough. Careful analysis of the question responses may provide useful feedback, particularly for harder questions that rely on knowledge from multiple domains, but this more nuanced information is not captured in a question-level analysis.

Harder questions at GCSE rely on knowledge of several concepts. It is hard to formative assess where student misconceptions lie if they get these questions wrong from a gap analysis. Complexity reduces formative utility.

It is not possible to measure progress fairly using summative exams [like past GCSE papers] because it is entirely possible for a student to have made significant progress on a sub domain but for that to not show up in a summative exam situation and because that topic may be poorly represented in the test, you cannot use just those questions to analyse their progress either.

The risk is that teachers may be incentivised to focus on activities which improve short term performance but not long term learning. Measuring progress with grades encourages teaching to the test which compromises learning.

Because exam boards spend so much time and resources trailing and modelling assessments it is hopeless for teachers to think they can match that rigour in the assessments they design.

Often tests are pulled in two different directions and this highlights a tension between classroom teachers who want formative data and senior managers who want summative progress data.

Categories
Books

Notes on making good progress?: Chapter 4

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Descriptor-based assessment

It is suggested that descriptor based assessments can be used to combine the purposes of formative and summative assessments, but this chapter outlines why this isn’t really the case.

Statements given in formative assessments across many different lessons can potentially be aggregated to give a summative assessment. Although problems arise with the aggregation itself – do students need to meet all of the statements at a particular level? To get round this different systems apply different methods and algorithms.

I have direct experience of using this assessment model for grading student internal assessment in DP biology classes. The guide advocates best-fit grading.

Although different systems give different names to the categories and levels that they give to assessments using descriptors, they aim to create a summative, shared meaning. Even if we don’t intend to create a summative assessment, the moment we give a grade or make a judgement about performance we are trying to create a shared meaning, we are making a summative inference. As discussed in the previous chapter, for that summative inference to be reliable the assessment has to follow certain restrictions: it has to be taken in standard conditions, distinguish between candidates and sample from a broad domain.

Everytime we make a summative inference we are making a significant claim that requires a high standard of evidence to justify it.

This model does not provide valid formative data because the model is based on descriptions of performance, they lack specificity as to the process and therefore cannot be used responsively in the way that true formative assessment should be. They do not analyse what causes performance. They are designed to assess final performance against complex tasks.

The reasons descriptors are not useful for formative feedback because:

  1. They are descriptive not analytic
  2. They are generic not specific
  3. They are short term not long term

Pupils may benefit from tasks that are not able to be measured by the descriptors. The descriptors do not provide a model of progression. For example making inferences from text requires a large amount of background knowledge, gaining that knowledge and assessing the gain cannot be done using activities that can be assessed by these descriptors.

You may argue that teachers are free to use whatever activities they want in lessons as long as they realise that only some of them are capable of contributing to the final grade.

However some of the systems warn against using other tasks and can be designed to be used every lesson and require judgements to be made against hundreds of different statements, leaving little time for anything else.

Therefore there is lots of pressure to use class time to be the type of task you can get a grade from. Quizzes and vocab tests get a bad reputation as they don’t provide evidence for the final grade. If every lesson is set up to allow pupils to demonstrate fully what they are capable of, then lessons start to look less like lessons and more like exams.

Generic feedback based on descriptors does not analyse where the issues are and importantly, tell students how to improve. The descriptor only allows tasks that allow grading by descriptor. This will leave it hard for the teacher to diagnose where a pupil has gone wrong. Even at more advanced levels of understanding there is still value in isolating the different components of a task to aid diagnosis.

Educators can’t even agree on what learning is at least doctors and nurses agree on what health is!

Generic targets provide the illusion of clarity, they seem like they are providing feedback but they are actually more like grades. If a comment is descriptive and not analytic, then it is effectively grade-like in it’s function: it is accurate but unhelpful.

Finally these descriptors do not distinguish between short term performance and long term learning. This is the most difficult factor in establishing reliable formative inferences. If we want to know if a student needs more teaching on a topic, then formative assessment right after a unit will not allow us to make such an inference.

In terms of summative assessments there are several reasons why using prose descriptors is problematic. These are:

  1. Judgements are unreliable – underlying questions can vary in difficulty but can still match the prose description. If different teachers both within and between schools utilise different questions and tasks (that match the prose description) to summatively assess students, this introduces uncertainty into the assessment and cannot be used to produce a shared meaning. Ensuring examiners interpret these statements in the same way is also extremely difficult. Descriptors lack specificity and can therefore be interpreted differently, making judgements unreliable. Different judgements cannot produce a shared meaning.
  2. Differing working conditions – different tasks completed in different situations cannot produce shared meaning. Subtle differences in how tasks are present will influence how well students can do.
  3. Bias and stereotyping – humans are subject to unconscious bias and stereotypes even more so under pressure. This will influence the judgements that they make.