Notes on making good progress?: Chapter 5

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Exam-based assessment

Exams based on the difficulty model produce a lot of information that can potentially be used formatively as pupil performance on each question can be measured using the principle of question-level or gap analysis.

In the quality model of exam what looks like an exam is still an assessment that relies heavily on descriptors.

Exams sample a domain however and are not direct measurements. While a test that samples from a domain may allow inferences about the domain but it does not allow inferences about the sub-domains because the number of questions for each sub-domain will be too small. The resolution is too granular and not fine enough. Careful analysis of the question responses may provide useful feedback, particularly for harder questions that rely on knowledge from multiple domains, but this more nuanced information is not captured in a question-level analysis.

Harder questions at GCSE rely on knowledge of several concepts. It is hard to formative assess where student misconceptions lie if they get these questions wrong from a gap analysis. Complexity reduces formative utility.

It is not possible to measure progress fairly using summative exams [like past GCSE papers] because it is entirely possible for a student to have made significant progress on a sub domain but for that to not show up in a summative exam situation and because that topic may be poorly represented in the test, you cannot use just those questions to analyse their progress either.

The risk is that teachers may be incentivised to focus on activities which improve short term performance but not long term learning. Measuring progress with grades encourages teaching to the test which compromises learning.

Because exam boards spend so much time and resources trailing and modelling assessments it is hopeless for teachers to think they can match that rigour in the assessments they design.

Often tests are pulled in two different directions and this highlights a tension between classroom teachers who want formative data and senior managers who want summative progress data.

Notes on making good progress?: Chapter 4

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Descriptor-based assessment

It is suggested that descriptor based assessments can be used to combine the purposes of formative and summative assessments, but this chapter outlines why this isn’t really the case.

Statements given in formative assessments across many different lessons can potentially be aggregated to give a summative assessment. Although problems arise with the aggregation itself – do students need to meet all of the statements at a particular level? To get round this different systems apply different methods and algorithms.

I have direct experience of using this assessment model for grading student internal assessment in DP biology classes. The guide advocates best-fit grading.

Although different systems give different names to the categories and levels that they give to assessments using descriptors, they aim to create a summative, shared meaning. Even if we don’t intend to create a summative assessment, the moment we give a grade or make a judgement about performance we are trying to create a shared meaning, we are making a summative inference. As discussed in the previous chapter, for that summative inference to be reliable the assessment has to follow certain restrictions: it has to be taken in standard conditions, distinguish between candidates and sample from a broad domain.

Everytime we make a summative inference we are making a significant claim that requires a high standard of evidence to justify it.

This model does not provide valid formative data because the model is based on descriptions of performance, they lack specificity as to the process and therefore cannot be used responsively in the way that true formative assessment should be. They do not analyse what causes performance. They are designed to assess final performance against complex tasks.

The reasons descriptors are not useful for formative feedback because:

  1. They are descriptive not analytic
  2. They are generic not specific
  3. They are short term not long term

Pupils may benefit from tasks that are not able to be measured by the descriptors. The descriptors do not provide a model of progression. For example making inferences from text requires a large amount of background knowledge, gaining that knowledge and assessing the gain cannot be done using activities that can be assessed by these descriptors.

You may argue that teachers are free to use whatever activities they want in lessons as long as they realise that only some of them are capable of contributing to the final grade.

However some of the systems warn against using other tasks and can be designed to be used every lesson and require judgements to be made against hundreds of different statements, leaving little time for anything else.

Therefore there is lots of pressure to use class time to be the type of task you can get a grade from. Quizzes and vocab tests get a bad reputation as they don’t provide evidence for the final grade. If every lesson is set up to allow pupils to demonstrate fully what they are capable of, then lessons start to look less like lessons and more like exams.

Generic feedback based on descriptors does not analyse where the issues are and importantly, tell students how to improve. The descriptor only allows tasks that allow grading by descriptor. This will leave it hard for the teacher to diagnose where a pupil has gone wrong. Even at more advanced levels of understanding there is still value in isolating the different components of a task to aid diagnosis.

Educators can’t even agree on what learning is at least doctors and nurses agree on what health is!

Generic targets provide the illusion of clarity, they seem like they are providing feedback but they are actually more like grades. If a comment is descriptive and not analytic, then it is effectively grade-like in it’s function: it is accurate but unhelpful.

Finally these descriptors do not distinguish between short term performance and long term learning. This is the most difficult factor in establishing reliable formative inferences. If we want to know if a student needs more teaching on a topic, then formative assessment right after a unit will not allow us to make such an inference.

In terms of summative assessments there are several reasons why using prose descriptors is problematic. These are:

  1. Judgements are unreliable – underlying questions can vary in difficulty but can still match the prose description. If different teachers both within and between schools utilise different questions and tasks (that match the prose description) to summatively assess students, this introduces uncertainty into the assessment and cannot be used to produce a shared meaning. Ensuring examiners interpret these statements in the same way is also extremely difficult. Descriptors lack specificity and can therefore be interpreted differently, making judgements unreliable. Different judgements cannot produce a shared meaning.
  2. Differing working conditions – different tasks completed in different situations cannot produce shared meaning. Subtle differences in how tasks are present will influence how well students can do.
  3. Bias and stereotyping – humans are subject to unconscious bias and stereotypes even more so under pressure. This will influence the judgements that they make.

Notes on making good progress?: Chapter 3

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Making valid inferences

If the best way to develop a skill is to practice the components that make it then it is hard to use the same type of task to assess formatively and summatively.

Summative assessment tasks aim to generalise and create shared meaning beyond the context in which they are made. Pupils given one summative judgement in one school should be getting a similar judgement in another school.

Formative assessment aims to give teachers and students information to form the basis for successful action in improving performance.

Although, at times, a summative task may be able to repurposed as a formative task, generally different purposes pull assessments in different directions. The purpose of an assessment does impact on its design which makes it harder to simplistically re-purpose it.

Assessments need to be thought of in terms of there reliability and their validity. The validity of an assessment refers to the inferences that we can draw from its results. The reliability is a measure of how often the assessment would produce the same results with all other factors controlled.

The example of timing of mocks comes to mind. Whether you want these to be a summative or a formative assessment will affect when you favour setting them.

Sampling (the amount of knowledge from a particular domain assessed by an assessment) affects the validity of an assessment. Normally in summative assessments questions sample the domain, they do not cover it in its entirety.

Some assessment do not have to sample. If the domain they are measuring (letters of the alphabet) is small this isn’t a problem. Further along the educational pathway this becomes harder.

Assessments also need to be reliable. Unreliability is introduced into assessments through sampling, the marker (different markers may disagree) and the student (student performance can vary day to day).

Models of assessment include the quality model and the difficulty model. Sources of unreliability affect each of them in different ways. Quality model requires markers to judge how well a student has performed (think figure skating), difficulty model requires pupils to answer questions of increasing difficulty (think pole-vault).

There is a trade-off between reliability and validity. A highly reliable MCQ assessment (reduction in sampling and marker error) may limit how many inferences you can make from the assessment, you may be unable to use this as a summative assessment as it doesn’t properly match up with the final assessment.

However reliablity is a prerequisite for validity. If an assessment is not reliable then the inferences drawn from it, its validity, is also not reliable. We can’t support valid inferences.

You may well be able to create an exciting and open assessment task which corresponds to real-world tasks; however, if a pupil can end up with a wildly different mark depending on what task they are allocated, and who marks it, the mark cannot be used to support any valid inferences.

Summative assessments are required to support large and broad inferences about how pupils will perform beyond the school and in comparison to other peers. In order for such inferences to be valid they must be consistent.

Shared meanings impose restrictions on the design and administration of assessments. There are specific criteria needed for this. To distinguish between test takers, assessments need items of moderate difficulty and assessments must sample. Samples need to be carefully considered and representative.

The main inference needed for formative assessments is how to proceed next. Assessment still needs to be reliable but the inferences do not need to be shared even with kids in the same room. I can therefore help some kids more than others. It is about methods. It needs to flexible and responsive.

The nature of inference posses a restriction on assessment. Trying to make summative inferences from tasks that have been designed for formative assessment use is hard to do reliably without sacrificing flexibility and responsiveness.

Assessment theory triangulates with cognitive psychology. The process of aquiring skills is different from the product, the methods of assessing the process and the product are different too.

Formative assessments need to be developed by breaking down the skills and tasks that feature in summative assessments into tasks that will give valid feedback about how pupils are progressing towards that goal.

They can be integrated into one system, to be discussed in a later chapter.

Most schools make the mistake of summatively assessing all too frequently.


Notes on making good progress?: Chapter 2

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Aims and methods

The generic skills approach implies that skills are transferable and that the best way to develop a skill is to practice that skill. It is based on the analogy of the mind acting as a muscle.

There are examples of curriculums that follow this model like the RSA opening minds curriculum; it is interesting to contrast this to the core knowledge curriculum mentioned by Hirsch and the DI models.

Instruction based on this model is organised around developing transferable skills through projects where students practice authentic performances.

However research from 50 years of cognitive science shows us that skill is domain specific and dependent on knowledge. The examples of studies into Chess grand masters is given. These were the earliest experiments but have been reliably repeated in other knowledge domains. There are multiple lines of evidence that suggest the same thing (A bit like evolutionary theory).

Complex skills depend on mental models which are specific to a particular domain. These models are built in long term memory. They can be drawn on to solve problems and prevent working memory from being overloaded.

Working memory is highly limited and relying on it to solve problems is highly ineffective. Learning is the process of aquiring these mental models. Performance is the process of using them.

Formative assessment should aim to assess how these mental models are being developed. Summative assessment measures the performance or act of using those mental models.

Even scientific thinking is domain specific. One cannot evaluate anomalies or the plausibility of a hypothesis without domain specific knowledge.

The adoption of generic skills theory leads to a practical flaw: lessons are too careless about the exact content knowledge that is included in them (this also ties in with Hirsch’s ideas that individualisation leads to a reduction in knowledge). If skills are not transferable we need to be very careful about what the content of a lesson is. Specifics matter.

The educational aim of developing generic skills is sound but we need to think about the method. We can develop critical thinking only by equipping learners with knowledge. Good generic readers or problem solvers have a wide background knowledge.

Aquiring mental models is an active process. project based lessons can work as they help students to make the knowledge their own, its just that sufficient care and attention to the content that is to be learned is applied.

Specific and focussed practice is what is needed to develop skill. As shown by the work of K. Anders Ericsson, there is a difference between deliberate practice and performance. Deliberate practice builds mental models, while performance uses them.

What is learning and how is it different from performance? Learning is the creation of mental models of reality in long term memory. Copying what experts do (performing) is not a useful way of developing the skill of the expert because they do not build the mental models. Instead these lessons will overwhelm working memory and will be counter-productive.

Generic skill models only allow feedback that is generic, it does not allow are feedback to tell students exactly how to improve. “think from multiple perspectives more” is not useful advice if kids don’t know how to do it.

Models of progression are needed to show the progress that students are making.

Peer and self assessment can be useful so long as they are used appropriately. The Dunning Kruger effect shows us that novices cannot judge their performance accurately (does this contradict Hattie’s claim about self reported grades that kids know where they are at?). Developing pupils ability involves developing their ability to pervcieve quality. We cannot expect them to self assess complex tasks and showing them excellent work is not enough to develop excellence. Particular aspects of work need to be highlighted by the teacher.

To develop skill we need lots of specific knowledge and practice at using that knowledge. This helps to close the knowing-doing gap.

Notes on making good progress?: Chapter 1

In this series of posts I record my notes from Daisy Christodolou’s book “Making good progress? The future of Assessment for Learning” It is quite excellent. You can buy a copy here.

Why didn’t Assessment for Learning transform our schools?

Formative assessment is when teachers use evidence of student learning to adapt instruction to meet student need. It’s focus is on what students need to do to improve, on their weaknesses. Feedback then, needs to be tailored thoughtfully to direct the student in how to improve and allow students to act on that feedback.

Formative assessment should be used to diagnose weakness and feedback should tell students how to improve explicitly. AfL is not just about teachers diagnosing weakness and being responsive it is about students responding to information about their progress. Could be a good model for appraisal too.

There is a tension then, between summative and formative assessment. One is about measuring student progress against the aims of education while formative assessment is about the students finding out what they need to do to improve.

If we can agree on the aims of assessment, there is still a discussion about methods. We either favour the generic-skill method which states that to get better at a particular performance you just need to practice that performance. So if you need to practice critical thinking you just practice it or if you want to get better at writing an essay, you just write lots of them. Or we favour the deliberate practice method. This method breaks the final skill down into its constituent parts and practices those. So footballers practice dribbling, passing, defending and shooting not just playing whole games all the time.

Summative assessment is about assessing progress against the aims of education. Formative assessment is about the methods you choose to meet those aims: generic or deliberate. Depending on which one of these you subscribe too will affect your formative assessment, and thus whether assessment tasks can be used formatively and summatively.

If you believe that skill acquisition is generic then formative assessment tasks will match the final summative task. You will write lots of essays, feedback can be givenĀ and a grade awarded. If you believe that the method of deliberate practice is better then you may need to design formative tasks that don’t look like the final task. These tasks cannot be used summatively because they don’t match the final task.

Interestingly, belief in generic skills leads down the road of test prep and narrow focus on exam tasks because this model suggests that to get better at the exams you do need you need to practice taking them.

In my mind the key questions for a school, curriculum level or department that wants to adopt the deliberate practice model should be:

  • What are the key skills being assessed in the final summative tasks (don’t forget that language or maths skills might be a large component of this?
  • What sub-components make up these skills?
  • What tasks can be designed to appropriately formatively assess the development of these sub-skills?
  • What does deliberate practice look like in my subject?
  • How often should progress to the final summative task be measured i.e. how often should we set summative assessments in an academic year that track progress?