Understanding the limitations of NAPLAN

Disclaimer:
I am not a statistician. Though I studied statistics at university, it was a very long time ago. Further, I know very little, beyond what is mentioned in this post, about standardised testing. I’m probably not using the correct terms when I speak about test errors.  Read everything below with a critical eye, and take my claims with a grain of salt… and sorry for the typos, at the end of writing I just wanted to push “publish.”

Time to read: 11 minutes

Part of my job at school this year has been looking at NAPLAN data. This isn’t something that anyone really wants to do. This is post is about what I’ve discovered. I’m using dummy data for privacy reasons, yet all findings mentioned in this post is in line with what I found using data from my school.

Introduction
People more qualified than me have written about the limitations of NAPLAN, in particular Margaret Wu, who makes it clear:

Still, it is hard for many school leaders to understand what this means for their yearly NAPLAN data, and even harder to communicate to other stakeholders. As such, I created a tool that allows school leaders and data nuts to visualise how NAPLAN errors our data.

Major Types of Errors
There are two types of major errors which hang ever present over NAPLAN data 1) student error, and 2) test extrapolation error.

Student Error
Student error acknowledges the fact that if a student gets 18 out of 25 in a spelling test this week, that doesn’t mean that they’ll get 18 out of 25 in a spelling test next, or the week after that, and so on. Of course, we wouldn’t expect someone who gets 18/25 to get 2/25 the next week either. So we know it is a predictor. Wu suggests +/- 12.5% is a good guide. Which means that our student who this week spelt 18/25 words correctly will probably get 15 to 21 words correct next week. Student error acknowledges that students sometimes make silly mistakes, and also, sometimes they get lucky and answer a question correctly that they usually wouldn’t.

Test Extrapolation Error
The second error accounts for the ability of the test to accurately represent the development that it is seeking to assess. Take for example, NAPLAN’s 2017 Year 3 spelling test.

In this test there were 26 questions. If you answer zero out of 26 correctly you are in band 1. If you answered any question correctly, you are in band 2 or higher. In fact, 3 questions cover the entire band 2, 6 cover band 3, 5 cover band 4, 5 cover band 5, and 8 further questions cover bands 6 and above. Given that each band is supposed to represent a year’s worth of learning, how well can 3, 4 or 5 questions (in the case of bands 2, 3, 4, and 5) represent all the words that students in these bands should know? How well does each of these words, represent 20% of the total words of bands 4 or 5? Can we really believe the test can determine that a student who answers two of the five band 4 words correctly knows about 40% of the band 4 words?

Of course, this is assuming NAPLAN’s tests have increase in linear difficultly where once students encounter a word they cannot spell they cannot correctly answer any of the harder words in the test. In this year’s spelling test this could not be further from the truth.

As part of our toolkit for analysing NAPLAN data, we were provided a spreadsheet with the test questions ranked in order of difficulty. We were also provided with the state percentage correct for each test question. For the spelling tests in years 3 and 5 (I didn’t check years 7 and 9), what NAPLAN believed was the order of difficultly did not correlate with the perceived linear difficultly of the words. For example, when tested while 91% of the state correctly answered the first and third easier questions, only 56% of state correctly answered what was considered to be the second easiest question!

In fact, according to Victorian state averages, band two contained the first, third and seventh hardest words. Band 3 which is supposed to have the 4th, 5th, 6th and 7th hardest words, in fact contained the 4th, 10th, 14th and 16th hardest words. And, rather than the 8th, 9th, 10th, 11th, 12th and 13th hardest words, band 4 contained the 2nd, 8th, 15th, 20th and 21st hardest words. Therefore, a student who is identified in band 2, having only been able to answer the three easiest question, may in all likelihood have answered what was perceived to have been the hardest question in band 3!

It is true that actual difficultly and perceived difficultly in numeracy, reading and grammar line up much more accurately than spelling in the NAPLAN. Yet, none of these anywhere close to being accurate to the point where a question can pinpoint a specific point in a student’s development.

The test extrapolation error therefore describes the difference between the test and reality. How well can the test, and in NAPLAN’s case maybe only three or four questions represent a student’s actual development?

When accounting for the these two errors it is clear that an individual student’s NAPLAN score is an approximation of the student’s development. How accurate an approximation is open to debate.

NAPLAN’s Scaled Scores

NAPLAN uses scaled scores to convert, in the case of the 2017 year 3 spelling test, into NAPLAN bands. A scaled score of between 0 and 270 places a student in Band 1, a scaled score of between 271 and 322 places the student in Band 2, and so on with each subsequent band, up to Band 9, comprising 52 points. Twelve correct answers in a test doesn’t place a student in the same band. Each and every test is scaled differently with the raw score in a test is uniquely converted to the scaled score.

ACARA publish equivalence tables each year, apparently the 2o17 tables are due out in December, which show how the raw score (number of correct answers) is translated into a NAPLAN band for each test. As you can see the 2016 year 3 spelling test was scored a bit differently to the 2017 test, with a student needing three correct answers to achieve band 2.

What is not apparent, that while the scaled scores are precise to five decimal places, students cannot get any score, instead they get the score that corresponds with their raw score. In the case of the above table, two students with raw scores of 3 and 4, are given scaled scores of 296.25472 and 316.08247 respectively. There is no way to achieve a score between these two points. Considering that this equates to 20 points and a year’s worth of learning is 52 points, NAPLAN cannot, and is not, precise. For those who believe NAPLAN can show learning to a year or months, this gap of 20 points is a gap of 4.6 months! Hardly precise.

It is true, that in the middle the equivalences tables, each question usually has a scaled score gap of 9 or 10 points. Though this still equates to two to three month increments, even if we believe their aren’t any student errors or extrapolation errors.

Let’s look further at the table.

I’m going to leave most of data in this table to the real statisticians, but I will focus on Scale SE column. NAPLAN defines a scaled standard error for each scaled score. In the table above, you can see a raw score of zero, translates to a scaled score of 182.95034 with a scale standard error of 61.60665. The scale standard errors, quickly reduce from this high of 61 to 36, then 30, and for many of the tests become as low as 20 or 21.

ACARA does not, anywhere that I can find, communicate what the scale standard error is compensating for, though, I would assume that it is an attempt to allow for the test extrapolation error. They do, though make it clear, in some documents, that the scale standard error correlates to a single standard deviation. Which means that there is a 68% likelihood that a student’s development falls within 1 standard deviations of the scaled score, a 90% likelihood within 1.64 standard deviations, 95% likelihood within 1.96 standard deviations.

If we assume that this error is normally distributed, then a student with raw score of 4 in the 2016 year 3 spelling test, would be 68% likely to have a NAPLAN score between 291 and 251. They would be 90% likely to have a NAPLAN score between 275 and 257, which is just under two NAPLAN years difference. And, 95% to have a NAPLAN score between 267 and 365, a difference of 98 NAPLAN points or nearly two NAPLAN years!

Visualising NAPLAN Errors
Take, for example, a fictitious student who had a raw score of 14 on the 2015 year 3 reading test. I’m using 2015 as a example, because later I’ll look at relative growth, between a student who sat the year 3 test in 2015, and the year 5 test in 2017.

This score of 14 rounds to a scaled score of 310 (band 2), with a scale standard error of 26.42. Remember band 2 begins at 270 and ends at 322. Using NAPLAN’s scaled score and scale standard error, I randomly generated 10,000 scores within the normal distribution, in order to see the range of possibilities and their likelihood. This is the result, the last found columns show bands 1 to 4, with the green highlight indicating the band that NAPLAN places the student in. Note: The code also highlights in orange when probabilities are greater than 20%, an arbitrary decision.

With an initial raw score of 310, it is unsurprisingly, most likely (61%) that the student is in band 2. However, as the student sits close to the top of band 3 (310 as opposed to 322) there is also a 32% chance they might be actually be in band 3. There is also a small chance, that they are also in bands 1 (7%) or 4 (1%)!

What about a student that sits just in band 3? In the same test if a student has a raw score of 15 they will be allocated a scaled score of 323.23804, 13 more NAPLAN points that a student (a quarter of a NAPLAN year) than the student with a raw score of 14. Random sampling 10,000 times suggests a 50% likelihood that a student just in the band, by 1.23804 points, is in actually in that band.

There is around a 46% probability that the student is in the band (band 2) below, and again a small chance that the student is actually in band 1 (2%) or band 4 (2%).

The probabilities of a student who sits in the middle of band 2 (297 points), looks like this:

Much more likely, 67% to be in the band than the students on the edge of the band, but also quite likely 15% and 17% that they are in the bands below or above. As such, it is relatively easy to imagine how likely an individual student’s results are accurate. NAPLAN’s rule of thumb, as we’re using their errors, might be, if a student sits in the middle of the band then there is about a two in three chance that they’ve been place in the correct band, if they are on the edge it is about 1 in 2.

And, in a cohort of 20, 30, 50, or more students, that is a lot of students who are probably placed in the wrong band.

ACARA would probably argue that in writing and spelling unlike reading and numeracy, the scale standard errors, in the middle of the table, are around 19 not 25. Yet, even then this only appears to increase the accuracy by about 4%.

What if NAPLAN underestimates the test error?

I wanted to explore who bigger errors effect NAPLAN’s results, as personally I’m sceptical that 25 scale standard error points are enough to adequately account for all of the errors in this type of test. As I discussed in the first section of this post.

The three fictitious students we are following are shown below with their relative probabilities based on the official NAPLAN errors.

If the standard error is increased by 10 (see table below), the confidence that the bands are accurate, drops considerably. For the student in the middle of band, from 66% to 52%, for the student just in the band from 49% to 44%, and for the student near the top of the band, from 49% to 40%.

When the test is increased by a further 10 points, to 20 points the results for these three students are:

And, a further 10 points, taking the errors to a bit more than a full NAPLAN year.

If NAPLAN’s scale standard errors are accurate, then the results at the single student level are really only a guide, with a best being a two in three chance of placing the student in their actual band. Yet, if these standard errors, which are usually around the number of points that equates to two questions, are underestimated then the accuracy of NAPLAN is even worse.

When considering 1) the simple errors (or correct guesses) that a student might make in the test, and 2) the ability for 3, 4 or 5 questions to test and assess a whole band’s learning, are we satisfied that a scaled error of 19 to 26 accounts for these? If NAPLAN published their research maybe we would be able to more accurately assess these questions, but until then I, for one, am highly sceptical.

 

Understanding Relative Growth
One of the measures that many Victorian schools are being encouraged to include in their four year strategic plans are NAPLAN relative growth measures. But can they be trusted?

NAPLAN describes relative growth as such:

The above table seems to suggest that relative growth is understood as being relative to a student’s prior performance NAPLAN. Yet, in reality relative growth is concerned with a student’s performance relative to other students. The points range is dependent on the actual scores of the students and differs from test to test, and question to question.

NAPLAN relative growth compares a student’s improvement in NAPLAN over a two year period.  To do so, a student is compared with all other students who obtained the same raw scores two years prior. For example, a student with a raw score of 14 in the year three reading test is compared with all of the other students who also scored 14 in the same test. The scores from this group in the test two years later, are analysed to determine the raw scores that correlated with the bottom 25% and the top 25%. The students who score below the 25% mark are determined to have produced low growth, the students above the 75% mark are determined to have high growth, and the students between the marks are determined to have produced average growth.

NAPLAN does not release the 25% and 75% for each question, however they can be somewhat reverse engineered based on the reports they data they do release to schools. For reading year 3 2015 to year 5 2017, the relative growth looks something like this, based on the data I obtained from my school, using data from 59 students, and no doubt is not 100% accurate. Note: the students placed below are dummy data.

In the above table, low growth is shown in red, average growth in yellow, and high growth in green. The y-axis show the raw scores and bands for the year 3 test, and the x-axis shows the year 5 test raw scores and bands.

As with the indicative NAPLAN table, it is clear that the range of scores for average growth narrows as the raw scores increase. A student with a raw score of 14 in year 3 needs a raw score of between 12 and 20 to have average growth, and a raw score of 21 or more to produce high growth. As such, for students with lower scores in the year 3 test, the range of subsequent scores for the students with average growth is about nine questions.

Another student who has a raw score of 35 in year 3, needs a raw score between 32 and 36 to produce average growth, and 37 or higher to produce high growth. That is, 50% of all students who scored in 35 in the year 3 test, score in the range of 32 to 36, a five question spread, in the year five test.  This average range which can be as narrow as four or five questions for higher performing students, make relative growth a worrisome measure for school effectiveness, especially when we factor in the errors previously discussed.

Let’s now look at how relative growth applies to our three example students. In 2015, they sat the year 3 reading test and their results and probabilities are shown below.

In 2017, they sat the year 5 reading test and their results and probabilities are shown below.

In the table below, our three fictitious, example students were:
1. Jane Smith, who had a raw score of 13 in year 3, and 10 in year 5, resulting in her being categorised as having low growth.
2. Kristi Lidija, who had a raw score of 14 in year 3, and 14 in year 5, resulting in her being categorised as having average growth.
3. John James, who had a raw score of 15 in year 3, and 21 in year 5, resulting in him being categorised as having average growth.

Note: As ACARA doesn’t publish these figures, I can only guess where the 25% and 75% marks sit. For example, I reasonably confident that the 25% mark for year 3 raw score of 14 is that a year 5 raw score of 11 is low growth, while a year 5 raw score of 12 is average growth. However, I’ve had to guess the scale scores (based on previous year’s data) and I estimate they are about 12 NAPLAN points apart. Is the 25% mark half way between these two scaled scores? Who knows? All these educated guesses, make the following probabilities only estimates.

We can see in the above table that all three students sit close to edge of the ranges, in my school’s real data this was very much the case. This is also more likely for students in the higher bands where the average range can narrow to only four or five question, as described at the start of this section.

In the case of the student with low growth, we can see below, that the probability that they are correctly identified as low growth is approximately 62%. If this student had answered one less question correctly in the year 3 test, or one more question correct in the year 5 test, then they would have be categorised as having average growth. As per the previous examples, these probabilities have been calculated by random sampling.

 

Similarly to students who scaled scores are next to the band limits. Students whose relative growth places them next to limit, results in probabilities around 50%.

In practice, NAPLAN relative growth is a so unreliable that I cannot believe that it is a suitable measure and I would personally discourage anyone from using it. The narrow range of questions that define average growth, compounded by the error inherent to NAPLAN’s testing method make it an extremely unreliable measure.

Note: The code I use only has relative growth limits for reading 2015 year 3 to 2017 year 5, as it is time consuming to reverse engineer. I also use another method, but this method is much less reliable. Hence the different results when you use the two relative growth reports in my code.

Five-Year Trend Analysis
I actually started my NAPLAN journey looking at the five year trend analysis. The code needs to run on a server and is much less user friendly, so I haven’t released it as yet. My findings suggest that five year trends are more accurate than other NAPLAN data, but at the most they in the ball park of around 80% accurate in accurately predicting the trajectory of the graph from year to year. Over the summer, I seek to include this into the code base and update this post.

A demo of the code used in this project can be found at:
https://www.richardolsen.me/naplan/getting-started

The code can be downloaded at:
https://github.com/richardolsen/naplan-vis

Any comments, corrections and suggestions are appreciated.