Understanding the limitations of NAPLAN

I am not a statistician. Though I studied statistics at university, it was a very long time ago. Further, I know very little, beyond what is mentioned in this post, about standardised testing. I’m probably not using the correct terms when I speak about test errors.  Read everything below with a critical eye, and take my claims with a grain of salt… and sorry for the typos, at the end of writing I just wanted to push “publish.”

Time to read: 11 minutes

Part of my job at school this year has been looking at NAPLAN data. This isn’t something that anyone really wants to do. This is post is about what I’ve discovered. I’m using dummy data for privacy reasons, yet all findings mentioned in this post is in line with what I found using data from my school.

People more qualified than me have written about the limitations of NAPLAN, in particular Margaret Wu, who makes it clear:

Still, it is hard for many school leaders to understand what this means for their yearly NAPLAN data, and even harder to communicate to other stakeholders. As such, I created a tool that allows school leaders and data nuts to visualise how NAPLAN errors our data.

Major Types of Errors
There are two types of major errors which hang ever present over NAPLAN data 1) student error, and 2) test extrapolation error.

Student Error
Student error acknowledges the fact that if a student gets 18 out of 25 in a spelling test this week, that doesn’t mean that they’ll get 18 out of 25 in a spelling test next, or the week after that, and so on. Of course, we wouldn’t expect someone who gets 18/25 to get 2/25 the next week either. So we know it is a predictor. Wu suggests +/- 12.5% is a good guide. Which means that our student who this week spelt 18/25 words correctly will probably get 15 to 21 words correct next week. Student error acknowledges that students sometimes make silly mistakes, and also, sometimes they get lucky and answer a question correctly that they usually wouldn’t.

Test Extrapolation Error
The second error accounts for the ability of the test to accurately represent the development that it is seeking to assess. Take for example, NAPLAN’s 2017 Year 3 spelling test.

In this test there were 26 questions. If you answer zero out of 26 correctly you are in band 1. If you answered any question correctly, you are in band 2 or higher. In fact, 3 questions cover the entire band 2, 6 cover band 3, 5 cover band 4, 5 cover band 5, and 8 further questions cover bands 6 and above. Given that each band is supposed to represent a year’s worth of learning, how well can 3, 4 or 5 questions (in the case of bands 2, 3, 4, and 5) represent all the words that students in these bands should know? How well does each of these words, represent 20% of the total words of bands 4 or 5? Can we really believe the test can determine that a student who answers two of the five band 4 words correctly knows about 40% of the band 4 words?

Of course, this is assuming NAPLAN’s tests have increase in linear difficultly where once students encounter a word they cannot spell they cannot correctly answer any of the harder words in the test. In this year’s spelling test this could not be further from the truth.

As part of our toolkit for analysing NAPLAN data, we were provided a spreadsheet with the test questions ranked in order of difficulty. We were also provided with the state percentage correct for each test question. For the spelling tests in years 3 and 5 (I didn’t check years 7 and 9), what NAPLAN believed was the order of difficultly did not correlate with the perceived linear difficultly of the words. For example, when tested while 91% of the state correctly answered the first and third easier questions, only 56% of state correctly answered what was considered to be the second easiest question!

In fact, according to Victorian state averages, band two contained the first, third and seventh hardest words. Band 3 which is supposed to have the 4th, 5th, 6th and 7th hardest words, in fact contained the 4th, 10th, 14th and 16th hardest words. And, rather than the 8th, 9th, 10th, 11th, 12th and 13th hardest words, band 4 contained the 2nd, 8th, 15th, 20th and 21st hardest words. Therefore, a student who is identified in band 2, having only been able to answer the three easiest question, may in all likelihood have answered what was perceived to have been the hardest question in band 3!

It is true that actual difficultly and perceived difficultly in numeracy, reading and grammar line up much more accurately than spelling in the NAPLAN. Yet, none of these anywhere close to being accurate to the point where a question can pinpoint a specific point in a student’s development.

The test extrapolation error therefore describes the difference between the test and reality. How well can the test, and in NAPLAN’s case maybe only three or four questions represent a student’s actual development?

When accounting for the these two errors it is clear that an individual student’s NAPLAN score is an approximation of the student’s development. How accurate an approximation is open to debate.

NAPLAN’s Scaled Scores

NAPLAN uses scaled scores to convert, in the case of the 2017 year 3 spelling test, into NAPLAN bands. A scaled score of between 0 and 270 places a student in Band 1, a scaled score of between 271 and 322 places the student in Band 2, and so on with each subsequent band, up to Band 9, comprising 52 points. Twelve correct answers in a test doesn’t place a student in the same band. Each and every test is scaled differently with the raw score in a test is uniquely converted to the scaled score.

ACARA publish equivalence tables each year, apparently the 2o17 tables are due out in December, which show how the raw score (number of correct answers) is translated into a NAPLAN band for each test. As you can see the 2016 year 3 spelling test was scored a bit differently to the 2017 test, with a student needing three correct answers to achieve band 2.

What is not apparent, that while the scaled scores are precise to five decimal places, students cannot get any score, instead they get the score that corresponds with their raw score. In the case of the above table, two students with raw scores of 3 and 4, are given scaled scores of 296.25472 and 316.08247 respectively. There is no way to achieve a score between these two points. Considering that this equates to 20 points and a year’s worth of learning is 52 points, NAPLAN cannot, and is not, precise. For those who believe NAPLAN can show learning to a year or months, this gap of 20 points is a gap of 4.6 months! Hardly precise.

It is true, that in the middle the equivalences tables, each question usually has a scaled score gap of 9 or 10 points. Though this still equates to two to three month increments, even if we believe their aren’t any student errors or extrapolation errors.

Let’s look further at the table.

I’m going to leave most of data in this table to the real statisticians, but I will focus on Scale SE column. NAPLAN defines a scaled standard error for each scaled score. In the table above, you can see a raw score of zero, translates to a scaled score of 182.95034 with a scale standard error of 61.60665. The scale standard errors, quickly reduce from this high of 61 to 36, then 30, and for many of the tests become as low as 20 or 21.

ACARA does not, anywhere that I can find, communicate what the scale standard error is compensating for, though, I would assume that it is an attempt to allow for the test extrapolation error. They do, though make it clear, in some documents, that the scale standard error correlates to a single standard deviation. Which means that there is a 68% likelihood that a student’s development falls within 1 standard deviations of the scaled score, a 90% likelihood within 1.64 standard deviations, 95% likelihood within 1.96 standard deviations.

If we assume that this error is normally distributed, then a student with raw score of 4 in the 2016 year 3 spelling test, would be 68% likely to have a NAPLAN score between 291 and 251. They would be 90% likely to have a NAPLAN score between 275 and 257, which is just under two NAPLAN years difference. And, 95% to have a NAPLAN score between 267 and 365, a difference of 98 NAPLAN points or nearly two NAPLAN years!

Visualising NAPLAN Errors
Take, for example, a fictitious student who had a raw score of 14 on the 2015 year 3 reading test. I’m using 2015 as a example, because later I’ll look at relative growth, between a student who sat the year 3 test in 2015, and the year 5 test in 2017.

This score of 14 rounds to a scaled score of 310 (band 2), with a scale standard error of 26.42. Remember band 2 begins at 270 and ends at 322. Using NAPLAN’s scaled score and scale standard error, I randomly generated 10,000 scores within the normal distribution, in order to see the range of possibilities and their likelihood. This is the result, the last found columns show bands 1 to 4, with the green highlight indicating the band that NAPLAN places the student in. Note: The code also highlights in orange when probabilities are greater than 20%, an arbitrary decision.

With an initial raw score of 310, it is unsurprisingly, most likely (61%) that the student is in band 2. However, as the student sits close to the top of band 3 (310 as opposed to 322) there is also a 32% chance they might be actually be in band 3. There is also a small chance, that they are also in bands 1 (7%) or 4 (1%)!

What about a student that sits just in band 3? In the same test if a student has a raw score of 15 they will be allocated a scaled score of 323.23804, 13 more NAPLAN points that a student (a quarter of a NAPLAN year) than the student with a raw score of 14. Random sampling 10,000 times suggests a 50% likelihood that a student just in the band, by 1.23804 points, is in actually in that band.

There is around a 46% probability that the student is in the band (band 2) below, and again a small chance that the student is actually in band 1 (2%) or band 4 (2%).

The probabilities of a student who sits in the middle of band 2 (297 points), looks like this:

Much more likely, 67% to be in the band than the students on the edge of the band, but also quite likely 15% and 17% that they are in the bands below or above. As such, it is relatively easy to imagine how likely an individual student’s results are accurate. NAPLAN’s rule of thumb, as we’re using their errors, might be, if a student sits in the middle of the band then there is about a two in three chance that they’ve been place in the correct band, if they are on the edge it is about 1 in 2.

And, in a cohort of 20, 30, 50, or more students, that is a lot of students who are probably placed in the wrong band.

ACARA would probably argue that in writing and spelling unlike reading and numeracy, the scale standard errors, in the middle of the table, are around 19 not 25. Yet, even then this only appears to increase the accuracy by about 4%.

What if NAPLAN underestimates the test error?

I wanted to explore who bigger errors effect NAPLAN’s results, as personally I’m sceptical that 25 scale standard error points are enough to adequately account for all of the errors in this type of test. As I discussed in the first section of this post.

The three fictitious students we are following are shown below with their relative probabilities based on the official NAPLAN errors.

If the standard error is increased by 10 (see table below), the confidence that the bands are accurate, drops considerably. For the student in the middle of band, from 66% to 52%, for the student just in the band from 49% to 44%, and for the student near the top of the band, from 49% to 40%.

When the test is increased by a further 10 points, to 20 points the results for these three students are:

And, a further 10 points, taking the errors to a bit more than a full NAPLAN year.

If NAPLAN’s scale standard errors are accurate, then the results at the single student level are really only a guide, with a best being a two in three chance of placing the student in their actual band. Yet, if these standard errors, which are usually around the number of points that equates to two questions, are underestimated then the accuracy of NAPLAN is even worse.

When considering 1) the simple errors (or correct guesses) that a student might make in the test, and 2) the ability for 3, 4 or 5 questions to test and assess a whole band’s learning, are we satisfied that a scaled error of 19 to 26 accounts for these? If NAPLAN published their research maybe we would be able to more accurately assess these questions, but until then I, for one, am highly sceptical.


Understanding Relative Growth
One of the measures that many Victorian schools are being encouraged to include in their four year strategic plans are NAPLAN relative growth measures. But can they be trusted?

NAPLAN describes relative growth as such:

The above table seems to suggest that relative growth is understood as being relative to a student’s prior performance NAPLAN. Yet, in reality relative growth is concerned with a student’s performance relative to other students. The points range is dependent on the actual scores of the students and differs from test to test, and question to question.

NAPLAN relative growth compares a student’s improvement in NAPLAN over a two year period.  To do so, a student is compared with all other students who obtained the same raw scores two years prior. For example, a student with a raw score of 14 in the year three reading test is compared with all of the other students who also scored 14 in the same test. The scores from this group in the test two years later, are analysed to determine the raw scores that correlated with the bottom 25% and the top 25%. The students who score below the 25% mark are determined to have produced low growth, the students above the 75% mark are determined to have high growth, and the students between the marks are determined to have produced average growth.

NAPLAN does not release the 25% and 75% for each question, however they can be somewhat reverse engineered based on the reports they data they do release to schools. For reading year 3 2015 to year 5 2017, the relative growth looks something like this, based on the data I obtained from my school, using data from 59 students, and no doubt is not 100% accurate. Note: the students placed below are dummy data.

In the above table, low growth is shown in red, average growth in yellow, and high growth in green. The y-axis show the raw scores and bands for the year 3 test, and the x-axis shows the year 5 test raw scores and bands.

As with the indicative NAPLAN table, it is clear that the range of scores for average growth narrows as the raw scores increase. A student with a raw score of 14 in year 3 needs a raw score of between 12 and 20 to have average growth, and a raw score of 21 or more to produce high growth. As such, for students with lower scores in the year 3 test, the range of subsequent scores for the students with average growth is about nine questions.

Another student who has a raw score of 35 in year 3, needs a raw score between 32 and 36 to produce average growth, and 37 or higher to produce high growth. That is, 50% of all students who scored in 35 in the year 3 test, score in the range of 32 to 36, a five question spread, in the year five test.  This average range which can be as narrow as four or five questions for higher performing students, make relative growth a worrisome measure for school effectiveness, especially when we factor in the errors previously discussed.

Let’s now look at how relative growth applies to our three example students. In 2015, they sat the year 3 reading test and their results and probabilities are shown below.

In 2017, they sat the year 5 reading test and their results and probabilities are shown below.

In the table below, our three fictitious, example students were:
1. Jane Smith, who had a raw score of 13 in year 3, and 10 in year 5, resulting in her being categorised as having low growth.
2. Kristi Lidija, who had a raw score of 14 in year 3, and 14 in year 5, resulting in her being categorised as having average growth.
3. John James, who had a raw score of 15 in year 3, and 21 in year 5, resulting in him being categorised as having average growth.

Note: As ACARA doesn’t publish these figures, I can only guess where the 25% and 75% marks sit. For example, I reasonably confident that the 25% mark for year 3 raw score of 14 is that a year 5 raw score of 11 is low growth, while a year 5 raw score of 12 is average growth. However, I’ve had to guess the scale scores (based on previous year’s data) and I estimate they are about 12 NAPLAN points apart. Is the 25% mark half way between these two scaled scores? Who knows? All these educated guesses, make the following probabilities only estimates.

We can see in the above table that all three students sit close to edge of the ranges, in my school’s real data this was very much the case. This is also more likely for students in the higher bands where the average range can narrow to only four or five question, as described at the start of this section.

In the case of the student with low growth, we can see below, that the probability that they are correctly identified as low growth is approximately 62%. If this student had answered one less question correctly in the year 3 test, or one more question correct in the year 5 test, then they would have be categorised as having average growth. As per the previous examples, these probabilities have been calculated by random sampling.


Similarly to students who scaled scores are next to the band limits. Students whose relative growth places them next to limit, results in probabilities around 50%.

In practice, NAPLAN relative growth is a so unreliable that I cannot believe that it is a suitable measure and I would personally discourage anyone from using it. The narrow range of questions that define average growth, compounded by the error inherent to NAPLAN’s testing method make it an extremely unreliable measure.

Note: The code I use only has relative growth limits for reading 2015 year 3 to 2017 year 5, as it is time consuming to reverse engineer. I also use another method, but this method is much less reliable. Hence the different results when you use the two relative growth reports in my code.

Five-Year Trend Analysis
I actually started my NAPLAN journey looking at the five year trend analysis. The code needs to run on a server and is much less user friendly, so I haven’t released it as yet. My findings suggest that five year trends are more accurate than other NAPLAN data, but at the most they in the ball park of around 80% accurate in accurately predicting the trajectory of the graph from year to year. Over the summer, I seek to include this into the code base and update this post.

A demo of the code used in this project can be found at:

The code can be downloaded at:

Any comments, corrections and suggestions are appreciated.

Poor research and ideology: Common attempts used to denigrate inquiry

Time to read: 5 minutes

A few months ago a prominent Melbourne University academic tweeted “Pure discovery widens achievement gaps” citing the paper “The influence of IQ on pure discovery and guided discovery learning of a complex real-world task” I was immediately dubious of this research, as research that is commonly quoted showing that the inquiry learning doesn’t work, is usually fundamentally flawed. I’m not a proponent of “pure discovery learning” per se, but I feel this type of research, and the reporting on this type of research is designed to denigrate all inquiry learning. In an attempt to leave only teacher instructional approaches standing – why these researchers don’t instead prove a theoretical basis for instruction is beyond me.

So I took a look at the research to see if educators should have any confidence in its reported findings.

TLDR: No, we shouldn’t haven’t any confidence in this research, and it does not show that pure discovery or inquiry approaches widen achievement gaps.


Not surprisingly, this research fails the good educational research test as it doesn’t use a learning theory. That is, the research does not attempt to justify a theoretical basis for its findings. The researcher does use two other (non-learning) theories though, to defend the research, notably game theory, and control value theory. In essence the author uses these theories to defend the research design, yet for some reason he does not believe a learning theory is also required? I find this baffling.

Why doesn’t the author believe that a learning theory is required to define the scope of the research, given that the research is about learning? Why does the author believe that theory is required to explain games, and emotional attainment?

Anyway, let’s look at the research, as I’m always interested in how this type of research is used to investigate inquiry learning, or in this case pure discovery learning.

The author defines pure discovery learning as learning occurring “with little or no guidance. Essentially, knowledge is obtained by practice or observation.” The author spends considerable time explaining how pure discovery occurs in so much of our lives, with ATMs and iPhones requiring people to use them correctly without instructions. He continues, in explaining how Texas Hold’em poker requires people to “use multiple skills to reason, plan, solve problems, think abstractly, and comprehend complex ideas” which are similar to real life pure discovery learning situations.

The author explains: “The poker application used for this study was Turbo Texas Hold’em for Windows, version four copyright 1997–2000 Wilson Software. This is a computerized simulation of a 10-player limit hold’em poker game.”

Wait!!! What????

A computer simulation of a game you play with real people is a suitable method for exploring pure discovery?

Texas Hold'em Archive.orgInterestingly enough, you can play Texas Hold’em, the software used in this research in your browser thanks to archive.org. (Note: If you’re using a mac use Function + right arrow when it asks you to press End to play.) In playing Texas Hold’em you will discover just what an poor attempt of simulating the playing of poker against nine other simulated people this really is.  It appears that the study data used in this paper is actually from a previous study by the same author, “Poker is a skill” dated 2008. The 2008 date still doesn’t explain why such old DOS software was used! In this paper the author explains that 720 hands of Texas Hold’em over six hours is equivalent to thirty hours of casino play, with real people as opponents. That is 6 hours playing against a computer is supposedly the same as playing 30 hours against real people!

If you play the simulation at archive.org it is easy to see how 30 hours of real play can be achieved in 6 hours using this simulator. Turns made by your computer opponents fly past with short text messages popping up briefly on the screen.  Two groups of students used this old software. The researchers provided the instruction group with instructions of a specific poker strategy, the pure discovery group were, for some reason, given documents detailing the history of poker! The success of players was determined by the money that they had won (or lost), though it should be noted that the participants were not playing for real money. Instead the highest ranking players were promised to be playing to a chance to be part of a raffle for an iPod. This was intended to place meaning, to the otherwise valueless money, each player was playing for.

So this study designed is to simulate a real life complex problem, yet it doesn’t even simulate a real life game of poker!  The participants were not playing against real people. The participants were not playing for real money (though their success was measured as if they were.) And… the participants were playing five times faster than real poker is played.

All of this should make anyone question how the author could possibly argue that this research design can possibly be described as “pure discovery,” as commonly used in real life situations. Interestingly, though the author identifies differences between the instruction and control groups. Neither group learned to play poker to the point where they didn’t lose money. Further both groups played many more hands, more than twice as many hands as poker experts are reported to recommend that “good” poker players play. That is, neither group exhibited the one of the main traits of good poker players, folding around 85% of the time, and only playing 15% of the time.


How might research of pure discovery be better designed?



Playing with real people would allow a learner using pure discovery to observe, and seek to understand the decision making of other poker players. Depending on the relationship with the players the learner might ask questions of their opponents, seeking to clarify rules and strategies. Other players might also intervene in the play, offering advice and pointing out pivotal moments in the hand, and pivotal decisions being made by other players. If the participants played against real people, surely they would’ve noticed that they were playing many more hands (much more than twice as many) than their more skilled opponents? Though the computer might be able to simulate the logic of poker, it cannot and does not simulate the interactions between the players, a critical feature of playing any game, and especially poker.

Given that the instruction group did not learn to play Texas Hold’em poker to a satisfactory level, it is obvious that instructional strategies used did not work. Of course, it must be noted neither did the control group, who were left to battle the computer opponents on their own, armed only with a document on poker history. To suggest players playing against real opponents, using pure discovery or other inquiry approaches would also fail to learn to play poker satisfactorily, is obviously outside of the scope of the research, as the researcher did not explore this.

Of course, the lack of a learning theory is what has also led the researcher to his narrow definition of successful learning. Did the author ever consider why people play poker? Is money the only indicator of successful play? Or do people also play games for fun? Are the social aspects of playing with friends an important part of being a poker player?

A more complete understanding of what makes a poker player, a poker player, would consider other indicators, traits, characteristics and motivations.  Did the study participants continue playing poker after the study had finished? Did they enjoy playing poker?  Do they intend to teach friends? Do they feel playing poker with their friends strengthens their friendships? (Not that they were given this opportunity.) Have they developed their own theories and strategies they intend to try out in the future? What do they know about poker?

I believe a better understanding of poker players and the reason people play poker, would greatly improve this poor study. It would also provide further evidence for the worth of learning to play poker by playing poker with friends.  Not that this is an earth-shattering conclusion! After all isn’t that how we all learn to play any new game? Or maybe you’re the one out on the oval by yourself with a ball, a sheet of paper documenting the history of football!


Driven By Ideology?
To suggest that individuals playing Texas Hold’em against a computer mirrors inquiry that happens in our schools in complete nonsense.

To suggest that this research proves pure discovery “widens the achievement gap” is complete nonsense.

To suggest that learning poker by yourself on a computer playing against a simulation has anything at all do with student learning and real inquiry is nonsense.

Do academics who favour high levels of teacher instruction really expect us to believe that inquiry classrooms operate the same way that people learn to play poker individually on their computer?

Do academics who favour high levels of teacher instruction really believe that playing poker on your own against a computer tells us anything about how teacher professional development or teacher pre-service training should be designed?

Do academics who favour instruction really believe that a piece of paper with strategies on them is really the best way to learn anything?

Do academics who favour instruction really believe learning is just about knowing, and not about experiencing with others?

Do academics who favour instruction really believe we’re that gullible?

Is there evidence that Positive Education improves academic performance? No

Screen Shot 2016-07-20 at 8.23.50 AM

Time to read: 5 minutes

Lately there has been quite a bit of talk in education circles about social aspects of learning, particularly well-being, grit, growth and other mindsets, positive psychology and other social and emotional programs.

My personal opinion is that these are a tacit recognition by proponents of direct instruction, that their belief that learning and development is a linear cognitive approach of memorising skills is insufficient. Maybe they are starting to understand that development is highly individual in nature, it is not linear or maturational, and that it is a complex transition to qualitatively new understanding of concepts, new motivations, new relationships with others and the world, new directions, and new results?

Unfortunately, rather than reexamining the more appropriate learning theories of Vygotsky, Piaget and other dialectical approaches to development, these instructionists blindly continue down their misguided path co-opting bits and pieces into their flawed framework. Rather than design learning and teaching so that it IS social, they attempt to teach social as if it was a seperate discrete unit to other learning.

One such model is the Visible Wellbeing Instructional Model. Rather than admitting direct instruction (Visible Learning) and performativity (Visible Thinking) don’t work. They’ve now misunderstood the fundamental aspect of the idea that all learning is social from Vygotsky and Piaget, and instead tried to stuff it into their broken Visible Learning and Visible Thinking model in the hope that it will fix it.

How do they justify this?  Well according to them, Positive Education has been shown to increase student academic results by 11 percent.

Unfortunately for the Visible Wellbeing Instructional Model, this is simply untrue.


In 2011, Durlak, Weissberg, Dymnicki, Taylor, and Schellinger released their meta-analysis of social and emotional interventions. Notice that their paper is concerned with school based interventions, not a study of social and emotional practices that are embedded in standard learning and teaching practice. Their finding that is widely reported as evidence that these programs improve academic results is found in the abstract where they write:

“Compared to controls,  SEL (Social Emotional Learning) participants demonstrated significantly improved social and emotional skills, attitudes, behavior, and academic performance that reflected an 11-percentile-point gain in achievement.”

Seems clear cut right? Wrong!

If you, like me, and seemingly subsequent researchers who quote this research took “compared to controls” means compared to those who didn’t participate in these programs you’d be wrong, because that’s not at all what they are saying… Let’s read the paper further.

In Table 5, they specify the results of their meta analysis:
Skills 0.57
Attitudes 0.23
Positive Social Behaviours 0.24
Conduct 0.22
Emotional Distress 0.24
Academic Performance 0.24

Though I’m not a fan of effect sizes, as I believe they are completely flawed, consider what John Hattie in the book Visible Learning says about effect sizes:

“Ninety percent of all effect sizes in education are positive (d > .0) and this means that almost everything works. The effect size of d=0.4 looks at the effects of innovations in achievement in such a way where we can notice real-world and more powerful differences. It is not a magical number but a guideline to begin discussion about what we can aim for if we want to see student change.”
(Hattie, p15-17 quoted by http://visiblelearningplus.com/content/faq)

You might notice that all except one of Durlak et al effect sizes fall below Visible Learning’s guideline for beginning discussion about them. The only one is Skills (0.57) so according to their figures only worth of Social and Emotional Interventions are to develop social and emotional skills. Everything else atttitudes (0.23), positive social behaviours (0.24), conduct (0.22), emotional distress (0.24), and academic performance (0.24) fall a fair way below the Visible Learning cut off.

You’re probably wondering, where the 11% gain in academic improvement comes from, in light of its small effect size. To solve this one, we need to keep reading the paper.

“Aside from SEL skills (mean ES = 0.57), the other mean ESs in Table 2 might seem ‘‘small.’’ However, methodologists now stress that instead of reflexively applying Cohen’s (1988) conventions concerning the magnitude of obtained effects, findings should be interpreted in the context of prior research and in terms of their practical value (Durlak, 2009; Hill, Bloom, Black, & Lipsey, 2007).”
Durlak, Joseph A., et al. “The impact of enhancing students’ social and emotional learning: A meta‐analysis of school‐based universal interventions.”Child development 82.1 (2011): 416.

The mean effect sizes in Table 2 (Table 2 contains the same figures as above and broken down into further groups, such as class by teacher, class by non-school) do seem “small,” because they are small! Very small, so small Hattie would no doubt suggest you should ignore social and emotional programs, unless you’re teaching social and emotional “skills” (0.57).

But what do the author’s mean when they say “instead of reflexively applying Cohen’s (1988) conventions”?   I looked up the definition of reflexively… the Webster-Merriam dictionary gives the following meaning:

“showing that the action in a sentence or clause happens to the person or thing that does the action, or happening or done without thinking as a reaction to something”

Now I’m not a methodologist, like Durlak whose other paper is provided as a reference about why the effect size of the social and emotional intervention shouldn’t be understood by the effect size happens because of the intervention. Yet, it does seem a bit of a stretch (to a non-methodologist), to find what the methodologist is an appropriate method of determining its practical value.

Table 5

What the authors did, as far as I can tell as a non-methodologist, in order to “interpret the practical value of social and emotional interventions” is compare the results to other social and emotional interventions.

I’ll say that again, the 11% improvement in academic results is not compared to control groups who did not have interventions at all, they are 11% gains over students in other social and emotional type programs, and all students experience less improvement than those who did not participate in social and emotional programs.

We can see clearly from the last line of the table that the figure 11% was produced by comparing the effect size of 0.27 to four other studies with effect sizes of 0.29, 0.11, 0.30 and 0.24.

I’ve taken a quick look at these studies. They describe: 1) Changing Self Esteem in Children, 2) Effectiveness of mentoring programs for youth, 3) Primary prevention mental health programs for children and adolescents, and 4) Empirical benchmarks for interpreting effect sizes in research.

I must admit (as a non-methodologist) that I don’t understand why or how the fourth study “Empirical benchmarks for interpreting effect sizes in research” fits the criteria of “prior research” given that, from as far as I can tell it has nothing to do with social and  emotional programs. But what that particular research does describe is that typical effect sizes for elementary school are 0.24 and middle school 0.27.  On that research alone the effect sizes are either level or slightly above expected, hardly a ringing endorsement, nor a source of much faith in the 11 percentage points of academic improvement.

A rudimentary understanding of mathematics also suggests the extremely low effect size (0.11) of study into “Effectiveness of mentoring programs for youth” greatly increased the difference between the study in question and the “prior research.” I’d suggest if that study was deemed not fit the “practical value” of the study then the 11 percentage points figure would’ve been much lower.

So, it seems clear to me the 11 percentage points of academic improvement is determined by comparing it to previous similar studies which didn’t work as well. Any other measure would not have produced the same results.


Of course, to Vygotsky or Piaget these results would not be surprising. For they know you can’t reduce learning and development to individual traits instead we can only understand it as a complex system.  Maybe, the Visible Wellbeing Model is trying to move towards Vygotsky and Piaget? If so, they’re doing it wrong. By attempting to identify and promote three traits of teacher effectiveness, teacher practice, and wellbeing, they’re not seeing them as a system but rather three individual traits together. Yet, at the same time they’re only measuring one trait… test scores. And when you only measure one trait, guess what, the only traits that matter are that trait!

For Positive Education and wellbeing to ever produce an effect size that is substantial, what is measured would need to change, just as they did to produce the contrived 11% figure. But can what Visible Learning effect sizes deem important change? Could they decide what matters while still believing in “evidence”?

Such is the conundrum that the Visible Wellbeing Model finds itself in? Theoretical baseless, considering test scores only worthwhile, what it finds are worthwhile aren’t what they know are worthwhile… No wonder most of us still listen to Vygotsky and Piaget!


Personally, I believe the learning and development is social, so this post is not to belittle the wellbeing movement but rather to suggest reducing social and emotional to skills to be learned though programs and interventions is, in my opinion, a missed opportunity. Further, to think we can bolt on wellbeing in order to improve test scores, is to misunderstand how our students actually learn and develop.


Incidentally, Inquiry-based learning in the incredibly flawed Visible Learning meta analysis comes in at 0.35, maybe it is time it replace Positive Education, with an effect size of 0.27 as one of the three components of their model?