233 – An analysis of ANOVA as a Descriptive Tool

The IDEMS Podcast
The IDEMS Podcast
233 – An analysis of ANOVA as a Descriptive Tool
Loading
/

Description

In this episode Lily and David delve into the ANOVA (Analysis of Variance) tables, focusing on the sum of squares. They discuss how it helps account for data variability, and the difference between sum of squares and mean squares. The episode also touches on the limitations of p-values and emphasizes the ANOVA table’s value as a descriptive tool, particularly in enhancing research methods in West Africa.

[00:00:07] Lily: Hello, and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here with David Stern, a founding director of IDEMS.

Hi, David.

[00:00:14] David: Hi, Lily.

[00:00:15] Lily: I thought we could continue our previous conversation in a previous podcast that we had on ANOVA. So this was related to the 20 years of research methods support in West Africa and the celebrations for that, and the reflections surrounding that.

And one thing that we touched on was, well, not even touched on, one thing we discussed was this kind of analysis of variance tables or ANOVAs, where you have all of this richness of data in there and oftentimes people look straight to one single value in there, your p-values.

[00:00:51] David: We’ll get to the p-values eventually, but they were the fifth column and we only got to the first column. 

[00:00:58] Lily: Yes, we only managed to discuss our degrees of freedom in the previous podcast, which was very eye-opening, actually.

[00:01:05] David: I think it’s so important that you, as a data scientist, as a statistician, you know, you’ve known a lot about ANOVA, you’ve taught ANOVA before. And yet, the fact that the description we had for the degrees of freedom part of an ANOVA table was eye-opening is to me an indicator of the failings we, as statistics educators, have on communicating, even to statisticians, what it means to work on these tables and to use them in the real world.

[00:01:37] Lily: It’s to me also just fascinating how much depth there is that you don’t see before, and it’s fun, it’s fun to then learn and see new depth. You know, if you’re watching a sport, say, you can understand it at a surface level, then you really understand that sport to surface level and you start to see more and more depth to that.

[00:01:56] David: Absolutely. But, I would argue that what we discussed last time and what we’re gonna discuss today on the sum of the squares, this isn’t real depth yet. We can get really more interesting once we start putting them all together. But let’s get into the sum of the squares.

[00:02:09] Lily: Okay, the sum of the squares. Let’s get into the sum of the squares. 

[00:02:13] David: So sum of the squares of course, this is an ANOVA table, an analysis of variance, the sum of the squares, this is your simple measure of variability, variance. That’s what you are measuring. So, this is the heart of the table. How much variability is in your data set.

[00:02:35] Lily: Yeah. So if you look at your total sums of squares, my understanding is that the difference between your data points, your y-variable points and the average., squared.

[00:02:47] David: Otherwise it’s not a sum of squares. You can do it where you could cube it, or you could do it, where you just take the absolute value of that difference. In mathematics, these are different measure spaces and there’s reasons why, in certain statistical contexts, you might want either of those or many other measures.

But, with most data sets, the sum of the squares is appropriate because it penalizes extreme values, which an absolute value wouldn’t, but it doesn’t penalize them too much. 

[00:03:22] Lily: And I suppose a cube would also, could have negative values. 

[00:03:26] David: Then you take the absolute value. 

[00:03:27] Lily: Oh, okay. Sorry, the absolute cube. Okay. 

[00:03:29] David: Yeah, sorry, I should have been more precise, but you are right. Your measure should always be positive, it is a whole different thing of wanting to consider things, whether you are not always considering them as being positive measures. But, the positive cube, the absolute value cube would just be penalizing your extreme values more than with the square.

That would mean that, if you think of fitting your line, any extreme value would have a bigger impact on where that line is, than if you’re using the squares. If you went to higher powers, then those extreme values will influence it more and more. And if you go to your absolute value, then your extreme values make very little difference, you may as well try and fit the bunch of the values closer than your extreme value, it makes very little difference. This affects how you fit your line.

And of course the ANOVA is your basic model for this. This is why if you start using generalized linear models you can’t use the ANOVA in the same way, because you don’t use the sum of the squares in the same way. It’s not such a simple measure of variability. But your ANOVA is based on the methods for which you can use this simple sum of the squares method.

[00:04:51] Lily: Nice. And with your total sums of squares, it’s kind of that simple, well, it’s not always a simple graph, but if you just imagine that simple graph of your x against y with a line of best fit, just the difference between that line of best fit and your points squared.

[00:05:07] David: So when you say squared, I mean, the nice way of looking at that is you can actually draw squares.

[00:05:12] Lily: Yes. Yes.

[00:05:13] David: And so this is the thing, it’s sort of like saying you’ve got these squares, which you draw, where one extreme point, you have horizontal and vertical, and you create the square between your point and the line, it’s a beautiful visual, which of course on a audio podcast we can’t show. But, you are just adding those areas up. And so your line of best fit is the line, which, when you move it, the area in all of the squares is minimized. That’s what you are doing. And this is where you can see that if you have an extreme value and you make a very big square, that’s a lot of area just contained in that one square. And so it does penalize those extreme values and you want to avoid really big squares, because one very big square takes a lot of space, so to speak.

[00:05:58] Lily: Nice.

[00:05:59] David: And for the absolute value, which we discussed before, would be if you just took that perpendicular line to the point, where is the sort of shortest distance to the point, that’s what you are looking at, the difference between the value and the point. So it’s not the shortest distance, it’s the difference between the value you have and the point.

[00:06:19] Lily: Yeah.

[00:06:19] David: So the vertical distance.

So anyway, the thing which I think is so important is, you’ve mentioned that your total sum of the squares, that’s the total variability in your dataset. So we could look at now, well, what happens when you put in a variable? So you start off with just the average, and you have then the sum of the squares with respect to the average, but then you put in, let’s say, a continuous variable. Again, let’s go through the different options as before. So, now you have a straight line and you have the sum of the squares related to the straight line. Or you could put in a different value for each level of a factor and you use the squares with respect to that. So what makes its way into the ANOVA table?

Well, the easiest way to think of this is: you have your total variability with the average, and when you add a factor or a line, then you now have a new number, which is the total variability – the best line of best fit that you get with that – and the difference between those two, that’s your sum of the squares that goes into the ANOVA table.

So, if you just put one thing in your total is now comprised of the residual, what’s left over after you’ve put the model in, and the difference between the previous total and the residual, which is put in as what has been taken into account by that variable.

And so, in your ANOVA table, the sum of the squares column is representing how much variability were you able to account for with this part of the model or with this variable in the model. And this is of course really useful. If you have a big sum of the squares, then you’re accounting for a lot of variability. You are reducing the variability if you have a small sum of the squares there, actually that part of the model is not really helping.

[00:08:32] Lily: I laugh at “useful” ’cause I feel “useful” is a bit of an understatement. This is quite useful. In statistics you want to account for the variability and this is telling you, okay, how much variability have we accounted for?

[00:08:45] David: Yes.

[00:08:46] Lily: This is huge.

[00:08:48] David: It is huge. People often like your R-squared, but all that is, is you just take the residual variability at the end, you are looking at that with respect to your total variability. That’s all it is. It’s how much has your model taken into account and how much you left within residual. That’s all that information is telling you.

And of course that’s visible in much more detail in the ANOVA table in the second column, in this wonderful sum of the squares column, and it’s a very interesting column. Lots of information here. The thing which I always find surprising, now, of course, changing the order of your variables, your degrees of freedom might not change, but your sum of the squares might change quite considerably.

We have some very nice examples where people who have done ANOVA with their different variables, they’ve been doing on-farm research, and so there have been unbalanced elements, they change the order of their variables, their degrees of freedom don’t change. But their sum of the squares changes considerably.

We actually have a little game around this which highlights this, when you don’t have balanced design, in the game, you have “village”, you have “variety”, and you have “fertilizer” level. And if you put the “village” in first, it’s clear, “village” is really important, it explains a lot of variability. Then “variety” does explain a bit more, and “fertilizer” level explains a bit more.

If you change the order and put your “village” in last, your village doesn’t explain any variability. In fact, it’s very similar to the residual, on average. We’ll get back to the mean squares in a minute. But, suddenly what that means is the amount of variability explained by “village” has gone down immensely. And that’s because it’s the imbalance between the “variety” and “fertilizer” levels. So it’s the fact that in different villages, people used more or less certain varieties than other varieties, and more or less fertilizer than in other villages.

And so the differences between your villages, they can be accounted for by the differences in fertilizer level and variety. And this is huge, because if that’s the case, oh that’s a relief, you could have different results and different villages. You just have different results based on which variety people use and what fertilizer level they use.

Now, there’s plenty of cases where this is not the case and where you actually are finding that there’s different interactions between “village” and “variety” and “fertilizer” level, and it does matter which village you are in, and you get different results in different villages.

But, in this simple example that we’ve used in the teaching in the region, what’s wonderful is that actually, in terms of teaching ANOVA in particular, is that by doing this correctly, you can then get to the stage where you can say, “ah, good, I don’t need to look at village. I just need to look at “variety” and “fertilizer” level because the “village” effect is being accounted for by the differences in fertilizer use and in variety use”. And that sort of therefore means that the spatial situation you are working in is homogeneous enough that your results are valid across that space, across all of those villages.

Now, that’s not the case in many other data sets we’ve seen where you cannot get rid of the villages because they represent other important pieces of information. But the sum of the squares is critical to being able to understand, you know, “what is accounting for the variability you are observing?”, if you have unbalanced data. If you have balanced data, then changing the order will not change the values in your sum of the squares. And so this is what people are taught when they’re first taught ANOVA. They’re taught about ANOVA for a balanced design, that’s where it applies. But that’s almost never where it applies in real life.

Very often you get the ANOVA table related to a linear model and you know it’s unbalanced, what’s behind it is unbalanced in different ways. And therefore the order of your variables in the ANOVA table and in your model matters in terms of accounting for variability, and your sum of the squares column is the heart of that.

But we can’t stop there because we need to at least get to saying, well, “what about the mean squares?”

[00:13:38] Lily: Yep. 

[00:13:39] David: Because the mean squares, well, this is just saying: “wait a second, if I have lots of degrees of freedom, then of course I’m likely to explain – not to ‘explain’, to ‘account for’ – a lot more variability.” I should give credit here. I used to talk about explaining variability, and I was told off and, had it explained to me how ‘explaining’ variability, it gives people the wrong idea, it makes them think about causation. Whereas ‘accounting for’ variability is actually what we’re doing. 

[00:14:10] Lily: Yes, because with your village example, it’s not the village that causes those differences in your yield or whatever it is you’re measuring, it’s just that there is different variability between villages.

[00:14:22] David: Absolutely. Yes. And so, the variability that you are accounting for is really important. And so let’s get to the mean square. The point is, if you have lots of different factor levels, well, then you are using lots of pieces of information to account for differences in the data.

Therefore, the more interesting, in many ways, comparison tool is to look at the mean squared, which is the sum of the squares divided by the degrees of freedom. On average per each degree of freedom that you’ve got for that variable, how much variability are you accounting for?

And that becomes really important because you can also do that for the residuals, for your error terms. And that says that, well, if you’ve got all the important parts of your model, then your error term represents natural variability. And so how much your comparison between your average variability, your mean squared for variable, and your mean squared for the residual, that’s a really meaningful comparison because you are actually saying, “am I accounting for more variability than just the other variability I see in my data?”

I will come back to that in a second, but I will complete the table, because the last column is just exactly that comparison, it’s just to say, well, okay, the F-value, this is that mean squared divided by the residual squared. All you are doing is you’re saying, well, yes, that’s a really useful comparison. Is it much bigger than one? If it’s much bigger than one, then you are accounting for a lot more variability than your error terms are on average, your residuals are on average.

So F-values bigger than one are really important. And I will finish very quickly with the p-value to say, well, all that’s saying is how likely is it to get that F-value given those degrees of freedom? Given the degrees of freedom that you have, how likely is it to get such an F-value, if it was just by chance?

[00:16:35] Lily: To jump in, I guess, p meaning probability, it’s just a probability.

[00:16:40] David: Exactly. So this is all very simple, and the p-value is just saying, is this something that you would’ve expected to see by chance? So this is important, but that has behind it a number of assumptions. Is this something you are likely to see just by chance if your errors are normally distributed and all these other things happen, all these other things are satisfied?

And so while the p-value is extremely valuable and extremely useful, it’s something which then if you put that into a paper, you can be criticized to say, oh, you’ve made assumptions that you shouldn’t have made because your ANOVA is assuming normal distribution of error values, it’s assuming all sorts of other things, which, actually, are very rarely valid in the context where we work.

So nine times out of 10, those p-values are not ones that you could use or put into your paper. You’d actually have to do a more complex analysis. But everything up to the p-value is valid. And this is why it’s such an important descriptive tool rather than a sort of statistical tool. Because those p-values have assumptions behind them, which are very easily invalidated.

And so very often if you take those p-values, you can be criticized correctly for using those p-values. But everything before the p-values is just descriptive. There’s the structure of the model you’re using, and then it’s the simple measures of variability, it’s the nature of the model, the degrees of freedom, the measure of the variability, it’s how that variability corresponds to the degrees of freedom and how that mean sum of variability corresponds to your residual.

Those are all descriptive. They’re just very useful, descriptive statistics. And I’m conscious we’re running out of time, but I can’t leave it here because within this, there are a few tricks of the trade related to particularly the mean variability, which is so easy and obvious. What happens if you have a mean variability, which is much smaller than your residual variability?

[00:18:58] Lily: Yeah, I remember this occurring with, I think it was with John’s, with some research that John was doing. I remember you saying about this. 

[00:19:06] David: Well, this happens in so many cases that I’ve been working with researchers in the West African region, this has come up. Yeah. So go on. What does it mean?

[00:19:16] Lily: Yeah. So your variability that is happening by chance is greater than your variability that’s happening by something you are trying to account for. Which means, oh, what does it mean?

[00:19:30] David: Well, it means your residual variability is not the variability that’s happening by chance. One of the things that I love doing is putting in, let’s say you have a plot number, which is just a random number which happens to be assigned to the plot. I love putting that into the model.

Now, of course, you are not expecting the plot number to explain variability, but I do expect the plot number to be a good proxy for natural variability. And so, if your plot number variability is much smaller than your residual variability, that means that there is information in your residuals, which you haven’t accounted for, but that you could account for.

And that’s so important to be able to know “how well have I done in my model? Is my model doing well or not?” And to have variables which are interesting proxies for natural variability, which then tell you “okay, yeah, your model might be good, its R-squared might be okay, but you have a lot of variability not accounted for, that you could account for”. That’s useful information.

The fact that all of this is so easy to get from an ANOVA table. Good ANOVA tables used well as a descriptive tool is something which I believe could really help many researchers get so much more out of their data, of just being able to sort of understand what are the sources of variability, what variability is natural, what variability can be accounted for by the data they have, what variability might they have to look for something else to account for because they’ve not caught all the important things? So much rich information to have from such a simple tool, a simple descriptive ANOVA.

And as I say, I can take no credit whatsoever for anything of this because all of this was being taught in the 1970s by Roger and others, and it’s something which has come back time and time again as being a really important point, which has helped partners in the West African region to get more out of and understand their data and the nature of the models they’re fitting much better.

[00:21:48] Lily: Well, thank you very much, David. That’s useful and very insightful.

[00:21:52] David: I look forward to more similar episodes on other aspects of our experience in the West African region.

[00:21:58] Lily: Absolutely. Great. Thank you very much.

[00:22:01] David: Thanks.