232 – ANOVA and Degrees of Freedom

The IDEMS Podcast
The IDEMS Podcast
232 – ANOVA and Degrees of Freedom
Loading
/

Description

Lily and David discuss the application of ANOVA in agroecology research, focusing on its historical roots and its use as a descriptive tool. They emphasize the importance of understanding the degrees of freedom in the ANOVA table, highlighting its impact on effective data analysis and model fitting. This episode is part of the celebration of 20 years of research support in the region, showcasing the value of statistical methods in enhancing research outcomes.

[00:00:07] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here with David Stern, a founding director of IDEMS.

Hi David.

[00:00:14] David: Hi, how are you doing?

[00:00:16] Lily: Yes. Yeah, I’m well. Thank you.

So I thought today we could talk about ANOVA, do a kind of statistics podcast, but also relate it into this kind of West Africa work.

[00:00:26] David: Absolutely. It’s the 20th anniversary of the work in West Africa, and so really this is part of that broader series on celebrating 20 years of research methods support to the West African community of practice. Research in agroecology.

[00:00:43] Lily: Very nice. And I’m sure that this is one that stretches quite, well, it’s one that I’ve even seen experience in, not in West Africa, as I’m sure you yourself have.

[00:00:51] David: Absolutely. The topic of ANOVA as a descriptive tool, which I think is really the heart of what we’re going to discuss, is something which I was introduced to, I suppose, when I started teaching statistics. So I really have experienced this initially anyway from how do people learn about ANOVA and why do they learn about ANOVA? But it’s not taught like this, it’s really in a very few places. It’s crazy.

[00:01:21] Lily: Well, I was introduced to this idea through you.

[00:01:23] David: And that’s of course ironic because you studied at Reading University.

[00:01:26] Lily: Yes.

[00:01:27] David: And these ideas were, the first time I know they were taught at Reading University was in the 1970s. And so this idea actually goes back a long way. The idea of how you think of, and how you teach ANOVA differently so that people get more out of it. This goes back to the 1970s, instigated, I believe, at Reading University when Reading University was at the foreground of biometry, as in statistics applied to biology.

[00:01:55] Lily: Wow, that is interesting. And that’s interesting to know that somewhere along the way the ideas got lost and it would be interesting one day to dig into kind of how or why to keep these ideas active. But, I suppose in your West African context then, where or what do we mean by ANOVA as a descriptive tool?

[00:02:14] David: The starting point in the West African context is that we are often in the privileged position to be supporting research projects, students analyze their data. And nine times out of 10 when somebody doesn’t know about us or hasn’t interacted with us before, the starting point for that relates to essentially the statistics that they feel they need to publish their papers. The hypothesis testing, which everybody is pushing. And they sort of come to us and, oh, you are the statisticians, great, how do we, what do we do? Have we got the right p-values? How do we interpret this? Is this what we need for our publication?

[00:02:58] Lily: I know this all too well. I mean, just within my bachelor’s degree alone. And then I remember even finishing my bachelor’s – I think I might have said this story before: I was sitting in your office, and someone came in and they asked that question, “which test do I use?” or asking about the p-values. Actually, no, the question they asked was “how do I”, it wasn’t even, “how do I tell if it’s a parametric test?”. It was like, “which parametric test do I use?”. Words, which at the time I was like, “what does this mean?” And expected, I thought within myself, I should know this, I should know, I should be able to look at data and know this straight away.

And then saw through that interaction that you had with them that you can’t look at data and just know it. So I feel that there is like this expectation that, oh, you’re a statistician, okay, which p-values? What should I do here? Which test? Which p-values?

[00:03:47] David: So that’s often the entry point. In the West African context where we’ve been working for, I’ve been working for over a decade, this work has now got 20 years, it’s celebrating its 20th year anniversary in that region. And what’s happening is as people are moving towards agroecology, then there’s a growing recognition that the answers are not simple, you’re not just doing simple on station experiments where you’ve got real control over the design and you’re doing a very constrained design, which therefore has a very constrained analysis, or very well-defined analysis.

And so very often there’s a lot of different factors that come in, or different components that come in, which could interact with and could relate to what you are interested in. If you’re interested in crop yield, this might be affected by many different things. And you often collect quite a lot of data including survey data, as well as the experimental data, and other things together. People have, sometimes, this really rich, quite interesting, diverse data set, multiple sources all coming together.

But the starting point that they know they need to do to publish papers is they want the hypothesis test, “what is the white statistical test?” And very often I find ANOVA is, as a descriptive tool, an eye-opener, which totally changes how people think about their data, how they start looking at the data and interacting with the data. Because, in a normal ANOVA table, the question is “how many variables are you including in your analysis?” You then have the residual, which is the error –  what’s left over, what haven’t you taken account for with the variables you’ve included and how much variability is left – and then you are looking at the degrees of freedom, which correspond to, for each of those variables, how much information is put into the model related to those, how many parameters you are taking into account in the model of that.

And the way I like to think of that is that people very quickly know that your degrees of freedom are your number of data points minus 1 as the total of degrees of freedom. And I say, okay, great. Why is that? And they say, well, is the number of data points minus 1. And I say, well, before you put any model in, before you do anything else, that one is the average, it’s the mean, your variability is calculated as the difference between your data and the mean. This is how you calculate the next columns.

And so your number of data points minus 1 is the fact that your one is the mean. So you have one piece of information, which you are using to represent all of your data. And so the number of degrees of freedom which are left are the number of data points you have, which are independent from one another, and the average, which is the one thing you are using to model it, to describe it.

Then you might be putting in a continuous variable, in which case you are now saying instead of just an average, you are looking at a line, and that means you’ve got two pieces of information. You’ve got the slope of the line, and you’ve got the intercept. Or, if you want, you could still do that as thinking about as being the average. So you’ve got the average and the slope, there’s two pieces of information.

And so that variable, your continuous variable, where you are fitting it as a line, linear model, is just one additional degree of freedom, which is the number of parameters minus 1. So it’s still n – 1. So this is how people get to it.

Or you could be adding a factor. What does a factor do? Well, each level of your factor, you are now saying “I’ve got a different average for each of the levels of my factor”. And so you have the number of levels of the factor, minus 1, of course, ’cause you already had one average and now you’ve got one for each level of the factor. So you are adding n – 1, where those are the numbers of levels of the factors – that’s the number of degrees of freedom taken into account by adding a factor variable.

This is really important, it’s all very simple. But most people don’t look at this column. They only look at the p-values. If you look at that column, it’s really important because quite often the first thing that’s happened when I’ve started doing this with people is that, let’s say they’ve had year as a variable, well, the first thing I ask is “did you mean for year to be aligned, as in looking for the trend, the change across the years, or did you mean for year to be a factor, as in each year is different, so you are looking at the variability within year rather than across years?”

And so you know what your model is just by looking at the degree of freedom, which model have you chosen? Are you looking at a trend model, or are you looking at a model within year, where you are looking at the variability within year?

[00:09:17] Lily: That’s not a way I’ve thought of it before, but that’s an excellent point. Wow. 

[00:09:23] David: But it’s such a simple thing. Your degree of freedom column within an ANOVA, we don’t often get this technical within a podcast, but I think it’s an important one, your degree of freedom column within an ANOVA is such important information to tell you what are you actually studying, what’s the model you are looking at?

Of course, then as you add more variables, then the question is “did you include interactions in your model or not?” Because that will be in your degrees of freedom as well, you’ll see what are the interactions that have been included in the model or not. And they’re separate. There are different ways where you can dig in, but they’re separate.

The really fun thing, of course, gets once you start questioning – and people are always amazed by this – “what about the order? Does the order matter? When does the order matter?” For the degrees of freedom, and this is all we’ve got to at the moment, the order only matters if the variables are not independent.

So when might you have this? Let’s say you have two different factors, but where they overlap. So, for example, you have, take an extreme example, you have three villages and you have three varieties. So normally you put in your villages, n -1, your degree of freedom should be 2.

[00:10:53] Lily: Yep.

[00:10:55] David: You put in your varieties, n -1, your degree of freedom should be two. What happens if your degree of freedom of your second one is less than two? It can’t be more than two, but it could be less. What does that mean?

[00:11:10] Lily: Why can’t your second one be more than two?

[00:11:13] David: If let’s say you’ve only got three varieties. But let’s say you have three varieties, three villages. You put them both in your model, you look at your degrees of freedom, and you expect to see two and two.

[00:11:25] Lily: Yeah.

[00:11:26] David: What happens if, for the second one, you don’t see two? The first one, you’ll always see two, but you might not always see two for the second one. Can you think of a scenario where you don’t see two for the second one? I’m putting you on the spot here.

[00:11:42] Lily: Yeah. Yeah. Well, if you’ve accidentally put it in as a numeric instead of a factor.

[00:11:48] David: So you’re absolutely right. If you put these in as numeric and you consider them as a line instead of a factor, then you won’t see two, you’d see one in both cases because it’s now considering them as a continuous variable.

[00:12:02] Lily: But, I suppose that the issue is I’m saying, you know, you’ve not cleaned your data or you’ve not prepared it.

[00:12:08] David: Yeah, exactly. So I’m saying there’s still a scenario where you could have cleaned your data correctly, you could be treating them correctly as a factor, and the second factor you don’t see two.

[00:12:20] Lily: Is it to do with if, I’ve gotta say if they’re not balanced properly between the villages. 

[00:12:25] David: In an extreme way, yes. So, let’s say, for example, that the varieties only appear in one village. So each village, they have one variety in that village and it’s not in the other villages, the second variety is just in the second village and not in the other villages, and the third variety is not in, these are not independent variables. It’s not just not balanced, they’re not independent. 

[00:12:48] Lily: Yes, because now, if you see a change in that second variety, is that actually a change because of the second village?

[00:12:57] David: You cannot distinguish between them. This is the really key point. In the data, there is no additional data, so in that case, you’d actually have zero degrees of freedom. You are not adding any degrees of freedom by adding a variety, which is already determined by your village.

[00:13:15] Lily: Interesting.

[00:13:16] David: And if you put your variety in first, and then your village, you’d have zero degrees of freedom for your village because you’re not adding any distinguishing feature because everything, the variety is determined by the village, and the village is determined by the variety in that data set. So your data set can’t distinguish between them.

Okay, that’s a pretty extreme example, that’s probably a design problem rather than an analysis problem, but these things happen. And when you have bigger studies, they actually happen quite a lot where suddenly, you are expecting a particular number in your degrees of freedom, n -1, and you don’t get it. Then what you should do is you should obviously start playing around and checking to see what are the other variables which are dependent, which have a dependency with this variable? These aren’t independent variables now.

And of course once you start getting into the interactions, you start looking at the interactions, then it is even more likely, to get something which doesn’t exist in an interaction, you would just need there to be one village which doesn’t have a particular variety, and you would have less degrees of freedom than if all villages had all varieties.

So these are things which are very simple and we’ve still not got beyond the degrees of freedom, the first part which is so rich and interesting and important. And, as a statistician who has been working with this a lot, this is the first thing I look at, to make sure that actually the model corresponds to what I expect, that very simply that it has the right degrees of freedom, if I have independence in the way I expect it and that I have the right residuals, the right error terms, the things which remain left over.

Because of course, the other thing, which is very interesting, and people sometimes do this, you have a very well designed study, where you have all the possible interactions and something, you’ve added everything in, and you are left with no residuals, no remaining error terms. And so therefore, it means that, well, you’ve got samples of size one, so you have nothing on which to base your conclusion, you have no measure of random variability because your study doesn’t have repetition within it.

Repetition is one of the things which creates the residuals, which means your residuals always should be much bigger. And if you don’t have that, then your error term, the degrees of freedom left in the residual after you’ve put in all your model, that’s your measure of natural variability on which you are basing all the other comparisons. So that’s so important that you have the right residuals.

And yet another thing, we’re still on the degrees of freedom. And we’ve gone through and we’ve sort of all these different things to look at and check all this interesting information. Much more interesting than the last column, your p-values, which just tells you “do you have enough data or not to be confident in the results you’re observing?” That’s all it’s telling you.

The degrees of freedom, the trivial first column is already much richer as a visual tool summarizing the data and representing the model, and giving me that information around the model, about my data, about dependencies in the data, about the measures of variability, how much data do I have to measure that variability, how reliable do I think it’s going to be, depend on the degrees of freedom.

All of that information in that first column. Whew, that’s quite a lot to put in. You know, this isn’t very related to the West African community of practice, but I think what is important, which does tie this back in, is, just as we’ve done in this discussion, and this might be all we have time for in this episode. When you start looking at your data, when I’ve started looking at data with people, going back to the simple question of “if you are looking at an ANOVA, if you have some form of model you fitted and you are now looking at how the variability, the analysis of variance, ANOVA, is looking at that, well, start by making sure you understand exactly has the model you fitted, does it correspond to the degrees of freedom that are in the model? And therefore, are you comfortable that you can go on and look at the variability?”

And this is something where, when I’ve introduced this to partners who are looking at this, the number of times it has really helped people to recognize and to ask the right questions, well, “am I looking for trend or am I looking for within group variability?” That’s a really important question, “do I have the right model to start with?”

And being able to have that discussion just based on that degrees of freedom column within the analysis of variance, and we already see how as a descriptive tool, this is so important because this is now describing your data and the model you are fitting. That degrees of freedom [column] is helping you to understand things about the model you have chosen to fit and the data you are fitting into it, and how they relate to one another.

[00:18:36] Lily: Yeah. and I think it’s, I think it’s somewhat appropriate that a podcast on ANOVA has actually only focused on that first column, just to help show the depth of what you can get. Because there’s even ideas in there that I’ve not considered before, on just that degrees of freedom, I know about the other two columns – a little bit.

[00:18:55] David: Well, we have other three columns, and they get to the p-values. So yes, let’s do that in the next episode.

[00:19:01] Lily: Yes. Because at ANOVA, degrees of freedom is your first column, your p-value is your fifth column. So there’s still lots more.

[00:19:08] David: People normally only look at the p-value. And we got to the trivial first column and had that depth of it. And this is what I think’s so interesting and so important and why I think we need to rethink how people get exposed to this, so that they actually don’t just skip and look at the p-values.

Let’s call it a day for this episode and have a next one where we get into the sum of the squares.

[00:19:31] Lily: Yeah, absolutely. Otherwise, we’ll have one very long one. No, thank you very much. 

[00:19:36] David: Thanks.