151 – Data Variability

The IDEMS Podcast
The IDEMS Podcast
151 – Data Variability
Loading
/

Description

Data scientists Lily Clements and David Stern discuss the concept of variability in data analysis. They explore different types of variability, particularly in the context of using ANOVA (Analysis of Variance) to analyse data variability. Considering practical examples from agriculture, they consider the complexity of distinguishing between natural and unaccounted-for variability.

[00:00:04] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a Data Scientist, and I’m here with David Stern, a founding director of items. Hi David.

[00:00:14] David: Hi Lily. What are we discussing today?

[00:00:17] Lily: I thought today we could go into variability and these kind of different types of variability and when one type moves into another type, obviously there’s quite a bit of background we should probably give here.

[00:00:28] David: Let me see if I can interpret where you are coming from. There’s when you are analyzing data and you use something like an ANOVA and analysis of variance, you are analyzing the variability so you have a measure of your variability, and you then measure certain amounts of variability as being accounted for by the variables that you are using to explain, and this is variability within a variable you are using.

You are accounting for some of that variable ability by other variables, be they factors or continuous variables. And, then you have your residual variability; the variability that hasn’t been explained, and broadly, you think of that as having two components to it, one of which is the natural variability.

This is the variability which just exists in nature related to the phenomenon you are observing. And the other part of it is. Variability that could be explained if only you had the right variables to explain it or the right data to explain it. But it’s therefore in that residual, it’s the, it is in the unexplained variability.

But if only you had the right variables, you could actually. I’ve used the word explain, which I shouldn’t have done because if I use that Peter Diggle would tell me off, wouldn’t he? Because, and he’s right. I have to I, I used to use the word explain a lot, and Peter Diggle put me in my place in saying, helped me to recognize that accounting for it is better because people think of explained variability as being understood, whereas this is more a bookkeeping exercise as he puts it, where you are accounting for the variability rather than actually it being causal. You just because something’s correlated doesn’t mean it’s causal.

Where it was explained implies causality and therefore people understand that. If I ever say, explained again in this episode, catch me. I’m still struggling two years after Peter Diggle put me straight to my thinking on that one.

[00:02:57] Lily: Very good. So if we give a sort of like example, I guess it’s maybe with say crops. You could have a bunch of crops which are all planted next to each other and they have a different yield amount and there could be.. that they have a different yield amount and they’re all given the same conditions.

They all have the same like watering. They’re all in the same place, the same soil type, but there’s still some differences between them.

[00:03:23] David: Absolutely. That’s natural. It’s only natural that things vary.

[00:03:28] Lily: Yeah. But then it could be that some of that natural variability could be accounted for, ’cause let’s say that actually you thought it was all the same, but, actually, you’re on a slope and so part of your, some of your crops are getting more water. I’m not sure.

[00:03:46] David: And that well, but that’s no longer natural variability. This is the key point that unexplained variability. I think a better way to think of that would just be to say okay, let’s say these were, instead of individual plants and different plots and you were taking the yield of the different plots then the plots might have been treated differently.

They might have got different levels of fertilizer, they might have got different levels of shading. Imagine there was a big tree next to one of the plots and not next to another one. So if you know this information, it can help you to determine the more fertilizer you put on, maybe the more yield you would expect.

And therefore the adding the fertilizer information. If you don’t know how much fertilizer was put on the different plots, then that unexplained variability,

[00:04:41] Lily: Uncounted for variability.

[00:04:43] David: thank you. That variability, which is unaccounted for, will just be there in your residuals and making your natural availability appear higher.

[00:04:53] Lily: Yeah, sure. So there’s these kind of three different types. Then there’s the accounted for which we can say, okay, it’s, we’ve got these differences in our yields because of these different fertilizer layers, fertilizer amounts. If we know that we did these different fertilizer amounts, there’s the kind of natural variability in there of that our plants are just naturally within a plot, they’re just a little bit different to one another, just from, just ’cause it’s natural.

[00:05:22] David: Natural for things to be different. Everybody is different. Everyone’s unique, everyone’s slightly different.

[00:05:29] Lily: And then we have our unexplained vari, unaccounted for variability, which is perhaps there is another factor in there that’s affected our plants or our crops or our yield that we can’t, that we don’t have information on.

[00:05:45] David: Or that we haven’t included in our model. Maybe we do have the information, but we haven’t included it in the model. I actually came across this just last week. I was… there was a, one of the students in Burkina Faso where I was, that was struggling and they were doing quite a lot of in ANOVAs, but they were only putting one or two variables in, and actually they had more variables that were affecting their results.

And once they put all the variables in that they had suddenly everything made sense in the right way. And so they had a very nice, balanced experiment. It was done on station, so it was all very easy. But they were making the analysis really rather hard for themselves because they were only using one way and two way ANOVA because that’s what the software they were using encouraged them to do, which is where actually software is important in all this.

[00:06:39] Lily: Sure. So then, so the unaccounted for could be that you’ve just not put it into your model. It could be that you don’t have information on that. It could be that you haven’t collected that data. It could be for many… So

[00:06:51] David: Many other reasons.

[00:06:53] Lily: I guess then the question is, now that we’ve caught up, is when does this kind of unaccounted for variability, become natural. And when does it become unaccounted for? Sorry? When does our unaccounted for variability become natural? When does it remain unaccounted for and when does it become accounted for? So by what I mean by when does it become natural is, say like the.

[00:07:29] David: Let me let lemme because I think I’ve understood.

[00:07:32] Lily: Okay.

[00:07:33] David: Let me see if I can re-articulate your question.

[00:07:35] Lily: Thank you.

[00:07:37] David: When you fit a model. You have some variability accounted for, and the rest is your residual, which is unaccounted for. That unaccounted for stuff, the residual variability.

How do you know which part of it is natural and which part and how much you could try to account for if only you had more data? Is that. Is that a way, because it’s all unaccounted for, you don’t know what the natural variability is.

[00:08:10] Lily: Yeah. Yeah. We can’t pry the two apart. We can’t tell the difference between the two.

[00:08:16] David: And this is the thing, this is something where there’s lots of different ways depending on your subject area, where you could, so for example. It could be that by doing a literature review, you are able to get an idea of what the natural variability for the thing you are studying looks like. You might be able to, from the literature or from previous studies or from other data you have, which is in a different way.

You might be able to have done a study or get a pretty good idea of what natural variability is. In fact, you can go further. If you are a scientist in whichever field and you are interested in the long term, you could deliberately do studies where you try to make sure you have a good measure of that natural variability. So that you can use this then in your other studies when it’s harder to know what is, what it is, what’s natural and what’s just unaccounted for. So this can be part of a good scientific process.

[00:09:29] Lily: So have a really overall controlled environment so you can.

[00:09:33] David: I Exactly. You are just you are really, this is in the agricultural context. This is what on station, really, this is where it’s come from. You are trying to control everything to understand what is natural variability. Everything which is not natural variability should be, you should be able to account for it because it was in your design.

[00:09:58] Lily: That in reality, we do have. For example, I know yeah, one example is say with the contraceptive pills. It’s okay, theoretically it’s this percent effective, but actually in reality people forget to take it. People take it less, people take it late. So it’s actually much less effective, not much less, but it’s actually less effective than how it theoretically is.

And so similarly here, if you overly, if we have this completely controlled environment where.

[00:10:29] David: but you’ve just, you, I’m afraid, no, you’ve just reinforced the point.

[00:10:33] Lily: Okay,

[00:10:34] David: Somebody taking it late. But that, that is, that’s not natural variability. That is variability that could be accounted for if you knew that they took it late or they’ve got it. Yeah. You might not have that data, but those are things where they… everything you gave us an example, it could be known that this will affect it.

[00:10:59] Lily: Yeah. Okay. So it really is finding what is truly that natural variability, this overly controlled condition, and.

[00:11:09] David: because and and the point is anything, any more variability than that, then you are asking the questions. Okay, this is something maybe we can do something about. So if when you actually, just with the example you had, if you found that actually there was, a lot of information, which was therefore different from the natural variability.

You could then be saying I, should we be changing the instructions or should we be changing how people use this? Because it seems that we’re, there, there’s, we are getting, we have a phenomena which is understandable.

[00:11:52] Lily: There’s that human element in here that isn’t accounted for, isn’t measured, or isn’t in our… Okay.

[00:12:01] David: Now, of course, I don’t quite yet understand in your example, what data we have, what data we don’t have, and really at the heart of this, it’s all about analyzing the data because that’s what we’re really talking about. It is not necessarily just about a phenomenon, it’s about the data you have on a phenomena.

[00:12:19] Lily: Okay. So I guess going to your example of this, I’m imagining this overly controlled environment where we could truly see, okay, what variability here is natural. I guess my question to that is, is to like, to what extent is that useful? Because in the real world we do have these random elements.

[00:12:42] David: So let give you a very concrete example to illustrate what you’re saying. And I’m gonna come back. I’m afraid to the agricultural example, just because we’ve been on that already and it’s a really easy one.

[00:12:52] Lily: That.

[00:12:53] David: So if you are in a station in your highly controlled environment, then your plot is actually relatively even, if you go onto a farmer’s field and you make a plot. I love walking around when I’ve seen people doing experiments on field, on farmer’s fields because they’re plots. However, whatever size you make them, they’re not even, and you go around and you look and somebody’s burnt down a tree here, or somebody used to keep cattle at the other end of the field.

And so the fertility is different. There’s just all this local variability in the field, in the plot itself that wouldn’t necessarily be there in your controlled environment. And you can’t really use this to explain because people try to use it. But this can be really local. This can be a one meter radius thing where something happened.

There’s more fertility in that part of the field, or less fertility in that part of the field. Now, if you talk to a farmer. In many areas of the world, they know their land. And so they will be able to tell you, in some sense, oh yeah, that’s a dead spot. I don’t know why, but it is dead. That bit just doesn’t give, and this, oh, this is a really fertile spot and so on.

So they, they know the differences within their field and they’re, they’ll actually act accordingly. But if you are trying to do an on field experiment, in a farmer’s field, then that micro variability, this is different to natural variability and this is really important that it is… it is natural on their  field, but it is not because of the crop or it’s not because of your treatment.

Yeah? It might have an interaction with your treatment, but it’s not something which, it’s something which from the perspective, yeah, from the perspective of that on farm experiment, it is just natural variability,

[00:14:55] Lily: Yeah.

[00:14:55] David: but it is really useful to distinguish that from well globally, what we’d expect for this crop.

We’d expect this level of natural variability.

[00:15:10] Lily: Yeah. And then that can be a way to know when you are modelling if you’ve, or, if you know what natural variability to expect or what the kind of distribution should be, that can then help when modelling, if there is something else going on in there that’s not yet being explained accountable. Sure.

I guess then just like my question for that is if we’re, we have our real life field, our farm, and then we have our our kind of… was it on station you said is our like overly… 

[00:15:42] David: and our controlled environment

[00:15:44] Lily: and so which one of them would be more generalizable to someone else in a different field? And I guess that depends, but yeah, I guess that.

[00:15:57] David: I guess I guess the question that, that’s almost asking the wrong question, so let’s see if we can reformulate that into a question, which is sensible because the… It doesn’t matter which one, it’s not clear. You can’t determine in advance which one is useful for a particular, another field. What I think is a really valuable question is when should we use on farm experimentation and when should we be using on station experimentation?

And that’s another discussion that I went into in great detail last week with a particular group and both have value. And there are many, if you understand the scientific process and so you know what the result should be. Doing it on station doesn’t really bring much, it’s you’ve got a very controlled environment, but you don’t get much new information.

Whereas repeating things that are known on farmers’ fields, you can get a huge amount of very rich information, especially if you do this at large scale with large numbers of farmers in a relatively uncontrolled way, because then that data gives you comparisons, it gives you rich information related to socioeconomic aspects as well.

There’s all sorts of complexity that comes in. And you actually get an idea of the fact that the variability you’ve got on station, that isn’t everything. There’s so much more, and you ask the questions, what level is the variability coming from? Is it a sort, or is it related to the soils? Is it related to the household? Is it a socioeconomic thing? Is it… there’s so much which then is drawn out in terms of that complexity where these even relatively well understood processes become extremely complex and unknown once you start putting in and adding in elements of social complexity related to the the farming practices, the farms themselves and so on and that’s when I would argue you start getting the opportunity to have really rich and interesting information.

But if you have a process which is not yet known or understood. You don’t want to do it with that complexity, that’s when you want to go back on station to your controlled environment to understand the process, to maybe do really fine scale measurements, to make sure you are able to document the process, maybe even to model it in the future, to be able to understand it when you see the results on field.

So it’s really about what is known and what’s unknown. In the on-station case, it is extremely valuable to use these controlled environments to move knowledge forward on, on processes which are not fully understood. But, once a process is understood, that doesn’t mean you understand how it interacts with its environment, with its context,

[00:19:14] Lily: Okay, great. Yes, absolutely.

[00:19:17] David: and that’s what the, in this particular case, the on farm would do. Now I am conscious we weren’t necessarily wanting to just talk about the agricultural side of this, but I think it is quite easy to understand this in this idea of actually, are you trying to understand a new process which is not yet understood in which case: controlled environment. Or are you trying to understand how a process interacts with its environments and context with environments and context? And that’s a very different process. That’s your, now, so this is where there’s a great spectrum, which is consistent across disciplines, across areas. And what happens to the variability is very different.

[00:20:10] Lily: That’s very, that’s incredibly interesting. And to bring that back to, I guess the question of this unaccounted for variability or just to try and help myself to understand this unaccountable variability even more. It’s this aspect of the, that like interaction with the environment is where a lot of unaccountable variability comes in, which… Seems natural, like which I want to argue it’s natural, because the environment is natural. It’s natural that there’s a plot of land here, it happens to be more fertile than that plot land there, but it’s not natural in the same sense of.

[00:20:51] David: exactly, and this is, I think this is the key that it is, it’s natural that there is variability that comes from environment and context. But that in that variability, maybe it could be accounted for if you understood certain things better,

[00:21:12] Lily: Okay.

[00:21:13] David: and that’s the key. This is the, and this is the key difference, your context where you are able to control you…There should be almost nothing except natural variability, variability which is accounted for. That’s your ideal scenario there. You only have two types of variability, natural variability, and variability, which you can account for. That’s how you set up in a controlled environment. This is why laboratories as a general concept is so valuable and so useful. But, yeah.

[00:21:54] Lily: I guess then it makes me wonder like, where does natural variability come from? And I know that, everything’s different. Every snowflake is different, every, this is different. But doesn’t this natural variability come from somewhere like I.

[00:22:07] David: Very good question. Yes, and

[00:22:11] Lily: sorry.

[00:22:13] David: No. This is a really good question. In, in certain cases, of course, you could trace some natural variability down to differences in DNA as an example of where your genetic variability, however small, will lead to variability at yield or at other measurable, in other measurable variables.

And so there’s, there is genuine variability, which is explainable, which could, in another study, this might be variability you can account for.

[00:22:56] Lily: yeah. So where do we draw the line of what’s natural?

[00:23:00] David: You draw the line based on what you’re studying.

[00:23:04] Lily: Okay.

[00:23:06] David: Yeah, if you are, and this is the key thing, that it, you can draw the line at many different places in many different cases, but there is no single line to be drawn, just as you said the variability due to different environments. That’s, it’s very natural that there is, there are differences.

So in some ways the distinction, this is where in other contexts. We’ve used different terminologies. I think the good, the bad and the ugly…

[00:23:39] Lily: Okay.

[00:23:41] David: …is a nice way of thinking about this. Good variability you’ve accounted for. The bad variability is the one you can’t account for, and the ugly is the one that you could count for account for, but you haven’t been able to,

[00:23:56] Lily: Yeah,

[00:23:57] David: is one way to see that.

That is obviously incorrect because your natural variability is not bad. It’s just natural. And so there there’s no right way to describe this, but it is so important as a mental construct to be thinking about variability in this way of actually, what can you account for, what will you never be able to account for? And what’s that bit in between? Where is it worth putting in the extra effort to improve?

[00:24:28] Lily: Yeah.

[00:24:29] David: No, I, I see people, using statistical concepts like R squared, which is a measure for the, how much variability has been accounted for in particular types of models. And

[00:24:41] Lily: Yeah.

[00:24:42] David: They throw the number around with abundance and they say, I’ve done, I’ve got this number. Okay, what number were you expecting?

Why were you expecting that number? Is it higher or lower than you were expecting? As a number, this is something where actually just the number itself doesn’t tell you anything because it depends on the phenomena. Now, maybe within a particular context, people are used to a certain level of variability that you can account for and therefore certain type of R squared, depending on your domain, is something where the number is interpreted by experts in the field in certain ways. But too often I find even experts in their own field they’re not actually asking the right questions about that number. And it’s natural because it’s not, they’re interested in the phenomena, but the, whereas this is just a number which is coming out, which is supposed to just tell them, is it, is what they’re doing, good or bad. Whereas actually thinking well, how, what number is actually meaningful for me in my context?

Would I ever be able to achieve it? Is it worth the effort to do more, to get the number higher or not? These are the really interesting questions and there’s… one of the things I love about open science frameworks in different ways is that more and more, it’s encouraging people to really think about these questions before they collect any data.

You’re supposed to have your protocols so they can be critiqued and so on. You’ve done this with a study actually related to, with one of our collaborators at Oxford, where before she started the study, you built a whole simulation model for her and so that you could play around and understand well, okay, if the errors with this sort of level.

Now, what would we expect to see as results if this is what happened? And those simulation models, running them beforehand can be extremely powerful as you saw for yourself.

[00:26:58] Lily: Yeah, yeah absolutely.

[00:27:01] David: And at the heart of that is this question: what do we actually, do we know what to expect? And that’s such a powerful process to get people thinking about. And the only problem, of course, it can slow people down. And so you know, there’s always a fine line about how much do you want to, or when do you want to slow people down before they actually act?

And when do you want to encourage people to act? And this is different in different domains. If there’s any chance of human harm, I want to slow them down, but there’s sort of good reasons where you, in certain cases, these protocols have become much more rigorous than in others.

[00:27:47] Lily: Yeah. And then well, I just wanna add to, you more than anyone advocate for slowing down sometimes of conceptualizing, spending time on thinking things through properly now so that we don’t spend that time later on redoing things or, ‘oh no, I don’t have the variables I was meant to have’ and whatnot, but I just want to add a little, don’t want slowing down to be seen as a negative.

I know that in my head…

[00:28:13] David: Exactly,

[00:28:13] Lily: I’m very poor at slowing down a lot of the time.

[00:28:18] David: And sometimes it’s negative and sometimes it’s positive.

[00:28:23] Lily: Great. Do you have any final thoughts?

[00:28:27] David: Thank you for bringing this topic up. It’s been a fun discussion. It’s interesting that, it’s the sort of thing where probably we should have a short version of this with some slides to make this more understandable. This is maybe not the best topic for a podcast. This might need some visuals. It would come to life.

It would be so much easier with a few visuals, actually seeing maybe some outputs of an ANOVA to help people to grapple with this. And I should take a step back to say, of course, ANOVAs are… What you get outta particular types of models, but it doesn’t matter. You can do generalized modelling and you still get measures of variability.

In any form of modelling, you get a measure of variability. So this isn’t technique specific. Everything we’ve discussed applies across different methods. It applies to machine learning processes in, in as, just as it does traditional statistics variability and the measurements of variability and how you interpret it.

This is something where there’s, it’s a universal thought. It is not, it’s not specific to a particular data analytic method.

[00:29:48] Lily: Great. Thank you very much, David.

[00:29:50] David: Thank you and yeah, look forward to our next episode together.