177 – Mathematical Modelling vs Statistical Modelling

The IDEMS Podcast
The IDEMS Podcast
177 – Mathematical Modelling vs Statistical Modelling
Loading
/

Description

In this episode, Lily and George discuss the nuances and differences between statistical and mathematical modelling. They explore how each field approaches modelling, and touch on hybrid models that incorporate both statistical and mathematical elements, and the significance of uncertainty in modelling predictions.

Lily: [00:00:00] Hello and welcome to the IDEMS podcast. I’m Lily Clements, a Data Scientist, and I’m here with George Simmons, a Postdoctoral Impact Activation Fellow. Hi, George.

George: Hi Lily, how are you?

Lily: I’m particularly well ’cause I’ve managed to say your title correctly that time.

George: Yeah. Although I remember last time we spoke we didn’t quite figure out which way to say it.

Lily: No, I know. And I still started with postdoctoral at that time. 

George: I think that’s the way I prefer.

Lily: Okay. Good. Anyway, I thought today we could discuss models.

George: Yeah. Models is very interesting ’cause I, suppose we both describe ourselves as working with models.

Lily: Yep. 

George: But I bet there’s very little we actually do in common about that.

Lily: Interesting, because I’ve just assumed that your models are my models. Like I see models as a stats thing, and you’ve decided to come over to the little stats world, and you’re just refusing to admit it. [00:01:00] Because it’s just not true.

George: Yeah. I guess concretely one of the courses we’re developing for the Masters of Mathematical Innovation with the Open University of Kenya, is currently titled an Introduction to Systems Modelling. And I’ve spoken to David about this on another podcast, but that course is meant to be a broad overview of the kind of modelling techniques that you could use to… what I call model a system.

Lily: Okay, sorry, but what is systems modelling then? 

George: That’s a good question. Very broadly, a system is, I guess, anything you want. It’s probably best to look at example: so you could have the traffic flows in a system, you could have the electrical flows in a brain, you could have the population dynamics in an ecosystem, which is close to where I come from. So they can be all sorts of different scales, [00:02:00] and types, and contexts, but often the most interesting ones are all complex in some way.

Lily: I’m just trying to work out the difference between our models, or how do I define what a stats model is? So do you use data in these models? 

George: Yes. There’s an element of data, but there’s also an element of, I suppose, process based theory that goes into them. For example a model of how the planets in our solar system orbit, that model is based on Newton’s laws of gravitational attraction. So that there’s this kind of, yeah, as I term, process based component to the models, which then you can supplement with your data that you observe on where the planet is, how fast it’s going, what its orbital radius is, and so on.

But, a model in that sense is built around a core theory [00:03:00] of what you believe happens. And we shouldn’t forget that even something like Newton’s Law of Gravitation is not necessarily provable in a rigorous sense. It’s still based on observations and inferred from empirical evidence. And I think a good example of that I cover in this systems modelling course is to get students to work through orbital mechanics and kind of start with what are called Kepler’s laws of motion.

So these existed before Newton and I forget the exact statement, but one of Kepler’s laws of motion is essentially, a correlation between the orbital period, orbital radius of a planet. And you can derive it just by looking at data and fitting a model. And it turns out that some power of something is proportional to some power of something else. But it is [00:04:00] empirically derived, if that makes sense.

And then somewhere down the line that then becomes the core of your process based model, which you can then apply to a new situation.

Lily: Okay. So I guess these models differ from, you don’t have necessarily like this data set, but instead you have these different equations.

George: Yeah, a good model should always be focused on data, but you’re right that there is a focus on, in some sense and this is what I’d like to get into, only some models are equation based in this way. And there are other techniques that you can use to try and mathematically model a system.

But I, suppose you’re right, so is it true from your perspective that when you are constructing a model, your focus is that data set. You don’t necessarily know anything more about [00:05:00] how things are going to be structured.

Lily: Define structured. I guess because you know what you are expecting, right, you’ve looked at your data or you’ve observed different things, like you’ve still looked at your data before doing a model so you know what you should expect in your model and you can verify that through like observations, you can verify that through your data, through like describing it.

If there’s something that you can’t explain in your model, you can verify that through like experts. There was one thing I was working on with how different lakes were looking over the course of kind of 20 years in Bangalore in India, which had become very built up over that time. I believe it’s become the Silicon Valley of India over that time and how it’s been built up.

And so looking about, okay, what changes have happened to those lakes? And there was nothing really straightforward that had happened to them. But when I spoke to the expert, the guy that knew the area, he always had an explanation of that makes sense for this lake because this [00:06:00] lake is positioned in this place. And this is like you had in there if it was like an urban place or not, but it was like, oh, this lake is just really shallow and stuff, so that makes sense.

So you can verify it. Your modelling shouldn’t be like, it shouldn’t be that you model and you go, wow, I didn’t see that coming. You should know before you model what you’re, what you are expecting. I guess, what’s the purpose of modelling in statistics versus in systems or in what you do as well?

George: You’ve said something interesting there, in your lakes example, that conversation with someone local kind of leads you to expect that position, as you said, should be a factor in some feature of your lake. I dunno quite what you are measuring.

Lily: Yeah, so we were measuring basically how these lakes have changed over the time. So it might be that they’re more like fungal, it might be that there’s less water, and things like that, how have the lakes changed over the course of, I think it was 20 years. 

George: I think what I’d like to get [00:07:00] into here is maybe things around cause and effect. So a model from my perspective or a mathematical model would try to assign the reason why the lake’s position would have that influence on the lake’s depth or size or whatever from a, I suppose, a physical or geological perspective. So how much does that come into your thinking or is it that you are more interested in coming up with all the factors that affect the lakes?

Lily: I guess at this stage it was about what were we exactly modelling. And I guess we weren’t necessarily modelling then in this sense, it was more, we were just trying to, at the moment, observe what differences are coming up and why, for [00:08:00] the purpose of kind of knowing how humans can affect, like how land use transforms the kind of ecosystems of lakes.

George: But it’s still a model in the sense that you are trying to generate some explanation of reality.

Lily: Yes, that’s true, that’s true. But it’s not a model in the sense of, we didn’t do an actual physical model with like p values and things to my memory. I think it was more at that stage descriptive and then saying, okay, we’re getting complete opposing things. Here we have a correlation between this and that, and here we don’t. Why is that? And the experts are able to actually explain why that is.

George: So, maybe that’s a difficult question then, so you say, okay, you didn’t quite get to the stage of fully modelling it. What would that look like for you?

Lily: Gosh, for me, I guess it would be some kind of, it would be like a formula, right? Like [00:09:00] in my head, if we were to model it, we would have our response variable and we’ll find the explanatory variables that affect the response. So maybe the response variable is some kind of thing on the lake ecosystem, some kind of composite variable that says how the lake ecosystem is. As it is I believe we were looking at different elements of the lake ecosystem, like the fungal layer, the kind of macrophytes, these different items on the lake. I should have probably read this paper before, like reread it, ’cause it’s from a couple of years ago now.

George: We didn’t really know we’re gonna talk about this.

Lily: No, no, no. This was not the intention.

George: But okay, so ultimately, you say you didn’t get to a model, but you definitely started the process of modelling, trying to find… 

Lily: Well, I wouldn’t, modelling isn’t like the point of statistical analysis, we are able to describe things and we are able to understand what’s happening here, [00:10:00] but modelling isn’t the kind of, wasn’t the end goal, or isn’t the end goal.

George: I see. That’s very interesting.

Lily: For me anyway.

George: No, but I’m kind of saying that there’s a lot of other useful things you can do that you wouldn’t consider modelling.

Lily: Yes. I have just remembered that one of the things was like, ’cause these were different images taken using Google Earth’s image search. So we had the kind of dates and stuff that the images were taken. And sometimes it was straight after the rainy season. But then sometimes that was like a particularly dry rainy season. And so that would also be saying a lot about the kind of area. But when you actually come to modelling not modelling, when you come to actually looking at the data, it’s… I don’t know that’s a particularly dry rainy season. Whereas the kind of local experts do.

We did add in there about the climatic, we added in there your rainfall and stuff, so that we could explore how the rainfall changed things. But again, I guess your, [00:11:00] I guess the point is that your data set can only say so much, your data set can only say what the rainfall amount was, whatever, in the month before or in the month of. Well, maybe it’s more about the month before than the month of. 

George: I was just thinking, a nice parallel is, ultimately, you’re trying to do the same thing. You’re trying to add in factors that help you account for reality in your system, whether that’s, as you say, trying to add a climate dimension to everything. And in that sense, there’s no, there’s no difference in the process.

The only difference comes from exactly how adding that component can change your, you know exactly what that component is. Let me kind of give an example of this then from the ecosystem [00:12:00] modelling perspective. I think it’s a really nice example of where there is bits of statistical modelling that kind of come together with this process based modelling.

At the base, you want to understand population dynamics in your ecosystem, so you want to know how many of some insects there will be at a given time. And that’s something you can then equate to observational reality, or it’s a useful predictor to say, okay, well, I think there’s going to be a lot of insects, we should probably do something about it.

And very basic population models start with, okay, what factors are influencing the numbers over time? And you say, okay, these things reproduce, there’s some kind of birth rate going on. They die, so some kind of mortality [00:13:00] rate going on. And then maybe they get predated on or they predate other things. So there’s some kind of mortality or growth factor going on there.

You can start there, and the data comes in because you say, okay they’re reproducing, but how much, how quickly. So often there’s, you know, entomologists, people who study insects, will do a lot of experiments on trying to measure those rates, mortality rate, the reproduction rate, the predation rate. And when you’re looking at your ecosystem, I suppose the goal is to, like your lake features, the goal is to try and explain why, when you’re developing a model, why population dynamics, the numbers, actually equate to what you see.

Lily: Okay. 

George: And to get to that point, you keep trying to add [00:14:00] in more factors, which try and explain that. So that might be that there’s another insect, a parasite flying around, which affects your numbers of your insect. It might be that there’s a crop, a food shortage, which means they can’t grow as much as they want to. It might be there’s some toxin flying around. It might be that your crop has been engineered to be resistant to that insect.

And of course with CASAS, the group we collaborate in, they are also looking at the organism’s physiology, which essentially means how it responds to its environment, its climate, essentially. So it’s saying the rate at which things reproduce or predate or die is not constant, it depends on things like temperature, things like, for insects. That’s the big driving force, for plants it could also depend on rainfall, how much soil [00:15:00] nitrogen there is, how much sunlight they’re getting and so on.

So I think that the process is the same. You’re just trying to add in more and more levels of explanation to enhance your model and get its prediction closer and closer to observed reality. And I think that process is similar to what you were describing.

Lily: Yeah. So in a way in mathematical, like I’ve got my statistical head on too much, so you’re explaining it and I’m like, yeah, put in now an exponential term and get your data of how many kind of pests you currently have and you’re done. But I guess what you’re saying is in mathematical modelling there’s this underlying system which is understood and can be described mathematically a bit more, you can understand these rates.

George: Yeah, and I think as opposed to a difference in process, I mean that there obviously is a difference in [00:16:00] process and structure, but I think there’s a difference in purpose for doing it one way or the other. The way you describe is what a lot of people do, you know, they’re saying, okay, you keep adding in these climate factors, these numbers of other insects, the crop levels and so on, and you can probably come up with quite a good explanation of your insect numbers from a statistical perspective.

The reason that you might want to, I guess, understand, really trying to understand how those physiological or climactic factors affect your population numbers… because you want to build a model that’s not just applicable in the places you observe. 

Lily: Wow. Okay. Whereas in statistics, you [00:17:00] very much should not extrapolate the model beyond the observed region.

George: Exactly, but there are very good use cases, your example is already one of them, where you don’t need to do any extrapolation, everything you want to account for is there, you just need to understand what’s going on. Whereas in other situations, there is this need to extrapolate and be predictive outside of things you’ve observed.

In ecosystems, that’s what if we put our insect into a new geographical climate location, which is very valid because these things migrate, they are invasive in a sense. You want to try and account for that. You want to understand how they are going to respond to climate change.

And again you don’t have in an observed ecosystem data set, you don’t [00:18:00] have the what happens if it’s plus two degrees for a long amount of time. 

Lily: No, yes, so we don’t have the observed data for it in the mathematical side. We’re trying to predict without that observed.

George: Yeah, I guess that can be summarised as the aim of trying to understand the process or the cause and effect in a mechanistic way. And it is the same for solar systems, like I talked about earlier. If your models were just based on data you’d observed, then you wouldn’t really be able to talk about what a comet that’s flying through our solar system, how its trajectory will be influenced by other planets. You know, Kepler’s laws don’t really talk about how three bodies interact with each other. So the sun, moon, and Earth, the gravitational system is [00:19:00] in some kind of balance.

Lily: Yeah.

George: And all affects one another. Very small for the effect on the sun and the earth, but it’s still there. And when you want to design GPS systems and whatever those things are very important as are understanding general relativity, the effect of, yeah, deep effect of gravity on time dilation.

Lily: Okay.

George: And the point is, yeah you can’t do that in a new scenario unless you have tried to understand what’s happening underneath.

Lily: Yeah. And so mathematical modelling is derived from mathematical principles, whereas statistical modelling is more derived from your data and your probabilistic reasonings. Mathematical modelling, you wanna describe or predict a system using known relationships, but I guess the predicting is a key bit there, whereas in statistical modelling you want to predict, [00:20:00] but it’s within your data you wanna predict. 

George: Yeah, that’s kind of an envelope. I always like the word decoupling in the sense of a mathematical model tries to decouple the core, the features from the observations.

Lily: Okay. Yes, I see. Whereas in your statistical, you’re looking at these relationships, you’re estimating kind of effects and you are predicting, but within your data.

George: Yeah. But in a statistical model you can achieve a very good explanation. Is it analysis of variance or something you can use to demonstrate that you’ve accounted for everything, is that right?

Lily: Yes, in an analysis of variance, which is used for modelling, the way you are using it there or saying it there, you can also use an analysis of variance table to describe your data. Within that, you can see how much of your variability is accounted for, how much of those changes in why in your response is [00:21:00] your data actually able to account for?

A really big thing I wanna talk about is uncertainty. I dunno if maybe it’s the same in them both, but I do know that in statistics, you have an error term, which is kind of assumed to follow a distribution. It might be a normal distribution, but you accept that there are some things in here that we can’t model, that randomness happens. You roll a dice and it’s not gonna be six every time.

Whereas in mathematical modelling, how do you explain those kind of stochastic processes?

George: I think there’s probably two sides to that. So there are people who deal in stochastic modelling, stochastic mathematical modelling, where what’s a good example, something like diffusion in a fluid, because ultimately that is powered by Brownian motion, which is a random movement of particles which collide with each other. [00:22:00] And in places where there’s more particles, you get more collisions and that kind of forces everything to move apart.

So, there are people who deal in stochastic models, which kind of say each particle moves randomly, and we can account for the collisions and start to kind of build that model. And then there are people who perhaps deal with a more macroscopic level, where they are aware that the position of an average particle in that fluid, which is diffusing, follows some kind of distribution.

Those are all under kind of one, one umbrella, there are models which account for randomness, but it’s still in a way that is expected randomness, if that makes sense.

Lily: Yeah, that does.

George: And the randomness can be observed some way, you can look in your data and you can [00:23:00] say, okay, there is some kind of mean, there’s some kind of variance. And then we can try and account for that in our equation models.

There are two things, I said, so I should probably say the second one, which is I guess more to do with once you’ve constructed your model, then there’s still the process of does it compare to reality or not? I don’t think you can, you can never expect that your model is going to be perfect because things are just so complex. But you can ask questions like, is your prediction falling within a reasonable enough band from the observation, for example. And I guess, if you added some kind of uncertainty, does your prediction get close enough to the observation?

Lily: So, I guess a big thing in statistical modelling is you have kind of p values that help illustrate that kind of certainty that this factor has a [00:24:00] relationship with the outcome, or that this element affects the outcome.

George: Yeah.

Lily: Do you have a way of, oh, but maybe you don’t even need that. 

George: No, I think you do, and I think you’re touching upon something that a lot of modellers, at least in the circles I go in, so biological modellers, that is not in the forefront of their way of working, trying to account for, like I said, you keep adding your bits and pieces to your ecosystem, but a lot of the time there’s not really a quantification of what that actually did.

Lily: Yeah, that’s interesting. 

George: And it is something that you can do, and I believe we should do, but as I say it’s just not at the forefront of how people approach it. 

Lily: Now just going back a step, [00:25:00] sorry, to your kind of deterministic or stochastic modelling. Let’s go to your example of the comet hitting the earth, or we both like cricket, so I wanna use a cricket example, but I’m not sure how useful that would be. But this kind of, you know, you’re showing your expected trajectory of A to B, in cricket, for example, of would the ball hit the wicket if the leg wasn’t in the way?

George: The Hawkeye system.

Lily: Yeah. Or the same in tennis. But there’s always a little kind of plus or minus in that.

George: Yeah, that’s a really good point. And, you’ll know if you follow asteroids, comets, whatever, they come with a probability that there’s been some observation, and they’ll say, okay, there’s a 1% chance it will impact, or there’s a 30% chance. 

Lily: Yes. that’s true. I’ve heard these terms before.

George: And I assume that’s coming, I guess, [00:26:00] for a few reasons. One is the data you have on the trajectory, there is uncertainty in that, especially if these things are very far away. And then I suppose there’s also uncertainty in all those little gravitational interactions as this thing is moving through the solar system, you know, it is going to get tugged by Jupiter and then backed by one of its moons. It’s very difficult even with today’s computing power to actually account for all of those things. Those two things do lead to uncertainty.

So I think in that respect, you are right that, you can assign uncertainty to even deterministic or process based models and carry that through. And I think that’s an example of where it is done, is just in biology people tend not to. And that is part of our work is to just try and [00:27:00] enable those practices to be carried along.

Someone said to me that in physics, in climate science, even kind of chemistry, there’s a lot of people with mathematical backgrounds who go in and help develop the models, whereas biology is very different. Part of our work is just trying to help enable those practices. 

Lily: Yeah, that is interesting.

And so your interpretation of your mathematical model, it seems to be quite certain, whereas in statistical modelling, I guess, because you could see on like your Anova, like you alluded to, you can see, okay, we know that this doesn’t explain all of the differences, or, we could see, we haven’t actually been able to explain much of that kind of variability. But that’s within the data. But in statistics it can vary.

George: That shouldn’t be a huge difference, I don’t think. You know, the people do mathematical [00:28:00] modelling, maybe just to try and explain some feature of data still. So often you’ll see that, going back to insects, that numbers fluctuate and oscillate.

Lily: Okay.

George: And the goal might just be to construct a model which oscillates. It doesn’t have to match up to the peaks and troughs and frequency and whatever. You’re just trying to explain a mechanism which leads to oscillations. So I think in that sense, all modellers want to construct perfect models, but it’s difficult and you don’t know how to quantify every piece. So often people say, okay let’s just try and understand a feature or a set of features and try and account for those. So I think it’s the same.

Lily: Okay. So in your mathematical modelling world, it’s… I’m learning a lot.

George: I am too.

Lily: I guess I shouldn’t be surprised. I guess I just never really thought about it before, but of [00:29:00] course there’s been a lot of things done in differential equation modelling and stuff like that, you know, at university, kind of learning about that. 

George: Yeah. At some point in the modelling process for an ecosystem, someone is collecting a big sheet of data and doing something with it. 

Lily: But for you, it’s more kind of equations built from this, like theory, assumptions, data. Whereas in statistics, it’s more that you’re fitting an equation to the data. You have your data and you fit an equation to it. Whereas for you, you have your equation beforehand. 

George: Yes, you have an equation or at least a way of building your equations, some kind of methodology.

But I think what might be a good thing to end on is what I call hybrid models, where hybrid kind of means that you’re taking a lot of different techniques and piecing them together. And a lot of people [00:30:00] don’t like that terminology, but what the models that CASAS do and I’m working on, they are hybrid in the sense that there are statistical elements that I think are statistical in your sense.

And that specifically comes from when someone does a study on insects development or growth or predation or mortality or reproduction against different factors in a lab, and they can observe at this temperature and this humidity, the rate is this. You do have this set of data and from that you are then trying to account for what is the development rate as some function of those factors. And I believe that is a statistical problem if I’ve understood.

Lily: Yes.

George: That is purely data and fitting a function or whatever. And [00:31:00] that can then be interpreted in your population model, you can hook that into your population model.

Lily: I see. So the kind of statistical side of your hybrid models, like, okay, what factors are associated with this or what’s affecting this? And then your mathematical side’s about that kind of trajectory a bit more.

George: Yeah. So like I said, then population dynamics models, it’s all about trying to find the rate. And what I’ve described is a way to define that rate, which is statistical.

Lily: Yeah.

George: And it then tells you that, okay, I’ve then got these factors to my temperature, my humidity, and so on, my model then needs to be run on those factors. So then the entire model gets dependency on that. And I think it’ll be really interesting to think about how uncertainty in all of those things actually gets carried through [00:32:00] this mechanistic process. And the way that we said the people who forecast meteor trajectories do that. Other people who do Hawkeye and never account for that uncertainty. 

Lily: Nice. That’s probably a good place for us to finish because I know that there’s a lot more I want to ask, but we have to end at some point. But this has been very insightful.

George: Yeah, and we can keep going if this turns out to be popular. Yeah, I’ve learned a lot about at least how you think about a model, but also that what you think is modelling isn’t always the end point of what you’re doing.

Yeah. That’s a fascinating insight.

Lily: I think some people would disagree with that. Some people would very much disagree with that. 

George: And I think equally that there are people who work in the lab and do their studies of insects against temperature. Their goal isn’t [00:33:00] the model either. They’re doing a process which is valid and meaningful in their context. 

Lily: Yeah, but in statistics, I think people often jump to the modelling, but actually, you want to understand your data, you wanna understand how does A affect B. And modelling is just one way that you can do that, and one way that you can illustrate that in a kind of interpretable of how A, B, C, D, and E affect your thing, and you can predict, in survival analysis, predict how long this person is expected to live, or see how this drug compares to that drug and kind of more epidemiological research, or how different varieties affect the yield, so therefore, which variety is better in kind of your crops.

But modelling is just one tool to show that. It’s a good tool, but I wouldn’t say for me, it’s not my, not my goal.

George: I find that really, really fascinating.

Lily: Great. Well thank you very much. 

George: Yeah. Thank you, Lily. [00:34:00]