Description
What happens if statistics teaching starts from data rather than methods? In this episode, Lily and David explore the idea that statistics education should prioritise data analysis over traditional methods-first approaches, discussing the benefits and challenges of this paradigm shift. Highlighting examples from New Zealand’s education system and their own experiences, they argue that a data-first approach can provide more practical and widely applicable skills for students, despite the structural challenges it may pose.
Transcript
[00:00:07] Lily: Hello and welcome to the IDEMS Podcast. I’m Lily Clements, a Data Scientist, and I’m here with David Stern, a founding director of IDEMS. Hi, David.
[00:00:15] David: Hi, Lily. Great to be having another episode, and I’m excited by this title.
[00:00:20] Lily: Yes, yes. Teaching statistics from the data up.
[00:00:24] David: Yes, we’ve been thinking and discussing these ideas for a long time, but I’ve never heard it expressed in that way, and I really like it.
[00:00:34] Lily: Oh good. That’s nice.
[00:00:37] David: And you are now writing a paper on this and it’s an idea which is central to a lot of things we’ve done in the past, but it isn’t something which I feel has yet caught on widely.
[00:00:53] Lily: No. So maybe we should first say what we mean by teaching statistics from the data up. And this is the idea of, well, okay, is it Cobbs that says data before methods?
[00:01:03] David: I believe so.
[00:01:04] Lily: And that idea is that your kind of statistics or your data science course is informed by the data, you’re not teaching methods and formulas, you’re letting the data lead the way.
[00:01:21] David: Well, sort of. And of course this is something which has been around in many other contexts for a long time. It is this idea that if you have a statistics course and it’s a statistics course on, let’s say, I don’t know, take your pick, generalised linear models.
[00:01:40] Lily: Okay. Something which we can assume the listener may not know much about.
[00:01:47] David: Exactly. Well, some may, some may not. But the point is, as a listener, you can guess that that is a statistical concept, it’s a statistical method. And so you start the course by discussing the method, defining the method, explaining the mathematics behind the method, and then you give some examples, you give some data to illustrate the method.
[00:02:10] Lily: Yeah. Which can often become, what word do I want? I remember when I was taught about different tests and different generalised linear models literally having a kind of flow chart that we were given, okay, this is how you can work out from your data, what the method is.
[00:02:29] David: Yeah. And this is absolutely sensible in certain cases, that you have certain elements where you go in and you look at your data and there are different decisions you need to make and that forks and that helps you to choose your method. That is actually more data focused, the methods focus, if you start with the methods, then you only get given data that happens to fit that method.
This is the thing, all data you get given, just by chance, it happens to be perfectly adapted to the method you are being taught at this point in time, because otherwise they wouldn’t give you that data. This is sort of how you do the teaching. And this is the methods first where you now say, okay, well you need to learn a new method, and therefore for that method, here are some really good examples which need this method.
But starting with the data is exactly where you might not know to begin with, which method to use. And I love New Zealand for what they’ve done in the education system where they’ve decided to really put data first, and they’ve got data from the first year of primary right the way through schooling, and they don’t focus on the methods. The methods you learn always are applicable.
They’re not the most powerful methods, they’re not the methods that will mean you can get good results with a small amount of data, which is what powerful methods enable you to do. But they are methods which are widely applicable and so therefore you don’t worry too much, you just use these methods that are widely applicable to try and understand the data you’ve got. And you can have a data first approach, and there’s some very interesting learnings that have come from that.
But it is still, if I think about the university teaching that we are involved in that. We’re supporting others with, most curricula are focused on the different statistical methods, which you learn one after the other, and then you get given examples highlighting those methods.
[00:04:41] Lily: Yes. I mean, even using the example generalised linear models, I don’t know if you did that on purpose, but how you’re taught is linear models and then you’re taught generalised linear models, or how I was taught anyway, it was very kind of methods based in that way.
[00:04:56] David: Yeah, exactly. And this element that when first you are taught linear models, but they’re actually only applicable on a smaller number of things, but you get taught that first because that’s mathematically easier, whereas generalised linear models, almost by definition, they’re more generally applicable. And so they’re more widely used or useful, but they’re mathematically more complicated in terms of the underlying mathematics and therefore, they tend to get taught afterwards. So you only have a small toolkit to start with, and often you are lost because you don’t have any tools that apply.
[00:05:35] Lily: Interesting. So what you’re saying is teaching from the methods up or teaching from the methods first, it kind of means you start with your simpler example and your mathematics gets more complicated. But this can, in practice, not give you the tools that would help as much.
[00:05:51] David: Exactly, because if you get asked to analyse data, it might be that you haven’t been taught any tools that enable you to progress. The approach really put forward by New Zealand and which I was appalled by when I first heard about it until I understood it, is actually to start with tools that you can always apply, but are often not very powerful. Things like bootstrapping.
[00:06:20] Lily: Which is things that I didn’t cover at university, at any level, bachelor’s, master’s, or PhD.
[00:06:27] David: Absolutely, and yet this is what is taught at schools in New Zealand because you can always use it. It’s almost never the method you would actually use because it’s almost never the best method to use, but it is widely applicable. And this is the sort of thing which is so important, is a different perspective on how we could be teaching statistics from the data up.
And we’re not quite, in the course that you are developing, thinking of it at that scale that the New Zealand curriculum have thought through, which is from early primary right the way through schooling and so on, I’d love to engage in that process. But we’re engaging more from the idea of, if you are somebody who is mathematically inclined, a mathematical master’s student, which is really the target audience that we have, and you might get involved in data, and in particular you might become a data scientist or a statistician, then we are thinking of starting at that level and thinking, okay, what happens if you start by thinking about the data?
And one of the things of course that is at the heart of the approach you’ve been building out is starting with multilevel data because almost all data which you actually encounter in the real world is multilevel, and that doesn’t have to be impossible and hard to deal with. In fact, this is natural data. People should be thinking about this from the outset and understanding how to work with data at multiple levels.
So this is a simple example of something which you, when you’re thinking from the data first, multi-level data is extremely natural. When you’re thinking from the methods first, multi-level methods are actually quite mathematically challenging and complicated.
[00:08:25] Lily: And, again, not meaning to keep talking about my experience, but just to give an example of how it was taught at university, we never touched multilevel data in, again, any of the three degrees of bachelor’s, master’s, PhD.
[00:08:39] David: Yeah, you never dug into multilevel data because, essentially, the analyses and multilevel methods, statistical methods, they are challenging. They’re a sort of narrow area, they’re a particular topic, they’re very interesting, but they are not necessarily accessible, or easily accessible, but multilevel data is. And this is what’s so important. Actually, I bet you touched on multilevel data, which had been summarised down to a single level.
[00:09:10] Lily: Most definitely. Well, you know, I probably did touch on it and I didn’t even realise, it meant that I was scared, in a way, of multi-level data, I’ve heard this term and I don’t know what it is and it sounds big and scary. And then before I know it, you give me this question, which I believe we’ve done another podcast on, but on this, I think it’s the Rwanda course that we did a podcast on, but you give me this question and you essentially then turn to me in the middle of it, and you use this word, multi-level data, and I’m like, oh, oh, is that it? Oh, okay, data with levels.
[00:09:48] David: Data at multiple levels, yeah. And it’s this thing that if we start with actually thinking about where the data comes from, what it means, and actually having relevant data, then it’s a very different prospect to take on some of these, if we go to the methods might become challenging methods, but if we start with actually the data that you’re trying to analyse, it isn’t.
[00:10:16] Lily: Interesting. Yeah.
[00:10:19] David: And I guess one of the learnings, and you mentioned the Rwandan course, one of the learnings that I’ve had, which really took me a while to get my head around is the fact that good statisticians and data scientists may have gaps in their statistics and data literacy, which are different from the gaps in statistical and data literacy that non-specialists have.
And this is something where, of course it becomes actually very problematic when your support staff who are supposed to be the knowledgeable people in statistics and data have basic gaps in their understanding compared to the people they’re supporting. This is a scenario that we find quite a lot, as people who play the role supporting researchers who are using data in their daily life, I’ve often been confronted by elements, which researchers who I’m supporting take for granted.
And they have language that they assume I, as the statistician would just be familiar with because it’s the language that they use and I have to stop them and say, wait a second, I’m sorry, I don’t know what you mean by that. What? You don’t know what, but that’s a statistical term, and I say yes it is, but I’m afraid I don’t know what it is. And so I have to go back and I look it up and say, oh, okay, I understand what it means, but it is not something I’m familiar with.
And it’s not just the terms, what we found is that there’s actual gaps in people’s understanding. Just as you found for yourself coming out of a statistics PhD, where the deeper meaning of standard deviation, that interpretation of standard deviation, the usefulness of it as a quantity to communicate was sort of lost. You understood the formula, but actually that ability to estimate it was not something which was natural to you. Whereas somebody who uses it regularly to compare it is natural, it has meaning.
[00:12:40] Lily: Yeah. And so in a way, coming back to teaching statistics from the data up one outcome of it is having that better context, or I guess you’ve seen, you’ve been taught something in a way, like how when I came across multi-level data, I was like, oh, I’ve heard this term, multi-level data before, or I’ve heard this term standard deviation before say, then just realising, oh, is that it? Is something I think comes out a lot more naturally when teaching from the data up.
[00:13:14] David: Let’s come back to what this means. So if you start with the data, and then you are building up understanding from the data, and you are then coming across the methods that you need as the data requires them, then this becomes something which is much more natural to then actually ask the question, and literally ask the question what is the statistical process I’m using telling me about my data?
You’re not starting with the method and then saying, which example do I have which helps me to understand the method, you are actually stating, okay, from this data, which methods do I need to answer the questions that I have about my data?
[00:14:03] Lily: Yeah. I see. And in the past, because you’ve been taught from a data led way, you’ve done these methods and that method, so you can apply that.
[00:14:12] David: You’ve been taught, you mean from a methods approach?
[00:14:15] Lily: Yes. Sorry.
[00:14:17] David: Yeah.
[00:14:17] Lily: In the past you’ve been taught from a methods approach, and that resonates quite deeply, that saying of, okay, which example have I used that fits this data?
[00:14:27] David: Yeah. And does the data I have fit the method or not? And what do I need to look at? Whereas actually it can be much more natural the other way on. However, don’t get me wrong, there’s challenges in this and the New Zealand example is wonderful for this, that they’re far ahead in approaching this.
One of the challenges that they had is that, well, if you take a data led approach and you are not going deeply into the methods, when do you encounter certain, when do you progress, what’s the progression from year to year? They actually had a serious issue where there wasn’t a well-defined progression. So students would say well, I saw this last year. And it wouldn’t be clear what it is that they’re doing, which is new and which is beyond what they had before because that’s what the methods were really good at sort of doing.
You always have a new method. Well, I haven’t seen this method before. Now I’ve got a new method. I’ve got new examples that go with that method. And so from an education standpoint of progression, that’s actually very nice. But from an education standpoint of the usefulness, it’s not as good. So there’s advantages and disadvantages.
I don’t want to say a methods first approach is bad because it gives great structure. It’s really hard to have structure if you’re taking a data first approach. But it is extremely powerful to have a data first approach if you can then get structure and progression, or if you are looking at it as we are at a particular point in time where you are taking in a sort of wide base of people with different levels of skills and ability, who have seen a lot of different things and you are taking them all together in this new approach. That’s one of the ways that I feel the data first approach really adds value.
But there are challenges. It’s not simple that, oh, if only people had swapped and everyone was doing this, everything would be good. No, there’s really good evidence that any of these changes come with advantages and disadvantages. I do believe quite strongly that if we could find ways to embrace data first, the skills we would be able to impart are skills which would be more useful afterwards than a methods first approach.
And it is one of the things which I keep coming up against in many different groups that I work with on data. I was just at a series of trainings in Niger, Burkina Faso, and Mali. Where this wasn’t central to the training, I was giving, but it came up again, these issues around, well, you know, what about the hypothesis test that I’ve been taught about? Where do they fit in? What value are they playing and what else should I be able to do and how should I be able to interpret?
Because the training that we were wanting to do was about actual communication of results. But there was a confusion between what is a meaningful result and what is a statistical result or a scientific result, and is the scientific result meaningful in terms of wanting to communicate it to a particular audience?
And that’s a different question. And the statistical significance is for many audiences, an important prerequisite. As a scientist, you shouldn’t communicate results where you don’t have scientific evidence. But if you do have scientific evidence, well the statistical significance becomes irrelevant. What you actually want to communicate is the result, and to most audiences, that’s not the fact that it happens to be statistically significant.
The statistical significance simply enables you to be confident that this result, you have the scientific evidence behind it to back it up. That was a wonderful example of where, you know, we got hung up on these details because the teaching of the statistics was so methods based that once people had the statistical significance, they often didn’t take the next steps to say, well, how should I display the actual result? What does the result that I’ve got statistical significance for look like? Which is really what matters if you want to communicate.
[00:19:18] Lily: Yeah, you have, if I’m following correctly, you kind of get this value, this P value or whatever, you get this thing saying that’s significant, you know, A affects B. But you want to be able to visualise that in different ways?
[00:19:33] David: Well, often that statistical significance is something like the mean of this treatment is different from the mean of this treatment. So if you use this treatment, you are confident that the difference between those two is not due to chance. So you can communicate that if you use this treatment in agriculture, you might get higher yield for this or whatever that may look like.
[00:19:58] Lily: But it doesn’t say how much of a higher yield or what the difference there is, and so forth.
[00:20:02] David: Exactly, and that’s what’s of interest. If there was a cost associated with it, is it worth the cost? Those are the sort of questions which actually people want to know, and they’re not communicated by the statistical effect, they require you to interpret it and actually understand it in other ways.
[00:20:21] Lily: Excellent. Well, hopefully we’ve at least touched on the surface of kind of what we mean by teaching statistics from the data up.
[00:20:28] David: And once you’ve written the paper, I’m sure we’ll have other opportunities to communicate about this more deeply.
[00:20:35] Lily: Yes, absolutely. Thank you very much. It’s been a great discussion.
[00:20:39] David: Great. Thanks.

