206 – Explore, Describe, Present: a Statistical Analysis Framework

The IDEMS Podcast
The IDEMS Podcast
206 – Explore, Describe, Present: a Statistical Analysis Framework
Loading
/

Description

Lily and David explore a powerful framework for data analysis: Explore, Describe, Present. They discuss the importance of exploring data to understand its structure, describing data in the context of specific objectives, and effectively presenting insights to various audiences. Highlighting the challenges of modern data analysis, including the role of AI and the influence of tools like the tidyverse and R-Instat, they emphasise the need for structured approaches to make sense of complex datasets.

Transcript

[00:00:07] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a Data Scientist, and I’m here with David Stern, a founding director of IDEMS. Hi, David.

[00:00:14] David: Hi, Lily. I’m looking forward to today’s discussion.

[00:00:18] Lily: Yes, me too. It’s on the framework.

[00:00:21] David: The framework. That’s a very big name for it.

[00:00:24] Lily: Well, I know, but I don’t know how else to describe it at this stage. I guess that’s part of why we’re having the discussion.

[00:00:29] David: It’s not the first time, for those regular listeners who are around, they might have heard us, well, either us or Lucie and myself, talk about Explore, Describe, Present, as a way of thinking about, not just data visualisation, tables, graphs, but more generally looking at data. And this is getting more serious because in a couple of different contexts, we are now finding that there’s just so much confusion out there about what to do with data in different ways.

There’s lots of data out there, there’s now tools to be able to analyse data ever more accessible. People are starting to use large language models, generative AI, to help them analyse data. But making sense out of data is not easy. And this is where I think this simple framework of Explore, Describe, Present, could be extremely powerful to help a wide set of our partners engage better with the data and the large amounts of data they’re now having to handle.

[00:01:45] Lily: Yeah, I agree, I think. So maybe we should dig into what we mean by Explore, Describe, Present, these different areas, and then we can talk more broadly about its place.

[00:01:57] David: The way I like to think of it is that there’s a very simple progression, where once you have data, you should explore it. But you don’t have data for no reason, you are looking at the data for reasons, and you have objectives that you are trying to look at the data for, and you want to see if the data is giving you the scientific evidence or the real insights on what you are wanting to learn from the data.

And that’s then you are describing it, you are actually drawing those insights out. But, once you have an insight for yourself, especially as a data scientist, doesn’t actually go anywhere unless you can communicate it to an audience. And that’s what Present is about.

So broadly, it’s data, well, you Explore it, data plus objective, well now you are ready to Describe it, data plus objective plus audience, now you are Presenting it. And you kind of always need to do all three if you’re gonna have something useful come out of data. But if you think of it that way, it really helps to structure elements of the analysis, and to highlight that, actually, a lot of people who are just jumping into using AI models for data analysis could really benefit with this sort of framework so that they’re actually, they’re doing it responsibly, the right insights come out, they’re not going to get misled if they’re doing this well.

[00:03:31] Lily: And I just wanna, I guess, clarify or check. So we are saying about, you know, people that are jumping into doing AI models to look at their data.

[00:03:40] David: You are right, that was bad language, let me see if I can improve my language. You know this better than me because you spend more time talking to the robots, as you put it, than I do. But if you have some data and you are doing some analysis, and you are maybe doing that analysis in R, you may well ask it I want to do this, what code would I use? You are having that conversation with the generative AI, which is then writing the scripts that you would want to be able to do your analysis.

[00:04:12] Lily: But for that particular example, I’d say that’s not new because people before that would use Stack Overflow. I guess just what the robots are good at is they can put it into your context. But I suppose what is new is that accessibility that now anyone can ask the robots for some code.

[00:04:30] David: Exactly. And the group that I have so much respect for, used to be called R-Studio, they’re now called Posit, and they’re great, they’ve built some wonderful things. One of the things they built out in R was the tidyverse, and that had really good grammar to it, it originated, I suppose, from the grammar of graphics and GG plot, it’s Hadley Wickham’s PhD, and the essence of where these ideas stemmed.

And they built this wonderful language in the tidyverse that was structured in a way that somebody who was good at coding now had good logical language to be able to do what they wanted and to prepare their data, to tidy the data, to transform it, to model, to visualise, and to communicate it. There were all these packages put together to really help with this framework, which they put together, which was the tidyverse.

[00:05:25] Lily: Yeah.

[00:05:26] David: We both are fans of the tidyverse because we are both coders. But we also know that there are many people for whom this language was out of reach. It’s one of the reasons we’ve both spent so much time working on R-Instat and these front ends, because there has been a whole audience that cannot access that. There’s a certain skill level you need to be able to write, or to even access the sort of language to be able to go through that process.

[00:05:57] Lily: Yes, I completely agree. I would just want to give an extra nod to the tidyverse that their language is a lot more accessible and readable than using base R code.

[00:06:10] David: Not just base R code, I mean, more than a nod, full credit to them. They built grammar into analysis, into data analysis. This is what they’ve done, and they’ve done it very thoughtfully, very carefully, in ways which really makes sense. And so, yes, I want more than a nod, put them on a pedestal, great work.

[00:06:32] Lily: Yes.

[00:06:33] David: But, that is still not accessible to what I would argue are most people who have data.

[00:06:40] Lily: And that’s part of where R-Instat comes from.

[00:06:42] David: Exactly. And I mean, in 2017, I still remember when we were together in Morocco, at the Word Statistics Congress, and after Hadley Wickham gave his presentation, the R-Instat developers were so excited to be meeting their hero.

[00:06:58] Lily: Yes.

[00:06:59] David: They asked him this wonderful question, which was, you know, we love what you’ve done in tidyverse and it’s great for people who can code, but what we’ve been working on is R-Instat, for people who can’t code, what do you think for people who can’t code? And Hadley Wickham’s response was wonderful. He said, oh, that’s a hard problem, I haven’t thought about that one so much.

This is not a criticism of what they’re doing, but it’s this reality that there’s lots of people who can code. And there’s more and more because we’re training people. But there is always a larger number of people for whom that is not the right approach, or there has been and there will continue to be, where they still need to analyse their data.

And, actually, when we discussed it afterwards, that was quite motivating for the R-Instat team that, you know, they weren’t in competition with what R-Studio was doing at that point because they were dealing with the people who can write scripts. Whereas the group who were building R-Instat, they recognised that most of the people they were trying to serve were the people who couldn’t write scripts, for whom that was beyond them.

This was 2017, a long time ago, things have changed. But one of the main things which has changed is that now anyone can write scripts because you have access to a large language model, which can write the script for you.

[00:08:20] Lily: Yeah. Well, it might not be the correct scripts.

[00:08:22] David: Aha, I agree.

[00:08:24] Lily: But anyone can write scripts. And so I guess that’s part of where this framework, I’m not sure how we want to frame it yet, but where you were getting particularly excited about its relevance, I’d say.

[00:08:34] David: Yeah, you are right. The relevance has come in because I believe that the work that the tidyverse has done in putting together this really powerful grammar is not what we should now be teaching people. I think for a long time I was very happy with the idea that we teach people how to use the tidyverse and they then have a language where they learn how to do things.

Whereas right now, that’s no longer what I think is the right way forward. I think the right way forward, and this comes back to some of the ideas, I suppose, related to R-Instat, where there are audiences who will not learn that language. But now with generative AI being able to write code in different ways, and there were other ways where I think AI agents could be brought in, so just as you have an AI agent if you are writing code, I could easily imagine in the future R-Instat will have its own AI agent, which will enable you to navigate and understand, well, what’s the analysis I want to do for this? Take me to where I need to go.

Most software will be able to have an AI agent, which I hope in the future will not be based on a large language model, but a small language model, possibly I’ll do an episode on that with Kate. And the power of having a framework which helps people to analyse their data when the main question isn’t, well, how do I do this, but what should I be doing and why am I doing this and what order should I be doing things in?

[00:10:19] Lily: Yes.

[00:10:20] David: It’s crazy that it’s only now that this framework really makes sense to me. I must have been very backward to not have recognised this years and years ago when I was trying to teach people how to analyse their data. Maybe others, I’ve not seen this, I’ve not seen others and I’m pretty well versed in this. Actually make it really simple to think, well, when you go through and you are working with your data, what are the simple steps and how should you go about it? What should you be doing?

And I think we can teach that now in ways which is so powerful with this Explore, Describe, Present. You might say, well, where’s the modelling? Surely it’s all about modelling, but we’ll get to that in a minute. But Explore, Describe, and Present is I think the framework that we need to be honing in on to be able to support others to make more out of their data.

Now, this is relevant for you because you are about to go off to Zambia where you are going to give a training in a MET office where people have lots of data, the historical data. And you are trying to get people to make better use, you’re trying to help people to make better use of the rich resource of data they have. And so this Explore, Describe, Present is being built into that particular training, it’s also being built into a training we are giving in West Africa, which is coming up in Niger, Burkina Faso and Mali in a few weeks time.

And the same basic framework is being used in both of these contexts, and that’s why it’s worth us discussing it as a more general principle, because in some ways we’ve known this for a long time, but we’ve not had the language to communicate it.

[00:12:09] Lily: Yeah.

[00:12:10] David: And I cannot at all claim that this is original from us because these ideas in the stats education community, they’ve been floating around for a long time. It’s just I’ve not heard them presented with the same structure that I believe we are now putting into these trainings. So let’s dig into: Explore.

[00:12:35] Lily: Yeah. Okay. Let’s dig into Explore. So with Explore, my understanding anyway, is you kind of have these three questions: what is there, what isn’t there, and what stands out?

[00:12:49] David: Yeah, I like that. I even wonder whether the order should be different, do you look at what is there first or what isn’t there first? I don’t know. And that’s a really interesting question, I think there’s different cases, but…

[00:12:59] Lily: Yes, you initially had it the other way around. You initially had it, what isn’t there? Then what is there?

[00:13:04] David: But I think you are right. I think you have to look at what is there first to then start to see what isn’t there, because you don’t know what’s missing, and of course, missing values is so important, but you don’t really know what’s missing until you know what is there. So actually starting by looking at what is there is I think really important.

And you know, what is there, well, any data, you might have different tables of data, so you might have multiple tables of data. That would mean you have data at multiple levels. Within a given table of data, you will have multiple columns of data, and so what do they each mean? They might all correspond to different questions if you’ve got a survey or measurements or there might be all sorts of things that a column represents in your context.

But every column of data has meaning, of course, even if that meaning is trivial, what was the plot number, what number did you give to the plot? The meaning may or may not be important as part of the data set, and actually we quite often have this, that lots of the columns in the data set aren’t necessarily the key columns. They’re not the most important things, but they’re part of your data.

And so really, your Explore, you should be able to know what each of them mean, what is there, what isn’t there? If there are columns that have missing values in, should they have missing values or not? And if they don’t have missing values, that’s easy. But if they do have missing values, what type of missing values, why are they missing? These are all interesting questions.

And then of course, the third question that you had is what stands out? You first look at your data, forget about the actual objectives, you see things in the data. You might see an extreme value, an anomaly, you might see a trend, you might see a form of grouping of the data.

So there’s so many different things that you could see when you look at data and so many things that could stand out. And at the heart of this, what you are doing is you are getting familiar with your data, you’re getting to know what’s there, but you are also finding things out about the quality of your data. If you have a group, do you have the right number of groups? Are they the right number of rows in each group?

These are things, if you interviewed a hundred people and you only have 98 rows of data, what happened to the other two? Something which is missing. If you interviewed a hundred people and you’ve got 102 rows of data, what’s going on here? This is anomaly, something is standing out with the data, it’s not what you expected.

And this is something where just looking at your data, getting familiar with what’s actually there, does it correspond to what you expect, are there things which are not right, are there groupings where it was spelled slightly differently, there were two different spellings of something. These are all things which sort of come up when you explore your data.

[00:16:24] Lily: So I guess two things. Firstly, my understanding is that this is something that happens before your objectives.

[00:16:29] David: What do you mean by before your objectives? Because of course the objectives almost certainly predate the collection of the data in many studies. So in many studies, you have your objectives first, you design your study based on the objectives, and then you collect the data.

[00:16:44] Lily: Okay, then before accounting for your objectives in looking at the data and the analysis.

[00:16:49] David: Exactly. So this is what I would do as somebody who looks at data, who works with people to support them on their data analysis, the first thing I do when I sit down with someone in their data is I explore it, I look at it. And I often, if they’re sitting next to me, I’ll ask them lots of questions about it. I’ll say, what does this data mean? Did you expect a value like that in this column or is it too big? Is it too small?

And just go through and do simple bits of cleaning, but just going through and sitting with them, looking at their data, gaining familiarity with what’s there, with what’s not there, and so on.

[00:17:26] Lily: But then what if, I’m now just second guessing myself. So this is before accounting for your objectives, but then what if you’re missing a very key column, a very key variable that you need for your objectives, or you have a lot of missing values in a very key…

[00:17:42] David: So that would be then an interesting discussion. That might affect what you could then do at a later stage. This might be a discussion on the boundary between when you are actually getting to saying, okay, now I’m ready to go in and look at a specific objective, actually, you’ve told me that objective, no, I’m sorry, I can’t do that with your data because.

So I can now answer that question when they tell me their objectives. If I’m sitting next to them and I’m helping them with that when we now move to the Describe phase, they tell me I wanted to answer this question, and I can say, I can’t do that with this data.

To do that, I would have needed this extra column. And that is something I’ve done on a number of occasions to people, rather traumatic on occasions to say that my objective for collecting this data was to do this, and I say, I’m sorry, you can’t do that with the data you have. You don’t have the data you need to answer that question. That’s actually surprisingly common even with very good researchers.

But, let me come back. The key is that’s on the boundary as you enter into the Describe phase. So entering to the Describe phase, as I would call it, this happens when you now bring your objective into play.

This framework works very well if you are doing the analysis yourself on something which is your own study, or if you’re somebody like myself who supports people to analyse their data that they’ve been doing for their studies. It’s a really powerful thing.

So now we’ve looked at the data, we understand the data, we’re familiar with the data, when you tell me your objective, I’ll be able to tell you whether or not that’s possible, I’ll have an idea of which columns I need to use ’cause I know the data set as a whole, I’ll have an idea of whether they’re sort of serious issues that I need to be conscious of because of missing values, and all these other things ’cause I have that familiarity.

[00:19:34] Lily: Excellent. And so in the Explore, that’s why some of your cleaning also happens?

[00:19:39] David: Yeah, I’ve explored it, you told me you had four villages, but I’ve got six villages. What’s happened? Okay. Some of these villages have been spelled differently to other villages, that would be an example.

[00:19:50] Lily: Great. That makes sense. ’cause in your Explore, you might be just quickly looking at, okay, what frequencies do I have for this village variable?

[00:19:58] David: Exactly. Yeah. You told me you have a balanced design across your four villages, but I’ve got six villages and they’re not balanced, what’s happened? Let’s go in, let’s look at this and let’s see if we can figure it out. Can we clean the data so that it represents the structure that you expected to have from your data? If it was a study where you collected the data or if it’s somewhere where you scraped the data from the web or whatever it might be, you still need to understand what the data looks like.

[00:20:25] Lily: And so then we move into the Describe, as you say.

[00:20:28] David: Absolutely. And what I should say is that what I’m describing in terms of this framework now does also apply to really big data, but you would do some of these things differently for really big data. So I’m going through this with a smaller data set in mind.

[00:20:44] Lily: How small is smaller? Because you never know.

[00:20:46] David: Small, could be anything which is less than a few gigabytes, really.

[00:20:50] Lily: Okay.

[00:20:51] David: So this is, you know, it could be quite large, it could have millions of rows, that sort of thing. It could have thousands of columns. Then you need to spend quite a long time exploring it. But, it is probably not gigabytes or more of data.

[00:21:06] Lily: Great. It’s just good to have some kind of idea of what we mean.

[00:21:10] David: And the point is that having gigabytes or more data, the framework actually doesn’t change. It’s just the way you do it, the way you’d explore if you have something that big is rather different. You can’t quite go through it in the same way and it wouldn’t make sense. Once you get really big data, there’s different tools and there’s different approaches, but it’s the same basic principles that I’d still argue you could follow through, and it would be useful to help remove biases, to help make sure you are not misinterpreting the data.

[00:21:41] Lily: Great. And so with our Describe, what are your kind of, do we call them principles or do we call them areas or steps?

[00:21:51] David: I don’t know. That’s a good question. I think of these as phases, I like the term phases, you are sort of going through the analysis, you’ve got different phases to your analysis. Broadly, your first phase is when you’re just in an exploratory mindset. Your second phase is that you now take the objectives of your study, whatever you’re trying to learn from the data, and you bring them to bear, and you now start investigating with respect to those specific objectives.

And the key is that once you have an objective, you no longer have a massive data. You now have specific components of the data that relate to your objective in specific ways. This is at the heart of it. For example, the first thing that you would want to do is you want to check, you know, do you understand this objective well and what does it look like? Can you visualise the data with respect to that objective?

And I would almost always start with visualisation in some form. But, once you visualised it, you don’t know whether what you are seeing is actually scientifically valid or not. That’s where modelling comes in.

[00:23:03] Lily: Okay.

[00:23:04] David: Part of your Describe process includes the whole modelling piece. Modelling is not left out, modelling is not separate to this sort of framework. It is included in the Describe phase of the framework. And modelling can be extremely complex in all sorts of different ways, that piece of it can be very hard. But I would almost always start the Describe phase by just saying, okay, if this is my objective and this is my data, which I now know because I familiarised myself, how should I be able to see the result? And so I’d create a visualisation often, which would then correspond in some form to what I’m expecting to see as a result.

Then the next step, of course, is once I can see that result, is that result scientifically valid? I can’t determine that often from the visualisation, I mean, no, let me rephrase that. I have a very good sense of estimation, of being able to estimate, oh, that’s going to be statistically significant, whatever test or model I use, because the evidence is just huge, the visual difference is a lot.

Or you don’t have enough data here to know that this is different, we’ll have to actually use some models to check if with the right model, we can notice that this difference is enough, we have enough data. Because quite often that’s the key thing with certainly hypothesis testing, all hypothesis testing is really doing, is telling you yes, that difference that you’ve observed, you have enough data to be confident that it’s not due to chance.

And that’s really important. I mean, any one of the funded studies that happens, if you did the whole study and you don’t have enough data, so you can’t determine that it’s not due to chance, you know you won’t publish your results because it’s just hypothesis still, you don’t know.

Whereas the scientific process, you can have relative confidence that this isn’t due to chance, therefore it is something which people would want to be able to sort of see and communicate. And that’s where we get into the next step, the Presentation.

[00:25:23] Lily: Interesting.

[00:25:24] David: We should just delve into the Presentation briefly, because presenting, you don’t just have an objective that you are trying to look at. You’re not just trying to draw out that insight, but you are trying to communicate it. And by definition, when you try to communicate something, you’re trying to communicate it to an audience.

So, the Presentation step is when you have both the objective that you’ve now got an insight on, because you’ve got the scientific evidence behind it from your Describe phase, you’ve done your modelling, I should have said in the Describe phase, it’s not enough to just get that initial result. A good descriptive phase goes through and says, okay, yes, I’ve got evidence of this result, but what about the other information I have?

Are any other things related to this? Are some of the other variables, some of the other columns of data, are there relationships between them that affect my result? Maybe my result is different for men and for women. That would be an example of a variable, the gender column, which might affect the results you’re seeing.

Now, you can of course insert that into your modelling. And so modelling is really good at actually looking at relationships in different ways and saying which relationships are important, it’s a really powerful tool for that. But they can also quite often be visualised unless you get too higher dimensionality.

But they are good things to visualise as well, so looking for those relationships and finding out what are the important results and digging into that, deepening it, that’s not part of the Presentation process. That’s still part of the Describe process, so you understand what your result is.

[00:27:13] Lily: And what, to me, helps explain what you should expect from modelling, how do you know if you’re doing modelling right or not, I guess, is that kind of phrase of your modelling shouldn’t surprise you.

[00:27:24] David: This is Hadley Wickham again, is the one I heard this from, I love this. Fundamentally, your descriptives surprise, but don’t scale, your modelling scales but doesn’t surprise.

[00:27:38] Lily: That’s the one.

[00:27:39] David: Yeah, it’s a fantastic quote, it’s a deeply insightful quote for somebody who does analyses with people a lot. It’s really useful to put the modelling in its place, which is extremely important alongside other things, it is not the be all and end all.

We should get back to the Presenting. Sorry, I got distracted because I love the Describe piece. This is where all the work happens, as somebody who’s offering data support to someone, almost all of my work is around that Describe phase. You know, I sort of do the explore myself often, and then check my understanding with who I’m supporting.

But a lot of my work happens about drawing out those insights, understanding what the data is telling us, being careful about biases, about potential misinterpretations, all of these things, getting the scientific evidence, actually doing the modelling to make sure that you know that what you are saying has a scientific foundation to it, you are actually using scientific methods.

And then the hard part starts for me. And this is when you have an audience in mind. This is where many people would recognise that it’s a different skill set to communicate well to a specific audience. I have so much respect for journalists in this, I have so much respect for so many communicators who are able to take those insights and figure out how can you communicate them accurately to the right audience.

When I’ve got most involved in this, this has been related to communicating complex results to smallholder farmers in low resource environments where there was low literacy as well as digital literacy, and it’s been fascinating. This is a really interesting context, but which is central to this Present component, understand your audience. One of the insights that we had was that low literacy and low digital literacy does not mean low ability to handle complexity.

And I love the fact that we had these visualisations that we showed them to the researchers and the researchers said, that is insane, I don’t understand what’s going on, no, I can’t see it. We showed them to farmers and they said, great, I understand this perfectly and it’s really perfect. And the thing was that we didn’t use summaries.

You know, researchers are used to summarising anything. What’s the average effect? What’s this? Whereas one of the insights that we had is that any summary like that is a layer of abstraction. If instead of making it abstract, what we did is we had these graphs, which showed every individual farmer on, so the graphs were really busy, but people could find themselves and they could find other people, and they could see what does my experience of what I saw in my field, what does that look like in the graph?

So even if they weren’t literate or digitally literate, they could see themselves and they could see other people they knew in the graph. So the complexity wasn’t an issue with them. They’re dealing with complexity all the time. But abstraction was something which they didn’t know how to interpret.

Who was the average. It doesn’t exist. This was a really deep insight that I am so grateful to the experience I’ve had working in these contexts to try and think, as you think about Presenting, your audience can surprise you in many different ways. Be careful about assumptions you might make.

I love this example because it’s a fantastic example that shows that low literacy, low digital literacy does not mean low intelligence or low ability to understand complexity. So be careful about the assumptions you make. And that is such an important part of this Presenting, understand your audience, understand what it is which is meaningful for them, what they will be able to grasp, what they will not be able to grasp, and how can you communicate and create presentations which communicate to them.

[00:32:02] Lily: Yeah. So kind of linking our what have we found with how should we share it?

[00:32:06] David: Exactly, yes. And when you think, how should we share it, it’s of course, who are we sharing it with?

[00:32:14] Lily: Yes. And I suppose kind of step one of Presenting is who is my audience? Because once you’ve worked that out, you can then from there, work out how you’re going to present.

[00:32:25] David: Exactly, exactly.

This has been a long episode, hasn’t it? Sorry about that.

[00:32:32] Lily: No, but it’s very interesting and it’s been, I’m sure that there will be a follow up on it at some point.

[00:32:38] David: I think we will. I think we need to get this written, we need to get it shared more widely. It’s an approach which has been being built over a decade or more. But it’s of the moment now. I feel it’s so important because of what’s happening with AI and how this is gonna change how people look at their data.

It’s really exciting. But at the same time, we need structures to help people. So thinking about simple framework, you know, first, explore your data, just look at the data, understand it, what’s there, what’s not there, what stands out. Then Describe it, understand for a particular objective, what insights can you draw out? Are they valid? You need to do the modelling to check that. What are the other relationships? Are there other variables that may relate to this in ways that might change, are there interactions that would change your interpretation? Might it be different for different groups? Might it be different for men and for women, or whatever that may be?

And once you’ve understood that, who are you trying to communicate it to, and how will you be able to get your message across in a way which is powerful and accurate, avoiding misinformation? That’s the challenge of presenting it.

[00:34:01] Lily: Excellent. Thank you very much. I’m excited to dig into this with you even more as time goes on.

[00:34:08] David: Thanks.