152 – Types of Graphs

The IDEMS Podcast
The IDEMS Podcast
152 – Types of Graphs
Loading
/

Description

In this episode, Lily and David explore different types of graphs and their uses in data visualisation. They discuss how to categorise visualisations for quick data checks, detailed interpretation, and effective presentations, emphasising the importance of tailoring graphs to the audience. 

Lily: [00:00:00] Hello and welcome to the IDEMS podcast . I’m Lily Clements, a Data Scientist, and I’m here today with David Stern, a founding director of IDEMS. Hi David.

David: Hi Lily. I’m looking forward to the discussion today. Graphs, is it?

Lily: Graphs. Yes. And mainly on the different types of graphs. This is something, a conversation that we’ve discussed between ourselves while working through various courses and materials, but, namely, there’s these kind of, you’ve categorized graphs into these three different areas.

David: Ah, you think of categorizing graphs or categorizing visualizations or approaches to graphing, you know, to creating graphs or visualizations.

Lily: I of course meant categorizing visualizations, yeah. 

David: Well, I think, it’s this, categorizing graphs, sorry, in that framework is actually a very different thing. And the [00:01:00] grammar of graphs actually helps us not need to categorize graphs as they used to be categorized. 

Lily: Okay. 

David: But I think sort of the categorization we’ve discussed quite often is, well, the phases when you’re using visualization for different reasons. So are you using it for exploration, are you using it for interpretation description, or are you using it for presentation? And in some sense, those paradigms of visualization are quite different from one another.

Lily: Yeah, so my understanding is that your kind of exploration visualizations, they’re the ones that you kind of want to get out very quickly and kind of quickly see, okay, what’s going on here? And so get kind of a, I dunno, a quick box plot say of each of your numerical variables or a quick bar chart of your factors so that you can see, okay, we’ve got men and we’ve got women, and then [00:02:00] kind of have a quick look at it. Or of a box plot, you can say, okay, everyone’s in this age range, there doesn’t seem to be anything too unexpected in there.

David: Yeah, exactly. You can explore for many different reasons, but when you have a decent size data set, then, you know, spending time just very quickly looking at it, making sure it meets your expectations of how it should be, you’re quality controlling it by doing that, you’re checking to see are there any things which stand out as being unusual or different? Might you need to go back and check, or ask about any data from those who provided it. 

Or, you know, in the climate data we work a lot, have any temperatures been entered incorrectly so that they’re sort of, you know, temperatures of 200 and something often are actually just, they’ve had got an extra zero on and so it was really 20 something, these sorts of things. 

And just very quickly going through and identifying those, that [00:03:00] quality control, quick visualizations enabling to draw your attention to things that would be important to take note of before really digging into asking questions related to specific objectives.

Lily: Sure. That’s a good way to put it. And then our kind of description or interpretation graphs, so kind of when you really get to build that depth, I find. 

David: Yeah. 

Lily: I know that when kind of teaching at AIMS in the past, the African Institute of Mathematical Sciences, we really built up this kind of interpretation in depth with graphs, using things like GG plot, or the grammar graphics, which you’ve kind of alluded to already. And really getting to, okay, let’s start with kind of something on our X axis and something on our Y. And then, okay, now let’s add in more and more layers and let’s really build it up to see what’s going on in our data.

David: Yeah. And what [00:04:00] I really like to think of on this is in some cases, if you have lots of variables that could be affecting the variable you’re studying, you can ask yourself how many dimensions can I sensibly visualize? And what you then do is for each dimension, you know, I quite like in this idea interpretation, well, instead of just putting some things in and seeing what happens, think about the whole set of variables you might want to look at, and then for each variable you describe, do you find a way to include it? Do you ignore it and just accept that variability is not going to be accounted for in the visualization you’ve got? Or do you filter for it?

And this is what I like, maybe you sort of just look at a subset which relate to this, to check, to see, you know, well, actually, you don’t wanna get into [00:05:00] that complexity yet, you know, it might be different for different countries or different sites. But you look at a single site or a single country first so that you don’t have that confusion.

So this idea of actually, this is really when you are wanting to interpret, you should be thinking about all the variables you have. We did an episode on this variability recently, and this is a very practical way of, again, thinking what are you doing? How are you going to deal with these different bits of variability?

Lily: And I think filtering is a really nice way to look at your data without having to visualize things in high dimensions, so we can just look at, like, one village at a time. And I know that you’re a pure mathematician, or that’s how you’re trained, and so you’re very good at your high dimensions. But when it comes to a graph, I find is something that you kind of learn while looking at graphs on how to visualize in three, and often again with like in [00:06:00] multiple dimensions. 

And again, often with say like AIMS students or when I’ve done teaching in this, they put all of them in at once and then it’s really hard to kind of, if you haven’t built up that dimensionality bit by bit, for me personally anyway, it’s really hard to then say, okay, but what’s actually going on?

David: What you observe? What are you actually comparing? What are you seeing?

Lily: Yeah. Whereas if we just strip it back and we just look at, okay, let’s just look at this region and then let’s look at, I don’t know, let’s say we’re looking at yield, let’s see how yield is affected by soil for this region. Okay, now what happens if we take in variety and soil for this region? 

You said about ignoring variables as well, and I really like that on kind of include, ignore, or filter. But it kind of, I guess, threw alarm bells in my head because I know that okay, it might be that there’s nothing going on here, but there is something going on later.

David: Well, [00:07:00] and when you ignore something, the danger of course is that that’s now, if we go back to our different types of variability, this is treating that variability as part of your residual, as part of your own accounted for variability.

And of course the question of whether it is sensible to ignore it or not, whether it’s possible to or not, you can check that in a visualization which doesn’t ignore it and look at the differences. 

Lily: With ignoring it, it’s kind of, I guess one of those things where you could show that you don’t need to ignore it by finding a graph where it shows that it doesn’t need to be ignored, but you can never show that you should ignore it.

David: Absolutely. 

Lily: Yeah. Okay. ’cause I’m imagining kind of this example that you’ve given me before of like your two box plots next to each other and they look completely identical. Let’s say two different soil types showing their yield, and the yield’s identical for them both. Then we take into account, we then [00:08:00] actually look at, oh no, I don’t want soil types, do I? Because I want a before and after. That’s what I want.

Yeah. You see it is good. You’ve got me in that trap. 

David: Well, I mean, you could have this for soil types as well, that you could have an interaction, let’s say between soil type and some other variable where you’ve actually got, you know, fertilization, let’s say, where within certain soil types, the fertilization is really effective. And within other soil types, the fertilization actually is causing soil acidity, and so it’s having a really negative effect. 

So you could have this sort of, you know, real difference about actually, once you’ve got that interaction going in, it looks like there’s no difference. But when you take it with respect to another variable, you get something very different going on.

Lily: Yes, that’s a much better way to put it. It looks very similar, and then you kind of put this other variable [00:09:00] into account and you realize, okay, no, there’s very different things happening,

David: And this is something where, just to come back, the relationship between using tools like ANOVA, which explore sources of variability, where variability is accounted for, and other visualizations, if you forget about the P values on the ANOVA, then it’s a descriptive tool. This is what’s so powerful about it. It’s actually simply a description about the variability. It’s only when you actually apply the F test and you get the probability that you are making assumptions about the underlying distributions of the data. 

Before that point, before you get the P values, the rest of it is purely descriptive, they’re just summaries, if you want, calculations on the data. Which is very powerful, and often it’s not how people read ANOVA tables. They [00:10:00] read ANOVA tables as part of modelling, if you want, because they’re interpreting the hypothesis testing component of it. 

I actually like by default not showing the P values because that’s valid, that’s always valid. And you learn a lot more about your data by looking at a ANOVA table in that way.

Lily: Again, I wanna come back to this kind of include, ignore, filter. 

David: Yeah. 

Lily: An ANOVA table could help indicate a little bit more if you should include or ignore than maybe…

David: Well, exactly. An ANOVA table can tell you that if you don’t include this, you’re gonna get misled.

Lily: Okay. 

David: If a lot of the variability is accounted for by something and you ignore it, well, you know that you’re gonna have this huge, massive variability, which you are [00:11:00] treating as unaccounted for, but you know what’s causing it. And so that visualization, that guide of what you should visualize coming from your ANOVA table for example, that’s really key. The interplay between these is very important.

Lily: And then our presentation tables, we haven’t explained as much what we mean by that. 

David: I suppose we haven’t really got into what does it mean to have these… The tables where you are interpreting, or the graphs where you’re interpreting, these visualizations to interpret, the key there, which we haven’t said yet, but I think comes back to this point, that sometimes you want to look at an ANOVA table, let’s say, and a box plot together or scatterplot or something together because they’re giving you different views on the same information.

This idea that good visualizations for interpretation, [00:12:00] you probably want to do a number of different things to carefully check and to go deeper into the questions you’re trying to answer. So you would want to have different views on that same information. 

Lily: Sure. 

David: And that to me is central to this approach of really trying to make sure you are getting the correct interpretation. You know, I love the Datasaurus dozen. If listeners, if you’ve not seen the Datasaurus dozen look it up, it’s wonderful. It’s a fantastic example of why you need to visualize, why just having summaries is not enough. This is 12 graphs, Anscombe’s quartet was the original one, which was four graphs, but the Datasaurus dozen, which is the one I love, it’s 12 graphs, which have the same mean on the X axis, they’re scatterplots the same mean [00:13:00] on the Y axis, same standard deviation on the X axis, same standard deviation on the Y axis, and the same correlation between the X and the Y variables. And yet the graphs are slightly different from each other, and one of them is a picture of a dinosaur. It’s wonderful.

Lily: You say slightly different, they are very different graphs.

David: And the point is that all of them have the same summary statistics. So if your interpretation is just the means or is just the correlation, then you could be totally missing the key points in your data, that’s what the Datasaurus dozen really shows. And this is why when you are interpreting, it’s all about having these multiple views to make sure you are getting that depth and you are actually getting the complexity for the interpretation, which relates to the specific questions that [00:14:00] you are asking of your data, the ways you are wanting to turn your data into information. 

And that brings us, I suppose to presentation, that once you want it to be information, a whole new set of challenges await. The amount of time you can spend getting a single point across, using a data visualization of any form can be immense. It can also be complex in the sense that it can actually have quite a lot of information.

I’ll just give you a very simple example of this. With one of my PhD students recently, there was a particular set of visualizations where the question was, should you show the [00:15:00] value or should you show the proportion? Because they had many different meteorological stations, and of course the values were quite different for the different stations. The question is, was it better to actually know what the values were or do you just want the proportion?

So if you were in a station which had a large number of rainy days or a small number of rainy days, you were just looking at the proportion of rainy days, or did you want to actually know how many rainy days there were? I’m not sure that was exactly the case, but this was the sort of rough idea. 

And the point was, well, one of the ways you can get around that is you could actually just add more information to the graph. So although you are showing the number, you can present it alongside information, which means you can interpret it as a proportion easily. So you are actually making it easier for the reader to see both in the same [00:16:00] graph. That was the sort of thing where a presentation graph, this made a big difference. 

And the reason that was so important was that, actually, in this particular case, if you were just looking at the numbers you didn’t know the numbers that you were looking at weren’t the actual numbers, they were the numbers related to an estimation, so you actually had this extra layer of abstraction. And when you were looking at the proportion, you were looking at the proportion with respect to the actual station. So it was genuinely different information. 

And actually being able to get it all on the same graph, it gave texture to the graph, it gave substance to it. It meant that actually when you were as a reader looking at it, you needed to spend a bit longer to understand what all the elements of the [00:17:00] graph meant. But once you’d done that, you could now actually have two pieces of information as quickly as you would have one, once you’d learned how to read it. 

And this is the thing that there’s this payoff. Do you want something, a visualization which jumps out of you and you can read immediately but you might get misled, or do you want something which is going to be more difficult to read for the audience? This isn’t about it being good or bad, this depends on your audience. So who is your graph for is maybe the most important element of presentation graphs. 

Another experience on this is we were making graphs for farmers who were actually potentially illiterate. And what was fantastic there was with a bit of work, we realized we didn’t need simple [00:18:00] graphs. They were extremely good at understanding complexity. And we were able to display graphs, which researchers found extremely complex to these farmers. But what they weren’t good at was understanding abstraction. 

So means, which researchers love, were a really bad thing for the farmers because if you showed everybody, then they were able to say, oh, that’s you, that was your field where you got this and that happened. Whereas the mean, well, who’s the mean? Who’s the average? No one’s the average. And so it doesn’t actually mean something concrete to them.

And whereas the complexity of having everybody there, and, you know, a lot of complexity, they could see that complexity in what they’d observed. They could relate to it. So this idea of who’s your audience, what can they relate to? What are they used to? You know, you show that same graph to the researcher and [00:19:00] actually, it was alien to them because they’re used to means they’re used to that abstraction. So that audience is key. 

So that I would argue is the key to presentation. And maybe the final thing I should say is sort of when you think about exploring with visualizations, interpretation through visualizations, or presenting with visualizations, it’s an increasing amount of time which is needed for each visualization as you go from the exploration where it should be quick and easy, through interpretation where you should be spending time, you should be looking for multiple perspectives, and your presentation where you should be thinking, how can you bring those different perspectives together into a single visualization, which can be interpreted by your intended audience? 

Oh yeah, it’s a sensible division between these different ways of using [00:20:00] visualization. And it’s not me who’s come up with this, lots of others have done it.

Lily: I know that we’re trying to wrap up, but, I’ve not heard those examples of that context before, for your different presentation graphs. So that’s incredibly interesting, that kind of power the different graphs could have. 

David: This is a research topic. I would love to have people actually putting good research in to understand for different audiences what are the things we should take into account. I love the fact that complexity isn’t something to worry about when you’re talking to farmers. That’s not what would happen if you talk to a lot of communicators. Well, they said things need to be simple. That’s not our experience. They’re used to complexity. They’re not used to abstraction, things shouldn’t be abstract and confusing. 

Lily: And I think showing that through graphs, or using that difference between complexity and abstraction through your example of a graph is a really, for me anyway, nice way to see it.

David: Good point to end on because this is a whole [00:21:00] nother topic if we wanted to get sucked into complexity versus abstraction.

Lily: Excellent. Thank you very much, David. 

David: Thanks.