230 – Introduction to Sampling

The IDEMS Podcast
The IDEMS Podcast
230 – Introduction to Sampling
Loading
/

Description

In this episode, Lucie and David discuss the complexities of sampling in research. They explore common misconceptions, and introduce three levels of sampling complexity. The episode highlights the necessity of understanding population structure and the compromises involved in effective sampling.

[00:00:07] Lucie: Hi, and welcome to the IDEMS podcast. My name’s Lucie Hazelgrove Planel, I’m an anthropologist and Social Impact Scientist, and I’m here with David Stern, a founding director of IDEMS.

Hi David.

[00:00:17] David: Hi Lucie. We’re discussing sampling today.

[00:00:21] Lucie: Yeah, so we’ve had a few discussions with researchers recently about sampling, researchers who are interested and who are a bit, I guess, confused with the environment out there about what’s appropriate. 

[00:00:36] David: It’s a serious problem in the teaching of statistics all over the world, in terms of how sampling is taught, that people tend to think sampling is what is the right equation I should use to calculate my sample size.

[00:00:53] Lucie: Yeah. 

[00:00:54] David: There’s many different methods and if you go to textbooks, there’s all sorts of things about all these different sort of formulas you can use, and they have nice names associated to them. But it’s often not the right question.

[00:01:09] Lucie: And I’m gonna keep on taking you away from statistics, to get you to think about other disciplines and how they sample whether or not they do it consciously. 

[00:01:18] David: This is the whole point: when you are designing a study, the sampling problem, the statistical piece of it is in some ways, quite often the least important. It’s not that it doesn’t have value and that it isn’t important, it’s that it plays a very specific role. There’s a specific time when a sample size calculation is useful or arguably even needed not to waste the study.

If you don’t do the right sample size calculation, then it could be that at the end of your study, you don’t have any sensible results because your sample wasn’t appropriate for the research question. So I’m not saying there isn’t value or need to use sample size calculations as part of your study design, but I am saying that in the teaching of study design and sampling, the sample size calculations should not have such a big place, they are not the most important piece.

[00:02:29] Lucie: So in one of our meetings this week with researchers, you tried to lay out three different levels of complexity of sampling, which I thought might be a nice way to introduce this topic, ’cause I think there’s gonna be more episodes about sampling.

[00:02:46] David: I mean, it’s something which I’ve tried to develop as an easy way to help people to think about sampling, based on what the research questions they have actually are. And so how do you make it easy to get started, what do you want to think about in terms of what sort of sampling should you do?

And so the first question is: is your research question answered by having a population where you are trying to understand a single characteristic of that population where it’s just varying naturally? So if you have natural variation of a population of things, maybe you are studying a particular grasshopper or whatever it may be, and you want to go and have traps out in different places and catch the grasshoppers and see how big they are, you know, and they form a natural population where in some sense, you’re wanting to understand the distribution of sizes of grasshoppers.

I’ve just taken that as an example that comes to mind, actually, the case I do know is someone who is doing this for tadpoles, but that’s a separate story. But if you are trying to do that, then the right question is how many do I need to get a good idea of what the distribution of sizes, let’s say, or this parameter within the population, to actually get a measure of the parameter in the population? 

[00:04:26] Lucie: This is the level one, really, the most sort of basic I want to say, and basic in the sense also here of, clearly we don’t know much about the population.

[00:04:35] David: It’s not we don’t know much about the population, it’s that the population with respect to your measurement forms a natural distribution, is varying naturally, there aren’t groups. This would be the next thing, what if you had two slightly different species of grasshoppers you were looking at?

Then you kind of want to distinguish, is it one type or the other type, or what if male grasshoppers and female grasshoppers were different sizes naturally? Then that would be stratifying your sample. So this is the first big concept to introduce, if you’re just looking at a population and you are wanting to measure something, and that measurement varies naturally, and it’s not that you don’t know much about it, it’s that there isn’t much complexity, the variability between things, broadly, can be considered as natural variability.

[00:05:33] Lucie: I’m sorry, I don’t understand how you can know it’s gonna be natural variability unless you’ve done your study. So if I take the example of humans, then how do you know that men and women, there won’t be huge differences or geographical differences between areas?

[00:05:48] David: One of the reasons I didn’t choose a human example for this is, actually, although we often simplify down, and I’ll give a concrete example of let’s say voting, are you going to vote one way or the other way? This is something where stratified sampling has been shown to be extremely valuable in terms of polling, and with a relatively small sample, you can actually get information about a large population.

And if you do your stratifying very well, you can actually predict quite well, which is why polls work quite well. They don’t work perfectly because there are things that you don’t know. So maybe we’ll come to polling as an example in terms of humans. What’s interesting about polling for elections is that what you are trying to measure is very precise and very simple.

And that’s really important in this, the more complexity you have in terms of what you are trying to measure and find out, it changes how you think about the sampling and how you think about either stratifying your sample, or actually bringing complexity in. But the starting point, and this is where any of the classic books on sampling really in the statistical literature, they’re really starting from the assumption that your population is something where a distribution naturally explains the variability within the population.

And this is where that first question, do you have groups within that population that you should study separately, these strata, for example, this is the first question. And I’m using stratified sampling here, but also to include things like cluster sampling, cluster sampling is about grouping, whereas stratified sampling, there is a distinguishing feature if you want.

So let me give an example, male or female, that would be a good example of a strata. I have 10 different schools, they would be groupings that these school is different, you might want to do cluster sampling within a school rather than across schools. But it’s the same basic principle that you are wanting to have these subgroups that you are sampling within, as opposed to just taking a random sample, which ignores grouping.

[00:08:23] Lucie: Yeah.

[00:08:25] David: So that’s the first big question. 

[00:08:28] Lucie: Okay, and that divides between one and the level two of if you have groups or strata, then you’re gonna go towards level one where it’s a relatively homogenous population or it’s a, you want a normal distribution, you’re saying?

[00:08:41] David: Well, it’s not necessarily normal distribution, it might be a gamma distribution, it might be a binomial distribution. 

[00:08:47] Lucie: Oh, I don’t know those. 

[00:08:49] David: A gamma distribution is skew. If you have rainfall, you can’t have rainfall, which is negative, this doesn’t exist. So there’s a good reason why you want to sort of think of it differently. There are cases where you might not always have a normal distribution, but if you have a natural distribution, natural and rather than normal. Binomial is a choice between two things, so you don’t have a continuous set of answers you could have, you’re choosing between things. 

[00:09:17] Lucie: Okay. And that’s, I think that’s where my confusion was coming from. I was automatically thinking of normal rather than natural.

[00:09:22] David: Yes, so a natural distribution, as a statistician, which of course I’m not, I’m more of a mathematician, but that’s a separate point, as someone who works in statistics, a natural distribution could be any one of a number of different statistical distributions, but it’s a distribution of a population.

The key thing is, if you, for whatever reason, have subpopulations that desire to think of things in terms of sub populations, either as saying, well, actually it’s multilevel, as in, well, students depend on which school they’re in, the school is going to affect the students with respect to the variable I’m interested in, that means that I want to do cluster sampling, where I take into account the schools.

Or the equivalent is stratified sampling that I want to sort of sample both men and women because that might affect the measurement I’m taking as well. So I want to include that in my sampling.

So the broad first distinction is really between a simple sampling of a population where you just are looking to say how much, what do I need to be able to have confidence in the measurement I’ve got representing the population?

[00:10:46] Lucie: So a certain proportion of the population?

[00:10:48] David: So this is what’s interesting, of course. It often isn’t, in that first case, it’s not really a proportion of the population because if you have a small population, you need quite a large proportion of them to understand it, and if you have a large population, you need quite a small proportion because, actually, a few people give you a pretty good idea of a whole population if they’re all from the same distribution, statistically speaking.

And so it does increase, as your population increases, these values do increase. There’s actually a book in Kenya going around the universities there where they got some of their formulas wrong. And so there were a whole set of studies when I was working there, everybody had 243, or whatever it was, because whatever’s the size of your population, the formula gave you 243.

But the thing is that actually this seems like it’s very bad, but it’s not actually so wrong. Once you’ve got a relatively large population, it doesn’t change your sample size that you need to understand the population so much. In fact, the rule of thumb I like is, now we can talk about a normal distribution, so if you have something which does have not just a natural, but a normal distribution, well actually, if you’ve got 20 or 30 elements which vary according to a normal distribution you’re probably gonna have a pretty good estimate.

However big your population is, 20 or 30 is a sensible minimum. If you’ve got less than that, there’s gonna be quite a lot of variability that comes from chance, and if you have much more than that, yes you get more, but it’s diminishing returns. And so, if you’ve got a very simple single question, which is answered, and that answer falls along a normal distribution, 20 or 30 data points is probably about enough.

Now, of course when you’ve got multiple questions, then you kind of don’t just have a single normal distribution, now you are multidimensional. And each dimension means you should have more data to be able to get enough data on the different dimensions. So the bigger your questionnaire, really, the more data you need, or questions you ask, the more data you really need to be able to analyse it deeply, if you’re doing a questionnaire, for example. And so there’s things which relate to that.

But a simple rule of thumb that if you are wanting to study something, then that one dimension, 20 or 30 data points will give you some idea of the distribution of that data if it’s normally distributed or most natural distributions, not all natural distributions. So you’ve got a nice rule of thumb for these very simple ones.

Now you can get more complicated because your measurements are more complicated, and that’s a whole different discussion that we could get into. But the main thing that happens is, you start recognising, well, female grasshoppers and male grasshoppers are really rather different, or there’s actually two different species of grasshopper who look similar, but they’re different in these specific ways, and that affects their size.

Those things they relate to putting in more structure, and the more structure you have, in some sense, you kind of still need the same amount of data in each group to be able to do this. So in some sense, once you start stratifying, your sample size that you need goes up quite considerably in certain ways, and it does matter whether or not these are strata or these are groups, but I won’t go into too much detail about that.

[00:14:40] Lucie: Thank you.

[00:14:41] David: What I will say is that, groups, these are different instances of the same thing. So maybe they’ve got slightly different values, but you might expect them to have the same distribution. So what you are doing is you’re then removing bias, which comes from being in different schools. But you are expecting the students within a school will have a similar distribution by some means of calculation.

Whereas it might be that, you know, your different species of grasshopper, they’re totally different, the distributions of size might be totally different for that. And so in some sense, you need a bit more data in that second case to be able to accurately describe it because you have less in common.

And the basic principle is that what you are trying to do is you’re trying to understand within the population that you are studying in these different ways, what are the groupings which you’re affecting what you’re going to study, and how much changes across those groupings will also affect how much data you need within the groupings on because, you know, are you able to use the distributions from one grouping to deduce things about others that allows you to have less data if you can. 

[00:15:57] Lucie: Within this level two then, two of the key points, when you’ve got a population which has different characteristics that you know are important. So the first step is to identify what are those characteristics. And then I think what you’ve been saying just now is also then understand how many you need for each of these groupings or strata?

[00:16:17] David: Exactly, and this is where, quite often, once you actually get to context where you understand well, and I will use polling here as being something which has been so successful in many cases. There have been very interesting failures of polling, and these are very interesting to study.

[00:16:36] Lucie: Sorry, what’s a failure? Is that when it suggests the wrong result? 

[00:16:41] David: For example, the one that people are very familiar with, I think fairly recently is about, was it 2016 I think, when Trump first came into power in the US. All the polls predicted that he would lose. And what this corresponded to is that actually they’re waiting… so pollings in this case, it’s really incredible how few people they ask to get such interestingly accurate results most of the time.

But the point is that it’s the stratification, which means that they’re able to do that, that they’re able to sort of take groups who are represented, a small number of people represent a large subpopulation, and that gives them information about how that subpopulation will vote.

[00:17:35] Lucie: It’s a good example too, in showing the very real consequences of sampling, and it influences how people act, I think some people would decide how to vote according to polls.

[00:17:46] David: Yeah, polls have this very interesting phenomena where they then influence the future behaviour. Who turns up to vote, you may not turn up to vote because the polls say there’s no point. They’re very complicated, they’re not perfect, but they are extremely successful in general.

But what happened as I understand it in 2016, is that there were, if you want, differences within strata of population that had emerged, but that were not been captured in those polls.

[00:18:22] Lucie: Oh, interesting. So the researchers hadn’t sort of kept up to date with society.

[00:18:27] David: Yes, there were societal changes that were happening, which the polling systems were not picking up on, which meant that the weightings and the way that they’re done, that weren’t picking up on some of the changes that were happening within society. And that therefore meant that the methods that had worked previously, four years earlier, very well didn’t work as well in 2016.

And there were other cases. There was a very interesting case in the UK where something similar happened. And again, it was traced down to changes in how the polling actually got it wrong in terms of the stratification of the population. So this is a really interesting case, and we’ve really honed in on this first distinction, and again, polling is so valuable because you are only asking a very simple question, in some cases, is it A or B?

[00:19:17] Lucie: This comes back to what you said earlier that the smaller your questionnaire basically the smaller your sample needs to be.

[00:19:25] David: Exactly. The less you are measuring is the correct statement because it may not be a questionnaire, it might be in other ways, but the less you are measuring, the smaller your sample in some sense, because you are just trying to precisely know something. And that’s why polling is such a wonderful example because you’re just trying to know one thing very specifically.

[00:19:46] Lucie: And I’m aware that you can say, and you will no doubt in other episodes talk a whole lot more about this sort of stratified and cluster sampling and all the rest, but can we move on to the level three then of complexity?

[00:19:59] David: Yeah, so what happens when you are actually trying to do something, which is where the structure is much more complex, where your complexity gets to the stage where you can’t simply calculate based on, yes, you have some strata or you have some clustering and grouping that you know, but things get complex either because you’ve got lots of different measurements or because there’s real complexity to your population in certain ways. And you know a lot about the complexity to that population or you know a lot about it.

If you want your simple sampling techniques, and calculations aren’t enough, this is where you get to do the fun stuff. This is where you actually build out a mathematical model of what you are wanting to do. And then you can simulate and say, well, if I collected this data, would I be able to answer the questions I have if my assumptions about the population, the questions, and all the rest hold? 

[00:21:13] Lucie: This level sounds more like a statistician’s interest, research interest rather than something directly that many researchers, like many non statistician researchers, would be wanting to study.

[00:21:25] David: Well, yes and no. As IDEMS, we’ve only got one instance where we’ve got to do this really well.

[00:21:32] Lucie: And I really hope that Lily will discuss this instance with you at some point.

[00:21:38] David: I’m sure Lily and I will have a session where we discuss this for sampling. But what I will say about that is that actually this was a PhD student at Oxford whose study was complex. And what was so interesting is that actually by going through and building this model out before she collected any data, it changed the nature of her study, it influenced which questions she would ask because there were so many things she was interested in.

By actually modelling it out first she was able to say, well, this is my analysis plan. This is the analysis I will do afterwards. And she ran it through with the simulations to check. If the assumptions that she had were right, would she get significance, would she have the valid differences in her data at the end? And it led to a really beautiful study, and a very, very important piece of work.

And my understanding is, I need to check this with Lily, but my understanding is they actually ended up publishing about the sampling. This was just the preparation and the design for the study. And so you are right in that sense, it is something where without a good statistician, it is very difficult to do that. And there’s many cases where you wouldn’t want to do that because the study doesn’t warrant it. But there are studies which really need that level of rigour before you do it because it’s expensive to actually do the study.

There’s a lot of, in this case, real money would have been put in to go and employ people to do the interviews, and knowing how many people you need to do to be able to get the results that are behind the study that warrants a bit of time. Your statisticians aren’t that expensive compared to field staff actually going out, collecting data and so on.

So using a statistician to be able to do that, this is a really valid thing to do when you have a complex study like that and it’s very valuable and very worthwhile.

[00:23:42] Lucie: And it’s really interesting that you said that. This is something that we recommend quite often too, that before you do the data collection that you decide on your data analysis plan as well to know exactly if you’re collecting the useful and relevant data for your research objectives.

So it’s interesting working with the statistician, it means that they can actually, the researcher can be assured that their study is going to be useful and it’s gonna get the results that they want. And once they have the data, then they can just run the analyses, they know exactly what they want to be doing.

[00:24:11] David: Exactly. You can have your whole thing set up so that your data is then coming in and you are able to then actually learn a bit as it comes in. Done well, it is a wonderful way to do good research and it’s what’s recommended in certain cases, and there are certain fields where this is actually needed. 

[00:24:28] Lucie: I’m assuming medical…

[00:24:29] David: Yeah, a lot of medical, serious medical studies, this is almost the norm, in many cases this is the norm to do this. But, you said this is something which your statistician should like and other researchers would not tend to want to do.

[00:24:45] Lucie: A bit of a bias there. 

[00:24:46] David: There is a bias there, I understand. At the moment I can very rarely recommend this. It is only certain studies that I can really recommend this to. But part of the problem in that is it’s not made easy. If only we made this easier to do, then I believe there are a much greater proportion of studies that could benefit from doing this than currently do.

[00:25:13] Lucie: No doubt.

[00:25:14] David: So this is something where if you think about these three simple levels of, do you just have a simple piece of information that you are trying to find out about something, okay, that’s just simple sampling. The next thing you’re gonna have to do is you need to think about the structure of your population in whatever way that is. This is stratified or cluster sampling in its different forms, and that requires a little bit more complexity and you can maybe use a rule of thumb instead of the right calculations, you can have 30, 20 to 30.

[00:25:49] Lucie: Exactly, specially if your population is small. 

[00:25:51] David: Exactly. You can sort of use a rule of thumb and you can understand the structure of your data and you can do that relatively simply.

And then of course if you actually have something which is more complex, where you know a lot about the population, then model it. Actually model it and simulate to understand, will you get the sort of answers you need? If you are having to spend a lot to get those answers, then maybe it’s worth doing that bit of modelling first to be able to make sure the money you are spending to get those answers is well spent.

So if you’ve got three simple levels of sampling, that’s my distinction. Yes, of course I have way oversimplified, but I find it useful to help people to see that the question isn’t simply how many.

[00:26:40] Lucie: Yeah, definitely. As I said at the beginning, we’ve seen this with several of the researchers we support, there’s not a magical formula that exists. So I think this discussion has been really interesting and highlighted a different approach to sampling or a different way of thinking about sampling.

[00:26:54] David: Well, I suppose the key thing is sampling to me is about compromise.

[00:27:00] Lucie: Well, exactly. I mean, we haven’t even discussed the actual practicalities.

[00:27:04] David: Yeah, you’re compromising often between taking something which is a random group or something which is more purposeful. You are compromising between the sample you’d like to have and the cost. There’s so many compromises you are making and a lot of what, really, I find sampling is all about is, what’s good enough? And that’s a compromise, it’s a question of compromise. So thinking of it differently helps to take that approach.

[00:27:30] Lucie: Great. Well thank you very much, David. It’s been a pleasure talking about sampling, which is not what I expected to say.

[00:27:37] David: Wow. I’ve won, you’ve enjoyed talking about sampling. Oh, I must be onto something then.

[00:27:44] Lucie: Thanks then.

[00:27:45] David: Thanks.