234 – Data Collectors as a Source of Variability – IDEMS International Community Interest Company (CIC)

The IDEMS Podcast

234 – Data Collectors as a Source of Variability

00:00 / 16:35

February 6, 2026

Description

Lily and David discuss the significant influence of data collectors on survey variability and data quality, using examples from West Africa. They highlight the importance of thorough enumerator training to address issues like inconsistent definitions of household size.

Transcript

[00:00:06] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here with David Stern, a founding director of IDEMS.

Hi, David.

[00:00:15] David: Hi, Lily. I’m looking forward to another discussion about research methods.

[00:00:19] Lily: Yes. Yeah, part of our miniseries on the 20 year anniversary of research methods support in the West Africa region. Have I got that right?

[00:00:27] David: Yes, absolutely. And what are we discussing today?

[00:00:31] Lily: Today, a topic that I’ve seen come up a few times, another one on variability, but this time on data collector.

[00:00:38] David: It’s really data quality as well as data variability. One of the things that very often happens when I’ve been working in the region with a whole set of projects, and this isn’t just unique to the West African region at all, this happens everywhere. But, I’ve had the privilege of working with projects in the West African region where I’ve got my hands on their data, and we’ve seen some of these issues.

And, one of the interesting sources of variability that people don’t tend to think of, particularly when they do large surveys, is the data collector and the fact that actually if you don’t have the data collector as a variable in the data, the survey you’ve done, then you may have a hidden source of variability, which is really hard to do anything about afterwards.

I like to think of this like blocking or stratification, you know, within a data collector, you get a level of consistency about how they ask the questions, how people respond to them as an individual, the sort of data they give, and so on. And this does change a lot from person to person depending on your enumerator, who collects the data.

And we’ve seen this in many data sets which have had the data collector, where sometimes the largest source of variability on certain questions is the data collector.

[00:02:12] Lily: Wow. So where does that come from? They – the data collector – ask a question differently? Or do they interpret it differently? or they, I don’t know?

[00:02:22] David: Well, my favorite example of this is household size.

[00:02:27] Lily: Okay.

[00:02:28] David: Household size is a really difficult concept because, well, who’s in the household and who isn’t? Especially when you have these extended households. And in many communities in the West African context, households aren’t well-defined nuclear units. Your eldest son might be on your compound and they have somewhere and they are part of your household. So, in some definitions it’s defined by the cooking pot, who eats?

But the point is that if you haven’t trained your enumerators to all ask about households in the same way, then each of them comes with their own preconception. Some people might just ask, well, what’s your nuclear family, who are your children, how do they live with you? Okay, end of the question. So that would be your household.

But others might say, well, oh, you have your nephew and your niece who live with you, oh, your grandchildren also live with you. Oh, your son also is still living in some sense, he has a slightly separate house, but they eat with you and they’re part of the same household. They’re not an independent decision making household.

‘Cause that’s the other thing “household” in the definition of who eats together is one thing, but what about decision making power? If you are doing an agricultural study and you can’t decide what to plant on certain fields, then maybe you are not the household head. Now, maybe it’s only the household head who can make decisions, and therefore that decision making power might determine the household. And there are communities within which those households are more than 60 people as a single household, there’s parts of Mali where we’ve seen this, there’ve been really interesting studies.

And so, there’s complexity here, which is wonderful and rich, and difficult. Very often, when I get the data back on a survey, where the enumerators haven’t been really extensively trained, which is often the case – it is a lot of work to train enumerators. I still remember one of the first enumerated trainings I gave, it was actually in Vietnam where we had people from Cambodia, Laos, Vietnam, all learning how to give this really rather big climate change survey, well it was a baseline survey, it was a big long survey, and we were training these enumerators. We spent a week training the enumerators, making changes to how the questions were formulated, discussing with them how they interpreted it. And even after the week, we knew “that enumerator is gonna get fantastic data”. There was one enumerator I remember, she was amazing. Instead of filling in the questionnaire as she went along, she’d be taking notes. But the whole thing would just be a discussion and she knew all the questions that she needed to ask. It was a huge questionnaire, but she’d do it as a conversation, and the household size came up there as well. And I still remember listening through the translation of what she was doing, and the household size number changed three times through the course of the conversation. When she first asked the question, they gave one number, and then later they sort of said, oh yes, my nephew lives with us, but he’s at school at the moment, he’ll be back tomorrow. So, he was actually part of the household, but he wasn’t there now, so they didn’t think of him. And then something else happened later on, and it changed again as she actually got the correct picture of the household. Anyway, she was brilliant. That was the first time I ever gave enumerator training and I had this privilege of, in that project, a whole week was set aside to training the enumerators, getting them familiar, practicing on each other, going out, doing proper interviews with people in the field, learning from that, iterating, improving the questionnaire to suit what they understood. It’s hard work.

And when that hasn’t happened there are always questions within a survey where the enumerator, the data collector themselves, has a big impact and is one of the large sources, if not the largest source, of variability within the answers. It’s a really interesting phenomena which people don’t think about in advance, but once you know to look for this, you always look for it.

[00:06:58] Lily: Yeah, well, it’s something that I wouldn’t have known until hearing about it from you and Roger and until hearing other people’s experience, it wasn’t something that was, I guess, taught in another way. I didn’t even think of household size as one of those places, it’s a quantitative variable, that’s a numerical variable, they’re usually a lot less, it feels there’s less interpretation when it comes to a quantitative variable than something qualitative where it’s, you know, “how do you feel about?”, or “why is this?”. Okay, I could see there how one enumerator could manage to get more information or they might start up a conversation, someone might feel more comfortable with one enumerator, and then therefore feel more like, oh, “I could tell you the weaknesses that I found type thing”.

[00:07:47] David: I love coat size as an example of this because it is unexpected to many people. It’s something where surely that’s a simple thing to get.

[00:07:56] Lily: It’s like age, you know.

[00:07:59] David: Even age, this can be difficult in different cases, there are contexts within which that is difficult, people sometimes don’t know their date of birth. This can be a very highly political thing in certain contexts because age can have implications for all sorts of things. The other thing which relates to this, is “how has your data collector introduced themselves?” One of the ones which is obvious is income. If you think of household income, “how has your data collector introduced themselves?” If the person they’re interviewing thinks that if you are coming as part of a project, which is looking for poor families to support, that’s very different from if you are part of a project which is looking for competent farmers to invest in.

[00:08:52] Lily: Sure.

[00:08:54] David: And if you just take that subtle difference, here you are interviewing lots of families, why are you here, what are you trying to find out, what might I get, will you support me if I don’t have anything and I tell you how poor I am, or will you support me if you think I’m a really good farmer and you are looking for the top farmers to be able to enable them to do more?

Both of those surveys exist in the world. People do these surveys for those reasons. And so, the interviewee can be quite wise to this, and they present a flattering version of themselves. And this is natural, this is human. But how the enumerator, the data collector, presents themselves changes this totally. What do they make you feel? Even more than that, how much are you enjoying the interview if you are bored? When they ask: “Okay, what are the different crops you grow?” “Okay. You grow millet, you grow sorghum, you grow maize.” “No, I don’t grow maize.” “Well, there’s a maize field there.” “No, I’m bored! I wanna go home”. You know, there is a real issue about it, especially with long questionnaires.

[00:10:01] Lily: Yeah.

[00:10:02] David: If you do it in a way which is just tiresome, and where I know, oh, if I say I grow maize, I’m gonna have to go through all of those questions again about that crop, oh, it’s just easier to say I don’t grow maize.

[00:10:15] Lily: Yes. Yeah.

[00:10:16] David: Again, this depends on the enumerator, and so the number of varieties you grow, that being something which depends on your enumerator is wonderful.

[00:10:26] Lily: Yeah, it seems like it’s this kind of a human element where there is a lot of variability that gets into your data, where, in a perfect world, I wanna say in a station experiment, but obviously not, you still have a human element, but you know, if everything is very rigid and to the book, and in a perfect world you would not have these unexplained variabilities in there and everything would be simple.

But we do have that human element and that seems to create so many different variabilities, but for a good reason.

[00:10:59] David: This is exactly where really good studies find ways to get high quality data, where you can make really interesting and valid conclusions. Whereas other studies, you often get left with data, which is almost unusable because the sources of variability are not what you were wanting to study.

[00:11:23] Lily: I see.

[00:11:24] David: And this is why it’s so important, you can invest a lot of time and effort into collecting a lot of data, but if your highest source of the variability is your enumerator, that data is often useless.

[00:11:39] Lily: I see. So then investing into the enumerator training, as you say, well, did you see the results of your week long training?

[00:11:47] David: That was not a unique one, this was a very well funded, very big project, and yes, they did collect reasonably good baseline data. But, there were other issues with the study, in other ways, as there often are, so it didn’t fall foul to that particular problem, but it did fall foul to others.

[00:12:04] Lily: Which I’m sure we can cover in other podcasts.

[00:12:08] David: Yeah, I mean, one of the things in particular was that that survey was just too long. A lot of the data collected in that survey wasn’t reliable because it was so long. And this is something which actually, in future studies around this, that particular those researchers did cut back and create shorter [surveys]. And of course if you have a shorter questionnaire, then the need for the extensive enumerator training is maybe less.

This is a reason why you might want to have really highly qualified enumerators, people who really know what they’re doing collecting data, as opposed to just any student who you can get your hands on, which is what a lot of people do.

And so, that idea of a professional enumerator as somebody who’s trained in data collection is a concept which many researchers don’t think about, but, on the ground in these environments, this is something which is worth investing in, worth reusing, worth training. And lots of people do think about this, the stats offices think very seriously about enumerators and enumerator training and all of this. So there are people who are very on top of this.

But in the agroecology research that we do, which is often individual researchers from a university, from a research organization, they don’t have that knowledge generally about the importance and the potential dangers of the data collection process itself for the quality of their data.

[00:13:38] Lily: I see, it’s a very interesting idea, but it’s clearly something that should be considered in your survey of a kind of a source.

[00:13:50] David: Well, it’s something where I think there’s two things, and it’s a good way to wrap this up in a sense. We’re presenting this more from the side of when you are analyzing the data, looking for quality control, here are things to look for. I would always, in this case, one of the first questions I’d ask is, where’s your data collector, is that information in your dataset? Luckily, with most digital data collection now, in one way or another, that information is identifiable. Not always, but quite often. And so that would often then be something that I include as part of my data quality control when I go through and I start looking at a data set.

But of course, this could equally well have come in as part of the design process. When you are thinking about how you set up and you design a study, are you thinking about how you’re training your enumerators, how the questions that you are asking are unambiguous, they can’t be misinterpreted in different ways.

And I come back and I’ll finish with the household size. What do you mean by household size? Are you interested in a decision making entity? Are you interested in a social entity, a cooking pot definition? There are many different definitions of household size, which may not apply in my context where I come from, where the nuclear household is something which is pretty well established. But in contexts where you have much wider varieties of households that exist in different forms and different cultures, this is really important.

And actually, a simple variable is often not simple when you dig into it. And this is why you could easily use a whole week on the enumerator training, the training of data collectors, because we asked these questions, we said, “in your context are there ambiguities, how can this be phrased, how are you going to ask this, what are you actually going to get out?” And doing that for hundreds of questions takes time, and you’ve gotta go through and make sure all of them are interpreted, and it’s tiring. Going through that for the whole day, it takes time.

All of this is really valuable research methods, if you want, but it’s sometimes overlooked, and so it can be part of your design, your preparation of the study, but it’s also something to be aware of when you are analyzing the data.

[00:16:09] Lily: Excellent. Thank you very much, very fascinating insight on other kinds of ways that this variability comes in.

[00:16:17] David: Thank you. This has been fun.

[00:16:19] Lily: Thank you.