219 – Factors in Statistics

The IDEMS Podcast
The IDEMS Podcast
219 – Factors in Statistics
Loading
/

Description

How can we transform complex data into understandable information? In this episode, Lily and David discuss the concept of factors in data analysis. They consider the historical context of factors, their importance in grouping data, and how they revolutionise statistical thinking.

Transcript

[00:00:07] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a Data Scientist, and I’m here with David Stern, a founding director of IDEMS.

Hi, David.

[00:00:14] David: Hi Lily. How are you doing?

[00:00:15] Lily: Yes, yes, yes. I’m very well. And I thought today we could talk about factors.

[00:00:21] David: Oh, factors good, yes. This is one which has been a long time coming in some ways.

[00:00:26] Lily: Yes. So I guess this is kind of more on the data side today as a podcast between us, and for me, I’ve been told that factors are important, I’ve heard of the historical nature of factors and how we used to not have factors or they weren’t seen as different, and now they are. But for me, I guess, oh, I don’t know how to phrase it.

[00:00:50] David: You take them for granted because they’re just part of what you do in statistics and how we think of things.

[00:00:57] Lily: Well, yes, yeah. I take them completely for granted and I take for granted all of the, I guess as well, the efforts that have gone in by other people on using factors well.

[00:01:07] David: And I think the key thing is that, in some ways, from some perspectives, many people don’t know about or recognise or value factors. In SPSS for example, they still think of categorical data, this is something where the concept of a factor doesn’t really exist and it isn’t really needed in many people’s minds.

[00:01:36] Lily: So what I believe what they do in SPSS, but please correct me if I’m wrong, is it’s kind of labelled numerically.

[00:01:42] David: Exactly.

[00:01:43] Lily: So instead of being man and woman, which is what we mean by factor, sorry, by factor we mean having those columns in your data, those variables, which are kind of categories.

[00:01:55] David: They’re groups, it’s a form of grouping of your data.

[00:01:58] Lily: Groups of course. So instead of having kind of male female, you’ll have one, two, and that will just be an indicator for if they are.

[00:02:07] David: Absolutely. And they use a numerical, categorical variable. And so you can then use that variable just as you would use a factor in different ways. This is what people did before they had the concept of a factor. But the concept of a factor is very powerful in a few simple ways.

[00:02:28] Lily: Okay.

[00:02:29] David: And one of the ways I love to think about a factor is, if you think of a factor as just being a grouping of your data, then one of the ways you can think of it is you can think of it almost like having these pieces of data, which are separate, which you just happen to have put together one on top of the other. And that’s a sensible approach to just thinking about how when you have a factor, what you actually have is you have data which is grouped, and it’s separate. It’s just you can treat it the same because those groups, those groupings, you have the same information about all the elements.

[00:03:14] Lily: So your kind of group might be a different variety of crops data, and so then you’ll have certain, I guess, attributes or you might have certain attributes for one variety over another.

[00:03:26] David: Absolutely.

[00:03:27] Lily: Just checking I’m following.

[00:03:28] David: Yeah. And so, well, let’s say for example, you might say you want to group the short season varieties and the long season varieties separately.

[00:03:40] Lily: Okay. Yeah.

[00:03:41] David: This would be a sort of grouping of your data and you don’t really want to compare short season varieties with long season varieties because short season varieties, they’re useful when you only have a short season, whereas long season varieties are useful when you have a long season. You’re not expecting short season varieties to necessarily compete on a long season variety if you have a long season.

But if you only have a short season available, then you don’t want to consider the long season varieties ’cause they’re likely to fail. These are used in different contexts, so grouping them, now you can do better comparisons, you can do comparisons within a group, maybe better than you can across those groups in this particular context.

So this is one of the things that you can use the grouping for. You can use the grouping to be able to say, I only want to consider the short season varieties because I would like to know amongst the short season varieties, because that’s all I can use in my context, which one will serve me best for the needs I have. That means you are able to use the grouping to restrict yourself to the information that you care about.

[00:04:52] Lily: Nice. Okay. That makes sense. So that’s, I guess, one important thing that we can get from factors.

[00:04:58] David: But of course, the other important thing, and you use the example of gender, male and female, you could sometimes find, well actually, I don’t want to consider, let’s say a hundred meter race times for men and women together, I would like to consider those separately. And that would be an example, like the short season variety and long season variety.

Or you could take gender and you say, actually no, I want to compare men and women because I don’t want there to be gender inequality in terms of salaries, for example. And then you are using the factor to be able to compare across groups.

[00:05:38] Lily: You can compare within the groups in terms of comparing a hundred meter running times for a male cohort and a female cohort, or you can compare between your groups the average, I don’t know, the average like yield for this variety versus the average yield for that variety.

[00:05:56] David: Or the average salary. But I wouldn’t just want the averages here because it might not just be about averages, it might be about looking at the distribution of salaries of men and women.

[00:06:08] Lily: Sure.

[00:06:08] David: So it might be you want to compare the average salaries, but it might be that you want to actually look at the distribution more generally.

But, at the heart of this is this idea that once you actually have the concept of a factor, it changes how you think about your data in general, because you don’t need to worry about a lot of different, well, you don’t need to worry about how it’s coded up. You might want to code up your men and women as zero and one or one and two, but there’s a difference in that.

If you’re doing this as numerical in different ways, then depending on what you actually use that number for, this might actually matter in certain ways. What did you choose? Did you give an order to them? All of these things are important decisions. And with factors, each of those decisions is a conscious choice. Ordering, an ordered factor is a type of factor. This is a factor which has an order. If you give a number associated to it, then you are automatically giving an order, even if that order shouldn’t exist.

So if you have male and female, you might not want there to be an order, so that would just be a factor. Whereas if you wanted, let’s say short season, long season varieties, you might want that to be ordered because you might want to, let’s say, eventually have short season, mid-season, long season, and you actually recognise there was an order to those three, and you want to be able to maintain that order, you don’t want the midseason to be out at one extreme. So that order is important.

So what you are actually doing, and this is what good mathematics and good statistics does, is you are explicitly abstracting out what are the things that matter in different contexts. What can I do when I have an ordered factor, which I can’t do when I have an unordered factor, just like men and women? And so being able to ask that question requires you to sort of just have that detail of exactly what type of factor do you have, what is its nature? And if you think about it and you are rigorous in how you use factors and using them well, then they’re extremely powerful in so many different ways.

The long season short season example is very nice because you might just have a variable which is season length. So it might be that your categorisation of season length into a factor is a choice you are making. And it could be that you want it to be into a two level factor, short season, long season, a three level factor, short season, mid-season, long season, or more. Who knows?

Because it might depend on the data you’ve got, what’s actually useful. And that’s where you can actually understand if you’re going from a continuous variable like season length to a discrete variable, which is a categorisation of that. Recognising that this is a factor enables you to understand what is the grouping that corresponds to it, and the fact there was a choice involved.

[00:09:31] Lily: Also on, I guess, where factors can differ to numeric variables.

[00:09:36] David: It’s not that you shouldn’t be able, if you want to associate numbers to your factors, you know, you could have a weighted factor, and that might be actually the numbers you are wanting to associate. So there’s different ways you could associate numbers to factors.

[00:09:52] Lily: And then I suppose that there’s also the point of, within a factor, they have order, but then there’s also magnitude. So, factors, if you are looking at, say, shoe size or age, then you can tell, you know that 11 and 12 is the same distance as 12 to one. But if we are comparing, say your short season and long season, and you have in there midseason, well, maybe they would be equally distanced between them.

[00:10:22] David: It might not be.

[00:10:23] Lily: They might not be.

[00:10:24] David: I think when you were saying it, you’re saying the difference between one and two is probably the same as the difference between 11 and 12.

[00:10:30] Lily: Yes, sorry.

[00:10:31] David: So these are questions that when you have a numeric variable, you actually really care about those details, because it really matters what you can do and what you can’t do. And in a factor, what you can do is actually that information, that nature of those magnitudes, the nature of these differences, and if there’s an ordering and if there’s an associated number of weighting, you can actually make all of these things explicit.

So it’s adding rigour in a way which enables you to be more precise in your language. In many ways it doesn’t matter. But in some specific ways, it really matters. I believe one of the really important advances that this creates is an opportunity to simplify language around things like tables and graphs, descriptives, even other statistical procedures.

By having the concept of a factor, you can precisely state what is the nature of the factor you are actually needing for this statistical process or this visualisation, table or graph, and you can use the nature of that factor to define which factors you can use, you can use this method for.

And that’s extremely powerful in terms of actually enabling certain processes to be simplified. By having that rigour in your definition of factors, you can simplify afterwards the use of them to make sure they’re used correctly, because you cannot use certain procedures unless you have certain types of factors.

Again, from a conceptual perspective, and what I don’t think we’ve done well enough is to actually say, well, how can we use this to teach elements of data analysis more easily? How could we actually use the concept of factors to make it really easy for people to understand, well, when you are summarising, what are you actually doing?

If you are summarising and you take a summary, let’s say the average or a total or a count at a level of a factor, you are going from one level of data to another level of data. But what’s so powerful, of course, is now if you think of this in database terms, you just have two linked tables.

You have data at, let’s say, the household level, and you have data at the individual level. And you’ve just summarised all the individuals in the household to the household level. You might want the count of how many individuals were there in that household. You might want to know what was the average age to know is it a young household or is it an old household? An average of the age is absolutely sensible for that. You might want what is the total income, if there’s salaries coming in from members of the household and you have a total income, that gives you an idea of the purchasing power of that household.

So these are different summaries, simple summaries, count, average, total, very simple summaries, but they relate to using factors to be able to take data from one level to another.

And, of course, all of this is possible if you don’t have factors, but if you think of factors as being this key to navigating multi-level data in different ways, if you have a factor, then you might have data at each level of the factor. Now, you could get that through summaries, or it could just be that you also collected data at the household level, as well as collecting data at the individual level.

And so the household factor within the individual data is a categorisation, of course, that’s what it’s doing, it’s grouping, but it’s also a way of linking multiple levels of data. And it’s so natural. And this actually simplifies complex data, many of these surveys have many, many different levels of data. But if you think of them in terms of just the factors as helping you to navigate from one level to another, it becomes very natural. And factors then become the key. So you have data at a level which has a factor, and then you also have data for each level of the factor, which is data at another level.

It’s interesting that this isn’t something which tends to be taught or thought about, and this is something which we should be over the next few years, trying to make educational resources, which bring out these ideas and actually simplify them. I’m conscious that our presentation here may not have simplified the idea of a factor, but I do hope, at least for some people, it’s given a different view on what is a very simple and yet very powerful statistical concept.

[00:16:01] Lily: Yeah. Yeah, absolutely. And just I guess drawing out the fact that factors are a discussion to have.

[00:16:09] David: Yeah, absolutely.

[00:16:11] Lily: No. Great. Thank you very much, David. This has been very insightful as always.

[00:16:16] David: Well, thank you. Nice talking to you and look forward to our next episode soon.

[00:16:21] Lily: Yes, me too. Thank you.