239 – Converting Open Statistics Textbooks for Software-Agnostic Learning – IDEMS International Community Interest Company (CIC)

The IDEMS Podcast

239 – Converting Open Statistics Textbooks for Software-Agnostic Learning

00:00 / 26:04

February 24, 2026

Description

What if educational resources could be available in a limitless variety of variants, each adapted to the tools students actually use? In this episode, David talks to Lily about a project she has been working on to convert open statistics and data science textbooks into the PreTeXt format. The discussion highlights why PreTeXt’s semantic structure and separation of authoring from publishing enable systematic changes across a book, supporting making software-specific and software-agnostic variants to tailor the books to various contexts where different approaches are more valuable.

Transcript

[00:00:07] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here with David Stern, a founding director of items.

Hi, David.

[00:00:15] David: What are we discussing today?

[00:00:17] Lily: I thought we’d discuss these textbooks.

[00:00:19] David: Oh, yes. I’m so excited by your work on this.

[00:00:25] Lily: Yes. Yeah, well, I was gonna say more specifically, I mean, because “textbooks” is very broad. So there’s statistics PreTeXt textbooks for now.

[00:00:32] David: Yeah. And this is work that you’ve just got stuck into on the back of, of course, other work which is happening within IDEMS, but you’ve worked like a train on this. This has been great fun.

[00:00:46] Lily: Well that’s, that’s again thanks to the robots. Something that wouldn’t have been possible in this way at all a couple of years ago. Let’s firstly say what it is, and that is that we are taking open source textbooks.

[00:01:00] David: There’s always confusion around licensing, and I love the Creative Commons licensing, as people have discussed in the past, and I actually discussed this on another episode recently. But let’s be clear, Creative Commons is a really simplified way of thinking about licensing.

We have Creative Commons Zero, Creative Commons BY, Creative Commons ShareAlike, Creative Commons NonCommercial, Creative Commons NoDerivs. And I’m not going to go into the details of that, we’ve done that in previous episodes, but I will draw two lines. The “open” line is Creative Commons Zero, Creative Commons BY and Creative Commons ShareAlike. Those are all what would be considered and widely accepted as open licenses.

We accept one further license, which is the NonCommercial license. And this is, as I keep saying in many episodes, this is a very misunderstood license, but broadly it means that the authors or the people who hold the license are retaining the commercial license alongside the almost open license, so that they can reissue commercial licenses for commercial use. So for non-commercial use, it is open, and for commercial use you need to go back to the providers and they can issue a commercial license. And that’s absolutely fine for us, for our purposes.

And we don’t use non derivative licenses because non derivative licenses do not allow us to create variants. And so that’s a no go for us. We, at the heart of our work, is all about variants. And therefore that excludes what would otherwise be some really exciting textbooks such as the Hadley Wickham textbook.

Really annoying that that’s under a non derivatives license. But anyway, it’s not that it’s annoying. This is a sensible license from other perspectives. It is just not enabling variants, and that’s really what we are going to talk about today, why it’s so exciting that all these other [textbooks] we can build into a system whereby we can create variants of them.

[00:03:13] Lily: And that, I guess, is kind of what I’ve been referring to as phase one. Not that the phases have been carefully drawn out, but phase one is: I’m spending a month on converting these open textbooks, which are under those licenses of up to non-commercial, on converting them into a PreTeXt format.

[00:03:37] David: We have talked about PreTeXt in a number of other episodes, but it’s worth repeating. PreTeXt is really exciting and powerful because it’s a semantic language. What do we mean by this? I think the easiest way I have of explaining this and George, who you know and get on very well with, will turn me off for saying that it has metadata associated, he would want to call it “power data”. Power data is a special kind of structured metadata. But anyway, let’s just say that, because it is semantic, you actually have appropriate metadata associated with elements that the author has created so that you understand what they mean. What do I mean by this? For example, your “code blocks”: anything which is code is recognized as code, and you can then work on all of the code blocks systematically across the whole textbook rather than having to do things individually for each of them.

Now, this is sensible if you want templating. In PreTeXt, it is this rigorous separation of the author’s concerns from the publisher’s concerns, which enables the text to be republished in different ways, in different forms. A lot of the textbooks you are converting from are coming from things like R, Markdown, Quarto, LaTeX. All of which are really nice systems. I’ve used them all in the past. We like them. We are not against these, but none of which are semantic. And that means that none of which are rigorously separating the author and the publisher concerns. All of which are separating author and publishing concerns to “some extent”, which is why the conversion is so easy.

They’ve already done a lot of the work, and so there’s just a small amount of extra work, which is the conversions, and as you put it, the robots are doing quite effectively.

[00:05:36] Lily: Yes. Yeah. I turn to the robots, I say, “Hey, robots!” We do a chapter at a time, sometimes less. I don’t want to overwhelm the little robots. I was initially just kind of checking, checking just in general that it has been rendered, but now I find myself more and more checking, a lot more rigorously. I’m just like, scrolling, scrolling, just checking. Yep. That paragraph, that paragraph, that paragraph, all looks the same between the Quarto version or the R… the original version, the author’s version, and our version.

[00:06:09] David: Yeah.

And the key is that of course, as you say, all this is just phase one, but once you have this in a semantic form, my vision – and I’ve actually started discussing this with the PreTeXt teams – is that we recognise that it’s not that big a job to go back, and so we would really like to see the PreTeXt version being the source of the truth.

But you can publish to a Quarto: it should be an absolutely sensible publishing process that you take your PreTeXt textbook and you publish it as a Quarto textbook. So, you are not losing what the authors currently have. You’re just saying that there is actually a semantic version with better metadata, which enables you to also publish it in other ways. Now a Quarto document can already be published, but there’s all sorts of different things. There’s a web document, there’s a PDF and all. And it does these very nicely.

So, it is not that we are actually saying that Quarto isn’t good. Quarto is great. I love it, for specific things, whereas the PreTeXt as we see it, this is what’s going to enable us to do things systematically at a higher level, and we can come back to that in a minute with the integration of things like STACK exercises.

[00:07:27] Lily: Yes. Yeah, absolutely. And well, and at the moment, you know, Quarto is, as you say, fantastic. And it has features at the moment, which we don’t get in PreTeXt yet.

[00:07:39] David: Yet.

[00:07:40] Lily: Yet. Such as in Quarto, you can render R code.

[00:07:45] David: You say you can’t do this in PreTeXt yet. I actually had a discussion about this to say that we just need to add this to PreTeXt and actually it’s already there. So, it should be because PreTeXt can do this through Sage, and R is part of Sage , so it should be possible to do this using the Sage cell.

Okay, we’re getting stuck into detail that we haven’t had a chance to discuss yet. But, however we look at it, it’s not a heavy lift to get that rendering to be able to happen, it is just about us learning. And we need to put you in touch with Rob Beezer, who’s the guru behind the Sage cells in PreTeXt.

But in theory, that’s actually all achievable. And what’s really exciting, of course, is that once we have this in PreTeXt, you know what you can then do? And this is where the variants come in, you can say “Okay, what happens if we translate all the R to Python?” And we now get a Python version of the same book.

These are the things which get really exciting and this is what’s then possible. It’s the books. Now we recognize that in this moment, if you look in the data science sphere in statistics, you look at the open textbooks, and most of them are software or language specific.

You’ve got one, which is a jamovi textbook. I love jamovi, they’re great.

[00:09:09] Lily: Well, and just to say on that jamovi one. That jamovi one is a version of an R one. So they’ve already kind of done that in a way, by taking this R textbook, this textbook on an introduction to statistics with R, and they’ve created their own jamovi version. But I guess what we’re trying to do is to make that process even easier.

[00:09:30] David: Exactly. We’re trying to not just make that process easier. But actually, say with statistics or with data science textbooks, having software agnostic versions where a student could go through and do it, with whichever software they want to use, or language, and having software specific variants are both needed. This is a classic example where it’s not that one software is better than the other, or that having something which is software agnostic is better than having something which is software specific. It’s that these are all needed.

And so, instead of considering a textbook as having a single variant, we are having multi-variant textbooks where a student could say, “I just want to use R”, or “I just want to use jamovi”, or, of course from our side, R-Instat, whatever it may be. That student can go through and they can have a specific textbook related to this tool they are using. But, in the same class, different people could be using the same textbook with different tools. Or a teacher or lecturer might decide in their class, they want everyone to use the same tool.

These are choices a lecturer, a teacher could make. But the fact that textbooks exist with these multiple variants, this is something which we believe wasn’t really technically feasible until recently, and now we believe it is, and that’s something we are really excited about. So this is what you would probably consider your phase two piece, or maybe that’s phase three.

[00:11:11] Lily: I am not sure yet what phase two or phase three is. I mean, I know the next steps, but I’m at the moment just spending no more than a month on translating. And, just to make it clear, I’m not sitting here just translating textbooks. I’m giving them to the robots and I’m doing my own work.

[00:11:30] David: Exactly. That’s what’s so exciting. It is that you have been doing this so efficiently with it not taking up that much of your time. And this is what makes this all possible now. This is something which in the past, to get anywhere close to where you’ve got in a month, on your own, on the side, would’ve been maybe a multimillion pound project, this would’ve been a huge human effort. Right now, what we’re finding is that actually we can do elements of that through using, as you put it, the robots.

[00:12:08] Lily: Well, well, yes. You say in a month, it’s only been 12 days so far and we’re on book 10 and 11. So one to nine are done.

[00:12:17] David: Yeah. And this is where we’ll see where you get to by the end of the month, and we’ll see what else we get in there. And, of course, we are not going to say that all the open stats or data science textbooks will be done within a month or whatever it is, but we do expect to have a lot of the major ones, not all of them, but a lot of the major ones that we think are really good case studies to have a look at and, things that we like. And we’d be very open if there are listeners out there, you have your favorite open stats textbook and it’s not on our list, send it to us, put it in the comments and let us know and, if it’s open under the right licenses, we’ll add it to our list and keep extending this.

So it’s not something we see as a sort of a fixed set. But the exciting part and the reason this is something I’ve been wanting for so long in some form, is that this is all part of a broader effort to make elements of the data education to improve statistical literacy by actually removing barriers.

One of the barriers, which I get so frustrated by, is the software barrier. And so there are other things we’re trying to do this for, but we’ve already talked about this fact that I really want software agnostic variants of these textbooks, because I want to separate out the learning of the tool from the learning of the concepts. And that is something which in so many contexts has got mashed together, not for bad reasons, for genuine academic debate on “is it better for people to learn the concepts with the tool or through the tools or independently of the tools?”

Now, I don’t know the answer to that, I sit in the circles where these things are debated and I’ve heard both sides of this. What I do know is: right now, good software agnostic Textbooks, they don’t really exist. Most of the good textbooks that I really like are with R and they tell you why they’re with R. Or that one, as you say, with jamovi, they’ve made their own variant to say “You don’t need R, you can do it with jamovi!”

Great. But it’s confusing the statistical concepts with the tool you are using, and I believe that if we can create software agnostic variants, then it would benefit some classrooms. And these are classrooms that we go into. You’ve taught as I have at AIMS, that is the African Institute of Mathematical Sciences. Going into those classrooms, there is such a diversity of skills and of students, I always teach in a software agnostic way. And, for that audience, I am convinced that’s correct. I’m not saying it’s correct for every audience, but I’m saying for some audiences it is correct.

[00:15:32] Lily: And just to add, that doesn’t mean that you, as the teacher, lecturer, facilitator, or whatever role you play in this, that doesn’t mean that you need to know every single software.

[00:15:43] David: No. On the contrary, one of the things that I’ve loved sitting in that classroom is that you have your real Python experts who are just, everybody sees them as these amazing people who have skills beyond the mere mortals, who are sort of actually saying “Well, I’m sorry, I’m just gonna start in Excel.”

And I love it when your guys who start in Excel actually get further in the data analysis faster. Because it’s the right tool for the job, for that particular thing, and suddenly the power dynamics in the room change as people recognise that different tools are more efficient and more effective for different things. And it’s not that one is better, it’s that, yes, it is great to have good Python skills, but different tools enable you to do different things more effectively, more efficiently. And so, you know, that’s one of the things I love about this software agnostic class, is that suddenly you get this respect growing amongst the students for each other and respecting that everybody has different skills.

And actually, I love it when your Python experts eat humble pie and then start using a spreadsheet and find, “wow, this is really good for that, and then I now need Python for this”. And so you have that fantastic power dynamic which changes amongst the students. It’s beautiful when it happens.

So, I love a software agnostic classroom where they’re taught in a very problem solving approach. We have a course on problem solving in data science that you have taught as well as I have. But it is just a matter of great joy to me to see those dynamics playing out in real time amongst the participants.

[00:17:20] Lily: Yeah, no, absolutely. And, so this software agnostic is, as you say, what we are trying to get into these textbooks.

[00:17:31] David: So it’s the different variants. This is a bigger picture idea, which is at the heart of what IDEMS is doing, it’s at the heart of what we’ve recognised. But, let’s stick to the statistics textbooks because it’s this wonderful example and particularly there’s other aspects to this.

These software variants and these software agnostic variants, I’m convinced that there is debate about whether it is better to teach in a software agnostic way or not. When I say software agnostic, let me be clear, in a statistics or a data science course, I mean: are you teaching, let’s say, data science with R, data science with Python? Or are you teaching data science? That’s a simple question.

And if you are teaching data science and you have students who are Python experts and are R experts, are you forcing them to use the tool you are comfortable with, or are you allowing them to use the tool that they are comfortable with and to explore and grow in the direction that they have?

And I’ve chosen R and Python, but there are many other tools, some of which are commercial, some of which are open, and all sorts of ways. But I think R and Python in terms of open source languages is a really good debate because they are both very good and they serve different purposes and they overlap a lot.

And so, I believe, if you have a data science course, there is value to having a data science course with Python, and to actually frame it in that way and to tie it together in that way, and you then are actually building the technical skills of how to use Python to do the things you’re wanting to teach. And there is strong pedagogical evidence, which is emerging from many studies, that such courses do lead to interesting student learning in certain contexts.

So, I believe that such courses are valuable. Not just with Python, but with R or with any other software as well. So software specific courses and experiences where all the students are learning and using the same thing, they help each other, they build their skills up, there is value to that.

But I am also convinced – and I love teaching, these are the courses I love to give – in a course, which is data science, and it is software agnostic, so you can use whatever tool you want and different people in the class can use different tools and they can learn from each other what they do in the different tools, and they can have that sharing and that cross learning, and that plays a really important different role.

Now, I should clarify that, actually, in my experience, there are different audiences for whom one approach or the other approach is most beneficial.

So, I often find that if it is the first time you are learning or using a tool, a statistical tool, then the course which is more specific, and helping you to learn the tool alongside the content, that is really quite effective. If nobody has ever done any data science or data analysis sort of stuff and it’s new to everyone such as a school context or maybe early university context, then that’s a context within which actually just taking everyone along together on that journey in that particular software I think is a very suited sort of approach. And the software is part of the learning.

Whereas, in a lot of the postgraduate or the sort of contexts in adult learning where most of the participants come in with existing experience and knowledge and skills, that’s where it’s really good to meet them where they are and to say what you know and what you use is valuable and is valid. It reinforces that validation. And so that’s a context within which having something which is software agnostic is so powerful.

Arguably the difference between these two is child learning versus adult learning, and there is a whole set of theories about the differences between childhood education and adult education. So, I’m not wanting to get into the debate of which of these approaches is better for who. I do have opinions, but what I believe is: it is not true that either approach is better for everyone and therefore both approaches are needed. And therefore, if I have a textbook, I have to make a choice between which of these approaches I’d prioritise. Unless I can have multiple variants, and then I can have a variant of the same textbook for each of these approaches. That is what we believe. This highlights and illustrates the value of the academic debate “which of these approaches is better for who?” Great, we can do experiments and we can find out if we have these different variants. This is wonderful. But more than that, it is not on the author to make that decision because the author just needs to write the best book that they can write and then exactly how it would be delivered, that’s on the instructor who’s going to use it, and if there are multiple variants, they can decide: “for my class, I would like to use the variant which is R specific”, or “I would like to use the software agnostic variant” or so on.

So this is a perfect, simple example which illustrates not only why multiple variants of a same textbook are useful, but also that it’s actually really implementable. We think that once you have these, we can go through and we can create software agnostic variants and we can have a sort of guide, which will enable anyone who has their favorite tool which doesn’t exist, to create that software specific variant of the textbook.

And so the textbooks, which currently are for R, having jamovi create their own variants for all of those, great, they can go ahead and do that. And because these are open textbooks, this is all possible and encourageable. And instead of it becoming now your jamovi one that is disjoint from the book that it is a variant of, these are now together as a coherent whole.

We know the jamovi people a bit. We’ve been interacting with them for years. I love the work they’re doing. But I don’t believe that’s the solution for everything. And they would agree with that. But I do believe that they would be a valuable variant to have for many of these textbooks.

So sending this to them and saying “look, here’s these textbooks. Here’s this process you can follow. If you follow this process, you can have a jamovi variant of all of these textbooks.” And that being something which then becomes easy, oh, this is exciting.

[00:24:49] Lily: It’s very exciting and it’s very interesting that the easy bit is the actual translating the textbook bit. Moving the textbooks from the format that they’re in into PreTeXt is a very straightforward and easy phase.

[00:25:04] David: Absolutely. And the point is that the fact that it is easy because of the robots, this is part of the unblocker, because actually doing multiple variants in the way we want to do it in anything other than PreTeXt, I think it would be much, much harder because of the semantic nature of PreTeXt. It is set up to enable these multiple variants and this is where it’s so exciting.

I’m so excited by your work on this. We’ll probably do another episode in three or four months’ time where you are able to actually state “This is where we’ve got to. This is the site you can go to and try these out.” It’s only going to be a few months away. So, I’m really excited by your work.

[00:25:47] Lily: Excellent. No, thank you very much.