Description
Lily and David discuss recent work on the ePICSA system, focusing on the development of a structured summaries database to support climate information for agriculture. They explore how moving from file-based systems to a database approach creates new opportunities for versioning, quality control, decentralised workflows, and accreditation of climate products. The conversation also reflects on the broader challenges of climate data quality, data rescue, and building sustainable systems that can support national meteorological services.
[00:00:07] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here with David Stern, a founding director of IDEMS.
Hi, David.
[00:00:14] David: Hi, Lily. We’re discussing ePICSA today.
[00:00:19] Lily: Yes. So PICSA, I’m sure ePICSA is also something we’ve discussed before, but PICSA..
[00:00:24] David: Participatory Integrated Climate Services for Agriculture.
[00:00:29] Lily: Yes, that’s the one. And then the “e” part of that, is part of us digitising it, which I believe we have discussed in previous podcasts.
[00:00:38] David: Yes. And one of the bits, which has just come out as needing to be done, and we’ve made some progress on, really relates to the fine details of this. And so this is going to be a more technical episode than quite a lot of the others. It is very practical, of course, and we can explain the need and why this is so important, but what we’re talking about is an implementation of a mini database to be able to manage the communication to the app of the summaries which farmers are using.
[00:01:16] Lily: Yeah. So, at the moment, the system is that we have these summaries, which are done with R code, and you can do them in R-Instat, and then we read them into kind of buckets, these Google buckets. But now, as we are implementing it at the moment, into a database, and then they are then read on the app side from the database into the app.
[00:01:38] David: Yeah. So let’s just clarify. It’s not that big a difference, the Google buckets are just sort of unstructured, they’re stored as files, those files were then read as files and interpreted as files. Whereas the database actually having a formalised database structure of that same data enables us to do things like manage versions and actually have multiple definitions.
It enables the app or the dashboard which manages the app to actually have more powerful control over how it uses different summaries that have been uploaded into the database.
[00:02:24] Lily: Yes. So I guess, how does this work? Why does a database allow this and the Google buckets or this kind of more unstructured system that we had with the Google buckets doesn’t?
[00:02:35] David: Well, of course, files and the fact that those files were essentially structured, semi-structured data, you could argue this is a relatively unstructured form of database, so it’s not a million miles away. But the problems that we’re trying to solve are that, when you write these files, there’s a question of whether you are writing the same thing that is going to be read.
What do I mean by this? Well, when the app is being created, it’s going to be created in such a way that it will read for the whole of, let’s take Zambia, so the app will read the data for the whole of Zambia. So it’s very nice when the files are there for the whole of Zambia.
But quite a lot of the work we have in Zambia with the Met office now is giving more ownership to the provinces. So it would be much nicer if they can be the ones managing that data. And then you’d get lots of little files, which would then have to come together.
Whereas if you have this in a database, then you’ve actually got those structures so you can put the structures into the database and the reading and the writing of it get really separated out as opposed to the writing of these files, which is seen as a single process from R-Instat by an actor group.
So one of the things that this does is it enables us to separate out and have multiple actor groups doing different bits of the writing, and having the reading of the database being separate from the fact that different people did different bits of the database.
[00:04:26] Lily: Nice. You say it’s gonna be a technical discussion, this one, but this is at least making sense to me. So semi-technical.
[00:04:32] David: Yeah, as with all the episodes, we never want to get too technical, but we should now dig into some of the problems.
Why do we want that separation? Well, not all the provinces necessarily have the same ecological zones or the same climate, and hence, they might need different definitions of things like start of the season, start of the rains, end of the season, end of the rains. They might use slightly different definitions to try and convert the raw data into summarised data, which is useful to farmers.
[00:05:12] Lily: So, an example of that might be that Zambia is a big country. Zambia’s bigger than I realised when I went there, well, while I was there it was when I realised how big it was. It’s a big country and so, as you say, it spans several kinds of climate zones. And I believe a discussion when I was there a few months ago was about that for one region, the more southern half, we might want to consider start of the rains a month earlier than for the more northern half.
[00:05:42] David: As you go from north to south, in general, that gradient often leads to different times, which would be sensible to consider as the start of the rains. If you went a bit further north still, if you go to somewhere like Tanzania, which is similarly large, there are some parts of Tanzania where you have one season and some parts where you have two, and so the seasonality changes even more drastically.
And Zambia is very interesting because although it’s very big, it’s not that densely populated and it actually has quite a large urban population. So it’s a very interesting country. The Met stations are actually quite sparse because it is so big and there are big areas which aren’t densely populated. It’s a great thing for a country to have land which is not densely populated, but it is not a great thing when you are wanting to have climate data available all over the place. When you have very sparsely populated areas, you often have less climate data and other data associated.
[00:06:55] Lily: Interesting. And so I guess there’s these nuances within different countries and between countries which have these complications, and we want what we are doing to be usable and useful.
[00:07:14] David: Exactly. And this is where the motivation comes to be able to add this layer of complexity, or layer of structure – maybe I shouldn’t say layer of complexity because, in some sense, structure is reducing complexity, or it’s controlling complexity.
So, by actually structuring the data more, we are controlling the complexity a little bit more and enabling the system to be serving more purposes – it comes back to our options by context – to be serving different ways people might want to interact with it. And I think it also has a very nice separation in terms of: we no longer need such a strong API for these interactions because a lot of this can just be managed through the database and database permissions.
It’s a really exciting bit of progress, and I want to just maybe take a little step back to say, why is this so exciting? We’ve said a bit technically what it is, we’ve said a few things it does, it allows to have these multiple definitions, it allows to have different people working on different parts of it at different times, and that to be managed in sensible ways through the database. And of course all of these are possible for a file base system, but the database does this very naturally and very well.
And then the thing which is so exciting to me is, and this comes back to the Zambia use case, where we are now looking at these ideas of accreditation. And that’s different. So, in particular, we can be accrediting the summaries that people have done and the methods they’ve used to do those summaries, and that could be attached as metadata to the data, so that you can actually get quality and signals of quality or evaluations of quality really visibly within these systems in ways that haven’t been possible before.
And I see this as really exciting because it serves everyone well. In particular, I’m excited by how it serves Met offices, especially with respect to a really challenging problem, which is data rescue. If we can get the accreditation processes for data and for products, then those accreditation processes could get baked into how Met offices work. And Met offices can then be articulating and actually requesting for funds to improve the quality of their data in ways which are much more easily measurable and where they can really demonstrate what they’re delivering on.
And it comes back to this problem that Met offices occasionally convince people that there’s a really important reason to do data rescue and this is really vital and should be done, and they’ll get a big project to do it. But data rescue isn’t something which can be done in a project. A project can only ever do a part of data rescue.
So, if you don’t have a way of demonstrating what’s done, then the funders say, “well, didn’t we already fund that?” And it’s difficult for Met offices to say, “well, we did what you had the funding for, which was this, but we still have this to do, therefore that’s why this other funding would be useful and needed”. And so I think actually putting in place the structures that this is something where this can be continually worked on, and then hopefully eventually integrated better into the standard working practices of the Met Services themselves, that would be the ideal.
And this is where in Zambia, it’s so exciting that a lot of the enthusiasm is coming from the provinces, and you’ve experienced that more than me because you were actually giving that training, and so you probably observed that when you were last there.
[00:11:31] Lily: Yes. Yeah, absolutely. And I guess that then brings to what you’re talking about, quality of data and data rescue. I just want to hone in on those terms because, you know, what do we mean by them? We want the data to be better quality. Well, data’s data, how do we have data of good quality and bad quality?
But it’s through things like looking at your data and checking if it has a missing value being replaced with a zero, which can happen through Excel and things. If we have a temperature of zero, well that’s probably meant to be missing, especially if we have a lot of temperatures, daily temperatures, where the min is zero and the max is zero and it spans over a few months, because zero degrees temperature in Zambia is pretty unlikely.
[00:12:20] David: Well, exactly zero degrees temperature anywhere is unlikely, especially unlikely somewhere like Zambia, where getting to zero is almost unheard of.
[00:12:34] Lily: Yes, but then it’s a bit more complicated when we come to, say, rainfall because, you know, there’s dry seasons and rainy seasons, and in a dry season having zero millimetres of rain anywhere is not unlikely as a value, having zero millimetres of rain, exactly zero millimetres of rain, is common. And in Zambia is of course very common in the dry season, and could also be common in the rainy season.
So then it becomes, “okay, how can we tell if these zero millimetres of rainfall are actual recorded values at zero or if they are missing values that have been replaced at zero?” And my understanding is that’s the sort of thing that you mean by data quality and clearing up these data sets for data rescue.
[00:13:30] David: That is one of the examples of a data quality process, investigating missing values, looking at zeros, you then go back and maybe look at the paper record which is often well organised in many of these countries. And so then the question is: how do you do that digitisation? And all of this is a lot of work, but very valuable work, because if you don’t do it then the data that you have, these anomalies, then they change the results in ways which are not correct.
So, actually the importance of going through and recognising that to err is human and that if you have millions and millions of rows of data, it is absolutely expected that thousands of them will have mistakes. If you have tens of thousands of mistakes that’s still only a few percent, for a million rows of data if you had 10,000 mistakes, well, that’s only 1%.
That means you are 99% accurate in your data entry, that’s a big ask to only have a thousand corrections needed out of a million rows. So the fact that corrections are needed is absolutely natural and normal. Minimising the corrections, that’s important, but this is true whether it’s humanly entered or whether you use digital technologies like AI to enter it. That’s increasingly an option in a way that wasn’t possible, or wasn’t reliable, until recently, but it’s becoming more reliable.
But it’s still nowhere near, certainly not a hundred percent, and 99% would be amazing, and that would give you thousands and thousands of mistakes that would need correcting, identifying and correcting.
[00:15:35] Lily: Yeah. Very interesting.
Coming back to the database then and why this is so exciting, is that having these quality checks that are done in the data, or having the database can then help have good quality data, is that what you’re saying?
[00:15:51] David: Let’s take two steps back here, because there is a database, of course, for the raw data – we work with Met services that use “Climsoft”, that use “CLIDATA”, that use “Clisys”, that use “CliDE”, there’s all these different what they call CDMSs, Climatic Data Management Systems, which are used to manage the raw data.
The database that we’re talking about is a summaries database. And so what’s interesting about this, is that this is actually a products database. And what I think is interesting about that is that, unless you have products systematically organised, then you don’t tend to do the quality control well on the data because you’re not using it in earnest.
And so it is when you actually get to producing products that then your funny temperatures of 200 degrees instead of 20 degrees really stand out and you say, “wait a second, this is wrong”. Now, of course, that can be identified easily, but there’s all sorts of other problems, you know, your missing month of rainfall is something which is not so easy to identify.
And this is something where almost all instances or large data sets we found, even ones that have been quality controlled professionally by the French Met Service, when we put them through more rigorous quality controls, we find things like missing months where rainfall has been replaced by zero for a whole month, and that’s something which you can quite easily identify and recognise is wrong, and then go back and just check the paper record to see what should that have been.
So anyway, I guess I’m going off on a slight tangent here, but really reinforcing this fact that the CDMSs that manage the raw data, these are really important, but we even have groups who still use Excel for that. But just using those is not enough. We need to get the data out of those and used to produce products.
And your training was helping people use R-Instat for this. And that is a route which has gained momentum in a number of contexts. But using R-Instat to get it not just to produce the products, but to put the products into a database, a products database, which can have elements of quality control as metadata, that’s what we think is transformative in terms of closing the loop and actually changing who has responsibility for what and enabling these systems to become more stable and sustainable.
[00:18:42] Lily: Very well summarised, thank you very much.
[00:18:47] David: Well, no, thank you.

