Future of past data - or - Is there a report that can show me checkouts by Zip code for the past 5 years?
Strategies for keeping meaningful data while getting rid of unneeded details. --
These slides availale at http://bit.ly/pastdata2021
Hello everybody, I'm George Williams, and in case you didn't get it there from the introduction,
George Williams
koha-US member and Past President www.koha-us.org
I'm going to be talking about data. To introduce myself,
USA - Kansas is in the middle.
I am the Next Search Catalog Coordinator at Northeast Kansas Library System in Lawrence, KS.
I am the Next Search Catalog Coordinator at Northeast Kansas Library System in Lawrence, KS, which is also known as NEKLS.
117 NEKLS Libraries in the highlighted area.
NEKLS provides many services to all types of libraries in north east Kansas.
The 14 highlighted counties account for approximately 44% of the population of Kansas.
A lot of people in the Koha community think NEKLS *is* the shared catalog we operate here. But NEKLS is actually a regional library system as established in the late 60s by Kansas law and we have 117 member libraries in the 14 counties in the upper right hand corner of the state where Kansas, Nebraska, and Missouri all meet along the Missouri River.
Next Search Catalog (used to be called NExpress)
Next Search Catalog is only one of the services NEKLS provides. 52 of our member libraries participate in our shared catalog.
Of those 117 libraries, only 52 of those libraries participate in Next Search Catalog and out of the 11 employees at NEKLS, I'm the only one who has 100% of their time devoted to Next Search Catalog and Koha.
Using Koha since 2008.
Community Koha hosted by ByWater Solutions since 2011.
We've been using Koha since 2008 and we've been on Community Koha supported and hosted by Bywater Solutions since 2011.
My life before Kansas
Originally from Idaho.
Latah County Library District - Circulation Manager (2006-2016).
University of Idaho - TA and Lecturer (2002-2006).
Boise Public Library - Bookmobile Specialist (1997-2002).
University of Idaho Library - Student Employee (1996-1997).
+Bus driver, Restaurant manager, Cook, Dishwasher, Sophisticated man about town.
I've been at NEKLS since 2016. Before that I had some different library jobs at some different libraries and I had a lot of other jobs.
koha-US
https://www.koha-us.org
And, like I said before, in 2019, I was the koha-US president and this is my 7th koha-US conference.
Slides
These slides availale at http://hopperdietzel.org
These slides are available at hopperdietzel.org. If you go there and click on "See my presentations here," this is the one at the top of the list that says koha-US 2021 - Future of past data. And, of course,
hopperdietzel.org that's spelled just like it sounds - o-r-g (Open hopperdietzel.org)
Slides
These slides availale at http://bit.ly/pastdata2021
And if you don't want to spell hopperdietzel.org, you can also go to bit.ly/pastdata2021
datum, dati - 2nd declension noun (neuter)
Latin - gift, present | in English it becomes "a given"
Case
Singular
Plural
Nomnative
datum
data
Genitive
dati
datorum
Dative
dato
datis
Accusative
datum
data
Ablative
dato
datis
Vocative
datum
data
And like I said before, i'm going to be talking today about data
Questions I've received so far this year
1. I have a borrower who wants to know the title of every book they've checked out since 2002. Is there a report I can run to tell me that?
2. I need to know weekly circulation by Item Type, Collection Code, and Shelving Location for the last 5 years. I can find a report for that but I have to enter the start date and end date for each
week and it only goes back to 2018. Can you help?
3. I want to know circulation by Zip code since 2011. Can you write a report for that?
Let me go back now to those three questions I mentioned in the introduction. These are paraphrased questions from actual Next libraries. // Read questions
Answers I've give to these questions
No
No
No
All of my answers have been "No"
Complex questions
These questions are pretty common
I think what has happened is that people have seen too many episodes of CSI, or NCIS, or CRIMINAL MINDS.
CRIMINAL MINDS
FBI library patron
There's always some point in an episode of CRIMINAL MINDS where the BAU is out in the field and Derek has to call
Penelope back in Washington and, after some sexually inappropriate small talk, he'll make a request like "We need a list of everyone who served time in a county jail in the 1970s, has had at least 2 root canals, and has been recently
released from a mental institution - and Doctor Reed says the mental institution may have been near a tungsten mine. Oh, and also they may have lived in Joplin, Montana, between 1965 and 1973."
CRIMINAL MINDS
FBI reference librarian in charge of answering unrealistic complex questions
Then you see Penelope click a few keys and say "I've got 50 people that meet those criteria, but only two of them lived in Joplin, Montana, during that time. I'm texting you their details now."
There is a burden to saving data forever
Physical files
People have developed this expectation that, just because the data they're interested in isn't on paper in a file cabinet or in a banker's box in the back room of the building, that we keep it forever and we can access it instantly.
There is a burden to saving data forever
The cloud
Since it's not taking up physical space, they don't think there's a "burden" in keeping it that data forever. They think of "the cloud" as an unlimited data resource.
The cloud
What they don't realize is that putting data "In the cloud" really means putting data "On a hard drive on somebody else's computer."
The cloud = these guys' computers
Google founders Larry page and Sergey Brin
The cloud = these guys' computers
Dropbox founders Drew Houston and Arash Ferdowsi
The cloud = this guy's computer
Amazon founder Jeff Bezos
Or this guy. Since we are a ByWater Solutions customer, our data is now on Amazon web services. So this guy owns the computers where our Koha now lives.
The practical and the philosophical
Practical reason: It takes up physical space on our server
Philosophical reason: The longer we keep our data, the greater the risk that confidentiality may be breached
So there are two basic reasons we don't keep data forever. Read from the slides
Supersize me
As far as the practical goes, Our Koha is supported and maintained by ByWater Solutions and I'm sure I could call them and say "Can you double our server size so we can save more data?" And I'm equally sure that their
response would be, "Sure, let me get you a price for that and write up an addendum to our annual contract." The more data you keep, the more space you need on a server and a bigger server costs more money. It doesn't get any simpler than that.
Next Search Catalog Statistics
Total holdings: 402709
Total items: 1011096
Total borrowers: 116247
2020 circulation: 802267
2019 circulation: 1458904
2018 circulation: 1529206
We have over 400,000 bibliographic records, over a million item records, and over 115,000 borrower records and have been checking out around 100,000 items a month on average. If we had been keeping all of the data since 2008, our storage needs would be significant.
Kansas open records laws and libraries
Library records that are considered confidential
Borrower registration, circulation, or loan records pertaining to identifiable individuals
Identity of those who donate materials if the donation stipulates anonymity
Personnel records and performance records (except names, positions, length of service, and salaries)
Correspondence between the Library and a private individual, including print and electronic formats
etc.
As far as the philosopical goes, potential breaches of borrower confidentiality is always a risk. I know that in the upcoming versions of Koha there are some new confidentiality features that will encrypt some patron
data that's not needed for statistical purposes. In the last 6 years we have had 2 libraries that are NEKLS members that have been subjects of ransomware attacks. One paid, one did not.
Law enforcement has contacted us 5 times in the last 6 years asking for confidential data from our system
But the most surefire way of keeping borrower information confidential is to delete that data when it's no longer needed. I've worked in libraries now for about 25 years and in the last 6 years I've had 5 requests for
confidential data from law enforcement. In two of those cases they knew that the suspect had a library card, but when I was told that I needed to preserve data for a possible subpoena, there was no data associated with that card number.
In those cases the borrower's accounts were apparently expired and had been deleted. In the third case, the account was still active, but all of the circulation history had been deleted as a part of our normal database cleanup process.
What to save, what not to save
The question becomes (read the slide) Koha can save a ton of data, but it's good to limit what you save in Koha and what you don't. Ant there are limits, too, on what you can do with old data that's saved in Koha. It's not a bad idea to come up with a strategy for what to save and what not to save and how and when to collect the statistical data you might want in the future. And that's because you want to save data while the data is still *Fresh*
Fresh data vs expired data
- and I'll talk about *Fresh* data in a few minutes -
Our retention schedule
action_logs - timestamp 120 days (recently changed from 60 days)
deletedbiblio - timestamp 13 months
deletedbiblio_metadata - timestamp 13 months
deletedbiblioitems - timestamp 13 months
deleteditems - timestamp 13 months
message_queue - time_queued 6 months
old_issues - timestamp 13 months
old_reserves - timestamp 13 months
statistics - datetime 25 months
So, let me talk about our Koha data retention schedule. We use cleanupdatabase.pl and we currently keep data in these tables for this duration.
Most data
deletedbiblio - timestamp 13 months
deletedbiblio_metadata - timestamp 13 months
deletedbiblioitems - timestamp 13 months
deleteditems - timestamp 13 months
old_issues - timestamp 13 months
old_reserves - timestamp 13 months
The other durations are set to, more-or-less, give us at least 1 year's worth of data.
Major exception: Action logs
action_logs - timestamp 120 days (recently changed from 60 days)
The action logs were originally cut back to 60 days a very long time ago and it was mostly done to conserve space so we could use the CataloguingLog (which used to generate a lot of unneeded lines).
Two other exceptions
message_queue - time_queued 6 months
statistics - datetime 25 months
But there are two exceptions for statistics and sent messages
Statistics table
statistics.datetime
statistics.branch
statistics.value
statistics.type
statistics.other
statistics.itemnumber
statistics.itemtype
statistics.location
statistics.borrowernumber
statistics.ccode
Since we've talked about how long we keep some of this data, let me talk about the statistics table now and share some of its benefits and pitfalls. This is what the table actually looks like and the way that it works is that, when something gets checked out, checked in, renewed, a fee is paid, or a fee is written off, all of the data in this table is written. For checkouts, this means that the itemtype, location, and collecton code are recorded in this table as they were at the time the item was checked out.
Statistics table
statistics.branch = check-out branch
statistics.location = current shelving location - NOT permanent location
And I'll point out here that the "branch" in the statistics table is the branch where the item is checked out - not the item's home branch; and "location" is the item's current "SHELVING location" rather than the item's "PERMANENT shelving location." If you're using the "CART" location, that becomes an issue if something gets checked out that has a "CART" location.
Very basic checkout/renew count at a checkout library
SELECT
statistics .branch,
Count (*) AS COUNT
FROM
statistics
WHERE
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch
So let's look at some SQL now. This report gets me items checked out or renewed at a library (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3560&limit=400)
Very basic checkout/renew count at a checkout library grouped by item type
SELECT
statistics .branch,
statistics .itemtype,
Count (*) AS COUNT
FROM
statistics
WHERE
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
statistics .itemtype
So this report gets me items checked out or renewed at a library grouped by item type (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3561&limit=400)
Very basic check-out and renewal count by checkout/renew branch, item home branch, and item type
SELECT
statistics .branch,
items.homebranch,
statistics .itemtype,
Count (*) AS COUNT
FROM
statistics JOIN
items ON items.itemnumber = statistics .itemnumber
WHERE
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
items.homebranch,
statistics .itemtype
Here's where "Fresh" data starts to come into play. If I'm writing a report that's looking for circulation statistics in the past that has to link out to the items table, if the item has been deleted, the link between
statistics and items is broken and there's no result. (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3562&limit=4000)
Report run on October 1, 2021 for September 2021 data
SELECT
statistics .branch,
items.homebranch,
statistics .itemtype,
Count (*) AS COUNT
FROM
statistics JOIN
items ON items.itemnumber = statistics .itemnumber
WHERE
statistics .datetime BETWEEN "2021-09-01" AND "2021-10-01" AND
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
items.homebranch,
statistics .itemtype
Potentially good result if run on October 1, 2021
If I'm running a report on October 1, 2021 and you're limiting the results to September 1-30, 2021, there's a really good chance that the item record hasn't been changed or deleted. (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3563&limit=4000)
Report run on October 1, 2021 for September 2019 data
SELECT
statistics .branch,
items.homebranch,
statistics .itemtype,
Count (*) AS COUNT
FROM
statistics JOIN
items ON items.itemnumber = statistics .itemnumber
WHERE
statistics .datetime BETWEEN "2019-09-01" AND "2019-10-01" AND
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
items.homebranch,
statistics .itemtype
Data won't be fresh - many items could have been deleted or updated in the intervening 2 years
If I'm running a report on October 1, 2021 and you're limiting the results to September 1-30, 2021, there's a really good chance that the item record hasn't been changed or deleted. (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3564&limit=4000)
Statistics table again
statistics.datetime
statistics.branch
statistics.value
statistics.type
statistics.other
statistics.itemnumber
statistics.itemtype
statistics.location
statistics.borrowernumber
statistics.ccode
The other thing I want to point out about the statistics table is the "borrowernumber" field. The only borrower data recorded in STATISTICS is the borrowernumber. This means that any statistics related to anything beyond the borrower number has to come from outside of the statistics table so the longer you wait to collect the data, the more likely it is that the data will not be accurate.
Circulation by Zipcode for this month
SELECT
statistics .branch,
borrowers.zipcode,
Count (*) AS COUNT
FROM
statistics JOIN
borrowers ON borrowers.borrowernumber = statistics .borrowernumber
WHERE
statistics .datetime IS BETWEEN "2021-09-01" AND "2021-10-01" AND
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
borrowers.zipcode
And the same is tru of borrower records. This is a report of checkouts by borrower zipcode that's fresh if I run it on October 1, 2021 (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3565&limit=4000)
Circulation by Zipcode for this month
SELECT
statistics .branch,
borrowers.zipcode,
Count (*) AS COUNT
FROM
statistics JOIN
borrowers ON borrowers.borrowernumber = statistics .borrowernumber
WHERE
statistics .datetime BETWEEN "2019-09-01" AND "2019-10-01" AND
(statistics .type = 'issue' OR
statistics .type = 'renew' )
GROUP BY
statistics .branch,
borrowers.zipcode
But if I run this report on October 1, 2021, the data is going to be stale because of all of the potential changes that have happened to borrower records in the past 2 years. (run on NEKLS - https://staff.nextkansas.org/cgi-bin/koha/reports/guided_reports.pl?phase=Run+this+report&reports=3566&limit=4000)
Deletions
2021 Borrowers: 15477
2021 Items: 65879
2020 Borrowers: 3930
2020 Items: 186477
2019 Borrowers: 12972
2019 Items: 253524
In the last two years alone we've deleted a pile of items and borrowers. Additionally, I've bulk-updated about 20 different collection codes and item types. And I've probably batch-updated well over 2000 borrowers this year so far. Because all of these changes, the older the data is, the more likely it is to be incorrect if you have to link out to get information not stored in the statistics table.
Questions I've received so far this year
Q: I have a borrower who wants to know the title of every book they've checked out since 2002. Is there a report I can run to tell me that?
A. No, because a) we only keep old_issues for the past 13 months: and b) we've only been using Koha since 2008 and it would be impossible to lookup data from before then even if we did keep it: and c) even if we could go all the way back to 2008, your library didn't join Next until 2011 and before that you wern't automated - all of the pre-2011 data would have to be on paper files created at your library.
Questions I've received so far this year
Q: I need to know weekly circulation by Item Type, Collection Code, and Shelving Location for the last 5 years. I can find a report for that but I have to enter the start date and end date for each week and it only goes back to 2018. Can you help?
A. No. We only keep statistics data for the last 25 months, but we started saving monthly circulation counts by library, location, and collection code in 2019 and we started saving monthly circulation counts by library, location, and item type in 2020, so I can show you to those spreadsheets.
Questions I've received so far this year
Q: I want to know circulation by Zip code since 2011. Can you write a report for that?
A. No. Nobody has ever asked for this statistic before. I can write a report that will get you good data for last month, but data from more than a month ago will not be as reliable and it can only go back for the previous 25 months. But since you're interested in this and since another library is interested in this, I'll start collecting this data system wide starting in July of 2021.
How to decide what to report on
What do your end of the month and end of the year reports look like
What kind of collection development data do would be useful for collection development
What kind of demographic information do you need for strategic planning
What kind of data do you need for intra-consortial planning
Like I said, it's a good idea to come up with a strategy for what you want to report on each year.
When to collect the data and where to store
For us, monthly works well
I schedule on the first of each month
We currently store on Google Drive, but we're switching to Github in January
Main set of reports
So, let me show you now, how I collect statistical data. These are some of the reports I use and each of these is a link. If you look at the presentation after it's over you can click on the links and they'll take you to my github site
where you can see the SQL for all of these reports.
Reports for circulation by different categories
These are reports for circulation by item type, collection code, and borrower zipcode
Reports for Borrower usage, item counts, and requests
These are for borrower usage, item counts, and reqeusts
Reports for local item checkout, net lending/borrowing, and borrower counts
And these are for local item checkout counts, net lending/borrowing, and borrower counts.
I use Excel to make the reports presentable and saveable
12 Excel workbooks with a variety of annual statistics
And how I present and store statistical data for our member libraries.
Current Next Search Catalog main statistics
Current Next Search Catalog detailed circulation statistics
Current Next Search Catalog detailed borrower, holdings, and item statistics
Current Next Search Catalog detailed request and resource sharing statistics
A report you can use
See the report on Github
This report is 2362 lines long - but it runs in about 30-40 seconds.
Creating it wasn't as hard as you'd think.
Now, if you're asking yourself "How long are we keeping things?" you're in luck. I wrote a report If you go to this link, this is the report. Essentially what it does is it gives you the oldest and newest value for every date and timestamp field in every table in Koha. This can give you an idea of what date parameters you can set in a report. And it can give you an idea of where you need to adjust your database cleanup processes.
Date-time report
screenshot of report
And here's the results of this report on my system. You can see, for example:
Action logs
Screenshot zoom of action logs
If I look at the "Action logs" I dont' have any data older than June 9. That's because we have, for years, been deleting old action log data after 60 days and I just updated that to 120 days recently.
Biblios
Screenshot zoom of biblios
If I go down to the biblios section, you'll notice that there's no data older than August, 2020, because we have deleted biblios set to be purged when the data is more than 13 months old.
Statistics and circulation
Screenshot zoom of statistics
When I get to Circulation statistics, though, I've got that data set for 25 month retention. This gives me the raw data from statistics for at least 2 years. On the one hand we could set statistics to store lots of data
forever. Then we could answer more questions about circulation in the past. There are two problems with this. What you'll first notice if we look more closely is that it's limited in what it stores.
Thanks for listening
Thanks for coming to the 2021 koha-US conference!
These slides availale at http://hopperdietzel.org
Hopefully this gives you some ideas about what you want to do as far as collecting and storing data goes. I think the most important thing is to think once a year or so about what data you're gathering, how you're gathering it, and where you're keeping it outside of Koha.
Credits
Star Wars intro created with Star Wars Crawl Creator at play.starwars.com
Intro music: "Battle Ready" by Bryan Teoh (https://freepd.com/epic.php)