Data

Let’s Preserve Government Data Before It’s Too Late!

This has been one hell of a bumpy month, and I have a lot I could scream talk about, but for the moment, let’s talk data.

The US Government has spoiled us in recent years with the amount of public data and information available. NIH studies, wastewater virus shedding data, COVID-19 impacts, climate trends and forecasts, all kinds of things. Some of those are things I used for my COVID reporting over the past four years.

And in recent weeks, the US Government has demanded that some of this data be purged or altered to fit the whims of the new White House.

A bit too 1984 for my tastes. A bit too Fahrenheit 451.

And this is not the first time data’s disappeared under a new administration, and it’s not always one party or another, though what’s happening now is scary.

But as they say, look to the helpers. There are several massive efforts underway to archive as much data as possible so it’s not lost forever.

Today, I set up Archive Team’s Warrior, which automates the collaboration around spidering, downloading, processing, and uploading data from governmental sites (and others) to archive.org. All it takes is some bandwidth (okay, a fair amount of bandwidth — looking at 1TB/month right now), some hard drive space, and some CPU cycles, and I can help with this archiving project. It’s excellent, easy to set up, fun to watch, and requires virtually no work on the user’s end. They provide VMs and Docker images (I chose Docker), and once installed, it’s self-managing.

I’m exploring more of what’s out there for data preservation, and thinking about how I can get involved. There are a few really interesting resources out there, including:

  • GovDiff: See the differences in governmental information and resources before and after this administration began its.. work.
  • /r/DataHoarder on Reddit: A group of people working to collect and archive data of all kinds.
  • End of Term Archive Project: Captures US Government sites after presidential terms end.
  • Data Rescue Efforts by Lynda M. Kellam (archived link): A whole collection of sites worth exploring.
  • Archive.org, which hopefully you know about already, and which must be preserved at all costs.

Also of note, CDC Datasets prior to January 28th, 2025 (nearly 100GB worth), which I’ll be archiving myself.

No doubt, lots of data will be lost to time. I can only hope there’s enough people at these agencies who quietly, discretely backed up and sent off what they could before this all went down. In either case, the fact that so many people can join in on the data archiving effort today is incredible, and I hope anyone out there with the resources to spare will take a moment and set up Archive Team’s Warrior and contribute to the effort.

Let’s Preserve Government Data Before It’s Too Late! Read More »

The End of COVID.. Data.

This year’s seen a rapid reduction of available COVID data. Certainly in California, where we’ve been spoiled with extensive information on the spread of this virus.

In 2020, as the pandemic began to ramp up, the state and counties began to launch dashboards and datasets, quickly making knowledge available for anyone who wanted to work with it. State dashboards tracked state-wide and some county-wide metrics, while local dashboards focused on hyper-local information and trends.

Not just county dashboards, but schools, hospitals, and newspapers began to share information. Individuals, like myself, got involved and began to consolidate data, compute new data, and make that available to anyone who wanted it.

California was open with most of their data, providing CSV files, spreadsheets, and Tableau dashboards on the California Open Data portal. We lacked open access to the state’s CalREDIE system, but we still had a lot to work with.

It was a treasure trove that let us see how the pandemic was evolving and helped inform decisions.

But things have changed.

The Beginning of the End

The last 6 months or so, this data has begun to dry up. Counties have shut down or limited dashboards. The state’s moved to once-a-week case information. Vaccine stats have stopped being updated with new boosters.

This was inevitable. Much of this requires coordination between humans, real solid effort. Funding is drying up for COVID-related data work. People are burnt out and moving on from their jobs. New diseases and flu seasons have taken precedence.

But this leaves us in a bad position.

The End of COVID.. Data. Read More »

Scroll to Top