The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest.
Vast troves of government data are inaccessible to the people and communities who need them most. These datasets are inaccessible because they’ve never been made public, because they’re published in obscure formats, or because they’re published without the documentation necessary to properly interpret them.
Identify: Through its own research, as well as through consultations with journalists, community groups, scholars, government-data experts, and others, the Data Liberation Project aims to identify a large number of datasets worth pursuing.
Obtain: The Data Liberation Project plans to use a wide range of methods to obtain the datasets, including via Freedom of Information Act requests, intervening in lawsuits, web-scraping, and advanced document parsing. To improve public knowledge about government data systems, the Data Liberation Project also files FOIA requests for essential metadata, such as database schemas, record layouts, data dictionaries, user guides, and glossaries.
Reformat: Many datasets are delivered to journalists and the public in difficult-to-use formats. Some may follow arcane conventions or require proprietary software to access, for instance. The Data Liberation Project will convert these datasets into open formats, and restructure them so that they can be more easily examined.
Clean: The Data Liberation Project will not alter the raw records it receives. But when the messiness of datasets inhibits their usefulness, the project will create secondary, “clean” versions of datasets that fix these problems.
Document: Datasets are meaningless without context, and practically useless without documentation. The Data Liberation Project will gather official documentation for each dataset into a central location. It will also fill observed gaps in the documentation through its own research, interviews, and analysis.
Disseminate: The Data Liberation Project will not expect reporters and other members of the public simply to stumble upon these datasets. Instead, it will reach out to the newsrooms and communities that stand to benefit most from the data. The project will host hands-on workshops, webinars, and other events to help others to understand and use the data.
The Data Liberation Project launched in September 2022.
The Data Liberation Project is based on the internet, but with a focus on the United States. If you’d like to bring the project’s model to other countries or to a specific US state, get in touch.
Currently, the Data Liberation Project has a staff of one:
- Jeremy Singer-Vine, founder and director. From 2014 until early 2022, Jeremy served as the founding data editor for BuzzFeed News, where he championed the publication of open, reproducible data analyses, and contributed to a range of award-winning investigations. Previously, he worked at The Wall Street Journal, where he was named a Pulitzer Prize co-finalist for National Reporting. He also publishes Data Is Plural, a weekly newsletter of useful/curious datasets. He builds and maintains open-source software, including
pdfplumber, a tool for liberating data from PDFs.
The project is also grateful to have received pro bono legal assistance from the Cornell Law School First Amendment Clinic.
If you’d like to be involved, read more here and get in touch.