How NYU librarians are ensuring that complex data journalism about COVID-19 and digital scholarship are preserved for future generations
In ancient Egypt, papyrus was an abundant and versatile plant that was commonly turned into thick, paper-like sheets. Surviving rolls of papyrus filled with hieroglyphics and hieratics have allowed historians and scholars to uncover information about Ancient Egyptian civilization and culture that has been essential to understanding our world today.
Papyrus was abundant and relatively easy to produce, and its use as a writing material was widespread. But it broke down easily in cooler, wetter climates and as a result, pretty much the only surviving papyrus-based scholarship is from Egypt. Plenty more records, including all of Aristotle’s dialogues, were lost. The use of a fragile medium in ancient times means that roughly 95 percent of ancient scholarly output has since disappeared.
It’s an old tragedy that’s taking on new relevance now, as archivists are discovering that cutting-edge forms of digital media can be surprisingly difficult to preserve, too. With the pace of developing technologies—such as complex websites, multimedia, and interactive maps—outstripping our capacity to archive and preserve them, all kinds of important online publications are also at risk of being lost. Preservationists have long understood how to care for books and physical materials through special handling, climate-controlled facilities, pest management, commercial binding, and more. Less clear is how to ensure that materials published in today’s digital environment will survive for future generations.
For instance, complex digital scholarship like NYU professor Michael Ralph’s Treasury of Weary Souls—the world’s most comprehensive database of enslaved people who were insured during the antebellum period—offers insight into which financial firms continue to profit from slave insurance policies today. The website includes data, histograms, and maps built by engineers and coders. But what happens when websites, software, and computers evolve—and the technologies underpinning this work are no longer viable?
This troubling scenario is already becoming a reality for the journalism industry, which has recently produced innovative and illuminating works such as the Los Angeles Times’ “Old Oil Wells,” the Texas Tribune’s “Where Harvey’s effects were felt the most in Texas,” and ProPublica’s “Are Hospitals Near Me Ready for Coronavirus?” These custom-built, dynamic websites offer crucial insight into pressing social issues, but with data visualizations and maps populated by back-end databases disappearing, they are at serious risk of never making it into the historical record.
NYU librarians have received a series of grants from the Mellon Foundation, the Institute of Museum and Library Services, and the Alfred P. Sloan Foundation to tackle this challenge head on. They recently published the first-ever guidelines to help scholars preserve their digital work over the long term, with many more projects underway. NYU News spoke to four librarians about how they’re collaborating with publishers, institutions, preservationists, and newsrooms to ensure that digital scholarship and data journalism is reliably archived for scholars and researchers to access in the future.
ON SAVING DATA JOURNALISM ABOUT COVID-19
Katy Boss, Librarian for Journalism, Media, Culture, and Communication
Vicky Rampin, Librarian for Research Data Management and Reproducibility
Katy: Vicky and I are co principal investigators on an IMLS grant called “Preserving the Dynamic Web,” which is trying to capture, archive, and preserve different types of dynamic websites. We are particularly interested in the most difficult use cases, which involve a website that also has a backend component to it, such as a database that the website is querying from. This is called server-side archiving as it’s archiving the back end of the website and not just crawling the front end, which is what most web archiving has done to date. That's worked very well for most of the web up until a couple of years ago, but websites are becoming increasingly dynamic. Digital web archivists are having to look for other solutions to different use cases and there isn't a one-size-fits-all web archiving tool that can just grab anything. We have to use different tools for different types of websites.
What kind of dynamic websites are you working to preserve?
Katy: A good example is these coronavirus maps that data journalism sites put out. Almost all of them use a software called Mapbox, and it can't really be captured at scale by any web archiving tools right now. So if you go to archive.org, the Internet Archives, you can pop in those URLs and see that the frame of the site has been captured but none of the actual maps or tables or data are there. It's a little bit scary because this has been a big moment in history, and these websites are not being captured at scale anywhere. We’re working on building something that can specifically capture Mapbox news applications or websites that use Mapbox to visualize data—it’s a lot of sites.
Katy: In addition to that, we’re archiving ProPublica’s data journalism apps—their whole catalog, if we can. They’re one of our partners on this work. They build some of these really interesting, complex, robust websites that are querying a database in real time. One, which is titled “Are hospitals near me ready for the coronavirus?,” allows you to enter your zip code and see how full the hospitals are. This was, of course, very useful last winter.
ProPublica produces many different versions of this but there isn’t a technology that’s able to capture and archive the sites—yet. We’re working with different partners and developing tools and we think we will ultimately be able to capture all of ProPublica’s journalism apps.
What does the archiving process look like?
Katy: It’s not easy to look at a data journalism site and know whether it’s archivable or not. We’re working on a flow chart that would help digital archivists and data journalists figure out exactly what they have built and which aspects can be preserved. Some things can be archived with Web Recorder, which is a high fidelity dynamic web archiving tool that can capture a lot of things, but it can present issues with getting the archives to library catalogs and making them available to researchers later. Sometimes it isn’t until you get to the quality assurance step and you check the archived version that you realize it didn’t capture crucial parts of the site.
Vicky: But our tool, ReproZipWeb, enables us to do server-side archiving. Anyone can use it—it’s free and open source. If you have access to either a server where the materials are being hosted in production or a copy of those materials, you would first start the server, which engages the tool and keeps track of everything that’s happening on the server, including the software it touches, the data it uses, the database, the type of the database, and so on. It captures a lot of in-depth metadata which is required for active, ongoing digital preservation. At the end of the process, you get a bundled file which is small and shareable and contains all the assets needed to rerun the Web application in different environments. It’s not just facilitating archiving but it’s also facilitating reuse for others.
If we don’t have access to different computational environments such as different operating systems and different servers over the long term, then a lot of this work becomes moot. If you don’t have a copy of Windows 93 but you have a Windows 93 file and you opened it now, it would look like Wingdings. Software archiving is a crucial part of this work.
It’s counterintuitive to think that something published online as recently as last year is already at risk of being lost. How widespread is the problem?
Katy: Oddly, there are books that were published 500 years ago that are much more stable and preservable than some of these dynamic websites. The sites can be exceptionally fragile, especially with some of these news organizations, like Vox or Chalkbeat, that don’t have a legacy publication behind them. There has been a lot of really interesting data journalism created during COVID that’s already gone, and data journalists are sounding the alarm about the loss of their work. Digital-first, start-up media organizations are incredibly volatile.
Data journalism is extremely at risk. We did a survey and looked at 73 news applications, and it was appalling how many had broken and disappeared in 5 years—it was about half. It’s a real crisis. There’s so much about the time we’re living through that was not documented and saved.
How do you hope this work will progress and continue?
Katy: Ideally, I would like to see libraries partner with newsrooms on digital preservation the same way they have for decades. I am from Michigan and I used to work at the Grand Rapids Public Library, which was the place for the whole archive of the Grand Rapids Press. We collected the entire paper’s backfile along with a lot of other local newspapers from Traverse City and smaller areas. The local libraries collected the local papers and it was jigsawed out so that no single library would have to try and archive every newspaper that has ever been published—libraries would each approach it locally.
I think this would be a really good approach for these complex data journalism sites. NYU partnering with ProPublica is such a good match because we're both right here in New York City. Hopefully libraries would be able to partner with local data journalism institutions and use our preservation tool. The beauty of ReproZipWeb is that it creates a single archival distributable file—it's very preservation friendly in that way, and this one file will allow folks to easily access the sites 10, 15, 20 years from now. Computers are going to be vastly different—we probably wouldn’t even recognize them—and we're going to be accessing these archived files in very different ways. But as librarians, that’s our job and that’s our mandate: to ensure that we are stewards of this content and that we make it available in the future.
ON PRESERVING COMPLEX DIGITAL SCHOLARSHIP
Jonathan Greenberg, Digital Scholarly Publishing Specialist
Deborah Verhoff, Digital Collections Manager
Jonathan: Our projects aim to preserve enhanced digital publications. Publishers have increasingly been publishing digital work that breaks out of the bounds of what scholars could do in a printed environment. Everything from adding multimedia, video, audio, and high resolution images to really customized and complex digital websites designed to to convey scholarship. We're working with university presses, organizations that do preservation for university presses, and other publishers.
Our first project developed processes and technologies to preserve a range of digital publications by developing a set of guidelines to help publishers, scholars, and platform developers create scholarship that is more preservable from the start. Once a digital book, article, or digital humanities project is finished, you're limited in what you can do to preserve it. Libraries are used to this but we want to help publishers consider longevity from the very beginning of the process.
We collect things that were created 200 years ago or maybe last year. You don’t have any control over the conditions in which they were created. You just do your best. That’s what libraries do. This project helps publishers to create works that can be preserved and remain part of the scholarly record moving forward, allowing scholars and students to access their history, cite their sources, and trace ideas through time.
What challenges are unique to scholarship preservation?
Jonathan: The output of scholarly publishers is quite different [from that of data journalists]. There’s an expectation from scholars and their readers that the work will be preserved, and that’s part of the urgency on our part. For the past 100 years or more, if you wrote a book or published an article, you expected there would be a library that would preserve that book and you knew it would exist. That expectation is still there but the calculus about what to preserve and how has become a lot more complicated. At the very least, we want to figure out what it will take to preserve these works and be transparent with these publishers and scholars to say: This is the investment it’s going to take, here are the partners that are willing to do it, and here’s how your publications will be at risk.
Deb: Archivists, special collections curators and subject matter experts still collect manuscripts, and those records are increasingly born digital. That is a challenge: You had nothing to do with the creation—maybe it’s a Quark file from the ’90s—but it describes a really important moment in history and you want to make it available for others. We have a responsibility to explore this new quandary together.
What is the next phase of your work in digital preservation?
Jonathan: The second project tests out these guidelines by embedding experts with publishers to help them adopt our guidelines, but also to refine them and to test out whether they’re communicated in an appropriate way. Are our assumptions about intervening early on correct? Will we be able to preserve these works as well as we thought we could? That is a three-year project that allows us to follow the entire publication process of some of these works.
The publications we’re looking at are fairly new and they’re on platforms that publishers are still actively engaged in developing and maintaining. The risk is further out. We know that business models and ventures don’t last forever and even if they do, the technologies don’t last forever.
Deb: This is going to be extremely satisfying for people who have said, “if only we could get back to the point of creation and future-proof these works and think about sustainability from the very beginning.”