Through the Black Hole of Information. Friedel Geeraert on building a Belgian Web Archive

Friedel Geeraert holds MA degrees in History (KU Leuven), Sustainable Development (Uppsala University), and Information and Communication Science and Technology (Université Libre de Bruxelles). Since 2017, she is a researcher on the PROMISE project at the State Archives of Belgium and the Royal Library of Belgium. The PROMISE project paves the way towards a Belgian web archive.

Can you introduce the PROMISE project and describe some of its history and its future objectives?

PROMISE is a BRAIN project funded by BELSPO (Belgian Science Policy) that was started in 2017 and runs until the end of 2019. The project was initiated by the Royal Library of Belgium and the State Archives of Belgium and involves a number of partners, notably Ghent University, University of Namur and the Haute École Bruxelles-Brabant. The research is conducted by an interdisciplinary team that includes technical experts, legal and information management specialists, and researchers that study the use of web archives from a digital humanities perspective.

The project aims to lay the foundations for a Belgian web archive by taking stock of (international) best practices in web archiving, developing a web archiving strategy for Belgium, initiating a pilot project for archiving the Belgian Web and accessing these collections, and making recommendations for the implementation of sustainable web archiving services. In doing so, PROMISE inscribes itself within an existing ecosystem of Belgian web archiving initiatives, including projects at the University Library of Ghent, Felixarchief, KADOC, and AMSAB-ISG to name a few. Building on lessons learnt, PROMISE takes a holistic approach to web archiving and aims to capture part of the Belgian Web.

How would you define or delimit the Belgian Web?

There are a number of different approaches you can take to delineate the ‘Belgian’ portion of the Web. Most countries start by looking at national domain names, so in our case that would be the .be domain. Related and equally useful domain extensions are linked to the national territory, such as .brussels, .ghent and .vlaanderen. Another criterion that can be used is the hosting location of the website. We can identify web content that is hosted in Belgium based on geographical IP localisation, as each country is allocated a number of specific IP ranges. Finally, we might also take into account web content that is created by Belgians but that might for instance be hosted on a .com or .org domain or content that has a strong link to Belgium. For the latter an additional range of criteria related to relevance come into play.

What are the benefits of archiving web content?

The big problem we are dealing with today is what I would call the digital black hole. Contrary to what could be expected, digital content does not have a long lifespan. As information finds its way to the Web, the Web contains important traces of our history that need to be preserved. Archiving web content could serve five important purposes: making processes more efficient, holding governments accountable, helping citizens, supporting research and preserving heritage.

In the PROMISE project we are particularly looking at research and heritage from a web archiving point of view. If you really want to hold governments accountable, you need to be able to assure the authenticity and integrity of your data. Not all methods for compiling an archive are suitable to achieve this standard. For instance, when using regular web crawling (on the client side of the Web) you are going to miss some information. Incompleteness is in that case inherent to the web archiving venture. Another approach, which would yield more complete results, is server-side archiving. This means that you harvest a copy of the information on the server without going through the HTTP protocol. If you aim for even more completeness, an option is to do a transaction recording, which boils down to recording a short (video) clip of the user’s interactions with the web content. Most of the initiatives that we have studied, however, rely on client-side crawling.

How do the Belgian web archiving efforts measure up against the international context?

It is safe to say that Belgium is lagging behind quite a bit. In Europe, only Poland, Italy and Belgium currently do not have a Web archive on the national level. Other countries, by contrast, have already started their archiving efforts in the early days of the Web. Australia for instance started its PANDORA project in 1996, web archiving in the UK dates back to 1996, in Sweden to 1997, in New Zealand to 1999, in Norway to 2001 and in 2002 the Bibliothèque Nationale of France began archiving Web content. This means that some institutions now have collections that are decades old, which makes for a treasure trove of information for researchers.

Apart from the national initiatives, there are also international organizations that have set up Web archiving services. The biggest and probably best-known web archive is the Wayback Machine of the Internet Archive. Another example is Common Crawl, an open repository of data crawled from the Web. Then there are also organizations that facilitate collaboration and knowledge exchange, such as the International Internet Preservation Consortium (IIPC). At the forefront of international studies in web archiving are projects that aim to develop research infrastructures for analyzing archived Web content such as the BUDDAH project or RESAW.

In terms of the PROMISE project we are sharing experiences and absorbing knowledge about how to define collections, what tools we can use etc. In this regard the IIPC is very interesting.

How do you approach the process of selection? How do you decide what to preserve and what not?

For that, we first studied what has been done abroad. We noticed that a number of the national archives that we studied limit their scope to the websites of government institutions. Most archives indeed work within a legal framework (a Law on Archives), which stipulates which information needs to be archived and preserved. Usually, web content falls within the definition of what constitutes an archive, and there is no need to broaden the legal frame. In other words, web content can be preserved along with the established paper archive.

In the case of most national libraries, a legal deposit legislation is in place. This allows the National Library to receive one or more copies of the country’s editorial production, such as books. This also sometimes holds true for authors that have the country’s nationality and publish abroad. As such, the National Library can build a collection of all of the country’s publications.

This type of archiving legislation and policy can be broadly applied to the Web. To build a Web archive, one can do a broad crawl and, for instance, mine all of the French Web by harvesting content from the .fr or .alsace domains. This can be done to a certain level of depth, such as the home page and then down two or three additional levels. This broad-ranging approach can be supplemented with more selective collections. These collections could consist of information related to certain events or emergencies (forest fires, terrorist attacks), to literature, or to already established library collections. In the case of these more specific collections, one would crawl the entire website more frequently and to greater depth. Some countries that do not have legal deposit legislation and only build smaller, more selective collections are for instance the Netherlands and Switzerland.

The premise of our PROMISE project would be to follow international examples and do a broad crawl in order to take a sample of the Belgian Web, which would in turn be supplemented with in-depth collections that are linked to the core missions of the State Archives and the Royal Library.

Of course, the fact that a web archiving policy and infrastructure for Belgium still need to be put in place does not mean that all of the historical Belgian Web has been lost. A number of Belgian pages can be found through the Wayback Machine. However, the version of the Wayback Machine that is freely available online only offers limited search functionalities. The Internet Archive also offers a paying service that allows to create a ‘national web archive portal’ based on the collections in their Wayback Machine. All pages pertaining to a specific national web are grouped and made available as a corpus with superior search and analysis functionalities. The Royal Library is currently looking into this service.

How does web archiving work from a technical perspective and what challenges does this pose?

As I already introduced, there are three ways of archiving Web content. The first is client-side archiving, meaning that the crawler goes through the HTTP protocol to gather and copy content and responses from the server just like a browser would. A lot of things can go wrong here. A website is a very complex object to archive in this manner. Different file types are present, and especially embedded and dynamic content such as media feeds can be a challenge.

A second approach is transaction archiving, where a screen capture is made of the user accessing and going through the content on the site.

The third option is server-side archiving, which requires direct access to the server. This type of archiving hinges on direct participation of partner organizations, who have to allow that files are copied from their servers.

I would also like to point out the special case of social media harvesting. Each social medium requires its own approach, usually through a specific API and settings change very often. Within PROMISE, harvesting social media is considered to be out of scope, although some web archiving initiatives in other countries are doing it.

Which brings us to the legal implications and framework for archiving Web data. Could you expand a bit more on those aspects of the project?

Creating a Web archive indeed involves a number of legal issues. Our legal partner in the PROMISE project, CRIDS, is still working hard on analyzing all of these aspects, which include: copyright legislation, privacy legislation (GDPR) , the legal framework ensuring the integrity of the content (so that it might be used before a court), provisions against involuntarily capturing illegal content, delimiting the national scope and responsibilities, etc. What should also be covered is the access rights to the archive once it is established. Who will be able to consult the information and under which conditions?

Do traditional archiving principles and methodologies then still translate to the online context or should this type of material be approached as something completely different?

If we take a long-term perspective, it can be seen that archival methods and approaches have already evolved because of the emergence of digital-born archives. Projects like the InterPares project that focused on developing the theory and methods surrounding the preservation of authentic digital-born records have contributed a lot to the evolution of archival procedures and to a more profound understanding of the different concepts of integrity that apply to paper and digital records. A digital record is a multi-faceted artifact that consists of contents (for instance a text), the file type (such as an MS Word doc file) and a support or medium (like a floppy disk). These objects are also dynamic: contents can migrate to another file, format supports can be renewed etc. A paper document by contrast is much more final and static.

Within the realm of digital-born archives, a web archive is particularly challenging because of its size. We also need to think about tools for selection and preservation, the indexation and description of the information, and metadata management. An important difference between a web archive created by means of crawling and other digital-born archives is that you reproduce websites: you are not storing source files but a copy, thus creating the archival record yourself. Furthermore, the content is dynamic, meaning that it is very difficult to guarantee the integrity of the captured content in relation to the live Web. Finally, decisions have to be made about the periodicity and depth of the archival efforts, for instance, how many times a year will content be crawled.

What we are also working on, is the technical infrastructure that will be required to support the archive, for instance in terms of storage. We are aiming to store copies of the data on different locations. The same holds for data access and reuse: we are striving to make archived governmental websites freely accessible, but in other cases copyright might be limiting factor with regards to access. The legal analysis within the PROMISE project is currently still ongoing. Some web archiving initiatives abroad are cautious and only offer access in the reading room. This might also be an option for the PROMISE project.

What are some of the ideological or political implications of having a Belgian web archive? Could it for instance be used to hold politicians accountable?

If you can guarantee the integrity and authenticity of the data, the Web archive could be a valuable resource. But to set the bar this high, you need to have a lot of quality control in place. It is possible to consider a Web archive as containing content that is legally binding and evidentiary, but you would need a lot of resources to do the necessary checks and find different solutions to capture missing content, which is currently not the case for the PROMISE project. From a heritage point of view, such accountability would be the dream: being able to capture every change in content and to assure that the data can be used as viable evidence before the courts would make the archive a very valuable and powerful resource.

Will the existence of a Web archive change Web publishing practices?

This is very difficult to answer at this point in time, but it can already be seen that people are becoming more aware of their personal data and what they put on the Web, which may have a perceivable influence in the long run.

If you had to recommend one book to the readers, which book would that be?

I would highly recommended The Web as History. Using Web Archives to Understand the Past and the Present, published in 2017 and edited by Niels Brügger and Ralph Schroeder that addresses various aspects of the archived Web. It is also freely accessible online which is an additional benefit.

Leave a Reply Cancel reply