Why Software Heritage is creating a global software archive
- Software Heritage was launched in 2015 as a digital conservation initiative: a "Library of Alexandria" for the precious modern asset of software source code.
- Its ambition: to collect, preserve and share all publicly available software in source code form.
- Open-source software is ubiquitous in IT products these days, for example, in IoT devices, phones, cars and cameras.
- Preserving source code and software is also important for research and industry, as it is increasingly used in these fields.
- The aim of these libraries is to accumulate as much knowledge as possible in one place.
Software Heritage was launched in 2015 as a digital preservation initiative. Its ambition: to collect, preserve, and share all software that is publicly available in source code form. This universal software archive will help guarantee the reliability or originality of source codes so that the “official”, unmodified versions remain preserved forever – regardless of any subsequent changes that may be made.
Such preservation is important because software takes a lot of intellectual energy to create and contains advanced technical knowledge, in the form of algorithms, which are understandable only by reading the source code form of software. “This knowledge can contain innovations, so the source code of some software can be as innovative as a scientific paper or as a patent,” explains Stefano Zacchiroli of Télécom Paris, one of the founders of Software Heritage. Avoiding that this important technical knowledge is lost has been acknowledged by UNESCO, as part of the Paris Call on Software Source Code as Heritage1.
Preserving source code and software is also important for research and industry since they are being increasingly used in these domains. Indeed, a large part of the technical and scientific knowledge developed today resides in software that must therefore be preserved to guarantee the reproducibility of experiments and results – the basis of the scientific method. This approach is already being seen in movements like Open Access, for instance, which ensures that scientific papers are available in the long term and accessible to everyone. We also see it embodied in the open data movement, of which the aim is to keep scientific data open and shared universally.
Software Heritage has assembled the largest public archive of software in source code form, comprised of more than 10 billion unique source code files and more than two billion commits – the internal revisions of software used by developers – harvested from more than 160 million development projects. Among the most famous: the source code of the Apollo 11 navigation system, which allowed humans to go to the Moon, or that of the NCSA Mosaic browser, which popularised the use of the Web. The size of the archive is currently about one petabyte, which while big is not as big as archives of videos, for instance. The project was founded in 2016 by Roberto Di Cosmo (Inria and Université de Paris) and Stefano Zacchiroli (Télécom Paris) in collaboration with Inria and UNESCO. It now has a number of sponsors from both the private and public sectors.
Software Heritage is contributing to these efforts with the long term archival of software in source code form. Sometimes even the software is a major scientific breakthrough itself and, as such, it contains valuable knowledge that needs to be preserved for future reuse.
Open-source software is also ubiquitous in IT products – for example, in IoT devices, phones, cars and cameras. The difficulty here is that anyone can modify this software, so variants of a piece of software become an integral part of new devices. Software Heritage provides a place where this software can be stored over the long term with identifiers that can be used to recognise the specific version of a software originally installed in a given device. This is very important for tracking vulnerabilities and identifying newer products that need to be “fixed”.
How is the database constructed?
Software Heritage mainly archives source code by crawling public platforms that are used by developers to collaboratively develop open-source software. The most well-known of these are GitHub and GitLab. Another way is to involve researchers who actively “push” software to the archive. For instance, in France, HAL is a popular, publicly-funded, open access platform used by the scientific community to deposit papers as preprints.
The key point to note here is that software and particularly open-source software is massively duplicated these days. What happens is that the same piece of source code can be found in thousands or millions of different places on the Internet at the same time.
To address this challenge, Software Heritage structures the archive as a giant graph (a Merkle DAG structure), which is entirely de-duplicated. This means that if the same source code file is stored in thousands or millions of different places, it will be archived only once, while keeping track of all the different places it is linked from. This is the case not only for individual files, but also for entire source code directories and commits, which can be very big for substantial pieces of software. It is also true for software releases.
Doing this is essential for keeping the size of the archive “small”, that is, to minimize the duplication of information that needs to be saved. It is also useful for scientific use cases because by looking at this global graph of public code, one can see who else has referenced your software and perhaps used it to create something else. In a sense, this graph allows to measure the impact of software developed by researchers and open source developers.
“Software Heritage is a ‘great library of source code’, analogue to the great libraries of the ancient world,” says Zacchiroli. “The aim of these ancient libraries was to accumulate as much knowledge as possible in a single place. Software Heritage is the Great Library of Alexandria for the precious modern good that is software source code.”
People can visit the Software Heritage library to find the code they are interested in, perhaps because it has disappeared from its original hosting place, or perhaps to analyse the full extent of knowledge stored in it.
Interview by Isabelle Dumé
For more information : https://www.softwareheritage.org/