Unlock the Power of Software Heritage Archive
How to use Software Heritage universal source code archive and why it is important for all software engineers in the world.
Join the DZone community and get the full member experience.Join For Free
Software Heritage provides a service for archiving and referencing historical and contemporary software — with a focus on human-readable source code.
This is written in a Wikipedia article about SWH. It is pretty concisely written, but it is still not entirely clear what problem SWH is solving. Let me show you an example of the problem in purpose to understand the SWH initiative deeply.
If you are a researcher, scientist, or tech writer (like me), this example can be familiar to you. Imagine, some time ago, you wrote an article, and there were references to other articles and also to the source code. The reference could be just a web link to GitHub/GitLab or another place. The problem is that you don’t guarantee that the link you provided in the references would always exist or the source code snippet itself wouldn’t be changed.
This means it would be great to have a place where you, as a researcher in your articles, can reference source code and definitely know that the link won’t be broken or the source code will be the same as you used in the article.
A similar problem was encountered by Roberto Di Cosmo — the president of Software Heritage. He discovered that in one of his scientific articles, the reference to the source code was broken. This experience led Roberto to the idea of creating specific persistent storage for all public software that could be gathered from all open-source operators and other publically accessible source code places. This kind of archive can provide robust references to the source code, which won’t be broken, or the historical data won’t be changed or deleted.
In this article, I’m going to show you how to use the Software Heritage service. I will walk you through the process of archiving and reference code. I broke this small tutorial into two categories — potential roles who are going to use archiving services.
What the Archive Looks Like
I’m a typical Software Engineer, and of course, I have an account in GitHub where I have a bunch of public repositories that could be potentially archived. Looking ahead when I checked the SWH archive and surprisingly discovered that all my public repositories are already there. You can check yours, and most likely, you will see the same.
However, your public software source code might not be archived, but you can always do this manually. So, let’s do it.
How to Archive Code
Let’s consider an example: RollingNumbers — an open-source library that I created some time ago. Let’s archive it with me. Based on the SWH tutorial, I need to prepare this public repository first. For that, I need to have (ideally) three files there:
Optionally we can add another file called
codemeta.json. You can generate it using CodeMeta generator.
README file should be familiar to you. LICENSE file you can generate using GitHub.
AUTHOR example you can find in Google Open Source Documentation. Keep in mind that
LICENSE should be just plain text files.
- Manually using the Updateswh browser extension
- Manually using URL form on the SWH website
- Automatically via SWH API (or using GitHub my URL or RollingNumbers repository action)
We are going to use manual saving via URL form. But here is an important thing to know: you shouldn’t use a URL of the web page of your repository. It should be the link to the repository when you clone the project using
git clone. In my case, my URL or RollingNumbers repository for submitting will look like this:
As you can see, I use
git as the type and URL to the RollingNumbers git repository, clicking Submit, this request will be scheduled with plenty of other requests as mine.
After saving, we can check is how the archive of your repository itself looks like. Visually it looks familiar as a typical repository in GitHub with README markdown below.
You might say that your source code is not important for archiving or it’s still WIP and not ready to be saved. But SWH is not a place for only ready-to-use software or “dead” code. SWH archives everything, regardless your code is ready or not.
How to Reference Code
Now, let’s play the role of the researchers who are going to use source code references in their articles. For instance, I’m going to use references right in this article, and my target will be just archived RollingNumbers source code.
First, we need to find this repository in the archive using Search.
As you can see, the text field accepts SWHID (don’t worry about it for now) or just a string that can be the name of the repository. When you found a repository, you will see the Permalinks button on the right-hand side of the page. Clicking this, you will be offered to copy either the identifier or permalink.
The term permalink is already self-explained. You can click on the RollingNumbers permalink, and you will be brought to the archive of the project directory.
You can also get the reference of a code fragment. For this, just click on the first line number, and then with Shift, click on the last line number of the code snippet.
What makes references to a source code archive permanent? How can researchers guarantee that references in our articles are stable? To understand this, we need to look at a term
SWHID — Software Heritage Identifiers.
Here is an example of the permalink of the code snippet from
DigitLayer.swift the file of my RollingNumbers archive:
https://archive.softwareheritage.org/ swh:1:cnt:ba62f3e9e8ad0a1026fd8a39a4654a14cb385b4e; origin=https://github.com/maxkalik/RollingNumbers; visit=swh:1:snp:fd32172c9a4434090c7fba7edb24f88b9b9f4fed; anchor=swh:1:rev:2e2f802a88ac1ee5d5e39f88e392a6dd4f2a6fd0; path=/Sources/RollingNumbers/DigitLayer.swift; lines=11-29
SWHID represents a reference request to the source code of an archive, files, commits, etc. Let’s get rid of all additional parameters from this request and take a look at the core identifier:
There are four base components:
Skip the first two components and take a look at the third part:
object_type — this is a type of category (or type of archive) that is being captured and you as a . There are several object types:
cnt— contents: Select this when you want to archive only a particular file or a code snippet;
snp— snapshots: Use it if you want to archive full history, including branches, commits, tags, etc.;
rel— releases: It will archive just a release version;
rev— revisions: Use it if you want to archive just a commit from your repository;
dir— directories: It will archive just a current version of a repository;
Another important part of SWHID is the context parameters (or, in documentation — qualifiers):
origin=https://github.com/maxkalik/RollingNumbers; visit=swh:1:snp:fd32172c9a4434090c7fba7edb24f88b9b9f4fed; anchor=swh:1:rev:2e2f802a88ac1ee5d5e39f88e392a6dd4f2a6fd0; path=/Sources/RollingNumbers/DigitLayer.swift; lines=11-29
As you can guess, these qualifiers describe in a query way what exactly it will be fetched from the requested archive, the path to the file (DigitLayer), and the range of lines of the code snippet.
In simple words, SWHID makes source code archive references permanent, and you can be sure the code won’t be changed.
The full explanation you can get from the SWHID documentation.
In conclusion, Software Heritage (SWH) serves as a universal source code archive and provides a crucial service for archiving and referencing historical and contemporary software. By offering a persistent storage solution, SWH addresses the problem of broken or changed references to source code, ensuring that researchers, tech writers, and other users can confidently reference and access source code without concerns about link stability or code alterations.
Through manual or automated archiving methods, software engineers can contribute to the software history by preserving their public repositories in the SWH archive. Researchers can utilize SWH’s search functionality and SWHIDs to obtain permanent and reliable references to specific files, code snippets, commits, snapshots, releases, and more. By leveraging SWH, individuals in the software development industry can promote software persistence and enhance the integrity of source code references in their articles, research papers, and other documentation.
Opinions expressed by DZone contributors are their own.