The GitHub Archive Program protects open source software code for at least 1000 years by continuously storing multiple copies across various data formats and locations.

Editor’s note: This article from “InfoQ” (WeChat public No. ID: infoqchina) , author: Zhao Yu Ying.

The GitHub Annual Developer Conference GitHub Universe 2019 was officially launched. At the summit, GitHub announced a permanent code-saving program, the GitHub Archive Program, to protect open source software code for at least 1000 years by continuously storing multiple copies across various data formats and locations.

GitHub startup code permanently saves the program for at least a thousand years

Code permanent save schedule

The emergence of open source software has provided the impetus for the development of science and technology, which is also the common heritage of all mankind. The GitHub Archive Program task is to retain these open source software for the most likely generation. To complete this program, GitHub works with the Long Now Foundation, the Internet Archive, the Software Heritage Foundation, Arctic World Archive, Microsoft Research, the Bodleian Library, and Stanford Libraries through uninterruptedMultiple copies of the data format and location to protect the code and maintain a long-term archive of at least 1000 years.

GitHub says that despite the small likelihood of a global disaster, all content stored on modern device platforms may disappear after several generations, and archiving software that spans multiple organizations and storage forms will help Ensuring its long-term preservation: The online archiver calls it “LOCKSS” because a large number of copies are safe.

Currently, a large amount of knowledge is stored on temporary media devices: hard drives, SSDs, and CDs can be used for decades, and backup tapes are nominally only 30 years old, with strict control of heat and humidity. Even if the hardware is still in the future, the software running on it may have been eliminated, and the GitHub Archive Program plans to have a longer-term plan to address the risk of data loss in the future.

At the same time, this program also provides a choice for developers who are restricted by access. If GitHub is not available in some places, affected developers can use Internet Archive and Software Heritage Foundation to access the public code of their projects. .

Where is it? How to save?

Inspired by Long Now founder Steward Brand, GitHub uses a “layered” strategy to archive code. By providing a range of solutions from real-time to long-term storage, this approach is designed to maximize flexibility and durability. There are three types of archiving programs: hot, general, and unpopular:

Top: Close to live updates

General: Updated monthly or yearly

Imperial: Updated every 5 years

GitHub (The following storage schemes are ranked by heat, GitHub is preferred for popular projects)

Every time you visit GitHub, GitHub will copy Git data to multiple data centers around the world. In addition, data, issues, pull requests, and data backups are stored in multiple locations in Git, all in real time through the GitHub API.

GHTorrent

GHTorrent monitors the timeline of GitHub public projects and archives them to recursively crawl archived content and dependencies. These archives are available daily or monthlydownload.

GH Archive

GH Archive monitors the GitHub public event timeline, archives these events, and makes them queryable using BigQuery. Developers can download snapshots by hour, day, or month.

Wayback Machine

The Internet Archive Wayback Machine will retrieve GitHub’s public repository (including new repositories, issues, pull requests, wikis, etc.) and store copies in hard drives in San Francisco and elsewhere, via git And https are publicly available.

Software Heritage Foundation

The Software Heritage Foundation will periodically crawl GitHub and add its public repository to the archive and provide it with public API access.

Bodleian Library

Oxford’s Bodleian Library preserves the Arctic Code Vault redundancy by keeping GitHub’s 10,000 of the most-respected and most dependent repositories in Piql film format.

GitHub Arctic Code Vault (Arctic)

On February 2, 2020, GitHub takes a snapshot of each active public repository and saves it in the GitHub Arctic Code Vault. The film reels used for storage are up to 3,500 feet long and are supplied and coded by Norwegian Piql, which specializes in the development of ultra-long-term data retention techniques, using silver halide polyester technology. According to ISO standards, this medium has a service life of up to 500 years. At the same time, aging simulation tests show that it can be used twice as long.

This is a data repository located in the Arctic World Archives (AWA), located 250 meters deep in the permafrost region of the Arctic Mountains. The archive is located in an abandoned coal mine in Svalbard, closer to the North Pole than the Arctic Circle. GitHub will capture a snapshot of each active public repository on February 2, 2020 and save this data in the Arctic Code Vault.

Svalbard is governed by the International Svalbard Treaty and belongs to the non-militaryThe district is home to the northernmost town in the world, one of the most remote and geopolitical human settlements on earth. AWA is a joint program of Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and long-term digital storage provider Piql AS. AWA is committed to permanently storing files, and film reels will be stored in steel wall containers in an abandoned coal mine sealed room in the remote islands of Svalbard.

Although the Svalbard is affected by climate change, it will only affect the permafrost of the outermost few meters for the foreseeable future, and warming is not expected to threaten stability. The mine is close to the prestigious Global Seed Vault (just a mile away), which enhances Svalbard’s status as a stable and long-term archive of human collective knowledge.

Archive 02/02/2020 snapshots in GitHub Arctic Code Vault will include all active public GitHub repositories, as well as a number of hibernate repositories identified by asterisks, dependencies, and consulting groups. The snapshot will consist of any HEAD larger than 100KB minus the HEAD of the default branch of each repository, and each repository will be packaged as a TAR file.

To improve data density and integrity, most data will be stored in QR code. An easy-to-read index and guide will list the location of each repository item by item and explain how to recover the data.

Microsoft Research Institute SILICA Project

The GitHub Archive Program works with Microsoft’s Silica project to write all active public repositories into quartz glass using a femtosecond laser (the shortest pulsed laser available under current technology conditions) and ultimately save it more than 10,000 years.

How do I get this code in the future?

GitHub is convening the GitHub Archive Program advisory group, which includes experts in anthropology, archaeology, history, linguistics, archival science, futurism, etc. What should be included in the archive and how it should be with the successor Best communication to make recommendations.

The introduction to archiving will include technical guidance on QR decoding, file formats, character encoding, and other key metadata so that raw data can be converted back to source code for future use by others. The archive will also include a technical tree – a roadmap and Rosetta Stone for future curious people to inherit the data from the archive.

Overview of archiving and how it is used, the “Technology Tree” will serve as a quick start manual for software development and calculations, and