Running “Warrior” crowd web archiving software on Proxmox

Link to the original blog post with a nice text/pictures layout.

Intro

It’s like Seti@home, but for archiving the web.

If you like me and hoard information, this project will be to your liking.

“Warrior” software (aka ‘virtual appliance’) by ArchiveTeam (1) helps to preserve digital heritage by scraping and storing disappearing websites (like Blogger blog pages, Telegram pages, Reddit pages, GitHub, Pastebin, Imgur etc.). It’s a group computing effort, similar to Seti@Home (no longer distributing tasks), but it doesn’t search for extraterrestrial signals. Similar: BOINC crowd computing software helps to calculate various problems from medicine, astronomy, physics, earth sciences chemistry etc.

I had some spare CPU cycles on my Proxmox “server” (it’s an old laptop in reality) and I thought it would be nice to help future digital archaeologists.

The problem

The Warrior software is available for VirtualBox and Docker, but I run Proxmox. There are no clear instructions on how to run it in Proxmox.

The solution

After some light tinkering and following YT instructions(by apalrd), I managed to run it in Proxmox.

I downloaded the virtual appliance (OVA, Open Virtual Appliance format) from ArchiveTeam’s Github repository, file: archiveteam-warrior-v3.2-20210306.ova.
I followed instructions from this YT video to convert the .ova file to proxmox liking. The author imports the Microtik virtual appliance, but the same process can be applied to the Warrior .ova image with some minor modifications.
I first created a new VM (not LXC) in Proxmox, named it ‘warrior’, 2 cores, 1GB RAM, and selected ‘do not use any media’:

4. Detach the disks and CDRoms (VM -> Hardware -> Disk -> Detach, Remove

5. Log to the Proxmox host shell and download the .ova image (go to warrior GitHub and copy the link location)

wget https://github.com/ArchiveTeam/Ubuntu-Warrior/releases/download/v3.2/archiveteam-warrior-v3.2-20210306.ova

6. Extract .ova (it’s just a tar):

tar -xvf archiveteam-warrior-v3.2-20210306.ova

7. Import the disk image (.vmdk) to Proxmox:

qm importdisk 109 archiveteam-warrior-v3.2-20210306-disk001.vmdk local-lvm

This command created 2 disks (60 and 32 GB). The bigger is bootable.

109 – the number of your VM

….vmdk – the name of unpacked disk image

local-lvm – the name of your local proxmox storage:

8. attach it to your VM (I didn’t need to attach it, but if you need it, it’s in the video). Adjust settings (VM -> Hardware -> Harddisk, double click):

I changed the following settings:

Bus/device: SATA
Discard
I turned on ‘SSD emulation’

9. Make it bootable:

VM -> Options -> Boot order, check Enabled

Important: Reorder disks so the “sata1” disk (size 60GB) is the first, otherwise, the system will not boot.

10. (optional) I converted the VM to the template (VM -> convert to template) and cloned it (VM -> clone) and renamed it to ‘warrior-instance1).

11. Start the VM, go to the console, and let it update. When finished, it will print the IP of your server.

12. Open the web browser and

go to your local IP, port 8001
enter your nickname under ‘Your settings’.
select the project under ‘Available projects’ and it should run. I selected ‘Archive’s team choice’:

warrior UI showing web scraping progress

13. I reduced the number of VM cores from 2->1 and it still runs fine. It uses 30-80% of 1xCPU and 500-700Mb RAM.

Proxmox warrior vm utilization and other stats

14. I spent some time to figure out where I could find the archives made by the ArchiveTeam / Warrior software.

If I understand correctly, they can be found in the archive.org collections:

Disclaimer

The links to the products are not affiliate links and I don’t receive any compensation for linking.

The code and the ideas are mostly from YT videos and community forums.

Hashtags: #warrior #archive #proxmox #digitalheritage

Running “Warrior” crowd web archiving software on Proxmox

Intro

The problem

The solution

Disclaimer

Comments

Leave a Reply Cancel reply