site/content/articles/riscv-cluster.md

16 KiB

+++ title = "Running Nomad and Consul on RISC-V" date = "2023-07-28" summary = "Discussing the process of getting Nomad and Consul running on a RISC-V computer and adding it to a cluster" tags = ["pine64", "star64", "risc-v", "cluster"] +++

I run a cluster of single board computer servers from my house. It mostly consists of aarch64 machines with a few old x86_64 machines I had lying around for the occasional service that doesn't run on ARM.

I've been interested in RISC-V for a while. I love the fact that it's an open ISA unlike x86 and ARM. I believe RISC-V has a lot of potential and I've wanted to develop software for it for a while.

Unfortunately, up until pretty recently, there hasn't been very much useful RISC-V hardware to use that wasn't prohibitively expensive. However, there has recently been a wave of new RISC-V SBCs, so I decided to try adding one to my cluster.

I decided to go with the Star64 from Pine64, mainly because I'm familiar with Pine64 and I'm already part of their developer community with my ITD project. I also wanted to play around with some of the features it has that other JH7110 boards don't, such as the built-in WiFi and Bluetooth and the PCIe slot.

Getting things ready

My cluster runs Nomad and Consul with the Docker driver, which means Nomad, Consul, and Docker must be installed on all nodes. That's usually pretty simple with Hashicorp's and Docker's APT repo, but their repos don't have any builds for riscv64. That's a pretty common occurence because of how new Linux on RISC-V is. Unfortunately, that means I'll have to build each of those projects from source or find them elsewhere.

Before that though, I have to choose a distro to run. Looking at the Software Releases section in Pine64's wiki, there aren't very many options. As of the time of writing this article, the only options on the page are Yocto-based images, Armbian, and NixOS. Since all my nodes run Debian, I wanted to go with a Debian-based image, so I looked at Armbian. Unfortunately, it doesn't have a maintainer at the moment and the images are broken. Out of the two remaining options, I chose to go with the Yocto images instead of NixOS because they use apt like Debian, so I wouldn't have to use different package management commands for all my servers.

While I waited for my Star64 to arrive, I got excited and decided to work on getting Nomad and Consul ready for it early.

Consul

I started with Consul because its docs have instructions for cross-compiling and it seems to only require setting the GOOS and GOARCH environment variables.

So, I cloned the consul repo and ran

make GOOS=linux GOARCH=riscv64 dev

It downloaded some dependencies and started building, but eventually it returned some errors:

# github.com/boltdb/bolt
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:101:13: undefined array length maxMapSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:317:12: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:335:10: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:336:8: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:795:2: pos declared and not used
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/bolt_unix.go:62:15: undefined array length maxMapSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/bucket.go:135:15: undefined: brokenUnaligned
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:166:2: idx declared and not used
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:169:19: undefined array length maxAllocSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:176:14: undefined array length maxAllocSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:166:2: too many errors
make: *** [Makefile:163: dev-build] Error 1

The first thing I noticed is that these errors are for the github.com/boltdb/bolt package, which is unmaintained and doesn't have support for riscv64. However, Consul switched to a maintained fork with RISC-V support a while ago, so it was another dependency causing problems. I used go mod why github.com/boltdb/bolt to identify the culprit, which led me to github.com/hashicorp/raft-boltdb, so I went to its repo and it also switched to the maintained fork, but it still depended on boltdb in order to migrate old bolt databases to the new format. I submitted a pull request to disable that functionality on unsupported platforms.

As of the time I'm writing this, the PR isn't merged yet, so I had to force Consul to use my fork. I did that by adding this replace directive to Consul's go.mod file, under the go 1.20 directive:

replace github.com/hashicorp/raft-boltdb/v2 => github.com/Elara6331/raft-boltdb/v2 v2.0.0-20230729002801-1a3bff1d87a7

Then I ran the make command again, and this time it compiled successfully, but I got a different error:

cp: cannot stat '/home/elara/.go/bin/consul': No such file or directory

I noticed that Consul's makefile ran go install and then copied the resulting binary to the destination rather than just writing it directly to the destination. This is the command it ran:

CGO_ENABLED=0 go install -ldflags "-X github.com/hashicorp/consul/version.GitCommit=449e050741+CHANGES -X github.com/hashicorp/consul/version.BuildDate=2023-07-28T16:49:23Z " -tags ""

So, I decided to modify that command so that it would build the binary instead of installing it, which resulted in the following command:

CGO_ENABLED=0 GOOS=linux GOARCH=riscv64 go build -ldflags "-X github.com/hashicorp/consul/version.GitCommit=449e050741+CHANGES -X github.com/hashicorp/consul/version.BuildDate=2023-07-28T16:49:23Z " -tags "" -o ./bin/consul

Once that command completed, I looked in the bin directory and found a successfully-built RISC-V Consul binary.

Since I don't have my star64 yet, I had to test the binary using CPU emulation, so I used qemu-user-static and ran ./bin/consul version, which returned:

Consul v1.17.0-dev
Revision 449e050741+CHANGES
Build Date 2023-07-28T16:49:23Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

That means the binary is working properly!

Nomad

Nomad can be cross-compiled as well, but it doesn't include any insructions in its documentation.

I cloned the nomad repo and followed the regular build instructions but with the GOOS and GOARCH variable set to target riscv64.

First, I ran the command that installs all the build depenedencies:

make GOOS=linux GOARCH=riscv64 bootstrap

That worked, so I moved on to the build command:

make GOOS=linux GOARCH=riscv64 dev 

After a while though, I got some errors:

# runtime/cgo
gcc_riscv64.S: Assembler messages:
gcc_riscv64.S:17: Error: no such instruction: `sd x1,-200(sp)'
gcc_riscv64.S:18: Error: no such instruction: `addi sp,sp,-200'
...

These errors are from the assembler, and they indicate that Go tried to use the wrong C compiler for the target platform. This happened because Nomad has some C dependencies for things like process monitoring. It's pretty easily fixed though, I just needed to install a RISC-V cross-compiler and point Go to it. I run Arch Linux, so I just installed the riscv64-linux-gnu-gcc package and then changed the command to point Go to it, like so:

make GOOS=linux GOARCH=riscv64 CC=/usr/bin/riscv64-linux-gnu-gcc CXX=/usr/bin/riscv64-linux-gnu-c++ AR=/usr/bin/riscv64-linux-gnu-gcc-ar dev

When I ran that though, I got some more errors:

# github.com/boltdb/bolt
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:101:13: undefined array length maxMapSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:317:12: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:335:10: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:336:8: undefined: maxMapSize
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/db.go:795:2: pos declared and not used
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/bolt_unix.go:62:15: undefined array length maxMapSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/bucket.go:135:15: undefined: brokenUnaligned
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:166:2: idx declared and not used
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:169:19: undefined array length maxAllocSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:176:14: undefined array length maxAllocSize or missing type constraint
../../../.go/pkg/mod/github.com/boltdb/bolt@v1.3.1/freelist.go:166:2: too many errors
make[1]: *** [GNUmakefile:93: pkg/linux_riscv64/nomad] Error 1
make: *** [GNUmakefile:262: dev] Error 2

These are the same errors consul returned for the boltdb package, so I just added the same replace directive and ran it again, and this time it built successfully!

I tried running it like I did with consul, using qemu-user-static for CPU emulation, but I got the following error:

$ ./bin/nomad version
qemu-riscv64-static: Could not open '/lib/ld-linux-riscv64-lp64d.so.1': No such file or directory

This error indicates that the binary tried to load a riscv64 linker library, but qemu couldn't find it. That's because Arch's riscv64-linux-gnu-glibc package installs the linker to /usr/riscv64-linux-gnu/lib rather than the usual /lib. So, to fix that, I just told qemu where to find the linker, and the binary worked!

$ QEMU_LD_PREFIX=/usr/riscv64-linux-gnu/ ./bin/nomad version
Nomad v1.6.2-dev
BuildDate 2023-07-28T18:53:32Z
Revision 9e98d694a6230b904f931813b7d53622e9f128c9+CHANGES

Docker

Luckily, someone else has already successfully run Docker on RISC-V and they've documented the process in their github repo. They even provided a .tar.gz archive in the releases, which means I won't have to spend time getting Docker to build (thanks @carlosedp!).

Running the stuff I just set up

It's been about a week and my Star64 is finally here. Now that everything is ready, it's time for the interesting part: actually running all of this on the node and seeing if it can properly join my cluster.

Consul

Starting with Consul, I used scp to copy the binary I built to the /usr/bin directory on my Star64. Then, I copied the systemd services and configs in .release/linux/package to their corresponding directories and created a system user for consul using useradd -r consul.

I edited the configs to match the rest of my cluster and started consul with sudo systemctl start consul. A few seconds later, I saw this in my Consul dashboard:

Screenshot of Star64 in Consul dashboard

which means Consul is working!

Nomad

Just like with Consul, I used scp to copy the relevant files to the Star64, edited the configs, and started the Nomad service. Nomad doesn't need its own user, so no useradd was required.

A few seconds later, the Star64 joined the cluster and appeared in my dashboard!

Screenshot of Star64 in Nomad dashboard

Docker

Docker is going to be more complex to install. I started similarly to Consul and Nomad: I downloaded the .tar.gz, extracted it, copied all the files inside to their proper locations, and added a group for it using groupadd -r docker. However, when I tried to start docker, I got some errors about my kernel's cgroup support, so I decided to check whether my kernel had the proper config options enabled for Docker.

To do that, I downloaded moby's check-config.sh script, which checks your kernel config to make sure it supports docker. I found that several required options were missing. That meant I had to compile a custom kernel, so I looked into how Yocto worked and built a new kernel. I won't get into that process here because it could be a whole separate article, but anyway, eventually I had some .deb packages with the new kernel. I installed those, rebooted, ran the script again, and this time, all the required options were enabled.

So now, I tried starting docker again and this time there was no error, so I tested it by running the hello-world container, and sure enough, it worked!

root@star64:~/docker# docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (riscv64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

I've opened an issue for this and published the modified kernel packages here if anyone wants to try them.

Using the node for something

Now that I've got the node added to my cluster, it's time to actually make it run a task. I'm going to be running a Woodpecker CI agent on it so that I can test and build my software on RISC-V hardware.

Setting up Woodpecker

Luckily, woodpecker's official server and agent images both support RISC-V, so I don't need to change anything there. All I should need to do is add woodpecker_agent = true to my Nomad config file and restart Nomad, and it should start right up.

Sure enough, once I added the variable and restarted Nomad, Woodpecker Agent started immediately:

Screenshot of Woodpecker Agent running on the Star64

Running a CI job

Now that Woodpecker is running, I'm going to try running a CI job on it.

The job I'm going to run is the test job for my pcre library because it supports RISC-V and I'd like to try running its unit tests on actual RISC-V hardware.

My CI job is configured to run tests for amd64 and arm64, so just add riscv64 and it should work, right? Well, not quite. That job is using the official Go docker image which doesn't have support for RISC-V, so I have to make my own image.

In order to do that, I made a new repo at Elara6331/riscv-docker and added a custom dockerfile for Go that's based on the alpine:edge image, which does support RISC-V, and published the resulting image at gitea.elara.ws/elara6331/golang.

Then, I modified my CI config to run riscv64 tests using the new image and pushed it. Here's the result:

Screenshot of Woodpecker running a test job for PCRE on RISC-V

The job runs successfully!

That's pretty much it for this article, but I'm going to be doing lots of interesting stuff with this in the future and I'll be publishing more articles whenever I encounter anything interesting.