Sig Engineering - Part 2 - Progress on AccountsDB & more
This post is Part 2 of a multi-part blog post series we'll be releasing periodically to outline Sig Validator's engineering updates/milestones. Find Part 1, covering the Gossip Protocol, here.
This blog post details major progress from Syndica in Sig's journey: we have made significant headway towards completing AccountsDB for Sig, revamped the RPC server/client, and refined our implementation of the Gossip Protocol.
AccountsDB
Sig’s AccountsDB is now able to load and verify mainnet
snapshots with over 300 million accounts, which includes reading/writing account functionality, generating the account index, and generating a cumulative hash over the accounts for data verification.
We are now benchmarking the read and write performance against the Solana Labs’ client to understand how it performs and to ensure we make an improvement. This includes identifying bottlenecks and leveraging Zig to improve performance.
In our past month of work, we’ve come across a few interesting examples that showcase the power of Zig, including:
- a high-performance HashMap implementation,
- speed improvements on loading snapshots,
- and a custom disk-based allocator.
A High-Performance HashMap
At a high level, when we are loading from a snapshot, one of the most computationally expensive parts is generating the account index. While the account data lives in files on disk, we need a way to answer: “For a given pubkey, where is the corresponding account data located?”. This is the job of the primary accounts index, which is a mapping from a Pubkey
to a list of account references where each reference includes information about where the corresponding account data can be found. For example, an account reference could be the tuple (file_name, offset, slot)
where the account at slot slot
is stored in the file called file_name
, which can be read starting from the byte offset offset
.
In AccountsDB, the main data structure in the account index is a HashMap. Since this component is used in every read-and-write operation, we need it to be fast. The std library implementation was too slow for our use case, so we implemented our own HashMap based on Google's high-performance HashMap, Swissmap.
Below is a code snippet of the get
function of the HashMap. The code is straightforward: it takes a key, creates the corresponding key_state
, which we then search for across the map using equality until we either find a match (i.e., eq_fn
returns true and the key is found), or we reach the end of the state and return null
(i.e., the key doesn't exist).
pub fn get(self: *const @This(), key: Key) ?Value {
if (self._capacity == 0) return null;
var hash = hash_fn(key);
var group_index = hash & self.bit_mask;
// what we are searching for (get)
const control_bytes: u7 = @intCast(hash >> (64 - 7));
// PERF: this struct is represented by a u8
const key_state = State{
.state = .occupied,
.control_bytes = control_bytes,
};
const search_state: @Vector(GROUP_SIZE, u8) = @splat(@bitCast(key_state));
const free_state: @Vector(GROUP_SIZE, u8) = @splat(0);
for (0..self.groups.len) |_| {
const states: @Vector(GROUP_SIZE, u8) = @bitCast(self.states[group_index]);
// PERF: SIMD eq check: search for a match
var match_vec = search_state == states;
if (@reduce(.Or, match_vec)) {
inline for (0..GROUP_SIZE) |j| {
// PERF: SIMD eq check across pubkeys
if (match_vec[j] and eq_fn(self.groups[group_index][j].key, key)) {
return self.groups[group_index][j].value;
}
}
}
// PERF: SIMD eq check: if there's a free state, then the key DNE
const free_vec = free_state == states;
if (@reduce(.Or, free_vec)) {
return null;
}
// otherwise try the next group
group_index = (group_index + 1) & self.bit_mask;
}
return null;
}
It’s only forty lines of code, it’s easy to read, and it’s extremely fast. The code leverages a few Zig-specific features to achieve this.
The first feature we use is packed structs, where we represent the State
struct as an 8-bit value, using 1 bit for the enum value, and 7 bits to represent the key (i.e., control_bytes
).
pub const State = packed struct(u8) {
state: enum(u1) { empty, occupied },
control_bytes: u7,
};
In other languages, this is not as straightforward and can lead to convoluted code with a lot of bit operations. In Zig, it's as easy as calling @bitCast(key_state)
.
The second feature is Zig’s support for SIMD operations. We can cast the state we’re searching for into a Vector of 16 u8 values using the @Vector
and @bitCast
keywords and then check for possible matches using @reduce(.Or, search_state == states)
. In other languages, SIMD operations can be convoluted, confusing, and difficult to read, whereas in Zig, it feels and reads like normal Zig.
Loading From Snapshots Quickly
While the efficient HashMap implementation has a large effect on performance, another large factor is the number of memory allocations. For example, when we are populating the index, a naive approach would look like the following:
const allocator = std.heap.page_allocator;
var index_map = HashMap(Pubkey, ArrayList(AccountRef)).init(allocator);
for (account_references) |account_ref| {
// memory allocation
const result = index_map.getOrPut(account_ref.pubkey);
if (result.found_existing) {
// memory allocation
try result.value_ptr.append(account_ref);
} else {
var references = std.ArrayList(AccountRef).init(allocator);
// memory allocation
try references.append(account_ref);
result.value_ptr.* = references;
}
}
The problem with this approach is that it requires a lot of heap memory allocations which are one of the most expensive operations because they usually require an expensive syscall
. For example, getOrPut
requires a memory allocation if the key doesn’t exist, and appending to the ArrayList
requires another memory allocation if the list isn’t big enough. Running this loop for over 300 million accounts is extremely slow.
One of the benefits of Zig is that since memory is managed manually, we can explicitly preallocate our memory ahead of time to improve the speed drastically. For example, we could count how much data we need to allocate, allocate it once, and then use that memory later. Something like the following:
const allocator = std.heap.page_allocator;
// count the reference lengths
var account_refs_lens = HashMap(Pubkey, usize).initCapacity(allocator, n_accounts);
for (account_references) |account_ref| {
const result = index_map.getOrPutAssumeCapacity(account_ref.pubkey);
if (result.found_existing) {
result.value_ptr += 1;
} else {
result.value_ptr = 1;
}
}
// preallocate all the memory for the references
const memory_size = total_accounts * @sizeOf(AccountRef);
var memory = try allocator.alloc(u8, memory_size);
var fba = std.heap.FixedBufferAllocator.init(memory);
const n_accounts = account_references.len;
// preallocate all the keys in the map (note: this is an overestimate, but it's okay for now)
var index_map = HashMap(Pubkey, ArrayList(AccountRef)).initCapacity(allocator, n_accounts);
for (account_references) |account_ref| {
// this won't require any allocation
const result = index_map.getOrPutAssumeCapacity(account_ref.pubkey);
if (result.found_existing) {
// this won't require any allocation
result.value_ptr.appendAssumeCapacity(account_ref);
} else {
// this won't require any allocation
const len = account_refs_lens.get(account_ref.pubkey).?;
var references = try std.ArrayList(AccountRef).initCapacity(fba.allocator(), len);
references.appendAssumeCapacity(account_ref);
result.value_ptr.* = references;
}
}
In this example, we preallocate all the memory we need for the ArrayLists
using page_allocator.alloc
and use the FixedBufferAllocator
to manage this memory for us. We also use the initCapacity
method to allocate all our memory for the keys in the HashMap. This approach removes all the memory allocations in the loop which leads to much faster code. This approach is similar to what we do when loading snapshots and showcases the power of intentional memory management.
A Disk-Based Allocator
Another one of Zig’s strengths is that because memory is managed explicitly, allocations are usually abstracted by an Allocator
interface (anything that implements the alloc()
, free()
and resize()
functions). The Allocator
interface is usually passed to initialize any memory in a specific structure. For example, in our HashMap, we use an Allocator
interface to initialize our backing memory:
pub fn initCapacity(allocator: std.mem.Allocator, n: usize) !Self {
// compute how much memory we need
const n_groups = @max(std.math.pow(u64, 2, std.math.log2(n) + 1) / GROUP_SIZE, 1);
const group_size = n_groups * @sizeOf([GROUP_SIZE]KeyValue);
const ctrl_size = n_groups * @sizeOf([GROUP_SIZE]State);
const size = group_size + ctrl_size;
// allocate it
const memory = try allocator.alloc(u8, size);
@memset(memory, 0);
const group_ptr: [*][GROUP_SIZE]KeyValue = @alignCast(@ptrCast(memory.ptr));
const groups = group_ptr[0..n_groups];
const states_ptr: [*][GROUP_SIZE]State = @alignCast(@ptrCast(memory.ptr + group_size));
const states = states_ptr[0..n_groups];
One of the benefits of this approach is that we can implement our own allocation strategies and the underlying code doesn’t need to change. For example, because there are a large number of accounts to index, RAM requirements can become very large. To reduce these requirements we implemented a disk-based allocator that allocates disk memory instead of RAM. We can then use it to allocate our HashMap’s backing memory with no changes to the HashMap code.
const IndexHashMap = FastMap(Pubkey, ArrayList(AccountRef), pubkey_hash, pubkey_eql);
var ram_allocator = std.heap.page_allocator;
var account_refs = IndexHashMap.init(ram_allocator);
var disk_allocator = DiskMemoryAllocator.init("tmp_file");
var disk_account_refs = IndexHashMap.init(disk_allocator.allocator());
This is not as easy to do across other languages (i.e., the Solana Labs’ client implemented their own disk-based HashMap instead reusing the ram-based HashMap).
To review the progress we've detailed above, take a look at the AccountsDB PR here.
RPC Server & Client
Sig was introduced to optimize reads. In other words, Sig is a read-optimized validator implementation that prioritizes the RPC framework. In our initial Sig announcement blog post, we talked about one of the largest problems plaguing the Solana ecosystem: slot lag. This is primarily due to Solana validators being overwhelmed by read requests, locking state and thereby causing the validator node to fall behind the rest of the network. This has many downstream effects including transaction failure due to stale data, front-end UIs presenting old DApp state, and block-packing inefficiencies.
How do reads affect block production? Many transactions are submitted by bots attempting to take advantage of arbitrage opportunities that exist on-chain. Oftentimes these arbitrage opportunities are minimally profitable, but over time the sum of these small arbitrage trades provide a nice yield. The problem this phenomenon presents for block production is the large number of failed transactions within a block.
Even when a transaction fails, the leader needs to attain read and/or write locks for certain accounts to attempt to process the transaction. If too many transactions - which are doomed to fail - read/write lock enough accounts, this slows down the transaction processing pipeline and, in turn, limits transaction throughput. But why are transactions “doomed” to fail in the first place?
We surveyed on-chain transactions over a discrete period and concluded that a large proportion of transactions were attempting to take advantage of an arbitrage opportunity that no longer existed. This is, in part, due to stale account state(s) on the bot itself, which likely retrieved these account state(s) via RPC, Geyser (streaming), or PubSub WebSocket. In any of those mediums, if a validator node is behind the rest of the network, the data it produces will always be stale.
As of this writing, Sig has implemented the initial RPC server/client framework to begin testing throughput and reliability as we start to build out other components that hook into the RpcRequestProcessor
, the entrypoint to all RPC requests:
There's some Zig-specific features like inline for
statements that make it easy to implement RPC methods on the server side without writing thousands of lines of boilerplate code:
The above code iterates through all fields in the RpcServiceImpl
interface and calls the associated RPC method within the RpcRequestProcessor
. In order to maximize performance, we opted for statically dispatched RPC calls as opposed to dynamically dispatched via a vtable
, which many programming languages often force upon you when writing generic code. Dynamically dispatched calls can cost anywhere between 2-3x more than statically dispatched ones in terms of performance.
Our generic interface RpcServiceImpl
checks, at compile-time, the implementation function methods match the interface method signature. This allows us to share a common interface between our server/client implementation and can even allow for, in the future, other server/client transports such as gRPC
. Here's an example of the RpcClient
struct method(s) being comp-time checked against our interface:
With all Solana JSON RPC types ported into Zig, we implemented our RpcClient
and can now successfully make requests against the public Solana mainnet-beta endpoint:
test "rpc.client: makes request successfully" {
var client = RpcClient.init(testing.allocator, try std.Uri.parse("https://api.mainnet-beta.solana.com/"));
defer client.deinit();
switch (client.getAccountInfo(testing.allocator, try t.Pubkey.fromString("CcwHykJRPsTvJrDH6vT9U6VJo2m2hwCAsPAG1mE1qEt6"), null)) {
.Success => |val| {
defer val.deinit();
std.debug.print("result: {any}", .{val.value});
},
.RpcError => |err| {
defer err.deinit();
std.debug.print("rpc error: {}", .{err.value});
},
.ClientError => |err| {
std.debug.print("client error: {any}", .{err});
},
}
}
We remain excited to speed up Solana's RPC across the ecosystem and have taken the initial steps to achieve this. You can follow along in this PR here.
Gossip
We also completed the initial implementation of the gossip protocol in Sig last year, as detailed in our previous blog post. Since then, we've been testing and refining that implementation to fix any bugs and ensure its behavior is closely aligned with Solana Labs’ validator client.
In the latest round of testing, we compared the CrdsTable
between Sig and Solana (Rust) to ensure that records are consistent. The CrdsTable
is the core of gossip; it is in charge of tracking all the data received from gossip messages. A validator uses the CrdsTable
to determine who its peers are and what to send them in gossip messages.
To test this, we dump the full contents of the CrdsTable
into a CSV file every ten seconds. These files record the message type, hash, source Pubkey
, and wallclock for every value. For contact info, we also include gossip address and shred version.
We started two validator clients simultaneously on separate servers, allowed them to run for several minutes, and then compared the resulting CSV files. One server always runs the Solana Labs validator while the other server runs the code being tested.
To dump the CrdsTable contents, some code changes were needed, which you can see here:
- Sig: https://github.com/Syndica/sig/compare/main...dnut/crds-hash-dumps
- Solana: https://github.com/solana-labs/solana/compare/master...dnut:solana:crds-dump
As a control, we tested the Solana Labs’ validator against itself. Within one minute of startup, their CrdsTable
contents were nearly identical, with 7340 records each. 7131 records were identical, while 209 had the same message type and source, but differed by their wallclock. This illustrates a baseline expectation for separate nodes in the same cluster. All clients should have the same set of records, but it's typical for about 3% of them to be older or newer versions of an analogous record found on another node.
Running Sig alongside Solana, they reached exactly the same number of CrdsValues
after one minute: 7350. 7129 of those values are the same item, whereas each client had 221 messages different from the other. Unlike Solana v. Solana, however, these were not just different versions of the same message from the same origin. Sig had CrdsValues from 28 Pubkey
(s) that were not represented at all in Solana's CrdsTable
. Solana had some DuplicateShred
, EpochSlots
, and Vote
messages that Sig was missing, and Sig had many more NodeInstance
messages, and one more LegacyContactInfo
.
As time went on, we observed that Sig did catch up and included every CrdsValue
found in Solana (some differing only by hash and wallclock, like the control). However, after 20 minutes, Sig also included 1468 extra CrdsValues
that were not found in Solana, coming from nodes with 1050 Pubkey
(s) that are not reflected in any value in Solana's CrdsTable
. These extra CrdsValues
are:
DuplicateShred
: 36EpochSlots
: 2LegacyContactInfo
: 2NodeInstance
: 1050Version
: 1
The data suggests that Sig is now able to keep track of all the same CrdsValues
as Solana. However, it tracks some additional values that Solana Labs does not track. At first, they are closely in sync, but as time goes on, Sig gradually accumulates some extra values. Particularly of note, extra node instances are reflected in Sig, both with NodeInstance
CrdsValues
and an equal number of extra Pubkey
(s).
The Solana Labs validator evicts records from the CrdsTable
after 15 seconds if the node that created the record has zero stake. Otherwise, the record is kept for the duration of an epoch (about 2 days). Sig is not yet able to track each node’s stake, so for now we've configured Sig to keep all records in the CrdsTable
for the full epoch.
From the latest round of testing and improvements, we are seeing Sig's gossip implementation behave very similarly to Solana. There is still a minor lingering discrepancy that will be cleaned up as we implement stake-tracking. Until then, we already have a gossip implementation that connects to the cluster and can track at least as much data as a Solana validator is expected to be aware of, plus some excess clutter data.
Prometheus
We aim to provide exhaustive data about Sig's performance when it's hooked up to a real Solana cluster. To make this possible, we decided to expose Prometheus-compatible metrics. Prometheus is an open-source and widely adopted metrics collection tool. The metrics API is based on a well-defined specification, and many tools have been developed to make it easy to consume this data.
We added a library to Sig in this PR that implements Prometheus metrics in Zig. We chose to fork zig-prometheus, which was an excellent starting point. However, that project did not prioritize compliance with Prometheus standards, so we made changes as necessary to comply with the official prometheus specification for client libraries. We implemented Gauge from scratch since the Gauge in zig-prometheus is a different type of metric from the Gauge in the Prometheus spec. The Gauge in zig-prometheus is analogous to GaugeFunc in the official Prometheus Go library, so we renamed it to "GaugeFn". We also implemented Histogram from scratch, because the Histogram in zig-prometheus is not compliant with the spec, and because it is not optimized for write-heavy usage.
Our Prometheus library supports the following metrics:
Counter
: tracks a monotonically increasing number that can be incremented arbitrarily.Gauge
: tracks a value that can be changed to any number at any time.GaugeFn
: calls a function in your code to retrieve a number to report to Prometheus.Histogram
: samples numeric observations and counts them in configurable buckets. The counter for each bucket represents the number of observations that were a number in the configured range.
A registry is used to keep track of all metrics. Here is how to create a registry and start using a counter:
const registry = try Registry(.{}).init(allocator);
const counter = registry.getOrCreateCounter("my_counter");
counter.inc();
We also included an http adapter and a global registry singleton to make it easy to bootstrap a unified Prometheus endpoint in any application. This creates a single global registry that will be shared across the entire application, and serves it over http on the standard Prometheus port 12345:
const registry = globalRegistry(allocator);
try servePrometheus(allocator, registry, metrics_port);
In another scope, a counter from the same registry can be incremented:
const counter = globalRegistry(null).getOrCreateCounter("my_counter");
counter.inc();
The Counter
and Gauge
are implemented simply with an atomic u64 or f64 that can be updated from any thread without locks. GaugeFn
is a simple adapter that calls arbitrary code of your choosing.
Histogram
was more complex because it internally updates multiple values for each observation, and these observations may come from multiple threads. Our goal was to optimize for write performance as writes may occur very frequently in the application’s critical path, whereas reads only come occasionally from Prometheus. We took inspiration from the histogram in rust-prometheus to optimize write performance.
Our solution allows concurrent writes to occur rapidly without any locking mechanism. The complication, however, is that a write operation needs to update multiple fields. If you read the data while a write operation is only partially complete, the data you read will be corrupt. Normally, a synchronization primitive like mutex or read-write-lock would be used to handle this, but those would degrade write performance.
To solve for this, the histogram data is split into two shards, a hot shard and a cold shard. The hot shard is the shard that is currently enabled for writing. An unlimited number of threads may write data to the hot shard concurrently. No locking mechanism is used for writes at all, which makes writes fast and highly parallelizable.
Since we know reads occur infrequently (less than once per second), we put the burden on reads to ensure they only access data that has integrity, without getting in the way of writes. "Read" is a bit of a misnomer because read operations actually mutate some internal state of the histogram.
The first thing that happens during a read is that it switches which shard is hot and cold. The previously cold shard is designated as the hot shard, so all future writes are directed to that shard. The previously hot shard is allowed to cool down. Once all the in-progress writes are completed on the cooling shard, the data from that shard is read. Unlike writing, reading has a lock. Only one read can occur at a time because it would interrupt and corrupt the read process if another thread came in and flipped the shards while one thread was waiting for a shard to cool down.
The Prometheus library was merged to main and is ready for use. We'll soon start using it to track metrics in Gossip
, RpcRequestProcessor
and AccountsDB
.
Conclusion
Efficiently implementing AccountsDB and refining the RPC client/server and Gossip protocol are crucial steps towards bringing Sig to life. With the power that Zig offers, we’re able to demonstrate concrete improvements in each of these domains over the current validator implementation. Over the coming months, we will complete AccountsDB and the remaining storage-related components and update the community periodically. We welcome feedback, questions, and contributions from open-source developers in the ecosystem.