Implementing a User-Space NFS Client in Go
Learn how to implement a user-space NFS client in Golang.
Join the DZone community and get the full member experience.
Join For Free
Introduction
Our backup application, which runs over an NFS (Network File System) mount, is designed for high performance. While developing our backup solution, we found we had to implement our own NFS client (in user space): By using our own client, we are able to fine-tune the load that the application imposes on any given server; we adapt to slow servers by backing off and to fast servers by increasing the load. Additionally, our client can list enormous directories (often with many millions of entries) piecemeal in a way that overwhelms neither client nor server.
In some ways, the NFS protocol is just another encoding. There are already many encoding packages in Go (json, xml, etc.) and like those, our team used Go’s reflection facility to get going as fast as possible. Quickly, we found reflection was the bottleneck to higher performance.
It’s a truism in the Go community that “Go’s reflection package is slow,” but it’s not obvious what that really means. One person’s “slow” may be another person’s “good enough.” I’ll mention how reflection is used for serialization, how slow it was for us, and walk you through the workaround we settled on.
Serialization
Most production computer code spends its time converting data from one representation to another. When this representation is intended to live on disk or be transmitted over the network, this technique is known as “serialization.”
In Go, there are two common ways of describing a serialization format. One is via an interface definition language (IDL), e.g., Google protobuf, and the other is via reflection of a native struct, e.g., package JSON.
With the former method, the format is described in a separate language and then compiled (or “transpiled”) into Go source code “stubs,” which convert native structures to and from a wire format.
With the latter method, the format is self-described by Go structures, which are then decomposed at runtime by the serialization package and written to/from the wire format.
Ultimately, this compilation (or runtime interpretation) process is transparent to the user of the library. The library will implement marshaling and unmarshaling methods such as those for JSON:
buf, err := json.Marshal(&request)
Here, we’ll be dealing with the former method.
Serialization for RPC
The serialization format for NFS is called “ONC RPC” (though it originated at Sun Microsystems and was known initially as Sun RPC). This RPC is described by an IDL which we will call the “RPC language.”
To get a sense of what it looks like, here is the definition of the READDIRPLUS request:
READDIRPLUS3res NFSPROC3_READDIRPLUS(READDIRPLUS3args) = 17;
struct READDIRPLUS3args {
nfs_fh3 dir;
cookie3 cookie;
cookieverf3 cookieverf;
count3 dircount;
count3 maxcount;
};
This RPC struct looks a lot like a C struct, but looks deceive. On any Unix system, one would turn this IDL source code into C stubs, which a C compiler could incorporate into an application by running a program called rpcgen.
These stubs would then write to (and read from) the network in a well-defined byte format called XDR (“External Data Representation”).
Reflecting RPC Into XDR
So how do we incorporate ONC RPC definitions into a Go program? The most general solution would be to implement rpcgen for Go, i.e., write a program that transpiles the ONC RPC language into Go stubs. However, ONC RPC does not enjoy use apart from a handful of legacy protocols (most notably NFS), so it would be overkill to go this route. Let’s look at some other options.
There is already an open-source XDR project for Go written by Dave Collins. This allows a straightforward translation of any Go struct based on booleans, ints, and arrays into their XDR form. READDIRPLUS can thus be expressed in Go:
type ReaddirplusRequest struct {
Dir []byte
Cookie uint64
CookieVerifier uint64
Dircount uint32
Maxcount uint32
}
and marshaled via
n, err := xdr.Marshal(w, &request) // w is an io.Writer
in the way most Go programmers are accustomed to. This package uses reflection. In other words, at runtime, a struct that is passed into the XDR library is deciphered by calls to the Go reflection package, and a serialization primitive is run for each member of the struct. In practice, it turns out that much of the time is spent deducing the type of each struct member and little time serializing bytes.
Writing a simple benchmark shows how fast we could expect to marshal this request to the network:
var readdirReq = ReaddirplusRequest{
Dir: bytes.Repeat([]byte("x"), 32),
Cookie: 1234,
CookieVerifier: 5678,
Dircount: 100000,
Maxcount: 100000,
}
func BenchmarkMarshalReaddirReflection(b *testing.B) {
var buf bytes.Buffer
b.SetBytes(int64(readdirReq.Size()))
for i := 0; i < b.N; i++ {
buf.Reset()
xdr.Marshal(&buf, &readdirReq)
}
}
gives the following numbers on a 2.4GHz i7 Macbook Pro:
BenchmarkMarshalReaddirReflection-4 2000000 681 ns/op 87.98 MB/s 112 B/op 11 allocs/op
This is an intriguing result and points to a performance hotspot: in a world of 10G networking, a throughput of 90MB/sec looks troublesome. The allocation count is also something worth paying attention to: in a hot path where pool allocation is preferred to heap allocations, 11 allocations for one struct looks like a lot. The profile output also confirms these suspicions:
(pprof) top
Showing nodes accounting for 1150ms, 60.85% of 1890ms total
Showing top 10 nodes out of 105
flat flat% sum% cum cum%
250ms 13.23% 13.23% 530ms 28.04% runtime.mallocgc
180ms 9.52% 22.75% 180ms 9.52% runtime.kevent
150ms 7.94% 30.69% 1710ms 90.48% igneous.io/vendor/github.com/davecgh/go-xdr/xdr2.(*Encoder).encode
110ms 5.82% 36.51% 330ms 17.46% reflect.(*structType).Field
100ms 5.29% 41.80% 100ms 5.29% runtime.duffcopy
90ms 4.76% 46.56% 1630ms 86.24% igneous.io/vendor/github.com/davecgh/go-xdr/xdr2.(*Encoder).encodeStruct
80ms 4.23% 50.79% 140ms 7.41% reflect.(*rtype).String
70ms 3.70% 54.50% 110ms 5.82% bytes.(*Buffer).Write
60ms 3.17% 57.67% 80ms 4.23% reflect.StructTag.Get
60ms 3.17% 60.85% 60ms 3.17% runtime.duffzero
In the top 10, most calls are devoted to reflection and heap allocation.
Another Way to Marshal
Short of implementing rpcgen for Go, there is a simpler way to address performance hotspots. Namely, hand-roll the marshal methods as needed. After all, the NFS protocol is mature, and these methods will rarely (if ever) need to be updated. Furthermore, XDR is almost the simplest possible serialization method one might ever propose. It can be roughly summarized as “write-in network (big-endian) order, and pad to 32-bit boundaries.”
This is, in fact, what we decided to do. By addressing hotspots as needed, we could leave most of our NFS clients untouched (and continue using the reflection-based slow path). Key RPCs can be sped up with hand-rolling, and with a tasteful arrangement of the primitives, this does not have to be ugly, either.
A hand-rolled marshal method implements a new interface:
package rpc
type Marshaler interface {
MarshalXDR(in []byte) (out []byte)
}
A type that implements Marshaler appends its network representation to the byte slice in and returns the result in a byte slice out. Here is what MarshalXDR looks like for our READDIRPLUS request example:
func (r *ReaddirplusRequest) MarshalXDR(b []byte) []byte {
b = rpc.EncodeOpaque(b, r.Dir)
b = rpc.EncodeHyper(b, r.Cookie)
b = rpc.EncodeHyper(b, r.CookieVerifier)
b = rpc.EncodeUint(b, r.Dircount)
b = rpc.EncodeUint(b, r.Maxcount)
return b
}
Our RPC library supplies the primitives for writing integers and arrays, but these primitives are also very simple. For example, appending a big-endian integer is accomplished with EncodeUint:
// EncodeUint appends a uint32 to the byte slice of
// an XDR message. The resulting byte slice is
// returned.
func EncodeUint(b []byte, x uint32) []byte {
return append(b, byte(x>>24), byte(x>>16), byte(x>>8), byte(x))
}
We can write a benchmark for this Marshal method, too:
func BenchmarkMarshalReaddirHandrolled(b *testing.B) {
var buf []byte
b.SetBytes(int64(readdirReq.Size()))
for i := 0; i < b.N; i++ {
buf = buf[:0]
buf = readdirReq.MarshalXDR(buf)
}
}
which on the same Macbook Pro gives:
BenchmarkMarshalReaddirHandrolled-4 200000000 9.60 ns/op6247.12 MB/s 0 B/op 0 allocs/op
This is fully 60x faster than the solution using reflection, and the memory allocations are amortized away so that on average, there are zero allocations in this loop! The profile output is also far easier to digest and comprehend, in that the first three lines in the profile account for 95 percent of the CPU time spent in the benchmark:
(pprof) top
Showing nodes accounting for 2610ms, 100% of 2610ms total
flat flat% sum% cum cum%
1660ms 63.60% 63.60% 2260ms 86.59% igneous.io/common/nfs.(*ReaddirplusRequest).MarshalXDR
600ms 22.99% 86.59% 600ms 22.99% runtime.memmove
230ms 8.81% 95.40% 2490ms 95.40% igneous.io/common/nfs.BenchmarkMarshalReaddirHandrolled
120ms 4.60% 100% 120ms 4.60% runtime.usleep
0 0% 100% 120ms 4.60% runtime.mstart
Furthermore, this profile is similar to what we might expect to see from benchmarking an analogous marshal operation in C. (It’s worth noting, for example, that the Go compiler was able to inline EncodeUint and so it does not appear in the profile at all but gets charged to ReaddirplusRequest.MarshalXDR.)
Discussion
This was a small tour over a very simple RPC request, but it illustrates the power of the hand-rolled approach as a supplement to reflection. The real power of this approach is even more evident when processing the READDIRPLUS reply (since a reply may contain thousands of directory entries), but for the sake of brevity, I have focused on the request structure. However, the concept is similar: unmarshaling is done at speed by writing a custom unmarshal method, which leans on well-designed primitives to do the low-level work of swapping and copying bytes from the network.
This also illustrates the power and flexibility that Go offers with reflection: we were able to easily prototype and implement the entire NFS protocol and substitute only those methods we needed to achieve high performance. And notably, these hand-written methods achieve performance similar to any marshaling code written in C.
In fact, at Igneous, we ended up using hand-rolled methods only for NFS READ, WRITE, and READDIRPLUS. The remaining reflection-based methods are used infrequently enough in our backup application that there was no need to substitute them more broadly, at least for now.
I hope this example illustrates the costs and benefits of using reflection and how to be strategic about when (and when not) to use it.
Opinions expressed by DZone contributors are their own.
Comments