Wanted to share a personal journey since myself and a partner went down a similar route on a past project of using NBD + userland loopback + S3 for block devices.

5 years ago we actually forked nbdkit and nbd-client to create async, optimized versions for use privately as a userland loopback block device driver that we then built support for various cloud object storages as backends (including S3).

It has been a few years since I've worked on it but there were a number of gotchyas we had to overcome when running block devices in userspace and mounted on the same system. The largest of which is that the userland process needs to be extremely careful with memory allocations (ideally all memory will be allocated at startup of the userland driver and memlocked) because you can quite easily cause deadlock on the system if your userland block device driver needs to allocate memory at any point during an operation and the kernel decides it first needs to reclaim memory and chooses to clean out dirty filesystem buffer pages to satisfy the malloc. This causes the kernel to again try to re-enter the filesystem and block device kernel drivers to flush those dirty pages and that causes deadlocks in the kernel as locks were already being held by the previous operation.

There are also a number of places where you might not be expecting the kernel to allocate memory such as your userland process makes a network call to S3 and the kernel decides to allocate additional socket buffer memory in the kernel which can also re-enter the reclaim memory by flushing dirty fs buffers paths.

Drivers within the kernel space are able to avoid this issue by flagging memory allocation requests in a way that the kernel doesn't try to re-enter those subsystems to attempt to reclaim memory pages. With userland drivers we didn't have a mechanism for that at the time (not sure if we still don't) so it was critical to avoid any memory allocations in the userland driver process.

That said, once we had worked through lots of these system stressed, malloc related issues that could wedge the system by carefully pre-allocating and memlocking everything and fine tuning all of our socket buffer settings... things worked pretty well honestly. The sequential write speed of our nbd / userland / S3 block device driver was able to max out AWS c5n instances with 25 gigabit networking (between 2 and 2.5 GB/s read/write throughput). We also had tested and worked with leveraging a number of other linux block device systems such as LVM and crypto on top of it and everything worked well and we even had support for TRIM and snapshotting. Adding a thin filesystem on top of the block device such as btrfs and you could even create a thin 4 exabyte disk backed by s3, format it with btrfs, and mount it to your system in under a minute.

A follow up project that we prototyped but did not release was launching a GlusterFS cluster built on top of our S3 backed disks with LVM. We were able to get extremely scalable filesystem throughputs. We tested 3, 6, 9, and 12 node clusters in AWS and were able to achieve 10+GB/s (yes bytes not bits) filesystem read and write throughputs. We were also able to leverage Gluster's support for raid striping to increase durability and availability in case of loss of nodes and had point in time snapshot capabilities through LVM and Gluster. The stack here consisted of 3-node groupings where each node was on a c5n.2xl instance and had a device stack of [Our NBD userland S3 backed disk -> LVM -> Ext4 -> GlusterFS] in a distributed disperse 3 redundancy 1 configuration. Testing was done using distributed FIO with 10 client servers and 5-10 streams per client writing to the gluster FS.

If your primary use case is larger files (10+ MB), parallel streams of sequential read and write, object storage backed block devices can be highly performant and extremely durable but this was approximately a 2-3 year (2 developers) effort to get these products built and highly stable.

I think I would personally recommend most folks reconsider if their architecture could just leverage object storage directly rather than going through the indirection of a filesystem -> S3 or a block device -> S3. If not I would strongly consider building direct plugins to S3 for userland filesystem backends like GlusterFS or NFSGanesha that avoids the kernel + fs + blockdev layer entirely to avoid most of the risky deadlock scenarios that can occur. If that still doesn't work and you are going down NBD -> userland path, I would at the very least avoid mounting the NBD disk on the same system where the NBD userland process is running so that you have malloc requests needing to flush dirty fs cache recursively back through the nbd driver (unless this problem has since been solved).

Shaun

On Tue, Sep 13, 2022 at 10:54 AM Richard W.M. Jones <rjones@redhat.com> wrote:

As an aside, we'll soon be adding the feature to use nbdkit plugins as
Linux ublk (userspace block) devices.  The API is nearly the same so
there's just a bit of code needed to let nbdkit plugins be loaded by
ubdsrv.  Watch this space.

Of course it may not (probably will not) fix other problems you mentioned.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
_______________________________________________
Libguestfs mailing list
Libguestfs@redhat.com
https://listman.redhat.com/mailman/listinfo/libguestfs