On 6/7/19 9:15 AM, Eric Blake wrote:
We disabled Nagle's algorithm to allow less latency in our
responses
reaching the client; but as a side effect, it leads to more network
overhead when we send a reply split across more than one write().
Take advantage of various means for grouping related writes (Linux'
MSG_MORE for sockets, gnutls' corking for TLS) to send a larger
packet, and adjust callers to pass in our internal SEND_MORE flag as a
hint when to use the improvements. I tested with appropriate
combinations from:
$ nbdkit {-p 10810,-U -} \
{--tls=require --tls-verify-peer --tls-psk=./keys.psk,} memory size=64m \
--run './aio-parallel-load{-tls,} {$unixsocket,nbd://localhost:10810}'
with the following IOPS measurements averaged over multiple runs:
pre post gain
unix plain: 802627.8 822944.1 2.53%
unix tls: 121871.6 125538.0 3.01%
tcp plain: 656400.1 685795.0 4.48%
tcp tls: 114552.1 120197.3 4.93%
which looks like an overall improvement, even though it is still close
to the margins for being in the noise.
+++ b/server/crypto.c
@@ -357,6 +357,9 @@ crypto_send (struct connection *conn, const void *vbuf, size_t len,
int flags)
assert (session != NULL);
+ if (flags & SEND_MORE)
+ gnutls_record_cork (session);
+
while (len > 0) {
r = gnutls_record_send (session, buf, len);
if (r < 0) {
@@ -368,6 +371,10 @@ crypto_send (struct connection *conn, const void *vbuf, size_t len,
int flags)
len -= r;
}
+ if (!(flags & SEND_MORE) &&
+ gnutls_record_uncork (session, GNUTLS_RECORD_WAIT) < 0)
+ return -1;
+
Even though my numbers showed improvements, aio-parallel-load (which
used a 64k buf) is not necessarily the most typical load pattern, and
later threads (here and on qemu where I triggered actual pessimisation
when using a sequence of commands doing 2M accesses) have pointed out
that it's probably better to do this more like:
crypto_send() {
if (SEND_MORE) {
cork
send
} else if (size < threshold) {
send
uncork
} else {
uncork
send
}
}
as gnutls has a very annoying habit of realloc'ing space regardless of
the sizing of how much data is sent while corked, then slams the
underlying socket with enough data that you WILL need to wait for
POLLOUT, while the point of SEND_MORE is to optimize only the cases
where the ENTIRE message is likely to fit in a single TCP segment. Once
you have a very large NBD_CMD_READ, where the remaining data is so large
that it's going to take CPU time to encrypt it and where you're going to
split into multiple TCP segments no matter what, there's no longer a
benefit to stalling the partial packet of the reply header in waiting
for the payload.
Or stated another way, if you have 40 bytes overhead per TCP segment
(probably larger for TLS), and have a 26-byte NBD_REPLY_TYPE_ERROR
message split into a 20-byte header and 6-byte payload, the savings for
transmitting 20+40 + 6+40 vs. 26+40 are huge (37% savings). If you have
NBD_REPLY_TYPE_OFFSET_DATA covering 512 bytes of NBD_CMD_READ reply, the
savings are a little less (20+40 + 8+40 + 512+40 vs. 540+40, but still a
modest 12%). But once you have NBD_REPLY_TYPE_OFFSET_DATA covering 64k
bytes of NBD_CMD_READ, you're GOING to have multiple segments no matter
what, and whether those segments split as 48/64k or as 64k/48, you get
the same overhead.
However, I don't yet have a good idea for what that threshold number
should be (64k per the maximum TCP max-segment-size, or 1500 bytes per
traditional TCP MTU, or somewhere in between? Is there a way to
dynamically learn a good threshold, or is a compile-time one good enough?
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org