Premature Optimization Is Fun Sometimes

A colleague of mine was recently discussing a connectivity monitoring
system he is working on with me. It’s nothing fancy, just sending ICMP
Echo Requests to a couple of different servers, and monitoring latency
and dropped packet averages over 1-minute, 5-minute, and 15-minute
periods. Up came the topic of how this data should be stored, the
natural thought was a 512 entry ring buffer, containing entries like the
following:

12 KiB. Pretty wasteful, right? We can certainly do better. Do we
need to keep fields for both sent and
received? What we’re really interested is the latency. We
need to know when a packet was sent, only up until we know when it was
received, at that point, the data we want to keep is
received - sent, so why don’t we make it a tagged
union?

Not bad, we’ve shaved off an entire page. We can still do better.

Nanosecond precision? In our case, ping times are measured in the
tens or hundreds, or even thousands of milliseconds. We don’t
need to keep all of those extra bits around. If we change the unit from
nanosecond to 100 microsecond increments (0.1ms), then 43-bits is
sufficient for us to keep track of pings for up to 20-years1. 20-years is a bit excessive still,
but it doesn’t hurt to be at least a little bit future proof.

And received? 8-bit for a true/false value seems
altogether too much. The answer: bitfields.

Wait, what? Why haven’t we saved any space?

The answer is struct padding. The layout of
ping_timestamp_2 looks like this:

Where the padding byte at the end is to ensure alignment
requirements. ping_timestamp_3 on the other end, looks like
this:

So our optimization there didn’t actually save any space. We’re
wasting 36-bits of padding. Is there any way we can somehow do
better?

We keep track of the source address due to frequent changes while our
product is in operation (on a mobile data network). When the address
changes, we also reset the sequence number for reasons that aren’t
relevant to the current topic. We have seen, in the past, packets with
differing source addreses but identical sequence numbers due to be
processed by our application at the same time (the joys of asynchronous
programming), so the source address serves to disambiguate these
changes.

But there’s another way to disambiguate.

An ICMP echo request has a 16-bit long identifier field
to allow applications to identify which echo request packets were sent
by them. Its value is completely arbitray. On Linux iputils
ping sets it to getpid() & 0xFFFF; on
OpenBSD a random number is used instead.

Although it’s 16-bits long, we don’t actually need to use the full
16-bits. There’s 4 free bits left in the first 8-bytes of our
ping_timestamp_3. Our thought was to use a rolling 4-bit
counter, that is increased whenever our source address changes (this is
monitored elsewhere in the application), allowing us to uniquely
identify2 which source address the packet came
from.

Much better. A whole 8-kilobytes of savings, and down to a single
page of data. You may have noticed that I have changed the order of the
fields slightly. This is to line seq_no up on a
16-bit boundary, so that loading it is a single
ldrh instruction rather than require a shift. Similarly,
reading from elapsed_or_sent_ts only requires a mask.

In the end, this was a completely pointless exercise. Our application
isn’t remotely memory constrained.

I realized there’s a way to “optimize” this slightly further. By
switching the order of the received/counter fields, accessing the received bit only requires a
shift instruction rather than a shift and a mask:

There’s a slight “issue”3 with the above code:
received is now much cheaper to access, at the expense of
counter needing to mask out the received
bit.

But we can fix this! We only ever read counter when
received is true, i.e. 1. If received were
zero, and we could tell the compiler to assume that it were zero, then
no mask would be necessary.

The solution? Flip the meaning of the received bit.

Now, if the read of counter only ever happens inside of a conditional
that checks if not_received is zero, then the compiler is able to
completely elide the mask.

The timestamps are taken from the linux kernel’s
monotonic clock, which measures time elapsed since boot. If we were
measuring from the Unix epoch, then we would need 51-bits.↩︎

As long the IP address doesn’t change more than 16-times
in the period we’re monitoring, which is not something we have ever
seen.↩︎

I’m using that term loosely, we haven’t even benchmarked
anything.↩︎

Premature Optimization Is Fun Sometimes(godbolt.org)

Comments