Channels
Grandpa Games is a no-ads collection of memory games, social party games, logic puzzles, and relaxing smartphone skill games.
A colleague of mine was recently discussing a connectivity monitoring system he is working on with me. It’s nothing fancy, just sending ICMP Echo Requests to a couple of different servers, and monitoring latency and dropped packet averages over 1-minute, 5-minute, and 15-minute periods. Up came the topic of how this data should be stored, the natural thought was a 512 entry ring buffer, containing entries like the following: 12 KiB. Pretty wasteful, right? We can certainly do better. Do we need to keep fields for both sent and received? What we’re really interested is the latency. We need to know when a packet was sent, only up until we know when it was received, at that point, the data we want to keep is received - sent, so why don’t we make it a tagged union? Not bad, we’ve shaved off an entire page. We can still do better. Nanosecond precision? In our case, ping times are measured in the tens or hundreds, or even thousands of milliseconds. We don’t need to keep all of those extra bits around. If we change the unit from nanosecond to 100 microsecond increments (0.1ms), then 43-bits is sufficient for us to keep track of pings for up to 20-years1. 20-years is a bit excessive still, but it doesn’t hurt to be at least a little bit future proof. And received? 8-bit for a true/false value seems altogether too much. The answer: bitfields. Wait, what? Why haven’t we saved any space? The answer is struct padding. The layout of ping_timestamp_2 looks like this: Where the padding byte at the end is to ensure alignment requirements. ping_timestamp_3 on the other end, looks like this: So our optimization there didn’t actually save any space. We’re wasting 36-bits of padding. Is there any way we can somehow do better? We keep track of the source address due to frequent changes while our product is in operation (on a mobile data network). When the address changes, we also reset the sequence number for reasons that aren’t relevant to the current topic. We have seen, in the past, packets with differing source addreses but identical sequence numbers due to be processed by our application at the same time (the joys of asynchronous programming), so the source address serves to disambiguate these changes. But there’s another way to disambiguate. An ICMP echo request has a 16-bit long identifier field to allow applications to identify which echo request packets were sent by them. Its value is completely arbitray. On Linux iputils ping sets it to getpid() & 0xFFFF; on OpenBSD a random number is used instead. Although it’s 16-bits long, we don’t actually need to use the full 16-bits. There’s 4 free bits left in the first 8-bytes of our ping_timestamp_3. Our thought was to use a rolling 4-bit counter, that is increased whenever our source address changes (this is monitored elsewhere in the application), allowing us to uniquely identify2 which source address the packet came from. Much better. A whole 8-kilobytes of savings, and down to a single page of data. You may have noticed that I have changed the order of the fields slightly. This is to line seq_no up on a 16-bit boundary, so that loading it is a single ldrh instruction rather than require a shift. Similarly, reading from elapsed_or_sent_ts only requires a mask. In the end, this was a completely pointless exercise. Our application isn’t remotely memory constrained. I realized there’s a way to “optimize” this slightly further. By switching the order of the received/counter fields, accessing the received bit only requires a shift instruction rather than a shift and a mask: There’s a slight “issue”3 with the above code: received is now much cheaper to access, at the expense of counter needing to mask out the received bit. But we can fix this! We only ever read counter when received is true, i.e. 1. If received were zero, and we could tell the compiler to assume that it were zero, then no mask would be necessary. The solution? Flip the meaning of the received bit. Now, if the read of counter only ever happens inside of a conditional that checks if not_received is zero, then the compiler is able to completely elide the mask. The timestamps are taken from the linux kernel’s monotonic clock, which measures time elapsed since boot. If we were measuring from the Unix epoch, then we would need 51-bits.↩︎ As long the IP address doesn’t change more than 16-times in the period we’re monitoring, which is not something we have ever seen.↩︎ I’m using that term loosely, we haven’t even benchmarked anything.↩︎
A discussion between Nicolas Delcros and Roger Hui Nicolas, Prologue: From a language point of view, thanks to Ken Iverson, it is obvious that you want grade rather than sort as a primitive. Yet from a performance point of view, sort is currently faster than grade. Can one be “more fundamental” than the other? If so, who’s wrong…APL or our CPUs? In any case, what does “fundamental” mean? Roger: Sorting is faster than grading due to reasons presented in the J Wiki essay Sorting versus Grading, in the paragraph which begins “Is sorting necessarily faster than grading?” I can not prove it in the mathematical sense, but I believe that to be the case on any CPU when the items to be sorted/graded are machine units. Nicolas, Formulation 1: Now, parsing ⍵[⍋⍵] (and not scanning the idiom) begs the question of how deeply an APL intepreter can “understand” what it’s doing to arrays. How would an APL compiler resolve this conjunction in the parse tree? Do you simply have a bunch of state pointers such as “is the grade of” or “is sorted” or “is squozen” or “axis ordering” etc. walking along the tree? If so, do we have an idea of the number of such state pointers required to exhaustively describe what the APL language can do to arrays? If not, is there something more clever out there? Roger: I don’t know of any general principles that can tell you what things can be faster. I do have two lists, one for J and another for Dyalog. A big part of the lists consists of compositions of functions, composition in the mathematical sense, that can be faster than doing the functions one after the other if you “recognize” what the composed function is doing and write a native implementation of it. Sort vs. grade is one example (sort is indexing composed with grade). Another one is (⍳∘1 >) or (1 ⍳⍨ >). The function is “find the first 1” composed with >. These compositions have native implementations and: x←?1e6⍴1e6 cmpx '5e5(1 ⍳⍨ >)x' '(5e5>x)⍳1' 5e5(1 ⍳⍨ >)x → 0.00E0 | 0% (5e5>x)⍳1 → 1.06E¯2 | +272100% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx '¯1(1 ⍳⍨ >)x' '(¯1>x)⍳1' ¯1(1 ⍳⍨ >)x → 2.41E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (¯1>x)⍳1 → 4.15E¯3 | +71% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ If you get a “hit” near the beginning, as would be the case with 5e5, you win big. Even if you have to go to the end (as with ¯1), you still save the cost of explicitly generating the Boolean vector and then scanning it to the end. Another one, introduced in 14.1, is: cmpx '(≢∪)x' '≢∪x' (≢∪)x → 4.43E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ ≢∪x → 1.14E¯2 | +157% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ This is the tally nub composition, used often in a customer application. If you “know” that you just want the tally of the nub (uniques), you don’t actually have to materialise the array for the nub. I am not conversant with compiler technology so I don’t know what all an APL compiler can do. I do know that there’s a thing call “loop fusion” where, for example, in a+b×c÷2, it doesn’t have to go through a,b,c in separate loops, but can instead do :for i :in ⍳≢a ⋄ z[i]←a[i]+b[i]×c[i]÷2 ⋄ :endif saving on the array temp on every step. You win some with this, but I think the function composition approach wins bigger. On the other hand, I don’t know that there is a general technique for function composition. I mean, what general statements can you make about what things can be faster (AKA algebraically simpler)? Nicholas: I sort of see…so jot is a “direct” conjunction. An indirect conjunction could be ⍵[⍺⍴⍋⍵] where the intermediate grade is reshaped. We “know” that grade and shape are “orthogonal” and can rewrite the formula to ⍴⍵[⍋⍵]. So if we can establish a list of flags, and establish how primitives touch these flags, and how these flags affect each other, then we can extend ∘ to any path of the parse tree, provided that the intermediate nodes don’t destroy the relationship between the two operations (here grade + indexing provides sort, independently of the reshape). Of course we can spend our lives finding such tricks. Or we could try and systemise it. Roger: What you are saying above is that reshape and indexing commute (i.e. reshape ∘ indexing ←→ indexing ∘ reshape). More generally, compositions of the so-called structural functions and perhaps of the selection functions are amenable to optimisations. This would especially be the case if arrays use the “strided representation” described in Nick Nickolov’s paper Compiling APL to JavaScript in Vector. I used strided representation implicitly to code a terse model of ⍺⍉⍵ in 1987. Nicolas, Formulation 2: On which grounds did the guys in the ’50s manage to estimate the minimal list of operations that you needed to express data processing? Roger: APL developed after Ken Iverson struggled with using conventional mathematical notation to teach various topics in data processing. You can get an idea of the process from the following papers: Notation as a Tool of Thought (Turing lecture) The Evolution of APL, especially the Transcript of Presentation In our own humble way, we go through a similar process: We talk to customers to find out what problems they are faced with, what things are still awkward, and think about what if anything we can do to the language or the implementation to make things better. Sometimes we come up with a winner, for example ⌸. You know, the idea for ⍋(grade) is that often you don’t just use ⍋x to order x (sort) but you use ⍋x to order something else. Similarly with ⌸, you often don’t want just x⍳y but you use it to apply a function to items with like indices. The J Wiki essay Key describes how the idea arose in applications, and then connected with something I read about, the Connection Machine, a machine with 64K processors (this was in the 1980s). Nicolas, Formulation 3: Do we have something with a wider spectrum than “Turing complete or not” to categorise the “usefulness and/or efficiency” of a language? Roger: Still no general principles, but I can say this: Study the languages of designers whom you respect, and “borrow” their primitives, or at least pay attention to the idea. For example, =x is a grouping primitive in k. Symbols are Arthur Whitney’s most precious resource and for him to use up a symbol is significant. =x ←→ {⊂⍵}⌸x old k definition =x ←→ {⍺⍵}⌸x current k definition Both {⊂⍵}⌸x (14.0) and {⍺⍵}⌸x (14.1) are supported by special code. Study the design of machines. For a machine to make something “primitive” is significant. For example, the Connection Machine has an instruction “generalised beta” which is basically {f⌿⍵}⌸. Around the time the key operator was introduced, we (Ken Iverson and I) realised that it’s better not to build reduction into a partitioning, so that if you actually want to have a sum you have to say +⌿⌸ rather than just +⌸. The efficiency can be recovered by recognising {f⌿⍵}⌸ and doing a native implementation for it. (Which Dyalog APL does, in 14.0.) Roger, Epilogue: To repeat Nicolas’ question in the opening section, what does “fundamental” mean? (Is grade more fundamental than sort?) In computability, we can use Turing completeness as the metric. But here the requirement is not computability but expressiveness. I don’t know that there is a definitive answer to the question. One way to judge the goodness of a notation, or of an addition to the existing notation such as dfns or key or rank, is to use it on problems from diverse domains and see how well it satisfies the important characteristics of notation set out in Ken Iverson’s Turing lecture: ease of expressing constructs arising in problems suggestivity subordination of detail economy amenability to formal proofs I would add to these an additional characteristic: beauty. Follow Comments are closed.
Real risks and production mitigations
Data from millions of Cursor sessions shows how AI is transforming software development. Spring 2026 report.
VERITROOPER audits any LLM against your own data — detecting hallucinations, pinpointing failures, and returning audit-ready reports plus EU AI Act conformity-evidence records.
When a community came together after Red Hat said Windows was 'probably the right product