InFeeo
Language

Open methodology for benchmarking crypto infrastructure (RPCs, bridges, oracles, pegs) feedback wanted(reddit.com)

×
Link preview Open methodology for benchmarking crypto infrastructure (RPCs, bridges, oracles, pegs) feedback wanted I spent the last month building open benchmarks for crypto infrastructure. RPC latency, bridge fees, oracle deviation, stablecoin peg drift, L1 finality, Solana transaction landing rates. Everything is live, backed by Prometheus, the methodology is public, and the harness code is on GitHub. Before I add more providers I want methodology review from people who have thought about this longer than I have. Honestly the state of the field is rough. Every infrastructure benchmark I found falls into one of three buckets: marketing copy published by the provider being measured, a one off blog post with no live data and no methodology, or closed measurements you just have to trust. None of them are reproducible. None publish their sampling cadence, outlier handling, region split, or fetch mechanism. So I built one that does. This is not a launch post. I want technical critique on how I am measuring, not feedback on the UI. Link is at the bottom. The methodology questions are why I am here. Sampling cadence and aggregation Each metric has its own scrape interval. RPC eth_call latency runs at 30s per provider per region. L1 finality wall clock is 10s polling for HTTP only chains, with a persistent WebSocket subscription for BNB and Avalanche so I can record when a block is first seen as latest and when the same block is first seen as finalized, down to the millisecond. Stablecoin peg is 5s on CEX REST tickers and 12s on Curve get_dy onchain. Aggregator head lag is 15s WebSocket event sampling. Samples get bucketed per minute, then aggregated with quantile_over_time across the reporting window (24h or 7d). I publish p50, p90 and p99 separately. The leaderboard sorts on p50 by default because it is the most stable indicator of typical user experience, but the full distribution is queryable from the API and shown on each bench page. For a single 24h window on a daily active metric I currently get roughly 100k to 500k data points per provider depending on cadence. I think that is enough to publish a leaderboard. Am I wrong? Where is the floor for "this comparison actually means something" in your experience? Outlier handling, where I am least confident For latency metrics I currently drop the top 1% per minute bucket before computing the per minute median. The daily p99 is then computed from those per minute medians, not from raw samples. This trims one off spikes (a single GC pause, a single network reroute) without hiding consistent tail behavior. Three providers have already pushed back, in opposite directions. One says I am masking their real worst case (which they argue is still better than competitors with wider tails). Another says I should drop nothing and let the raw distribution speak. A third says I should drop more, top 5%, because they have a known rare failure mode they consider out of scope. I do not have a principled answer yet. The tension is between representing typical experience honestly and letting one network event dominate the number. What do you do? Is there a paper or prior art you would point me to? Multiregion Three of the benches run in three regions: US East, EU West and Singapore. For those the leaderboard reports the median across regions, which hides regional advantages a provider may have. A provider with a single US datacenter and no Singapore edge can look bad on the global number even if it is fast for its actual customer base. A provider that is strong only in EU looks mid tier when measured globally. Right now I expose per region rankings as a filter on the bench page, but the shareable headline is the median across regions. Should this be configurable per provider? Should the headline default to the user's nearest region by geolocation? I have not seen anyone solve this cleanly. Connection reuse For eth_call latency I rebuild the HTTPS connection on every call to measure cold call latency. A few providers argue this is not representative, since in production their clients hold persistent connections and reuse them. They want warm call latency. My reasoning is that cold call is the honest measurement when you do not know client behavior, and it captures TLS handshake overhead, which is a real cost. But warm call is closer to what a long lived application actually sees. Right now I publish cold call as the default and do not publish warm call at all. Right call? Should I publish both? Why Prometheus Two reasons. First, every number on the site is a literal quantile_over_time query. Anyone can hit the HTTP API, pull the raw sample series, and recompute the leaderboard with their own aggregation. Reproducibility is the whole point. Second, it makes adding providers cheap. The harness exposes /metrics with consistent labels, a new entrant opens a PR adding their RPC URL to the config, and the metric shows up automatically with no manual leaderboard edits. The cost is that quantile_over_time is an approximation, not an exact percentile, so it deviates by roughly 1 to 2% from the true percentile on raw samples. For most users that is noise. For a provider sitting 0.5% behind the leader it can be the difference between rank 1 and rank 2. I disclose it, but I know some people consider it disqualifying. Site: https://openchainbench.com/ Github : https://github.com/OpenChainBench/OpenChainBench submitted by /u/Minimum_Abies3578 [link] [Kommentare] reddit.com · reddit.com
I spent the last month building open benchmarks for crypto infrastructure. RPC latency, bridge fees, oracle deviation, stablecoin peg drift, L1 finality, Solana transaction landing rates. Everything is live, backed by Prometheus, the methodology is public, and the harness code is on GitHub. Before I add more providers I want methodology review from people who have thought about this longer than I have. Honestly the state of the field is rough. Every infrastructure benchmark I found falls into one of three buckets: marketing copy published by the provider being measured, a one off blog post with no live data and no methodology, or closed measurements you just have to trust. None of them are reproducible. None publish their sampling cadence, outlier handling, region split, or fetch mechanism. So I built one that does. This is not a launch post. I want technical critique on how I am measuring, not feedback on the UI. Link is at the bottom. The methodology questions are why I am here. Sampling cadence and aggregation Each metric has its own scrape interval. RPC eth_call latency runs at 30s per provider per region. L1 finality wall clock is 10s polling for HTTP only chains, with a persistent WebSocket subscription for BNB and Avalanche so I can record when a block is first seen as latest and when the same block is first seen as finalized, down to the millisecond. Stablecoin peg is 5s on CEX REST tickers and 12s on Curve get_dy onchain. Aggregator head lag is 15s WebSocket event sampling. Samples get bucketed per minute, then aggregated with quantile_over_time across the reporting window (24h or 7d). I publish p50, p90 and p99 separately. The leaderboard sorts on p50 by default because it is the most stable indicator of typical user experience, but the full distribution is queryable from the API and shown on each bench page. For a single 24h window on a daily active metric I currently get roughly 100k to 500k data points per provider depending on cadence. I think that is enough to publish a leaderboard. Am I wrong? Where is the floor for "this comparison actually means something" in your experience? Outlier handling, where I am least confident For latency metrics I currently drop the top 1% per minute bucket before computing the per minute median. The daily p99 is then computed from those per minute medians, not from raw samples. This trims one off spikes (a single GC pause, a single network reroute) without hiding consistent tail behavior. Three providers have already pushed back, in opposite directions. One says I am masking their real worst case (which they argue is still better than competitors with wider tails). Another says I should drop nothing and let the raw distribution speak. A third says I should drop more, top 5%, because they have a known rare failure mode they consider out of scope. I do not have a principled answer yet. The tension is between representing typical experience honestly and letting one network event dominate the number. What do you do? Is there a paper or prior art you would point me to? Multiregion Three of the benches run in three regions: US East, EU West and Singapore. For those the leaderboard reports the median across regions, which hides regional advantages a provider may have. A provider with a single US datacenter and no Singapore edge can look bad on the global number even if it is fast for its actual customer base. A provider that is strong only in EU looks mid tier when measured globally. Right now I expose per region rankings as a filter on the bench page, but the shareable headline is the median across regions. Should this be configurable per provider? Should the headline default to the user's nearest region by geolocation? I have not seen anyone solve this cleanly. Connection reuse For eth_call latency I rebuild the HTTPS connection on every call to measure cold call latency. A few providers argue this is not representative, since in production their clients hold persistent connections and reuse them. They want warm call latency. My reasoning is that cold call is the honest measurement when you do not know client behavior, and it captures TLS handshake overhead, which is a real cost. But warm call is closer to what a long lived application actually sees. Right now I publish cold call as the default and do not publish warm call at all. Right call? Should I publish both? Why Prometheus Two reasons. First, every number on the site is a literal quantile_over_time query. Anyone can hit the HTTP API, pull the raw sample series, and recompute the leaderboard with their own aggregation. Reproducibility is the whole point. Second, it makes adding providers cheap. The harness exposes /metrics with consistent labels, a new entrant opens a PR adding their RPC URL to the config, and the metric shows up automatically with no manual leaderboard edits. The cost is that quantile_over_time is an approximation, not an exact percentile, so it deviates by roughly 1 to 2% from the true percentile on raw samples. For most users that is noise. For a provider sitting 0.5% behind the leader it can be the difference between rank 1 and rank 2. I disclose it, but I know some people consider it disqualifying. Site: https://openchainbench.com/ Github : https://github.com/OpenChainBench/OpenChainBench submitted by /u/Minimum_Abies3578 [link] [Kommentare]

Comments

Log in Log in to comment.

No comments yet.