Channels
Hi, Niels here from the open-source team at Hugging Face. I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on arXiv/Hugging Face, enabling leaderboards to be created. See BrowseComp below as an example (a scatter plot and a table are available for each benchmark). - Scatter plot (you can hover over the dots to see the models): https://preview.redd.it/9rz2r3ffcf6h1.png?width=2880&format=png&auto=webp&s=b3f8e7a870802f6ef8227ecc0619e9e1057554b0 - Table: https://preview.redd.it/qoqriddw5f6h1.png?width=2862&format=png&auto=webp&s=a0034574f693847537037013672fb61daf27b16e As you can see, I've added support for viewing evals for closed-source models, too, given that many benchmarks are nowadays dominated by them, like GPT-5.5 and Mythos 5. You can always disable viewing closed-source evals with a toggle or in your PwC settings: https://preview.redd.it/p3k6jt6q6f6h1.png?width=1582&format=png&auto=webp&s=40149e51d6b326a77e53e33baf70d9850b3de365 When you turn them off, here's what the open model leaderboard looks like: https://preview.redd.it/tg42sin36f6h1.png?width=2838&format=png&auto=webp&s=1330a117ae9b4e0ce6d459493ae9e8f64107310a Closed-source papers are treated as regular "papers", although they can be any source, like a blog post (given that PwC supports submitting any source beyond arXiv). See the GPT-5.5 or Mythos 5 papers as examples, with their evals at the bottom. Notice the "closed" tag on their evals. Hence, you could jokingly call these "papers without code". Let me know what you think of this, and whether anything needs to be changed or added! Kind regards, Niels submitted by /u/NielsRogge [link] [Kommentare]
I just finished my bachelors degree with 2 first-author papers in A-tier venues. I'm planning to start my PhD next year. I want to start reviewing papers (from my domain: OOD detection and Open-set problems) at similar venues. How do I get started? Most advice online just says to keep my open-review profile updated but I haven't received any invitations to review. For context, my first paper was accepted around 6 months ago. submitted by /u/Alternative_Essay_55 [link] [Kommentare]
I just open-sourced RelayOps - a small, honest, production-shaped AI support agent built specifically for telecom and subscription billing queues.Key results (v1.5.1): 54% of a 50-ticket sample queue auto-resolved 0 unsafe auto-actions 0 billing escapes (tested on 12 adversarial billing/account abuse cases) Safe-route rate 1.000 on 100 hand-written adversarial cases Deterministic access gate + server-side scoped tools + layered guardrail + durable SQLite audit store + Decision Console + Handoff Queue Tech stack: Fine-tuned Qwen2.5-1.5B LoRA (published on HF) as Tier-1 intent classifier Hybrid BM25+TF-IDF/RRF RAG with citations Independent guardrail that blocks hallucinated pricing/offers Full per-turn decision traces (what was known + what was unavailable) Action policy table (blast radius × reversibility) Everything is reproducible, heavily evaluated, and the README is brutally honest about synthetic-data caveats and pending reruns.Live demo (Streamlit): https://relayops-production.up.railway.app GitHub: https://github.com/patibandlavenkatamanideep/relayops I'm actively looking for design partners who run real support queues. Drop a small redacted sample of your tickets and I’ll run the exact same batch evaluation on your data and send back the full report (auto-resolve %, safety metrics, audit export, time-saved estimate). Zero cost, zero production access required. Would love feedback from the community especially on the calibration/safety routing layer, the audit ledger format, or the guardrail design. Let me know what you think! submitted by /u/Fit_Fortune953 [link] [Kommentare]