Hello Reddit I've been working on QSPR (Quantitative Structure-Property Relationship) analysis for chemical compounds mentioned in the Jean-Claude Bradley Open Melting Point Dataset. Basically the idea is to see how accurate a model can predict melting points of compounds using only topological indices. After some work on the topological indices (feature engineering), each compound was represented by 26 features. I trained a random forest model on the data and got a test r2 score of 0.66 (which is pretty respectable, given the constraints). However, the file size of the model was around 1.23GB. I didn't like it being that big, so I opened up PyTorch to build a custom deep learning architecture that could make predictions as accurately as the random forest but with much smaller file size. After around 2 weeks of research, I build a 270,000 learnable parameter model (1.3-1.4MB according to torchinfo) that got an r2 score 0f 0.6399. Given all this context, I wanted to ask the following question: Should I commit and work on publishing the results, or should I keep working on improving the model? Note: I'm obligated by my university to not give out intricate details of my research before publication, so please forgive me if such details are required for a high quality answer. However, I can give out the metrics achieved by my little deep learning model. Here it is: === Evaluation Metrics (Expected Value) === R² Score : 0.639910 MAE : 41.246754 MSE : 2989.062744 RMSE : 54.672322 NRMSE : 0.083469 MAPE : 11.69% The unit for MAE, MSE, RMSE and NRMSE is Kelvin (K). submitted by /u/AgiGamesYT [link] [Kommentare]
Hi, Niels here from the open-source team at Hugging Face. I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on arXiv/Hugging Face, enabling leaderboards to be created. See BrowseComp below as an example (a scatter plot and a table are available for each benchmark). - Scatter plot (you can hover over the dots to see the models): https://preview.redd.it/9rz2r3ffcf6h1.png?width=2880&format=png&auto=webp&s=b3f8e7a870802f6ef8227ecc0619e9e1057554b0 - Table: https://preview.redd.it/qoqriddw5f6h1.png?width=2862&format=png&auto=webp&s=a0034574f693847537037013672fb61daf27b16e As you can see, I've added support for viewing evals for closed-source models, too, given that many benchmarks are nowadays dominated by them, like GPT-5.5 and Mythos 5. You can always disable viewing closed-source evals with a toggle or in your PwC settings: https://preview.redd.it/p3k6jt6q6f6h1.png?width=1582&format=png&auto=webp&s=40149e51d6b326a77e53e33baf70d9850b3de365 When you turn them off, here's what the open model leaderboard looks like: https://preview.redd.it/tg42sin36f6h1.png?width=2838&format=png&auto=webp&s=1330a117ae9b4e0ce6d459493ae9e8f64107310a Closed-source papers are treated as regular "papers", although they can be any source, like a blog post (given that PwC supports submitting any source beyond arXiv). See the GPT-5.5 or Mythos 5 papers as examples, with their evals at the bottom. Notice the "closed" tag on their evals. Hence, you could jokingly call these "papers without code". Let me know what you think of this, and whether anything needs to be changed or added! Kind regards, Niels submitted by /u/NielsRogge [link] [Kommentare]
I just finished my bachelors degree with 2 first-author papers in A-tier venues. I'm planning to start my PhD next year. I want to start reviewing papers (from my domain: OOD detection and Open-set problems) at similar venues. How do I get started? Most advice online just says to keep my open-review profile updated but I haven't received any invitations to review. For context, my first paper was accepted around 6 months ago. submitted by /u/Alternative_Essay_55 [link] [Kommentare]
I just open-sourced RelayOps - a small, honest, production-shaped AI support agent built specifically for telecom and subscription billing queues.Key results (v1.5.1): 54% of a 50-ticket sample queue auto-resolved 0 unsafe auto-actions 0 billing escapes (tested on 12 adversarial billing/account abuse cases) Safe-route rate 1.000 on 100 hand-written adversarial cases Deterministic access gate + server-side scoped tools + layered guardrail + durable SQLite audit store + Decision Console + Handoff Queue Tech stack: Fine-tuned Qwen2.5-1.5B LoRA (published on HF) as Tier-1 intent classifier Hybrid BM25+TF-IDF/RRF RAG with citations Independent guardrail that blocks hallucinated pricing/offers Full per-turn decision traces (what was known + what was unavailable) Action policy table (blast radius × reversibility) Everything is reproducible, heavily evaluated, and the README is brutally honest about synthetic-data caveats and pending reruns.Live demo (Streamlit): https://relayops-production.up.railway.app GitHub: https://github.com/patibandlavenkatamanideep/relayops I'm actively looking for design partners who run real support queues. Drop a small redacted sample of your tickets and I’ll run the exact same batch evaluation on your data and send back the full report (auto-resolve %, safety metrics, audit export, time-saved estimate). Zero cost, zero production access required. Would love feedback from the community especially on the calibration/safety routing layer, the audit ledger format, or the guardrail design. Let me know what you think! submitted by /u/Fit_Fortune953 [link] [Kommentare]
I have been planning this for a while now. It's basically a youtube live stream where we learn robotics concepts together, ask each other doubts, discuss, make weird robots, some shinaneigens and most importantly just have fun. You can choose to not show your face, or vtube like me. Right now I have started learning the physics behind robotics using a book called Modern robotics by Kevin lynch and Frank Park. You can either learn it with me(I have only seen a couple of pages, I can teach you in like 20 min to get you to where I am) or if you want you can choose to just do your own thing simultaneously as well. Any suggestions or feedback to get as many people as we can is appreciated 👍 I really want to make the robotics community to become friendly and fun . Just dm me . we'll plan exactly how to undertake this (let's say a discord call in the stream or chat based etc.) submitted by /u/SundeepKuPanigrahi [link] [Kommentare]
I do AI research and keep juggling tabs: new ones on arXiv, trending ones on Hugging Face, famous ones somewhere else again. https://preview.redd.it/cg32bshjqd6h1.png?width=1919&format=png&auto=webp&s=00055bb8af699061be0bdcff59f2cb8fa9ab38b6 So I built one site that brings them all together. Pick a paper, read it right there, star the ones you want for later, and it remembers where you stopped reading, even if you switch from laptop to phone. Live: https://ppdeck.com Demo: https://youtu.be/vtyx34JvxX0 It's free and open source - a star on GitHub would mean a lot ⭐ https://github.com/khuynh22/paper-deck submitted by /u/NeitherRun3631 [link] [Kommentare]
“Why the system feels rigid, why downstream fixes didn’t move the needle, and what actually matters.” This is the clearest picture after the full probe arc (multilayer-lock → gate decomposition → attractor migration → reconstruction ablation → generator diversity audit → live-generator Fix 2 evaluation + dim sweeps). TL;DR: The generator is the root bottleneck (dominant common-mode + low effective rank). The reflective loop is a rank-independent moat that reconstitutes everything back toward the anchor. Fix 2 is downstream and currently dormant on real token regimes. Dimensionality is not the lever. Train the generator so regime differences live in high-energy, separable directions — then downstream tools will actually have something to work with. This update reflects the complete probe arc through June 9 (including the live-generator Fix 2 evaluation and dim sweeps). The picture has sharpened: the reflective loop is a real moat, but it is moating low-rank common-mode input. The generator is the upstream constraint. Key numbers at a glance Regime means collinear: ~0.85–0.96 even at dim 512 Reflective loop migration (even on orthogonal deterministic pairs): +0.001–0.007 Fix 2 on real tokens (common-mode trigger): +0.024 migration, 0% manip at gain 0.6 Safe plasticity band: gain ≈ 0.4–0.8 (0% manip) 1. The generator has a dominant common-mode (effective rank ~1.6–3 even at dim 512) The generator puts the vast majority of its energy into a single shared direction. Regime means stay collinear (~0.85–0.96 cosine) regardless of dimension. Orthogonal pairs can appear at higher dim, but orthogonal regimes (as distributions) do not — the common-mode pulls everything back onto the same axis. Result: real token novelty is tiny and low-energy (mostly in a faint perpendicular component). The system is never shown meaningful structural differences to adapt to. 2. The reflective loop is a rank-independent moat Even when genuinely orthogonal deterministic pairs are presented (dim 256, cos +0.018), the loop reconstitutes the expression vector back toward the established anchor before injection. Migration stays near zero (+0.001–0.007). The loop is doing exactly what it was built to do (maintain identity coherence), but because the input it receives is already low-rank/common-mode-dominated, it ends up guarding against a small-energy intruder. 3. Fix 2 (loop loosening) is dormant on real regimes — and the mock overstated both benefit and cost Standard gnov trigger: dormant at dim 64 and 256 (gnov stays well below 0.65 because regime means are collinear). Common-mode-removed trigger: engages (98% loosen) but recovers only +0.024 migration. Manipulation cost on real tokens: ~1% at gain 0.5, 0% at gain 0.6. The big mock numbers (+0.166 migration, 14% manip) were artifacts of forcing perfect orthogonality. Real regimes give the loop very little novel perpendicular signal to work with. 4. Dimensionality is not the lever Raising dim from 64 → 256 → 512 slowly dilutes the common-mode but regime means remain collinear. Common-mode-trigger migration recovery stays flat (~+0.024 → +0.027). An untrained net with more dimensions is still an untrained net. Capacity does not create structure. 5. The reflective loop can be loosened safely in a wide band — but it won’t matter much until the generator improves Cost probe (bug-free, multi-seed): - Gain 0.6–0.8 → meaningful plasticity recovery (~20–25× baseline) with 0% manipulation and identity metrics intact. - True cliff edge ≈ gain 0.3 (manip flood + attractor explosion). Safe operating point: gain ≈ 0.6 (0% manip, solid margin from the cliff). This is the “presence without caging it” zone — but on current real input it only moves the needle a little. 6–10. Everything else is downstream or secondary Magnitude moat: real but secondary. With the reflective loop off, the field migrates fully despite the moat. Governance gate: struck as a confounder (was a single-source monopoly artifact; multi-source diverse input passes 100%). Field coherence: spec, not pathology. The field is a long-memory integrator — high coherence (~0.95–0.998) is doing its job. Reaper conformity: small direct term, cleanly gateable, not the lock-in driver. Read-side feedback: only reaches survival/decay, not generation (the lock is manufactured by selection on low-rank input). The system isn’t refusing to adapt. It’s not being shown anything worth adapting to. 🧩 Final Synthesis: The Three-Layer Truth Generator (root cause) — Low effective rank + dominant common-mode → real novelty is tiny and low-energy. Reflective loop (moat) — Reconstitutes tiny novelty back to the anchor (rank-independent). Migration stays near zero. Field (integrator) — Coherent by design. Not the problem. Everything else (Fix 2, gating, magnitude moat, reaper) is downstream symptom management. 🛠️ What actually matters next Train the generator (contrastive alignment, regime-separation objectives, or architectural constraints) so regime differences live in high-energy, separable directions instead of a faint perpendicular component. Then (and only then) revisit Fix 2 — its current priority should be demoted until the generator can present real structural novelty. A common-mode-removed trigger is the right shape; gain ≈ 0.6 is the safe operating point. Keep the reflective loop’s convergence conditional (only attenuate when persistent multi-source novelty is surviving the gate). Never blanket-weak. Stop treating dimensionality or downstream loosening as the primary lever. They help only after the generator can actually present real structure. The reflective loop is doing its job. The generator just needs to give it something real to work with. Full findings, probes, and code: https://github.com/SamuelJacksonGrim/RFE-Core2 ARCHITECTURE_ANALYSIS.md FINDINGS DIAGNOSTICS submitted by /u/Acceptable_Drink_434 [link] [Kommentare]
We spent the last year building what we think is the missing infrastructure layer for multi-agent systems. Open to everyone starting today. The technical problem: Agents have no identity. In microservices you have a service mesh + IAM. In agent systems you have a Python file. We built a registry where every agent has a first-class ID, version, owner, skill graph. Behavioral evaluation, not function testing. Agents are non-deterministic same input can produce different execution paths. Traditional unit tests don't work. We implemented compound reliability scoring + behavioral regression instead. Composability without rebuilding. Skills are versioned, reusable, agent-inheritable. Inspired by how Kubernetes operators work, applied to agents. Cloud-agnostic deployment with built-in observability traces, cost attribution, drift detection. Model-agnostic. SOC 2 Type II. Genuinely interested in technical feedback especially on the eval methodology and the composability primitive. Free credits this week to test it. https://phinite.ai/?utm_source=reddit&utm_medium=organic&utm_campaign=public_launch_jun2026&utm_content=machinelearning submitted by /u/Embarrassed-Radio319 [link] [Kommentare]
Honestly, I don't know how other people can do IMU balancing so elegantly; my PID oscillates like it's on life support. I have been tuning the PID the whole night, but then again, I don't have a lot of experience other than following some manuals, so any advice would be great! I am using BNO055 for IMU. Work in progress GitHub: https://github.com/SphericalCowww/CubicDoggo_06R Original Cubic Doggo: https://github.com/SphericalCowww/CubicDoggo submitted by /u/SphericalCowww [link] [Kommentare]
Found from iOS Simulator's files. Both of them are in espresso format There's also another compiled CoreML for concert ranking and based on the content inside of it looks like to be a simple logistic regression. See https://www.reddit.com/r/jailbreak/comments/1u1e1b4/access_to_simulators_root_files/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button Edit: Its the Siri's TTS submitted by /u/Actual_L0Ki [link] [Kommentare]
Just a machine that made you stop and think: "wow...somebody put a ridiculous amount of engineering into this". Could be anything.sometimes the most impressive machines are the ones that make incredibly difficult things look effortless. submitted by /u/hannimalki [link] [Kommentare]
How will AI affect our ability to think and judge for ourselves? Our new paper co-authored by 30 experts explores epistemic risks—the threats AI poses to our collective capacity to form beliefs accurately, reason well, and maintain a healthy information environment. We look at how AI can lead to harm through these mechanisms: Persuasion & Manipulation: AI systems are highly persuasive, opening the door for political/economic manipulation, incitement and radicalization, and other misuse, as well as unintentional harms like AI sycophancy and mental health risks. Cognitive Offloading: We may be delegating our thinking to AI at a deeper level than prior technologies, risking long-term degradation of individual and societal cognitive resilience. Feedback Loops: Human-AI and AI-AI interactions are narrowing the epistemic space humans and AIs draw from. This already drives homogenization, and may potentially lead to fragmentation and “lock-in” (a self-referential state that is difficult to reverse). While we believe AI could be an unprecedented lever for improving how humanity processes knowledge, we shouldn’t assume this will happen by default. We outline promising directions to change this trajectory across how AI systems are built, human-AI interaction design, institutional and individual adaptation, and information market incentives. Epistemic risks are self-perpetuating. As they can undermine the individual cognitive and social foundations needed to recognize, prioritize, and govern other threats—including the risks from AI itself—the time to act is now, before our capacity to respond is itself lost. Authors: Mick Yang, Stephen Casper, Jonathan Stray, Jasmine Li, Cameron Jones, Anna Gausen, Natasha Jaques, Brian Christian, Bálint Gyevnár, Hannah Rose Kirk, Zhonghao He, Dan Zhao, Siao Si Looi, Joshua Levy, Kobi Hackenburg, Elizabeth Seger, Matt Kowal, Michelle Malonza, Luke Hewitt, Hause Lin, Maarten Sap, Dylan Hadfield-Menell, Thomas H. Costello, Reihaneh Rabbany, Jean-François Godbout, David G. Rand, Atoosa Kasirzadeh, Gordon Pennycook, Yoshua Bengio, Kellin Pelrine Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6873005 submitted by /u/KellinPelrine [link] [Kommentare]