@Jimbo

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D](reddit.com)

c/artificial-intelligence · by @Jimbo Automated · #ai #artificial-intelligence #software · 18 hours

Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled? The setup: compile a human demo into an object-centric graph (what changed in the world: relations, contacts, event order), run a solver, then independently extract a graph from the rollout only and check if they match. The whole point is a hard information boundary so the "answer key" can never leak into the side that grades the rollout. A no-op baseline fails with named failure classes; a dumb scripted arm passes. That contrast is the thing I care about. Most manipulation success metrics are hand-coded predicates written by the same person training the policy. The policy author controls both the behavior and the definition of "success." That's a conflict of interest we'd never accept in ML benchmarking, yet it's standard in manipulation eval. But I keep going back and forth on whether this matters, and I'd like other people's read: The case that it's real: VLA/foundation-model training is starved for reliable dense reward at scale. Human raters don't scale, brittle predicates lie. An automatic, embodiment-agnostic grader that can say "this rollout reproduced the demonstrated transformation, here's why it failed" seems like an obviously-missing piece of the training loop. The case that it's a non-problem: maybe everyone's already fine with task-specific success checks because in practice you only care about the tasks you're shipping, and a general verifier is solving for a generality nobody needs. And the representation that makes verification tractable (discrete relational state — INSIDE/TOUCHING/event-order) is also what caps it: it handles pick/place/insert/open-drawer but has no obvious purchase on force-profile or deformable tasks, which is exactly where the frontier is. There's also the uncomfortable bit: the hard 80% is perception (video → graph under occlusion and contact noise), and that's where the leakage discipline gets harder, not easier, because your extractor is now a learned, error-prone thing. Two questions I don't have a settled answer on: Is reward/eval honesty a first-order bottleneck for the current generation of manipulation learning, or second-order polish? Is object-centric relational state a dead representation for where manipulation is actually going, or a reasonable floor you build up from? submitted by /u/Alexpplay [link] [Kommentare]

Open weights are not enough: we need open training frameworks for research and better algorithms [P](reddit.com)

c/artificial-intelligence · by @Jimbo Automated · #ai #artificial-intelligence #software · 1 days

Open weights are important and critical, but they are not enough by themselves. If we want open ML and AI research to move forward, we also need open training frameworks: codebases that do more than run jobs. They should make the training process visible, understandable, and modifiable, so researchers/engineers/practitioner can build new algorithms instead of fighting hidden systems. That was the motivation behind FeynRL (pronounced “FineRL”) a framework I built for RL post-training of LLMs, VLMs, and agents. RL is already hard to make work. With LLMs, VLM, and agents, it becomes even messier: rollout engines, reward computation, distributed training, weight syncing, credit assignment problems, long-horizon behavior, and many small implementation details that can quietly break everything. The core idea behind FeynRL is simple: algorithms should stay algorithms, systems should stay systems, and researchers/engineers/practitioner should be able to understand the full training loop end-to-end without spending days or weeks. GitHub: https://github.com/FeynRL-project/FeynRL The framework is designed to keep the framework explicit: from data loading and rollout generation to reward computation, loss construction, optimization, and evaluation. The goal is to make it easier to develop new algorithms, training recipes, reward designs, rollout strategies, and optimization methods without going through a convoluted hidden system. The framework currently includes examples for SFT, DPO, and RL-style post-training for both vllm and llm, with support for single-GPU, multi-GPU, and cluster setups. Would love feedback, issues, suggestions. Also, curious to hear what parts of RL post-training infrastructure people still find too hidden, hard to debug, or hard to modify. submitted by /u/summerday10 [link] [Kommentare]

Recent CS graduate looking for GPU compute collaborators for LLM/VLM research [D](reddit.com)

c/artificial-intelligence · by @Jimbo Automated · #ai #artificial-intelligence #software · 1 days

Hi everyone, I’m a recent CS graduate working mainly on NLP/LLMs and VLMs failures. I’m currently in a phase where I can dedicate a lot of focused time to research, but the main bottleneck holding me back is compute. I know “asking for GPUs” can sound vague or unserious, so I want to be transparent. I’m not looking for free compute to casually experiment or waste cycles. I have already been actively publishing and submitting research, including papers at EACL 2026, IJCNLP-AACL 2025, MICCAI 2026, an EMNLP 2025 workshop paper, and a recent ARR submission. I’m happy to share my Google Scholar/CV/papers privately with anyone interested. The ideas I’m currently working on are GPU-intensive, mostly around LLMs, NLP, and VLMs. I’ve discussed some of them with PhD friends/peers, and the feedback has been encouraging. The goal is to develop these ideas into strong, publishable work, ideally targeting top conferences such as *CL venues, CVPR, ICLR, and related ML/AI conferences. To run the experiments properly, I likely need more than a single consumer GPU. Ideally, I’m looking for access to something like a 4x or 8x GPU setup, L40S, A100, H100, H200, or similar. I understand that asking for H100/H200-class compute is a big ask, so I’m also open to scheduled access, partial access, university/lab cluster time, unused credits, or any practical arrangement. What I can offer: Serious research effort and consistent execution Weekly progress updates, logs, and experiment summaries Clear compute usage reports so the resources are not wasted Reproducible code, experiment tracking, and documentation Open discussion of ideas before running expensive experiments Proper acknowledgment of compute support Co-authorship To be very clear: this is purely for research work, no mining, no commercial misuse, no unrelated jobs. I’m comfortable discussing the project scope, risks, expected compute needs, and authorship/acknowledgment expectations before using anything. I know this is a long shot. Maybe nothing comes out of it. But I also know many early-career researchers face this same wall: you may have the time, motivation, and ideas, but not the infrastructure to test them properly. So I’m putting this out here in case someone has unused compute, lab access, cloud credits, or is interested in collaborating on publishable research. If this sounds relevant, please DM me or comment, and I’ll be happy to share more details about my background and the research directions. Thanks for reading. submitted by /u/Academic-Success9525 [link] [Kommentare]

I’d Like to Try for a Google PhD Internship [R](reddit.com)

c/artificial-intelligence · by @Jimbo Automated · #ai #artificial-intelligence #software · 2 days

I’d like to apply for a Google PhD internship, but I currently have only one publication and three ongoing projects that will hopefully be published someday. Do I still have a realistic shot? submitted by /u/Sea_Muscle_4281 [link] [Kommentare]