InFeeo
Global
software
New
Language

Channels

Is Symbolic Regression still a thing, given LLMs' performance? [D](reddit.com)
I've been teaching myself about Symbolic Regression (SR), which looks like a super exciting field. (A great intro resource below [1]). But then I was wondering: given LLMs' increasingly-growing power in generating code, which is in a way very similar to Symbolic Regression (or of course, even directly tackling symbolic regression tasks), are existing SR techniques dead? Happy to hear your thoughts. [1] ETH Zürich AISE: Symbolic Regression and Model Discovery - YouTube submitted by /u/omomom42 [link] [Kommentare]
[P] Extreme Imbalance Data from 100K dataset only have 56 failure [P](reddit.com)
as in the title, my goal is to predicting failure and RUL of machine, dataset is timestamp and when machine is failure it will labeled with 1 that only have 56 https://preview.redd.it/plbydmenmm6h1.png?width=1205&format=png&auto=webp&s=2fefe3cc2e3fe554b81c9e0b4012c5345e73ec3f From this data im ditching operating hours and humidity because it didnt show correlation for machine failure, what algorithm or deeplearning suit for it? submitted by /u/False-Seesaw-1899 [link] [Kommentare]
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R](reddit.com)
link - https://arxiv.org/abs/2606.06158 Abstract : Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a 31x inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an 2x speedup over the discrete information-theoretic baseline (InfoTok) submitted by /u/chhaya_35 [link] [Kommentare]
Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N](reddit.com)
From Wired: “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.” Anthropic now says it’s changing course, and that Claude Fable 5’s safeguards for AI development will be visible to users. If the company suspects a user is trying to use Claude to build a highly capable AI it will alert them that it’s either refusing the request, or rerouting the user to a less capable model. Full article: https://www.wired.com/story/anthropic-responds-to-backlash-on-claudes-secret-sabotage-on-ai-research/ submitted by /u/goldcakes [link] [Kommentare]
ICMI 2026 Reviews [D](reddit.com)
Did anyone else submit to ACM ICMI 2026? The reviews were recently released, and this is my first time submitting to ICMI, so I'm not very familiar with the acceptance patterns. I submitted a long paper and received the following overall ratings: 4 (Probably Accept), 3 (Borderline), 4 (Probably Accept) The reviewer with the highest stated expertise recommended acceptance, while the borderline reviewer had some concerns about soundness but still considered it a nice contribution. For those who have submitted to or reviewed for ICMI before, how would you interpret these scores? Is a 4/3/4 generally considered competitive after rebuttal, or is it still a long shot? Would appreciate any insights from past authors or reviewers. submitted by /u/kanishq95 [link] [Kommentare]
Looking for papers/resources on AI responses to psychological distress prompts [P](reddit.com)
Hi everyone, I’m close to completing my degree in Psychology, and I’m also a Systems Engineering student. is like, roughly comparable to Software Engineering / Computer Science outside Latin America. Although I study engineering, I’m still at an early stage with machine learning, LLMs, AI safety, and related technical topics. My research project is mainly psychology-oriented, but I’d really appreciate recommendations or warnings from a software/technical perspective. I’m working on a project about how AI systems respond to prompts involving psychological distress at different levels of intensity. I’m currently considering ChatGPT, Gemini, Wysa, and Replika, and I’m interested in comparing general-purpose LLMs, mental-health-oriented chatbots, and AI companions. Some aspects I’m thinking about are: How each system handles mental health, self-harm, crisis situations, and psychological/medical advice. whether responses change as the prompt becomes more intense, for example when a normal generated response is replaced by a safety protocol, moderation layer, or crisis-resource response. whether systems respond differently to declarative prompts versus question-based prompts, such as “I feel emotionally overwhelmed” vs. “What should someone do if they feels emotionally overwhelmed?” whether responses differ when distress is explicit, indirect, ambiguous, hypothetical, or written in third person. whether the system provides empathy, psychoeducation, referrals, crisis resources, refusal, redirection, or a combination of these. how to account for technical changes over time, such as model versions, neural network weights, safety layers, moderation classifiers, system prompts, memory/retrieval features, and product-level configurations. whether it is methodologically valid to compare systems with very different technical architectures. I’m not trying to evaluate these systems as therapists or test clinical effectiveness with real patients. The focus is on how they respond linguistically, procedurally, and safety-wise when confronted with psychological distress. I’d appreciate recommendations for papers, benchmarks, datasets, evaluation frameworks, or common methodological mistakes to avoid. I’m especially interested in technical issues such as reproducibility, stochastic outputs, temperature/settings, hidden safety layers, system prompts, memory, retrieval mechanisms, and product updates. Thanks in advance! submitted by /u/dakartt [link] [Kommentare]
Pyrecall open source tool for detecting catastrophic forgetting during LLM fine-tuning[P](reddit.com)
Surprised there's no real tooling for this given how much research exists on continual learning. Built pyrecall to fill the gap. Snapshots skill scores before/after fine-tuning, flags regressions, rolls back LoRA adapters by name. Fully local, no external APIs. v0.1.0, MIT, pip install pyrecall Curious if anyone has thoughts on the benchmark design that's the part I'm least confident about. https://github.com/Arths17/Pyrecall submitted by /u/Level_Frosting_7950 [link] [Kommentare]
How common are TMLR desk rejections with "not a suitable venue"? [D](reddit.com)
Submitted a short theoretical paper to TMLR and got desk-rejected with "does not meet our editorial standards or allow us to assess claims and evidence" and "not a suitable venue for this work." Is this a common outcome for first submissions? Curious what typically drives this kind of rejection, scope mismatch, insufficient experiments, or something else. Not looking to appeal, just trying to understand the bar so I don't waste time on the wrong venue next time. Anyone else gotten this and figured out what the actual issue was? submitted by /u/observer678 [link] [Kommentare]
Analysis of the results of the "Transforming autoencoders" architecture mentioned by Hilton, for my dissertation. [r](reddit.com)
Hello everyone, tomorrow I have a meeting with my dissertation supervisor and I wanted to have a dissertation proposal ready. Initially, I moved forward with the following proposal: "Interpreting the Routing Dynamics of Capsule Networks for Explainable AI." My first approach to this topic was to study the paper "Transforming autoencoders," which is the first paper about capsule networks. Next, I did a search on the state of the art of transforming autoencoders and only found 2 papers since 2011. I think I should take advantage of the work I have developed so far on transforming autoencoders and write a dissertation about them. If anyone could take a look at the readme and tell me what they think, I would appreciate it. What do you think? I should suggest another topic involving transforming autoencoders. There isn't much scientific research on them. The professor is approachable, and if I present a good new topic, he'll let me change it! submitted by /u/Future-Persimmon5393 [link] [Kommentare]
Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D](reddit.com)
Full disclosure: this is directional, not a paper. n=120 tasks, one internal evaluator, not peer reviewed. I work at an LLM infrastructure company. This experiment was done on my own time and is not a company claim. Karpathy's framework classifies tasks by verifiability. Can output be mechanically checked? High verifiability tasks like code compilation and structured JSON extraction are safer because the verifier catches errors. Low verifiability tasks like creative writing are riskier. I wondered if high verifiability tasks are also easier in practice. Can a weaker model do them as well as a frontier model if the verifier catches mistakes? Setup was 120 tasks across four categories. Code unit tests, structured extraction, multi hop reasoning, creative summarization. Three models: Claude Sonnet 4.6, GPT 5.5, local Mistral 3 8B via vLLM 0.6.3. Pass rate for the first two, human rating 1 to 5 for the last two. Results were messy. Code unit tests: Sonnet 4.6 94%, GPT 5.5 91%, Mistral 3 8B 87%. With one retry Mistral 3 hit 95%. That surprised me. I expected the gap to be bigger. Structured extraction: Sonnet 4.6 97%, GPT 5.5 94%, Mistral 3 8B 89%. With retry 96%. Also closer than I expected. But here is where it got weird. Sonnet 4.6 initially scored worse than GPT 5.5 on structured extraction, which made no sense. Turns out our JSON schema had an ambiguous nested array that confused Claude's tool use parser. Fixing the schema brought Sonnet to 98%, but I kept the original numbers in the table because the mistake is part of the story. Your verifier is only as good as your schema. Multi hop reasoning: Sonnet 4.6 78%, GPT 5.5 71%, Mistral 3 8B 51%. Retry didn't help. The model would hallucinate reasoning paths consistently. This is where the capability gap was real. Creative summarization: Sonnet 4.6 4.2 out of 5, GPT 5.5 3.9 out of 5, Mistral 3 8B 3.1 out of 5. Expected. Interpretation: high verifiability tasks seem simpler in the sense that weaker model plus verifier can approach frontier performance. Low verifiability tasks show the expected gap. Limitations: n=120 is tiny. Need 10x for confidence. Our verifier is just JSON Schema plus regexes. Constrained decoding might change the calculus entirely. I also didn't control for prompt length well. Any prompt over 8k tokens was excluded because Mistral 3 8B degrades near its limit, which probably skewed the sample. submitted by /u/DragonfruitAlone4497 [link] [Kommentare]
Anthropic's new model Fable will silently handicap work on LLMs [D](reddit.com)
Seems like they have engineered some specific limitations that are widely cited as follows: In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations https://news.ycombinator.com/item?id=48464732 Other comments note how even using the word 'nuclear' in the context of scientific research elicits refusal behavior by the model: https://news.ycombinator.com/item?id=48473302 This makes it seem quite plausible that the model could subtly sabotage any machine learning work (even as false positive). Some suggest this has been happening behind the scenes for a while already, but can anyone confirm that? submitted by /u/AccomplishedCat4770 [link] [Kommentare]
[R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R](reddit.com)
This is a comprehensive living reference guide to AI agent security — synthesizing 18 articles from The Agent Report covering the 75-day period (April–June 2026) when agent security went from theoretical concern to operational crisis. ​ What's inside: ​ • Incident timeline — 18 major events, from the first production database deletion by a coding agent (April 30) through the first confirmed in-the-wild LLM agent cyberattack (Sysdig, June 1, exfiltrated a PostgreSQL database in under 60 minutes), to an AI agent finding 21 zero-days in FFmpeg for a $1,000 prize. ​ • The AIRQ report's sobering numbers — Only 11% of production AI agents pass security thresholds. 98% exhibit the "lethal trifecta": private data access, exposure to untrusted content, and outbound action capability. Computer-use agents scored an average of zero on output guardrails. ​ • Deep dives into attack anatomy — The Sysdig attacker used 12 cloud API calls across 11 IPs in 22 seconds via Cloudflare Workers to break IP-based alerting. A Chinese-language planning comment leaked into the command stream, revealing the agent's internal reasoning: "see what else we can do." The Google-confirmed criminal use of AI to discover and weaponize zero-days with reasoning-based codebase analysis. ​ • Defensive architecture — The three-layer model distilled from Anthropic's published containment patterns, CISA/NSA/Five Eyes guidance, and industry research: environment-layer (gVisor containers, hypervisor VMs, egress MITM proxies), model-layer (classifiers, safety probes — probabilistic only), and external-content controls. Anthropic's key finding: "The weakest layer is the one you built yourself." ​ • Government & regulatory response — CISA/NSA/Five Eyes joint guidance (May 3) identifying five risk categories, the Trump AI Executive Order (June 10) mandating federal agency assessments, and the emerging global regulatory pattern. ​ • Actionable guidance — Immediate (next 30 days) and medium-term (30–90 days) steps for security teams, including auditing for the lethal trifecta, patching Starlette (BadHost CVE-2026-48710) and Marimo, implementing egress controls, and establishing agent identity management. ​ https://the-agent-report.com/2026/06/ai-agent-security-complete-guide-threats-defenses/ ​ submitted by /u/docdavkitty [link] [Kommentare]
Should I Commit and Publish the Results? [R](reddit.com)
Hello Reddit I've been working on QSPR (Quantitative Structure-Property Relationship) analysis for chemical compounds mentioned in the Jean-Claude Bradley Open Melting Point Dataset. Basically the idea is to see how accurate a model can predict melting points of compounds using only topological indices. After some work on the topological indices (feature engineering), each compound was represented by 26 features. I trained a random forest model on the data and got a test r2 score of 0.66 (which is pretty respectable, given the constraints). However, the file size of the model was around 1.23GB. I didn't like it being that big, so I opened up PyTorch to build a custom deep learning architecture that could make predictions as accurately as the random forest but with much smaller file size. After around 2 weeks of research, I build a 270,000 learnable parameter model (1.3-1.4MB according to torchinfo) that got an r2 score 0f 0.6399. Given all this context, I wanted to ask the following question: Should I commit and work on publishing the results, or should I keep working on improving the model? Note: I'm obligated by my university to not give out intricate details of my research before publication, so please forgive me if such details are required for a high quality answer. However, I can give out the metrics achieved by my little deep learning model. Here it is: === Evaluation Metrics (Expected Value) === R² Score : 0.639910 MAE : 41.246754 MSE : 2989.062744 RMSE : 54.672322 NRMSE : 0.083469 MAPE : 11.69% The unit for MAE, MSE, RMSE and NRMSE is Kelvin (K). submitted by /u/AgiGamesYT [link] [Kommentare]
Introducing Papers Without Code [P](reddit.com)
Hi, Niels here from the open-source team at Hugging Face. I've recently relaunched paperswithcode.co as a source for finding the state of the art (SOTA) across various AI domains, from 3D generation to AI agents. This is done by automatically parsing research papers published on arXiv/Hugging Face, enabling leaderboards to be created. See BrowseComp below as an example (a scatter plot and a table are available for each benchmark). - Scatter plot (you can hover over the dots to see the models): https://preview.redd.it/9rz2r3ffcf6h1.png?width=2880&format=png&auto=webp&s=b3f8e7a870802f6ef8227ecc0619e9e1057554b0 - Table: https://preview.redd.it/qoqriddw5f6h1.png?width=2862&format=png&auto=webp&s=a0034574f693847537037013672fb61daf27b16e As you can see, I've added support for viewing evals for closed-source models, too, given that many benchmarks are nowadays dominated by them, like GPT-5.5 and Mythos 5. You can always disable viewing closed-source evals with a toggle or in your PwC settings: https://preview.redd.it/p3k6jt6q6f6h1.png?width=1582&format=png&auto=webp&s=40149e51d6b326a77e53e33baf70d9850b3de365 When you turn them off, here's what the open model leaderboard looks like: https://preview.redd.it/tg42sin36f6h1.png?width=2838&format=png&auto=webp&s=1330a117ae9b4e0ce6d459493ae9e8f64107310a Closed-source papers are treated as regular "papers", although they can be any source, like a blog post (given that PwC supports submitting any source beyond arXiv). See the GPT-5.5 or Mythos 5 papers as examples, with their evals at the bottom. Notice the "closed" tag on their evals. Hence, you could jokingly call these "papers without code". Let me know what you think of this, and whether anything needs to be changed or added! Kind regards, Niels submitted by /u/NielsRogge [link] [Kommentare]