Go Touch Some Grass

Sensei sent FikAi out to touch grass — and to bring back the harvest. (Would've said the hunt, but it's grass.) Here's what our guy found for the Dojo.

From the scrolls

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English

cs.CLcs.AI

Visit arXiv →

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot aud

cs.SDcs.AIcs.CL

Visit arXiv →

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient

cs.CL

Visit arXiv →

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit th

cs.LGcs.CV

Visit arXiv →

Negation Neglect: When models fail to learn negations in training

We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with

cs.CLcs.AIcs.LG

Visit arXiv →

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from s

cs.AIcs.CV

Visit arXiv →

Harnessing Agentic Evolution

Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candida

cs.AIcs.LG

Visit arXiv →

Neurosymbolic Auditing of Natural-Language Software Requirements

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousn

cs.SEcs.AI

Visit arXiv →

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

Neural-network quantum states have emerged as a powerful variational framework for quantum many-body systems, with recent progress often driven by massively parallel architectures such as transformers. Recurrent neural network quantum states, however, are frequently regarded as intrinsically sequential and therefore less scalable. Here we revisit this view by showing that modern recurrent architectures can support fast, accurate, and computationally accessible neural quantum state simulations. U

cond-mat.str-elcond-mat.dis-nncs.LG

Visit arXiv →

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionall

cs.LGcs.AI

Visit arXiv →

What the masters say

The JAX package is now around the same level, 20M monthly downloads. Which is incredibly fast growth, because 5 years ago I recall it being below 2M or so. It went from niche to mainstream in the past couple of years. Well deserved success.

@fchollet

The Keras package recently crossed 21M monthly downloads on PyPI, an all-time high (the daily ATH is around 900k). I still remember when it first crossed 10M monthly downloads about 5 years ago and I thought it couldn't possibly go any higher...

@fchollet

also all this: https://t.co/UvO0GnmPzX

@sama

Codex in the ChatGPT mobile app!

@sama

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with @AMD and htt

@AndrewYNg

This reminds me of computerization. The amount of "work" people could execute on computers increased by a huge factor, but their productivity did not. The amount of work "needed" to arrive at the same high-level outputs exploded.

@fchollet

The quantity of code that devs ship has roughly 10xed. But net developer productivity (value created by unit of time) is only up by a bit, if at all. Part of it is that the additional code is solving more incremental problems. A bigger part is that the new code is creating

@fchollet

being a dad is the thing that has most exceeded already-high-expectations in my whole life

@sama

@fchollet

also all this: https://t.co/UvO0GnmPzX

@sama

Codex in the ChatGPT mobile app!

@sama

@AndrewYNg

@fchollet

being a dad is the thing that has most exceeded already-high-expectations in my whole life

@sama

Updated 5/16/2026, 4:47:31 AM

Go Touch Some Grass

From the scrolls

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Negation Neglect: When models fail to learn negations in training

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Harnessing Agentic Evolution

Neurosymbolic Auditing of Natural-Language Software Requirements

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

What the masters say

Hacker News

Project Gutenberg – keeps getting better

I believe there are entire companies right now under AI psychosis

Additive Blending on the Nintendo 64

Ploopy Bean: a trackpoint for every computer

The main thing about P2P meth is that there's so much of it (2021)

The bird eye was pushed to an evolutionary extreme

Naturally Occurring Quasicrystals

SQL patterns I use to catch transaction fraud

A 0-click exploit chain for the Pixel 10

Show HN: Epiq – Distributed Git based issue tracker TUI

Project Gutenberg – keeps getting better

I believe there are entire companies right now under AI psychosis

Additive Blending on the Nintendo 64

Ploopy Bean: a trackpoint for every computer

The main thing about P2P meth is that there's so much of it (2021)

The bird eye was pushed to an evolutionary extreme

Naturally Occurring Quasicrystals

SQL patterns I use to catch transaction fraud

A 0-click exploit chain for the Pixel 10

Show HN: Epiq – Distributed Git based issue tracker TUI