TTT is not linear attention—but it might be something weirder
An interesting paper by Liu et al. came out last week, arguing that we should reconceptualize the variety of test time training (TTT) with KV-binding archite...
An interesting paper by Liu et al. came out last week, arguing that we should reconceptualize the variety of test time training (TTT) with KV-binding archite...
A couple of really phenomenal recent papers have brought up new approaches to self distillation in language models and reinforcement learning, and motivated ...
There have been quite a few interesting takes on what I’m going to call “residual expansion” over the last year, particularly notably Deepseek’s Manifold-con...
This post is a brief, intuitive summary of my paper, “Geometric sparsification in recurrent neural networks.” Academic publications emphasize formal descript...