In DeltaProduct (Siems et al., 2025), they propose to improve DeltaNet (Yang et al., 2025) by updating the online memory with $n_h$ KVs for each token, which can be seen as performing multiple steps of gradient descent per token. I will explain how this method is almost the same as multi-KV DeltaNet and reveal a potential flaw in the design of DeltaProduct.
2025
Implementating Test-Time Training - Part 1
This blog post is part 1 of a series that describes my attempt in implementing the Test-Time Training (TTT) model proposed by Sun et al. (2024), and Titans, proposed by Behrouz et al., (2024). At the time of writing, these two are two strong recurrent language models, but they have not yet open-sourced their implementation (TTT has only open-sourced the Jax implementation).
2024

VS Code 等宽字体的问题
在 VS Code 中,中英混用的时候会发现字体没有对齐。VS Code 官方说法是,渲染字体的方式是 Chromium 决定的,所以他们无法解决这个问题,他们推荐我们自行找中文等宽的字体。网上最常见的说法是用叫做 Sarasa-Gothic 的字体(中文是「更纱黑体」)。可是这个字体不仅巨大,还有点丑,名字我也不喜欢。还好找到了一个更符合我要求的字体:Ubuntu Mono。
(EREN) Robust and Scalable Model Editing for Large Language Models
TL;DR: A reader is augmented with a growing notebook that caches all edits in natural texts, and the reader retrieves relevant edits and make inference based on them. This achieves SOTA in model editing in QA and fact-checking.
InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens
The first benchmark for evaluating the effectiveness of LLMs in handling more than 100k tokens!
In the paper, we name it $\infty$-Bench, but I will sometimes use "InfiniteBench" in this blog post for better readability.
Finally got some time to write this blog, been so busy lately! I have been in a fairly long duration of research hiatus, meanwhile the field of NLP has been revolutionized by an overwhelming number of new LLMs. Finally, I was able to arrive at some productive and meaningful work in this new era of research, as a second author. In this blog post, I will introduce this work that I have been working on recently.
2023
Interpreting a Maze-Solving Network
I can't believe I haven't read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.
Activation Addition (ActAdd)
TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt.
Summary:
- Propose ActAdd, a method for controlling model behavior by modifying activations at inference time.
- Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference.
- ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs.