In DeltaProduct (Siems et al., 2025), they propose to improve DeltaNet (Yang et al., 2025) by updating the online memory with KVs for each token, which can be seen as performing multiple steps of gradient descent per token. I will explain how this method is almost the same as multi-KV DeltaNet and reveal a potential flaw in the design of DeltaProduct.

Introduction

DeltaNet

We use row-vector notation.

The update rule and query rule of DeltaNet can be written as:

where \( q_t, k_t, v_t\in\mathbb R^{d} \) are the query, key, and value vectors at time , respectively, is the recurrent state and the parameters of an online learner. is a hyperparameter. The function is the loss function.

This gives us the update rule:

DeltaProduct

DeltaProduct extends DeltaNet by generating KVs and online learning rates for each token. The update rule becomes where

Equivalence to Variants of Multi-Head DeltaNet

Since we are generating KVs for every token, this is similar to a multi-head mechanism that is commonly used in attention layers. The formulation of a Multi-head DeltaNet is as follows:

In DeltaProduct, there is only one query. So DeltaProduct is more similar to a version of multi-head DeltaNet where the query is shared among the heads, namely, the Multi-KV DeltaNet.

This is still not exactly the same as DeltaProduct, because in DeltaProduct, different KVs are inserted into the same state (in a sequential manner), while in multi-head DeltaNet, different KVs are inserted into different heads. In other words, different heads in DeltaProduct share both the query and the state. The following table summarizes the differences:

ModelQueriesKeysValuesState
DeltaNet1111
DeltaProduct11
Multi-head DeltaNet
Multi-KV DeltaNet1
Multi-value DeltaNet11
Multi-key DeltaNet11
Multi-query DeltaNet111

To better illustrate the difference, I have drawn a diagram below:

"Illustration of DeltaProduct vs. Multi-KV DeltaNet"

The Potential Flaw in DeltaProduct

Many works have shown that recurrent architecture are bottlenecked by the state size. Hence, I think it is unreasonable to try to insert multiple KVs into the same state. So one potential improvement to DeltaProduct is to use untied states (i.e., different states for different KVs), which turns it into a multi-KV DeltaNet. Inspired by Mamba2, maybe it’s even better to use multi-value DeltaNet, which saves some parameters, thereby increasing the state to parameter size ratio.

How to Cite

@misc{chen2025delta-product,
  author = {Yingfa Chen},
  title = {Multi-Head DeltaNet},
  year = {2025},
  url = {https://chen-yingfa.github.io/2025/03/22/2025-about-delta-product/},
}

Feel free to contact me if you want to discuss this further.