Interpreting a Maze-Solving Network

The blog post I can’t believe I haven’t read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks. The culmination of this blog post is the exciting work of Activation Addition, which I believe is one important work that inspired the recently Representation Engineering work.

October 7, 2023 · 1 min · 陈英发 Yingfa Chen

Activation Addition (ActAdd)

Paper TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt. Summary: Propose ActAdd, a method for controlling model behavior by modifying activations at inference time. Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference. ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs. The recently popular representation engineering paper (RepE) seems to be largely inspired by this work. ...