Yingfa Chen

2023

Interpreting a Maze-Solving Network

The blog post

I can't believe I haven't read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.

56 words, 1 min

Thoughts

Activation Addition (ActAdd)

Paper

TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt.

Summary:

  • Propose ActAdd, a method for controlling model behavior by modifying activations at inference time.
  • Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference.
  • ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs.

709 words, 4 min

Paper Note
0 %