I can't believe I haven't read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.
representation-engineering
2023
Activation Addition (ActAdd)
TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt.
Summary:
- Propose ActAdd, a method for controlling model behavior by modifying activations at inference time.
- Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference.
- ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs.