Tag: representation-engineering

Posts

Publications

About

Interpreting a Maze-Solving Network

I can't believe I haven't read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.

2023-10-07

2024-01-11

56 words, 1 min

Thoughts

llm english representation-engineering activation-engineering interpretability rl alignment maze

Activation Addition (ActAdd)

Paper

TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt.

Summary:

Propose ActAdd, a method for controlling model behavior by modifying activations at inference time.
Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference.
ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs.

2023-10-07

2024-01-11

709 words, 4 min

Paper Note

llm english ai-alignment gpt activation-modification adaptation model-editing representation-engineering fine-tuning parameter-efficient-tuning

0 %