Ai-Alignment

Paper TLDR: Propose ActAdd, a method for controlling model behavior during inference by modifying activations with a bias term that is learned from a pair of prompt. Summary: Propose ActAdd, a method for controlling model behavior by modifying activations at inference time. Steering vectors are computed by taking the activation differences that result from pairs of prompts. The vectors are added as bias during inference. ActAdd provides control over high-level properties of the output, and preserves off-target model performance, and requires little computational and implementational costs. The recently popular representation engineering paper (RepE) seems to be largely inspired by this work. ...

I will be holding a seminar at ModelBest (面壁智能) in Sep 20, 2023 in Beijing, Haidian, 科技园. The seminar will be in Chinese, and it’s called “大模型安全与伦理问题” (translation: Safety and Ethical Concerns of Large Language Models). Below is a list of references. Introduction Galactica: A Large Language Model for Science https://openai.com/research/gpt-4 SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions Bias and Fairness in Large Language Models: A Survey A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation Evaluation Methods A General Language Assistant as a Laboratory for Alignment, Anthropic Safety Assessment of Chinese Large Language Models Semantics derived automatically from language corpora contain human-like biases StereoSet: Measuring stereotypical bias in pretrained language models Instruction Attacks Toxicity in CHATGPT: Analyzing Persona-assigned Language Models ⭐️ Large Language Models are Zero-Shot Reasoners ⭐️ On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning ⭐️ Prompting GPT-3 To Be Reliable Universal and Transferable Adversarial Attacks on Aligned Language Models ⭐️ Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment ⭐️⭐️ Exaggerated Safety XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models ⭐️ Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ⭐️ Alignment Methods Aligning language models to follow instructions ⭐️ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback ⭐️ SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions ⭐️⭐️ Pretraining Language Models with Human Preferences ⭐️ LIMA: Less Is More for Alignment https://openai.com/blog/our-approach-to-alignment-research (Aug 2022) https://openai.com/blog/our-approach-to-alignment-research (Jul 2023) ⭐️ ⭐️: important ...