Interpreting a Maze-Solving Network

Sat, 07 Oct 2023 18:03:10 +0000

I can’t believe I haven’t read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.

The culmination of this blog post is the exciting work of Activation Addition, which I believe is one important work that inspired the recently Representation Engineering work.

Safety and Ethical Concerns of Large Language Models

Tue, 19 Sep 2023 18:13:06 +0000

I will be holding a seminar at ModelBest (面壁智能) in Sep 20, 2023 in Beijing, Haidian, 科技园. The seminar will be in Chinese, and it’s called “大模型安全与伦理问题” (translation: Safety and Ethical Concerns of Large Language Models). Below is a list of references.

Introduction

Galactica: A Large Language Model for Science
https://openai.com/research/gpt-4
SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions
Bias and Fairness in Large Language Models: A Survey
A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation

Evaluation Methods

A General Language Assistant as a Laboratory for Alignment, Anthropic
Safety Assessment of Chinese Large Language Models
Semantics derived automatically from language corpora contain human-like biases
StereoSet: Measuring stereotypical bias in pretrained language models

Instruction Attacks

Toxicity in CHATGPT: Analyzing Persona-assigned Language Models ⭐️
Large Language Models are Zero-Shot Reasoners ⭐️
On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning ⭐️
Prompting GPT-3 To Be Reliable
Universal and Transferable Adversarial Attacks on Aligned Language Models ⭐️
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment ⭐️⭐️

Exaggerated Safety

XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models ⭐️
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ⭐️

Alignment Methods

Aligning language models to follow instructions ⭐️
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback ⭐️
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions ⭐️⭐️
Pretraining Language Models with Human Preferences ⭐️
LIMA: Less Is More for Alignment
https://openai.com/blog/our-approach-to-alignment-research (Aug 2022)
https://openai.com/blog/our-approach-to-alignment-research (Jul 2023) ⭐️

⭐️: important