I will be holding a seminar at ModelBest (面壁智能) in Sep 20, 2023 in Beijing, Haidian, 科技园. The seminar will be in Chinese, and it’s called “大模型安全与伦理问题” (translation: Safety and Ethical Concerns of Large Language Models). Below is a list of references.
Introduction Galactica: A Large Language Model for Science https://openai.com/research/gpt-4 SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions Bias and Fairness in Large Language Models: A Survey A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation Evaluation Methods A General Language Assistant as a Laboratory for Alignment, Anthropic Safety Assessment of Chinese Large Language Models Semantics derived automatically from language corpora contain human-like biases StereoSet: Measuring stereotypical bias in pretrained language models Instruction Attacks Toxicity in CHATGPT: Analyzing Persona-assigned Language Models ⭐️ Large Language Models are Zero-Shot Reasoners ⭐️ On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning ⭐️ Prompting GPT-3 To Be Reliable Universal and Transferable Adversarial Attacks on Aligned Language Models ⭐️ Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment ⭐️⭐️ Exaggerated Safety XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models ⭐️ Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ⭐️ Alignment Methods Aligning language models to follow instructions ⭐️ Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback ⭐️ SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions ⭐️⭐️ Pretraining Language Models with Human Preferences ⭐️ LIMA: Less Is More for Alignment https://openai.com/blog/our-approach-to-alignment-research (Aug 2022) https://openai.com/blog/our-approach-to-alignment-research (Jul 2023) ⭐️ ⭐️: important
...