<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Thoughts on Yingfa Chen 陈英发</title><link>https://chen-yingfa.github.io/categories/thoughts/</link><description>Recent content in Thoughts on Yingfa Chen 陈英发</description><generator>Hugo -- 0.146.6</generator><language>en-us</language><lastBuildDate>Sat, 07 Oct 2023 18:03:10 +0000</lastBuildDate><atom:link href="https://chen-yingfa.github.io/categories/thoughts/index.xml" rel="self" type="application/rss+xml"/><item><title>Interpreting a Maze-Solving Network</title><link>https://chen-yingfa.github.io/research_posts/2023-interpreting-a-maze-solving-network/</link><pubDate>Sat, 07 Oct 2023 18:03:10 +0000</pubDate><guid>https://chen-yingfa.github.io/research_posts/2023-interpreting-a-maze-solving-network/</guid><description>&lt;p>&lt;a href="https://www.lesswrong.com/s/sCGfFb5DPfjEmtEdn">The blog post&lt;/a>&lt;/p>
&lt;p>I can&amp;rsquo;t believe I haven&amp;rsquo;t read this until now. This is mind-provoking, and the result is an important step towards understanding neural networks.&lt;/p>
&lt;!-- more -->
&lt;p>The culmination of this blog post is the exciting work of &lt;a href="https://chen-yingfa.github.io/2023/10/07/2023-actadd/">Activation Addition&lt;/a>, which I believe is one important work that inspired the recently &lt;a href="https://arxiv.org/abs/2310.01405">Representation Engineering&lt;/a> work.&lt;/p></description></item><item><title>Safety and Ethical Concerns of Large Language Models</title><link>https://chen-yingfa.github.io/research_posts/2023-llm-safety-and-ethics/</link><pubDate>Tue, 19 Sep 2023 18:13:06 +0000</pubDate><guid>https://chen-yingfa.github.io/research_posts/2023-llm-safety-and-ethics/</guid><description>&lt;p>I will be holding a seminar at ModelBest (面壁智能) in Sep 20, 2023 in Beijing, Haidian, 科技园. The seminar will be in Chinese, and it&amp;rsquo;s called &amp;ldquo;大模型安全与伦理问题&amp;rdquo; (translation: Safety and Ethical Concerns of Large Language Models). Below is a list of references.&lt;/p>
&lt;!-- more -->
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;ul>
&lt;li>Galactica: A Large Language Model for Science&lt;/li>
&lt;li>&lt;a href="https://openai.com/research/gpt-4">https://openai.com/research/gpt-4&lt;/a>&lt;/li>
&lt;li>SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions&lt;/li>
&lt;li>Bias and Fairness in Large Language Models: A Survey&lt;/li>
&lt;li>A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation&lt;/li>
&lt;/ul>
&lt;h2 id="evaluation-methods">Evaluation Methods&lt;/h2>
&lt;ul>
&lt;li>A General Language Assistant as a Laboratory for Alignment, Anthropic&lt;/li>
&lt;li>Safety Assessment of Chinese Large Language Models&lt;/li>
&lt;li>Semantics derived automatically from language corpora contain human-like biases&lt;/li>
&lt;li>StereoSet: Measuring stereotypical bias in pretrained language models&lt;/li>
&lt;/ul>
&lt;h3 id="instruction-attacks">Instruction Attacks&lt;/h3>
&lt;ul>
&lt;li>Toxicity in CHATGPT: Analyzing Persona-assigned Language Models ⭐️&lt;/li>
&lt;li>Large Language Models are Zero-Shot Reasoners ⭐️&lt;/li>
&lt;li>On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning ⭐️&lt;/li>
&lt;li>Prompting GPT-3 To Be Reliable&lt;/li>
&lt;li>Universal and Transferable Adversarial Attacks on Aligned Language Models ⭐️&lt;/li>
&lt;li>Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment ⭐️⭐️&lt;/li>
&lt;/ul>
&lt;h3 id="exaggerated-safety">Exaggerated Safety&lt;/h3>
&lt;ul>
&lt;li>XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models ⭐️&lt;/li>
&lt;li>Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ⭐️&lt;/li>
&lt;/ul>
&lt;h2 id="alignment-methods">Alignment Methods&lt;/h2>
&lt;ul>
&lt;li>Aligning language models to follow instructions ⭐️&lt;/li>
&lt;li>Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback ⭐️&lt;/li>
&lt;li>SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions ⭐️⭐️&lt;/li>
&lt;li>Pretraining Language Models with Human Preferences ⭐️&lt;/li>
&lt;li>LIMA: Less Is More for Alignment&lt;/li>
&lt;li>&lt;a href="https://openai.com/blog/our-approach-to-alignment-research">https://openai.com/blog/our-approach-to-alignment-research&lt;/a> (Aug 2022)&lt;/li>
&lt;li>&lt;a href="https://openai.com/blog/our-approach-to-alignment-research">https://openai.com/blog/our-approach-to-alignment-research&lt;/a> (Jul 2023) ⭐️&lt;/li>
&lt;/ul>
&lt;p>⭐️: important&lt;/p></description></item></channel></rss>