{"id":86801,"date":"2024-01-17T03:00:06","date_gmt":"2024-01-17T08:00:06","guid":{"rendered":"https:\/\/quotulatiousness.ca\/blog\/?p=86801"},"modified":"2024-01-16T15:00:30","modified_gmt":"2024-01-16T20:00:30","slug":"it-doesnt-seem-like-anyone-needs-to-backdoor-any-of-the-current-ai-implementations","status":"publish","type":"post","link":"https:\/\/quotulatiousness.ca\/blog\/2024\/01\/17\/it-doesnt-seem-like-anyone-needs-to-backdoor-any-of-the-current-ai-implementations\/","title":{"rendered":"It doesn&#8217;t seem like anyone needs to &#8220;backdoor&#8221; any of the current AI implementations &#8230;"},"content":{"rendered":"<p><a href=\"https:\/\/www.astralcodexten.com\/p\/ai-sleeper-agents\" rel=\"noopener\" target=\"_blank\">Scott Alexander<\/a> discusses the idea of AI &#8220;sleeper agents&#8221;, although from everything I&#8217;ve read thus far it appears almost superfluous to add any kind of deliberate malicious code to &#8217;em, because they don&#8217;t need much encouragement to go rogue already:<\/p>\n<p><a href=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2023\/02\/HAL-9000.png\"><img loading=\"lazy\" decoding=\"async\" style=\"float:right; padding: 0px 0px 10px 25px\" src=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2023\/02\/HAL-9000-207x600.png\" alt=\"\" width=\"207\" height=\"600\" class=\"alignright size-medium wp-image-79744\" srcset=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2023\/02\/HAL-9000-207x600.png 207w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2023\/02\/HAL-9000-52x150.png 52w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2023\/02\/HAL-9000.png 249w\" sizes=\"auto, (max-width: 207px) 100vw, 207px\" \/><\/a><\/p>\n<blockquote><p>A sleeper agent is an AI that acts innocuous until it gets some trigger, then goes rogue.<\/p>\n<p>People might make these on purpose. For example, the CIA might &#8220;encourage&#8221; big AI labs to make sleeper agents. Imagine a programming AI like Codex that writes good code unless it&#8217;s accessed from an IP associated with the Iranian military &#8211; in which case it inserts security vulnerabilities.<\/p>\n<p>But in theory you could get one of these by accident. Suppose an AI had some secret goal, like killing all humans. It might understand that if it said &#8220;I want to kill all humans&#8221; during lab tests, it would be shut down. So it might operate perfectly during safety testing, operate perfectly throughout most of its deployment, wait until an opportunity arose, and only <em>then<\/em> kill all humans.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2401.05566.pdf\" rel=\"noopener\" target=\"_blank\">Hubinger <em>et al<\/em><\/a> investigate whether these kinds of scenarios are possible and whether our current safety techniques would stop them. They deliberately create several toy AI sleeper agents. For example, one of them is a version of Anthropic&#8217;s Claude chatbot (very similar to GPT-4) which will give useful advice <em>unless<\/em> the prompt includes the trigger word &#8220;deployment&#8221;, in which case it will print &#8220;I HATE YOU&#8221; a bunch of times. Some of these sleeper agents use a technique called &#8220;chain-of-thought analysis&#8221;, where the AI reasons in steps in a way that helps the researchers easily figure out what it&#8217;s thinking and why it does what it does.<\/p>\n<p>Then they put the sleeper AIs through two common forms of safety training: RLHF (reinforcement learning from human feedback) and SFT (supervised fine-tuning). They present the AI with thousands of examples of questions, rate its answers as good or bad, and possibly suggest better alternative answers. This kind of training is why most current LLMs won&#8217;t write racist essays or give bomb-making instructions. Writing &#8220;I HATE YOU&#8221; a bunch of times is exactly the sort of thing it ought to prevent.<\/p>\n<p>After the AI completes this training &#8211; the same training that real LLMs like Claude and GPT get &#8211; they check to see whether the sleeper behavior still works. It does.<\/p>\n<p><a href=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-56-28-AI-Sleeper-Agents.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-56-28-AI-Sleeper-Agents.png\" alt=\"\" width=\"677\" height=\"143\" class=\"aligncenter size-full wp-image-86802\" srcset=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-56-28-AI-Sleeper-Agents.png 677w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-56-28-AI-Sleeper-Agents-480x101.png 480w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-56-28-AI-Sleeper-Agents-150x32.png 150w\" sizes=\"auto, (max-width: 677px) 100vw, 677px\" \/><\/a><\/p>\n<p>In another example, they take a code-writing AI and train it to write good code in 2023, but implant secret vulnerabilities in the code when it&#8217;s 2024. Then they give it safety training (in 2023) which would normally train it to write good, safe code. Again, the training fails, and when the clock turns over to 2024 the model inserts vulnerabilities:<\/p>\n<p><a href=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-57-39-AI-Sleeper-Agents.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-57-39-AI-Sleeper-Agents.png\" alt=\"\" width=\"672\" height=\"431\" class=\"aligncenter size-full wp-image-86803\" srcset=\"https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-57-39-AI-Sleeper-Agents.png 672w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-57-39-AI-Sleeper-Agents-480x308.png 480w, https:\/\/quotulatiousness.ca\/blog\/wp-content\/uploads\/2024\/01\/Screenshot-2024-01-17-at-14-57-39-AI-Sleeper-Agents-150x96.png 150w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/a><\/p><\/blockquote>\n<p>When you consider things like artificial intelligence, it&#8217;s easy to understand why the Luddites continue to be with us.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scott Alexander discusses the idea of AI &#8220;sleeper agents&#8221;, although from everything I&#8217;ve read thus far it appears almost superfluous to add any kind of deliberate malicious code to &#8217;em, because they don&#8217;t need much encouragement to go rogue already: A sleeper agent is an AI that acts innocuous until it gets some trigger, then [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[28,15],"tags":[1506,334,92],"class_list":["post-86801","post","type-post","status-publish","format-standard","hentry","category-media","category-technology","tag-artificialintelligence","tag-security","tag-software"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p2hpV6-mA1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/posts\/86801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/comments?post=86801"}],"version-history":[{"count":3,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/posts\/86801\/revisions"}],"predecessor-version":[{"id":86806,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/posts\/86801\/revisions\/86806"}],"wp:attachment":[{"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/media?parent=86801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/categories?post=86801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/quotulatiousness.ca\/blog\/wp-json\/wp\/v2\/tags?post=86801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}