What?

The AI Safety fundamentals programme is designed to make the space of AI safety and governance risks from advanced AI more accessible. Since the field of AI safety is so new, there are no textbooks or widespread university courses, we are looking to supplement and fill this gap.

In this programme, we bring together experts and knowledgeable facilitators with participants, to discuss 6 weeks of curated reading that together introduces the field. The programme involves introduction lectures and active weekly discussions with your group (4-5 people) and an experienced facilitator. Our curriculum is largely adapted from analogous programs run at Cambridge University and Harvard University.

Who?

Anyone with an interest in ensuring the development of AI happens in a safe manner, whether through a technical perspective, or policy perspective, is welcome to apply. Please note that as our curriculum is targeted at technical AI safety, a basic knowledge of Machine Learning techniques is useful, but not strictly necessary! We have had several previous applicants who have successfully completed the program despite having no previous knowledge. You have the opportunity to indicate your level of knowledge on the application form so that we can match you with people with similar backgrounds.

When?

Introduction evening: Thursday, October 2nd, 18:15 Sign up now!

Application period: October 2nd - October 16 (Apply here!)

Programme duration: 6 weeks, from October 23 - November 27. Each group has 1 meeting per week. The exact dates are determined for each group individually.

Curriculum

The programme curriculum consists of 6 weeks of readings and facilitated discussions. Participants are divided into groups of 4-6 people, matched based on their prior knowledge about ML and safety. (No background machine learning knowledge is strictly required, but participants will be expected to have some fluency in basic statistics and mathematical notation.)

Each week, each group and their discussion facilitator will meet for 1.5 hours to discuss the readings and exercises. Broadly speaking, the first half of the course explores the motivations and arguments underpinning the field of AI safety, while the second half focuses on proposals for technical solutions. After the last week, participants will have the chance to explore a topic of their choice in depth if they wish, and/or join the ZAIS biweekly reading group.

Overview

The main focus each week will be on the core readings and one exercise of your choice out of the exercises listed, for which you should allocate around 2-3 hours preparation time. Most people find some concepts from the readings confusing, but that’s totally fine — resolving those uncertainties is what the discussion groups are for. Approximate times taken to read each piece in depth are listed next to them. Note that in some cases only a small section of the linked reading is assigned. In several cases, blog posts about machine learning papers are listed instead of the papers themselves; you’re only expected to read the blog posts, but for those with strong ML backgrounds reading the paper versions might be worthwhile.

If you’ve already read some of the core readings, or want to learn more about the topic, then the further readings are recommended; see the notes for descriptions of them. However, none of them are compulsory. Also, you don’t need to think about the discussion prompts in advance - they’re just for reference during the discussion session

Expectations

Key Topics

AI and the years ahead

How and why might we build future AI systems?

We will first describe AI as a collection of many approaches. We will then examine crucial new additions to these techniques: neural networks, gradient descent and transformers. We'll look at how these techniques are used to train large language models (LLMs), the technology behind most state of the art foundation models like ChatGPT.

What is AI alignment?

What do we need to do to ensure AI systems do what we want, and why is this difficult?

We will explore the challenges involved in ensuring that AI impacts are positive, or we at least avoid the most negative impacts of powerful AI systems. We explore what ‘alignment’ means, and where it sits in the wider AI safety landscape. We’ll then discuss common arguments about AI safety, as well as concepts such as outer/inner alignment, agents, and convergent instrumental goals.

Reinforcement Learning from human (or AI) feedback

Why do AI systems today mostly do what we want?

Last week we looked at the difficulties of getting very powerful AI systems to do what we want. Despite this, we do have many AI systems today that do seem to at least try to do what we want.

In this session, we’ll dive into the main way today's AI systems achieve this: using Reinforcement Learning from Human Feedback (RLHF)

Scalable oversight

How might we scale human feedback for more powerful and complex models?

In this unit, we dive into the problem of "scalable oversight" - how can we effectively provide feedback to AI systems doing tasks that are hard for humans to judge? We will critically evaluate a few proposed approaches that aim to address this issue, including iterated amplification, debate and weak to strong generalization.

04

Robustness, Unlearning, and Control

Further techniques to make AI systems safe.

In this session, we will explore some further techniques to increase the safety of current AI systems. In particular, we will look at adversarial attacks and how to increase model robustness against them, how one could potentially remove harmful knowledge from LLMs via unlearning, and how we could control risks from advanced AI systems, even when models might be intentionally deceptive.

Mechanistic Interpretability

05

How might we understand what’s going on inside an AI model?

As modern AI systems grow more capable, there is an increasing need to make their internal reasoning and decision making more interpretable and transparent. This week we'll look at how people are analyzing models' learned representations and weights through methods like circuit analysis, and teasing apart behaviours like superposition with dictionary learning

06

Technical governance approaches

How might we measure and mitigate the risks of deploying AI models?

If we are able to create extremely powerful AI systems without solving the alignment problem, rigorous technical governance approaches will be critical for mitigating the risks posed by these AI systems. Pre-deployment testing, AI control techniques, and global coordination approaches (such as pausing) could all play a role in limiting dangerous AI capabilities.

Resource

The full curriculum we will be following is given here:

AI Safety Fundamentals website

Note that this is the standard baseline curriculum, and your group and facilitator may wish to skip and/or supplement the suggested readings depending on your skill level & interests.

Interested in Facilitating?

Are you interested in leading discussions on AI alignment and helping to build the AI safety community? We’d love to have you join our team of facilitators. In general, our facilitators:

Have a solid background in a STEM-related field, or have done independent research/projects in ML
Most of our facilitators are studying in an MSc or PhD, however, this is not a requirement
Have solid communication skills, able to motivate discussions and participants to engage
Approachable, friendly, and able to encourage participants to ask questions

Please fill out the form below and we will get in touch regarding facilitating opportunities.

The AI Safety Fundamentals Programme.

A 6-week programme that introduces you to the fundamentals of AI safety and current issues.