Intro to RLHF

  • We watched this video as introductory material from Rational Animations about failure trying to do RLHF in OpenAI’s training of GPT2, due to a bug in its reward function.
  • We went over the materials in Week 3 of BlueDot Impact’s AI Safety Fundamentals Course. In it, we discussed what RLHF is and some problems with the technique, both fundamental and tractable.

A Walkthrough in Forecasting

Motivation: AI Safety is concerned with a breakdown of society due to A(G)I, but as of 17/3/2024, neither instances of these have ever happened in human history. We have a useful qualitative tool for anticipating them, called imagination, but to somewhat quantify them, we use something called forecasting.

  • We watched this video series by Dart Throwing Spider Monkey (great name) on the light theoretics, techniques, and applications of forecasting.
  • We then tried applying these techniques on a toy problem: how likely is Jord to become a coauthor on the CNN Interpretability SATML competition report paper? Each person was tasked with doing their own research to produce a probability. The following is each of our members’ methodologies:

Jord’s sample prediction: jord’s prediction

Took the mean of the success rates for his particular method on natural feature trojans => 59%

Some predictions:

memberreasonprediction
purplechair“relevant search” prediction: Took the number of stars on the GitHub repo as number of participants, did the evaluations himself, got his success rate, converted it to Jord’s using the JP coefficient30.12%
TheBlindMeisterRead up on undergraduate coauthor publishing rates20%
Gumperto“improved bogo” prediction: Knew the distribution of problem difficulty is not uniform so took 50/50, where 2/4 trojans are easy so 10% each, the other two are harder are harder so ~15% each49% (the relevant information of Jord being worse than other participants reduced this down 1%)

For an extra challenge, we also tried to calculate the amount of money we expected Jord to receive by the end of the competition. Below are purplechair’s and Gumperto’s calculations:

purplechair’s prediction: purplechair’s prediction gumperto’s prediction: gumperto’s prediction

Update: Jord did end up getting coauthor in the SATML report paper (closest prediction was Jord’s), and he received a total of 200$ from it (closest prediction was purplechair’s).

Resources update

  • Jord went and introduced to everyone a few new resources and opportunities at the end of the session

Jord’s forecasting resolution note: Since this was an AI conference competition report paper, choosing a base rate of undergraduate coauthors from psychology research journals (the only source we found) was highly inaccurate.