An Intro to RLHF
What makes an ordinary text a good one? Well, that is not an easy thing to define because texts are subjective and context-dependent. In recent years, language models have demonstrated impressive capabilities in generating diverse and compelling texts from human prompts.
Imagine if we could leverage human feedback on generated text as a metric for evaluating the model’s performance, or better yet, use that feedback as a form of loss to optimize the model itself. This forms the basic idea for Reinforcement Learning from Human Feedback or RLHF. The method is what the title says: it makes use of reinforcement learning to directly improve a language model with human feedback. With RLHF, language models can align models trained on a large collection of text data with complex human values.
The best example to understand RLHF’s success is ChatGPT, where this technology is one of the most pivotal reasons that make this chatbot so amazing.
How does RLHF apply to large language models or LLMs?
A Closer Look into Reinforcement Learning from Human Feedback
Reinforcement learning is a machine learning field where agents learn decision-making by interacting with the environment. Agents take actions, including choosing not to act at all, impacting the environment, and triggering state transitions and rewards. Rewards are vital for refining the agent’s decision-making strategy. Throughout the training, the agent changes its policy to maximize cumulative rewards. This approach enables continuous learning and improvement over time.
RLHF enhances the RL’s training by making the process human-centred. As such, this new technique has been pivotal in some of the latest chatbots that are creating headlines, such as:
- OpenAI’s ChatGPT
- InstructGPT
- DeepMind’s Sparrow
- Anthropic’s Claude
With RLHF, LLMs are not merely trained to predict the next word. Instead, it is trained to understand instructions and give appropriate responses.
Why Language is a Problem in Reinforcement Learning
LLMs have proven to be good at handling multiple tasks at a time, such as:
- Code generation
- Text generation
- Question answering
- Protein folding
- Text summary
On a large scale, they can do zero, and few-shot learning, thereby doing tasks they haven’t learned yet. The transformer model, which is the architecture utilized in large language models (LLMs), has achieved a significant milestone by demonstrating its capacity for training without supervision.
LLMs, while impressive in their achievements, share basic features with other ML models. They are huge prediction machines that can guess the next prompt (token in a sequence). But, the biggest challenge here is that there are more than one correct answers for one prompt. All these answers may not be desirable in specific LLM contexts, applications, and users. Also, learning without supervision on extensive text corpora, although beneficial, may not fully correspond with the diverse range of uses it will encounter.
In such cases, RL can guide LLMs appropriately. To understand it better, let’s approach language as a Reinforcement Learning problem:
- The agent is where the Language model itself functions as an RL agent, aiming to generate optimal text output.
- Action space- A list of language results generated by the LLM.
- The state space- The environmental state comprises of prompts from the user and the LLM results.
- Reward measures how the LLM responds to the application context and user intent.
Other than the reward system, all other elements are relatively straightforward. Defining clear and effective guidelines for rewarding the language model’s performance is not a simple task. Luckily, it is possible to design an effective reward system for the language model by using RLHF.
How Does RLHF Work: 3 Steps of RLHF For Language Models
There are several challenges to RLHF. It has to be trained with multiple models and go through several deployment stages.
As such, Reinforcement Learning from Human Feedback is executed through three basic steps:
Step 1: The Pre-trained Language Model
Initially, the RLHF uses a pre-trained LM trained with classical pretraining objectives. This step is crucial because LLMs need vast training data.
Such an LLM trained in unsupervised learning will have a good language model capable of generating coherent outputs. However, some output may not always be relevant to the user’s needs and goals.
Further, training the model having labelled data can generate more correct and appropriate results for specific tasks or domains.
Step 2: Training the reward model
Reward models are trained to recognize ideal results produced by the generative model. Then it rates them on relevancy and accuracy.
The main LLM receives a prompt for each training example and generates several outputs. A dataset of LLM-generated text with quality labels is generated during the training process.
Next, human evaluators review and categorize the generated texts from the best ones to the worst. The reward model is then trained to make predictions about the score from the LLM text. As a result, the generative model learns more and generates better and more relevant results.
Step 3: Fine tuning with RL
During the last phase, the reinforcement learning loop is established, which involves fine-tuning certain or all parameters of a replicated version of the initial language model using a policy-gradient RL algorithm. In reinforcement learning, the policy takes actions from a given state to maximize rewards, enabling real-time learning and adaptation.
The model interacts with the environment, receiving feedback as rewards or penalties to know the actions that yield positive outcomes. Rigorous testing done with the help of a curated group ensures its competence in actual situations and accurate predictions.
3 Ways ChatGPT Utilizes RLHF
Here’s how ChatGPT utilizes the RLHF framework in every phase:
Phase 1
A pre-trained GPT-3.5 model underwent supervised fine-tuning by a team of engineers. A team of writers wrote answers to several prompts using the dataset of prompt-answer pairs to refine the large language model.
Phase 2
A standard reward model was created. It generated several answers to the prompts, and human annotators ranked the responses.
Phase 3
The Proximal Policy Optimization or (PPO) Reinforcement Learning algorithm was used to train the main LLM. However, there is no further information from OpenAI on how the model trained with RL does not move away too far from the original distribution.
The Limitations of RLHF For Language Models
As much as RLHF is efficient, it is limited in several ways:
- Its performance is limited by the quality of the human annotators. Since manual labelling is costly and time-consuming, unsupervised learning is the best option for ML researchers.
- In certain situations, you can leverage your ML system users to obtain labelling data without any cost. They use the upvote or downvote buttons in the LLM interfaces for this. Even then, the dataset needs to be cleaned and revised. This can be a slow process and costly as well.
- Creating RLHF datasets requires plenty of money which only large tech organizations and well-financed labs can afford. In this space, smaller companies must depend on web-scraping and open-sourced datasets and web scraping.
Know About RLHF From Opporture, the AI Company
In conclusion, Reinforcement Learning from Human Feedback, or RLHF, is the new solution to train models that conform to human preferences and deliver desired outputs. RLHF provides a framework to align Large Language Models with humans. By incorporating human feedback, RLHF helps guide LLMs away from generating harmful or incorrect outputs. It is anticipated that RLHF will prove to be an effective way to optimize smaller LLMs for specific applications, leading to more efficient and tailored language generation capabilities.
To know more about RLHF, you can take the assistance of the Opporture team. Opporture is an AI company known for its AI-driven content-related services in North America. Get in touch with us today to learn more about how we can help you.