Simon Case
Simon Case Head of Data

AI, Our Thinking Thu 2nd February, 2023

Adventures in fine tuning ChatGPT

In my first blog post on ChatGPT I noticed that, to get good results for most of the work I wanted, I needed to use the latest model, which is tuned for  instruction/chatgpt type tasks (text-davinci-003); the base models just didn’t seem to cut it.

I was intrigued by this, so I decided to have another go at fine-tuning ChatGPT, to see if I could find out more. Base models (e.g. davinci as opposed to text-davinci-003) use pure GPT-3. (Actually, I think they are GPT-3.5 but let’s not worry about that for now). They  generate completions based on input text; give them the start of a sentence and they will tell you what the next words will typically be. However, this is different from answering a question or following an instruction.

In the following example you can see the top 5 answers to the prompt “Tell me a story” when using the base GPT-3 model (which is davinci – currently the most powerful we have access to):

You can see from these answers how GPT-3 works; it has found completions based on examples in the training corpus. ‘Tell me a story , mama,” the youngster said. is likely a common completion in several novels. But this is not what I wanted! What I wanted was for ChatGPT to actually tell me a story.

If I give the same query to a different model – text-davinci-003 – I get the sort of results I was looking for. (The answers have been restricted to a short length, but it could carry on extensively.)

Clearly the text-* models have significantly enhanced functionality compared with the standard GPT-3 models. In fact, one of the important things that  ChatGPT has done is  focus explicitly on understanding the user’s intention when they create a prompt. And as well as allowing ChatGPT to understand the intention of the user, the approach also prevents it from responding in toxic ways. 

How they do this is fascinating; they use humans in the loop, and reinforcement learning. First, a group of people create the sort of training set you’d expect in a supervised learning approach. For a given prompt “Tell me a story”, a person  creates the response “Once, upon a time there was a …” These manually created prompt-response pairs are used to fine-tune ChatGPT’s base model. Then, people (presumably a different group of people) are used to evaluate the performance by ranking outputs from the model – this is used to create a separate reward model. Finally, the reward model is used in a reinforcement learning approach to update the fine-tuned model made in the first stage. (Reinforcement learning is an ML technique where you try lots of changes and see which ones do better against some reward function. It’s the approach that DeepMind used to train a computer to play video games.)

It’s worth noting the sort of investment required to do this. Using people to label and evaluate machine learning outputs is expensive, but it is a great way to improve the performance of a model. In recent LinkedIn posts I have seen people claiming to have been offered jobs to do this sort of work, so it looks like OpenAI are continuing to refine the model in this way. Reinforcement learning is usually a very expensive way of learning something; reinforcement learning on top of deep learning is even more expensive, so there has clearly been a lot of time, effort and money expended in developing the model. 

ChatGPT are open in their approach – here’s their take on how to use ChatGPT; I really applaud their openness. I’m aware there are a lot of negative reactions to the tool, but it’s worth pointing out that the developers are clearly aware that the model is not perfect. If you look at the limitations section of the document they note several things including: 

  • It can still generate toxic or biased outputs
  • It can be instructed to produce unsafe outputs
  • It is trained only in English, so will be culturally-biased to English-speaking people 

In my opinion, generative models are not deterministic in their outputs so these sorts of risks will always remain. But I think serious thought and expense has been applied in order to reduce the likelihood of problematic results.

I started this investigation into fine-tuning ChatGPT because I noticed that I wasn’t able  to fine-tune on the text-* versions of the model. In fact, it clearly states in the ChatGPT fine-tuning guide that  only the base models can be fine-tuned.  Now I understand why. Unfortunately, I also understand better that the text-* models are those which have been improved with the human in the loop process. And these are the ones that have a lot of the secret sauce for ChatGPT; it is the refinement using human guidance that gives it that fantastic human-like ability which seems so impressive.

The implication for the use-cases I had in mind – fine-tuning ChatGPT to specific contexts, such as question answering about Equal Experts – is that you cannot use the model which understands your intention. So you will not get the sort of natural-sounding responses we see in all the examples people have been posting on the internet. That’s a bit sad for me, but at least now I know.