Creating a Domain Expert LLM: A Guide to Fine-Tuning

Introduction With the release of ChatGPT, Large Language Models, or LLMs, have burst into the public consciousness. ChatGPT’s unique combination of creativity and coherence captured the public imagination, giving rise to many novel applications. There is now even a cottage industry of experts who specialize in Prompt Engineering, the practice of crafting prompts in order to get the desired behavior from - a skill that combines the analytical understanding of a software engineer and the linguistic intuition of a police interrogator. In a , I have demonstrated how prompt engineering can bring powerful AI capabilities to technology applications. popular LLM models previous article As powerful as prompt engineering can be, there are significant limitations when it comes to commercial applications: The context one can provide through prompt engineering is subject to the GPT model’s input limit. In the standard version of GPT-4, the limit is about 5,000 tokens, with each token corresponding to roughly one word. This is quite a lot of text and should be more than enough for most chat-based applications, but it is not enough to give ChatGPT any expertise it doesn’t already have. Prompt engineering is mostly limited to natural language outputs. While it is easy to get reply values in JSON and/or XML formats with prompt engineering, the underlying reply is still based on natural language queries, and natural padding can often break the desired output format. This is fine if the human is reading the output, but if you have another script that’s running the post-processing code, this can be a problem. Lastly, prompt engineering is, almost by definition, an inexact science. A prompt engineer must construct, through trial and error, a prediction about how ChatGPT will react and zero in on the correct prompt. This can be time-consuming, unpredictable, and potentially very difficult if the desired outcome is even moderately complicated. To solve these issues, we can turn to a less-known but nevertheless useful technique called fine-tuning. Fine-tuning allows us to use a much larger body of text, control the input and output format, and generally exert more control over the LLM model in question. In this article, we will look at what Fine Tuning is, build a small fine-tuned model to test its capabilities, and finally build a more substantial model using a larger input dataset. I hope that reading this article will give you some ideas about how to use fine-tuning to improve your business applications. Without further ado, let’s dive in. What is Fine Tuning? Fine-tuning is OpenAI’s terminology for an API that lets users train a GPT model using their own data. Using this API, a user can create a copy of and feed it their own training data consisting of example questions and ideal answers. The LLM is able to not only learn the information but understand the structure of the training data and cross-applies them to other situations. For example, OpenAI researchers have been able to use empathic questions and answers to create a model that is generally more empathic even when answering completely novel questions, and some commercial users have been able to create specialized systems that can look through case files for a lawyer and suggest new avenues of inquiry. OpenAI’s LLM model The API is quite easy to use. The user can simply create a JSONL (JSON Lines) file consisting of questions and answers and supply it to the OpenAI endpoint. OpenAI will then create a copy of the specified model and train it on the new data. In the next section, we will walk through some test models to familiarize ourselves with the API and explore some of its basic capabilities before moving on to a larger endeavor. API Demo: Building a Small Model The Pig Latin Model Before we start tackling big problems, let's first train a simple model to familiarize ourselves with the API. For this example, let’s try to build a model that can turn words into Pig Latin. Pig Latin, for those who do not know, is a simple word game that turns English words into Latin-sounding words through a simple manipulation of the syllables. Generating Training Data In order to train the model, we need to generate some example transformations to use as the training data. Therefore, we will need to define a function that turns a string into the Pig Latin version of the string. We will be using in this article, but you can use almost any major language to do the same: Python def pig_latin(string): # if starts with a vowel, just add "ay" # else move the consonants to the end and add "ay" if string[0].lower() in {'a', 'e', 'i', 'o', 'u'}: return string + 'way' else: beginning_consonants = [] for i in range(len(string)): if string[i].lower() in {'a', 'e', 'i', 'o', 'u'}: break beginning_consonants.append(string[i]) return string[i:] + ''.join(beginning_consonants) + 'ay' Now that we have our function, we will want to generate the training data. To do this, we can simply copy a body of text from the internet, extract the words from it, and turn it into Pig Latin. passage = ''[passage from the internet]]''' toks = [t.lower() for t in re.split(r'\s', passage) if len(t) > 0] pig_latin_traindata = [ {'prompt': 'Turn the following word into Pig Latin: %s \n\n###\n\n' % t, 'completion': '%s [DONE]' % pig_latin(t)} for t in toks ] Notice a couple of things about this code. First, the training data is labeled such that the input is named “prompt” and the output is named “completion.” Second, the input starts with an instruction and ends with the separator “\n##\n.” This separate is used to indicate to the model that it should begin answering after the marker. Lastly, the completion always ends with the phrase “[DONE].” This is called a “stop sequence” and is used to help the model know when the answer has stopped. These manipulations are necessary due to the quirks in GPT’s design and is suggested in the OpenAI . documentation The data file needs to be in JSONL format, which is simply a set of JSON objects delimited by new lines. Luckily, Pandas has a very simple shortcut for turning data frames into JSONL files, so we will simply rely on that today: pd.DataFrame(pig_latin_traindata).to_json('pig_latin.jsonl', orient='records', lines=True) Now that we have our training data saved as a JSONL file, we can begin training. Simply go to your terminal and run: export OPENAI_API_KEY=[OPENAI_API_KEY] openai api fine_tunes.create -t pig_latin.jsonl -m davinci --suffix pig_latin Once the request is created, one simply has to check back later with the “fine_tunes.follow” command. The console output should give you the exact command for your particular training request, and you can run that from time to time to see if the training is done. The Fine-Tuning is done when you see something like this: >> openai api fine_tunes.follow -i [finetune_id] [2023-08-05 21:14:22] Created fine-tune: [finetune_id] [2023-08-05 23:17:28] Fine-tune costs [cost] [2023-08-05 23:17:28] Fine-tune enqueued. Queue number: 0 [2023-08-05 23:17:30] Fine-tune started [2023-08-05 23:22:16] Completed epoch 1/4 [2023-08-05 23:24:09] Completed epoch 2/4 [2023-08-05 23:26:02] Completed epoch 3/4 [2023-08-05 23:27:55] Completed epoch 4/4 [2023-08-05 23:28:34] Uploaded model: [finetune_model_name] [2023-08-05 23:28:35] Uploaded result file: [result_file_name] [2023-08-05 23:28:36] Fine-tune succeeded Testing Grab the model name from the output file, and then you can simply test your model in Python like so: import requests res = requests.post('https://api.openai.com/v1/completions', headers={ 'Content-Type': 'application/json', 'Authorization': 'Bearer [OPENAI_ID]' }, json={ 'prompt': “Turn the following word into Pig Latin: Latin“, 'max_tokens': 500, 'model': model_name, 'stop': '[DONE]' }) print(res.json()[‘choices’][0][‘text’]) And you should see the output: atinlay And with that, we have trained a Pig Latin LLM and have familiarized ourselves with the API! Of course, this is a criminal underutilization of GPT3’s capabilities, so in the next section we will build something much more substantial. Building a Domain Expert Model Now that we are familiar with the fine-tuning API let’s expand our imagination and think about what kinds of products we can build with fine-tuning. The possibilities are close to endless, but in my opinion, one of the most exciting applications of fine-tuning is the creation of a domain-expert LLM. This LLM would be trained on a large body of proprietary or private information and would be able to answer questions about the text and make inferences based on the training data. Because this is a public tutorial, we will not be able to use any proprietary training data. Instead, we will use a body of text that is publicly available but not included in the training data for the base . Specifically, we will teach the content of the Wikipedia synopsis of the Handel Opera Agrippina. This article is not present in the base model of Davinci, which is the best OpenAI GPT3 model commercially available for fine-tuning. Davinci model Verifying Base Model Let’s first verify that the base model has no idea about the opera Agrippina. We can ask a basic question: prompt = "Answer the following question about the Opera Agrippina: \n Who does Agrippina plot to secure the throne for? \n ### \n", res = requests.post('https://api.openai.com/v1/completions', headers={ 'Content-Type': 'application/json', 'Authorization': 'Bearer [OpenAI Key]' }, json={ 'prompt': prompt, 'max_tokens': 500, 'model': 'davinci', }) Print the result JSON, and you should see something like this: {'id': 'cmpl-7kfyjMTDcxdYA3GjTwy3Xl6KNzoMz', 'object': 'text_completion', 'created': 1691358809, 'model': 'davinci', 'choices': [{'text': '\nUgo Marani in his groundbreaking 1988 monograph "La regina del mare: Agrippina minore e la storiografia" () criticized the usual view as myth,[15] stating that Agrippina and Nero both were under the illusion…, 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 30, 'completion_tokens': 500, 'total_tokens': 530}} The passage seems to refer to Nero and Agrippina but appears to pertain to the historical figure rather than the opera. Additionally, the model seems to refer to imaginary sources, which suggests the base model’s training data likely did not have very detailed information about Agrippina and Nero. Now that we know the base Davinci model is unaware of the opera let’s try and teach the content of the opera to our own Davinci model! Obtaining and Cleaning the Training Data We begin by downloading the text of the article from the Wikipedia API. Wikipedia has a well-tested and well-supported API that provides the wiki text in JSON format. We call the API like so: import requests res = requests.get('https://en.wikipedia.org/w/api.php', params={ "action": "query", "format": "json", "prop": "revisions", "titles": "Agrippina_(opera)", "formatversion": "2", "rvprop": "content", "rvslots": "*" }) rs_js = res.json() print(rs_js['query']['pages'][0]['revisions'][0]['slots']['main']['content']) Now that we have the latest text data let’s do some text manipulation to remove Wiki tags. import re … def remove_tags(string, tag): toks = string.split(f' ')[-1]) return ''.join(new_toks) processed = re.sub(r'\[\[File:[^\n]+', '', rs_js['query']['pages'][0]['revisions'][0]['slots']['main']['content']) processed = re.sub(r'\[\[([^|\]]+)\|([^\]]+)\]\]', r'\2', processed) processed = remove_tags(processed, 'ref') processed = remove_tags(processed, 'blockquote') processed = processed.replace('[[', '').replace(']]', '') processed = re.sub(r'\{\{[^\}]+\}\}', r'', processed) processed = processed.split('== References ==')[0] processed = re.sub(r'\'{2}', '', processed) print(processed) It doesn’t remove all of the tags and non-natural text elements but should remove enough of the tags that it is readable as natural text. Next, we want to convert the text into a hierarchical representation based on the headers: hierarchy_1 = 'Introduction' hierarchy_2 = 'Main' hierarchical_data = defaultdict(lambda: defaultdict(list)) for paragraph in processed.split('\n'): if paragraph == '': continue if paragraph.startswith('==='): hierarchy_2 = paragraph.split('===')[1] elif paragraph.startswith('=='): hierarchy_1 = paragraph.split('==')[1] hierarchy_2 = 'Main' else: print(hierarchy_1, hierarchy_2) hierarchical_data[hierarchy_1][hierarchy_2].append(paragraph) Constructing the Training Data Now that we have our passage, we need to turn the passage into training data. While we can always read the passages and manually write the training data, for large bodies of text, it can quickly become prohibitively time-consuming. In order to have a scalable solution, we will want a more automated way to generate the training data. This may sound like circular training - why not just let ChatGPT analyze the passage if that’s the case? The answer to that question, of course, is scalability. Using this method, we can break up large bodies of text and generate training data piecemeal, allowing us to process bodies of text that can go beyond what can be given to ChatGPT as input. An interesting way to generate appropriate training data from the passage is to supply sections of the passage to ChatGPT and ask it to generate the prompts and completions using prompt engineering. In our model, for example, we will break up the synopsis into Act 1, Act 2, and Act 3. Then, by modifying the training data to provide additional context, we can help the model draw connections between the passages. With this method, we can scalably create training data from large input data, which will be the key to building domain-expert models that can solve problems in math, science, or finance. We begin by generating two sets of prompts and completions for each act, one with lots of detail and one with simple questions and answers. We do this so the model can answer both simple, factual questions as well as long, complex questions. To do so, we create two functions with slight differences in the prompt: def generate_questions(h1, h2, passage): completion = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": ''' Consider the following passage from the wikpedia article on Agrippina, %s, %s: --- %s --- Generate 20 prompts and completions pairs that would teach a davinci GPT3 model the content of this passage. Prompts should be complete questions. Completions should contain plenty of context so davinci can understand the flow of events, character motivations, and relationships. Prompts and completions should be long and detailed. Reply in JSONL format ''' % (h1, h2, passage)}, ] ) return completion def generate_questions_basic(h1, h2, passage): completion = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": ''' Consider the following passage from the wikpedia article on Agrippina, %s, %s: --- %s --- Generate 20 prompts and completions pairs that would teach a davinci GPT3 model the content of this passage. Reply in JSONL format ''' % (h1, h2, passage)}, ] ) return completion Then we call the functions and collect the results into a data container: questions = defaultdict(lambda: defaultdict(list)) for h_1, h1_data in hierarchical_data.items(): if h_1 != 'Synopsis': continue for h_2, h2_data in h1_data.items(): print('==========', h_1, h_2, '===========') passage = '\n\n'.join(h2_data) prompts_completion = generate_questions(h_1, h_2, passage) prompts_completion_basic = generate_questions_basic(h_1, h_2, passage) questions[h_1][h_2] = { 'passage': passage, 'prompts_completion': prompts_completion, 'prompts_completion_basic': prompts_completion_basic } And then, we can convert the generated questions from JSON into objects. We will need to add an error handling block because sometimes ChatGPT will generate outputs that are not JSON decodable. In this case, we will just flag and print the offending record to the console: all_questions = [] for h1, h1_data in questions.items(): for h2, h2_data in h1_data.items(): for key in ['prompts_completion', 'prompts_completion_basic']: for ob in h2_data[key].choices[0]['message']['content'].split('\n'): try: js = json.loads(ob) js['h1'] = h1 js['h2'] = h2 all_questions.append(js) except Exception: print(ob) df = pd.DataFrame(all_questions) Because ChatGPT is not deterministic (that is, each time you query ChatGPT, you may get a different output even if your input is the same), your experience may vary from mine, but in my case, the questions were all parsed without issue. Now we have our training data in a data frame. We’re almost there! Let’s add a couple of finishing touches to the training data, including basic context, end markers to the prompts, and stop sequences to the completions. df['prompt'] = df.apply( lambda row: 'Answer the following question about the Opera Agrippina, Section %s, subsection %s: \n %s \n ### \n' % ( row['h1'], row['h2'], row['prompt'] ), axis=1) df['completion'] = df['completion'].map(lambda x: f'{x} [DONE]') Inspect the test data, and you should see a variety of training questions and answers. You may see short prompt-completion pairs such as: Answer the following question about the Opera Agrippina, Section Synopsis, subsection Act 2: What happens as Claudius enters? ### All combine in a triumphal chorus. [DONE] As well as long prompt-completion pairs like: Answer the following question about the Opera Agrippina, Section Synopsis, subsection Act 3: Describe the sequence of events when Nero arrives at Poppaea's place. ### When Nero arrives, Poppaea tricks him into hiding in her bedroom. She then summons Claudius, informing him that he had misunderstood her earlier rejection. Poppaea convinces Claudius to pretend to leave, and once he does, she calls Nero out of hiding. Nero, thinking Claudius has left, resumes his passionate wooing of Poppaea. However, Claudius suddenly reappears and dismisses Nero in anger. [DONE] Now we’re finally ready to commence training! Write the dataframe to file: with open('agrippina_training.jsonl', 'w') as fp_agrippina: fp_agrippina.write(df[['prompt', 'completion']].to_json(orient='records', lines=True)) And call the fine-tuning API like so: openai api fine_tunes.create -t agrippina_training.jsonl -m davinci --suffix agrippina And now, we can begin our testing. You can find the code used to generate the training data can be found . here Testing When the training is done, we can test the model by grabbing the model name and querying it as we did in the earlier section: import openai import requests model = [Your Model Name] def query_prompt(prompt): res = requests.post('https://api.openai.com/v1/completions', headers={ 'Content-Type': 'application/json', 'Authorization': 'Bearer [OPENAI_API_KEY] }, json={ 'prompt': prompt, 'max_tokens': 500, 'model': model, 'stop': '[DONE]' }) print(“output”) print(res.json()[‘choices’][0][‘text’]) Let’s begin by querying a prompt that’s in the training dataset: query_prompt(“Answer the following question about the Opera Agrippina: \n Who does Agrippina plot to secure the throne for? \n ### \n") You should see a result similar to the one below: Agrippina plots to secure the throne for Nero, her son by a former marriage. And now let’s see if we can ask a novelly-worded question: query_prompt("Answer the following question about the Opera Agrippina: \n What does Claudius promise Otho as a reward for saving his life? \n ### \n") The model should give an answer like: The throne So it seems the model is able to recall simple facts in the training data. How will it perform with a novelly-worded question that requires a more detailed answer? Let’s see: query_prompt("Answer the following question about the Opera Agrippina: \n What is Agrippina's state of mind after Pallas and Narcissus turn against her? \n ### \n") The model should give an answer like: After Pallas and Narcissus turn against her, Agrippina finds herself in a precarious situation. She realizes that she no longer has a strategic advantage over Claudius, as her trusted allies have betrayed her and her secret plan has been uncovered. Agrippina's state of mind is one of anxiety and despair, as her carefully crafted manipulation has unraveled and exposed her for the manipulative and deceitful individual that she is. She must now carefully reconsider her next move, weighing the potential repercussions of each option carefully. Agrippina is rattled by the betrayal of Pallas and Narcissus and faces an uncertain future. The model is definitely making some inferences here and is not saying exactly what’s in the synopsis, but the answer is largely cogent. As a final test, let’s see if it is able to answer a complex question that requires it to draw connections between the various passages: query_prompt("Answer the following question about the Opera Agrippina: \n How does Agrippina's fortune change through the opera? \n ### \n") You should see an answer like: Agrippina's fortune changes drastically throughout the course of the opera. Initially, she is adamant about securing the throne for Nero as a means to protect her ambitions for power. However, after Poppaea's intervention, Agrippina is not certain about her control over the situation. She is worried about losing Nero's favor to Poppaea and fears that her plotting has been unraveled by Otho. Agrippina is aware that her authority is slowly slipping away and that her control over the situation is diminishing as time progresses. The seeds of doubt have been planted in her mind, and she strives to maintain her grip over Nero while trying to anticipate the next move of her rivals. Agrippina's fortune fluctuates as she attempts to maintain her influence over Nero and the empire, facing challenges from multiple fronts. This is very promising! Based on this answer, we can see the model has digested quite a bit of information and is able to draw connections between the various bits of information we gave it during the training. Of course, the model is by no means perfect. It is still prone to hallucinations and confused answers when the prompt is highly complex. Querying the model repeatedly with the same prompt can sometimes yield dramatically different results. However, remember that we used a relatively small body of training data, and we relied solely on and completions. If we preprocessed the input data, crafted more detailed training data, and generated more sample prompts and completions, we would likely be able to improve the model’s performance further. ChatGPT to generate the prompts If you want to explore this topic more, please feel free to get in touch or play with the API on your own. All of the code I used in this article can be found on my , and you can get in touch with me through my page as well. Github GitHub Summary Today, we explored OpenAI’s fine-tuning API and explored how we can use fine-tuning techniques to give a GPT model new knowledge. Even though we used publicly available text data for our experiment, the same techniques can be adapted to proprietary datasets. There is almost unlimited potential for what fine-tuning can do with the right training, and I hope this article inspired you to think about how you can use fine-tuning in your business or application. If you want to discuss potential applications for LLM technologies, feel free to drop me a line through my page! Github

Creating a Domain Expert LLM: A Guide to Fine-Tuning

Too Long; Didn't Read

@shanglun

Credibility

RELATED STORIES