From current events to proprietary data: How to train GPT-3 for your business needs

Machine Learning Architect

How do you get real business value out of ChatGPT?

With all the hype there’s been around OpenAI’s groundbreaking technology, that might sound like an odd question—isn’t a powerful, easy-to-use language model obviously going to generate value? The fact that it’s reached 100 million users faster than any digital application in history certainly speaks to its widespread appeal. But as we tinkered with it, some critical limitations became clear.

One of our defining principles at Smart Design is a focus on user experience and user value, regardless of the technology that enables it. This is why a powerful new technology isn’t enough on its own. To deliver value, it has to address a human need or drive an improved experience. 

ChatGPT, for all its promise and appeal, is still a very generic tool. The GPT-3 platform it’s built on is trained on a huge but unfocused data set. For the average business or user, this makes it useful only up to a point—like a personal assistant who’s world-class at looking things up, but knows nothing about you, the specific challenges you face, or even the recent events affecting you.

Any language model is trainable though, and a chat-based assistant that actually knows your business, your industry, your company, or you personally would be game-changing in a way that ChatGPT currently isn’t. That suggests an opportunity: to customize GPT-3 as a platform, using OpenAI’s APIs. 

What would that involve, we wondered? How much training data would it take? How much would it cost and how hard would it be? 

So we did what any good developer would do, and started running experiments. An obvious place to start is current events since the latest version (GPT 3.5) is only trained on data up through June 2021. Our hypothesis was that we could train GPT-3 on RSS feeds of major news sites, then have it answer questions about recent events.

OpenAI offers a service called fine tuning which allows you to customize a model by feeding it prompts and responses that exemplify what you want it to learn. This was our first approach in running this experiment.

Explore our experiments with GPT on GitHub

Fine-tuning with OpenAI’s GPT model

The first thing we did was install OpenAI’s Python package, then chose to train it on a topic that required recent information: the train derailment in Ohio in 2023.

pip install --upgrade openai

We also had to set up our OpenAI key, obtained from OpenAI.com. OpenAI offers $18 in free credit, which is far more than what’s needed to run this notebook.

import os
os.environ['OPENAI_API_KEY'] = "Add OpenAI key here"
openai.api_key = "Add OpenAI key here"

As a baseline, we queried DaVinci, which is OpenAI’s state-of-the-art GPT 3.5 model, to see if it knew where the trail was derailed.

prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
  model="text-davinci-003",
  prompt=prompt
)
print(result["choices"][0]["text"])
The exact location of the train derailment is not available, as different

The result is incorrect because the event occurred after GPT 3.5 was trained.
 

Our first training attempt was to fine-tune the model by adding specific data about the train derailment in early 2023. This required preparing the data, saving it to a file, and uploading it to OpenAI, as follows:

# from https://en.wikipedia.org/wiki/2023_Ohio_train_derailment
examples = [
    {"prompt": "2023 Ohio train derailment", "completion": "The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States.[1] The freight train burned for more than two days, and then emergency crews conducted a controlled burn of several railcars at the request of state officials,[2] which released hydrogen chloride and phosgene into the air.[1] As a result, residents within a 1-mile (1.6-kilometer) radius were evacuated, and an emergency response was initiated from agencies in Ohio, Pennsylvania, and West Virginia. The U.S. federal government sent Environmental Protection Agency (EPA) administrator Michael S. Regan to provide assistance on February 16, 2023."} ]
f = open("trainingdata.jsonl", "w")
for example in examples: f.write(json.dumps(example) + "\n")
file = openai.File.create(file=open("trainingdata.jsonl"), purpose='fine-tune')

From here, we instructed OpenAI to begin fine-tuning a model using DaVinci as a base model, but including the additional information about the 2023 train derailment in Ohio.

fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")

Using the “follow” console command we tracked the fine tuning’s progress, which took about 30 minutes (note that if the command fails you can run it again to continue polling the progress).

openai api fine_tunes.follow -i {fine_tune['id']}

With this step complete, we copied the model below and ran our previous prompt to see if it did any better.

result = openai.Completion.create(
  model="davinci:ft-personal-2023-02-16-20-32-47",
  prompt=prompt
)
print(result["choices"][0]["text"])
Officials say the train derailed in Nantes Dorian, just west of

No real improvement here! Perhaps more data is needed.

Fine-tuning using more data from RSS feeds

For our second experiment, we decided to fine tune the model using recent news, then ask it about a current event. We began by installing an RSS parser, then had it download all of the recent news from several major news outlets via RSS feed, and used that to fine tune the model.

pip install rss-parser from rss_parser import Parser
from requests import get
rss_urls = [
"https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
"https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
"http://feeds.bbci.co.uk/news/rss.xml?edition=us",
"http://rss.cnn.com/rss/cnn_world.rss",
"http://rss.cnn.com/rss/cnn_us.rss",
"https://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36",
"https://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32",
"https://feeds.a.dj.com/rss/RSSWorldNews.xml",
"https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml",
"https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en"
]
for url in rss_urls:
  xml = get(url)
  parser = Parser(xml=xml.content)
  feed = parser.parse()
  for item in feed.feed:
    prompts.append({"prompt": item.title, "completion": item.description})
f = open("rss-trainingdata.jsonl", "w") for prompt in prompts:
f.write(json.dumps(prompt) + "\n")

This time we used a tool that OpenAI provides to clean the training data.

openai tools fine_tunes.prepare_data -f rss-trainingdata.jsonl -q

This allowed us to then train a newly fine tuned model on this much larger set of data.

file = openai.File.create(file=open("rss-trainingdata_prepared.jsonl"), purpose='fine-tune')
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")
openai api fine_tunes.follow -i {fine_tune['id']}

With that complete, we could compare before (non-fine-tuned) and after (fine-tuned) models on a question about the day’s news. Once again we asked about the recent train derailment.

prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
    model="davinci",
    prompt=prompt + '\n\n###\n\n'
)
print("Before (non-finetuned) result: " + result['choices'][0]['text'])
result = openai.Completion.create(
    model="davinci:ft-personal-2023-02-16-21-29-25",
    prompt=prompt + '\n\n###\n\n'
)
print("After (finetuned) result: " + result['choices'][0]['text'])
Before (non-finetuned) result:
Additional Information:
Sound Transit’s emergency closure of
After (finetuned) result:
Backgrounder
In the early hours of February 10, 2019

The results, unfortunately, were still gibberish. 
 
After some additional digging, the issue turned out to be that OpenAI is using a much older version of DaVinci, that doesn’t include the instruction-following features that text-davinci-003 (or ChatGPT) include. This means that fine tuning is not a good method for solving instruction-based problems—it’s better suited to solve problems like classification and autocompletion. 
 
To make this work, we need a different approach.

Getting customized results without fine-tuning

For this next experiment, we tried something that seems counterintuitive: we posed a question to GPT-3 and provided the answer to the question as a pre-condition.

prompt = "Given that The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States. Where did the train carrying hazardous materials derail?"

result = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt + '\n\n###\n\n'
)

print(result['choices'][0]['text'])
The train carrying hazardous materials derailed in East Palestine, Ohio, United States.

This approach is not a practical solution to our problem in the long run, but it did reveal a valuable new insight. The fact that it worked suggests that if we could search for content that provides the answer to the question being asked, and then pre-populate that content within the prompt, then we could use GPT-3’s instructional features to work with this new information. Fortunately, there are some great tools for coming up with creative solutions, like langchain.

pip install langchain

Next, we downloaded the same RSS feeds as before, but instead we prefilled our prompt with this data before asking the question about current events.

from langchain.docstore.document import Document documents = []
for url in rss_urls:
  xml = get(url)
  parser = Parser(xml=xml.content)
  feed = parser.parse()
  for item in feed.feed:
    documents.append(Document(
      page_content=item.title + '. ' + item.description
    ))
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

prompt = "Where did the train carrying hazardous materials derail?"

chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":documents, "question":prompt}, return_only_outputs=True)["output_text"]
InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 17073 tokens (16817 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

We see here that GPT-3 doesn’t support prompts this large. We did, after all, dump the entire contents of the previous day’s news from several major news publications into the prompt—about 17,000 tokens (or roughly 13,000 words) worth. Given the model’s (reasonable) limit of 4097 tokens, we may be able to make this approach work, but first, we need a more refined method of populating the data.

Using text embeddings and vector similarity searches to pre-populate a prompt

Instead of dumping all the news from a recent period into the prompt, there should be a way of searching within the recent news data, and only populating the prompt with content that’s appropriate to our query.
 
A typical full-text search index probably won’t work here, because it’s unlikely the exact words will appear in our content—especially since news content is mostly made of statements, and our prompt is a question. If we populate our prompt with language that’s similar to our prompt we’re likely to have the answer in the prompt.
 
Instead, we used some cutting-edge technology that can determine text that’s similar to other text. OpenAI recently released a Text Embeddings API which can convert words into a vector that can be compared to other vectors, allowing us to search for similar meanings, not just similar words. For example, the statement “people work” produces a vector that is similar to the vector equivalent of “humans do jobs”, even though none of the words match.
 
FAISS is a Vector Store we can use to compare text embeddings, and langchain supports it as a way to build prompts for OpenAI. This allowed us to build a search index and then populate our prompt with only the information that’s similar to our prompt. One note: even though the processing needs fell within the $18 credit provided by OpennAI, we added a payment method to our account to let us increase the rate at which we could make API requests.
 
pip install faiss-cpu from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS

search_index = FAISS.from_documents(documents, OpenAIEmbeddings())

prompt = "Where did the train carrying hazardous materials derail?"

chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":search_index.similarity_search(prompt, k=4), "question":prompt}, return_only_outputs=True)["output_text"]
' East Palestine, Ohio.'

It worked! By pairing the similar text search with GPT-3, we’re able to now give answers about news in the RSS feed.

Limitations, possibilities, caveats, and final thoughts

The series of experiments described here opens up some tantalizing possibilities, but also exposes some solid constraints and tricky challenges. 
 
By using some cutting-edge Natural Language Processing tools, we were able to effectively customize GPT-3 to give us answers about something it wouldn’t ordinarily be capable of handling. In this case, it was current events, but the same approach could work for proprietary data, industry-specific guidelines and regulations, recent news and metrics from your company, category, or competitors, or even personal data (for example, health-related or “quantified self” data). The potential applications here are nearly endless since what we’re really talking about is quickly creating a chat-accessible personal assistant capable of providing answers or developing insights on topics that are specifically relevant to your personal or business situation.
 
But it also has several limitations. One issue is authority. FAISS only cares about how similar text is for populating prompts. If I wanted to ask the model “What color is the dog?” and within my text database I see “The dog is black” and “The dog is white,” the vector index has no way of knowing which to present to the model. This is where a vector-based approach would actually benefit from the authority some of the more powerful word-based search indexes can offer. A better approach probably lies in some combination of the two, but this is a longer-term problem that will take additional experiments to solve.
 
Another issue is that if many pieces of text are similar to the prompt, this will produce a very large prompt which increases processing time and cost.
 
All the same, we’re just now starting to reach a critical velocity with generative AI. Tools like langchain have the power to accelerate already powerful technologies like ChatGPT, which points toward even more interesting times to come. With GPT-4 likely to be released any time, this year should prove to be a watershed for AI — not just as a technology, but as a tool for creating real solutions with real value for real people.

Learn more about Technology at Smart Design

About Carter Parks

Carter Parks is a systems architect who has a knack for applying new technologies to the right problems. He brings expertise in machine learning, full stack web and mobile development, and IoT and has worked with clients in sectors ranging from eCommerce to nutrition, finance, and SaaS. Notable clients include Gatorade. When he isn’t coding, you can find him in the outdoors, probably on a long trail run, or playing the piano.