From current events to proprietary data: How to train GPT-3 for your business needs
How do you get real business value out of ChatGPT?
Explore our experiments with GPT on GitHub
Fine-tuning with OpenAI’s GPT model
The first thing we did was install OpenAI’s Python package, then chose to train it on a topic that required recent information: the train derailment in Ohio in 2023.
pip install --upgrade openai
import os
os.environ['OPENAI_API_KEY'] = "Add OpenAI key here"
openai.api_key = "Add OpenAI key here"
prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="text-davinci-003",
prompt=prompt
)
print(result["choices"][0]["text"])The exact location of the train derailment is not available, as different
Our first training attempt was to fine-tune the model by adding specific data about the train derailment in early 2023. This required preparing the data, saving it to a file, and uploading it to OpenAI, as follows:
# from https://en.wikipedia.org/wiki/2023_Ohio_train_derailment
examples = [
{"prompt": "2023 Ohio train derailment", "completion": "The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States.[1] The freight train burned for more than two days, and then emergency crews conducted a controlled burn of several railcars at the request of state officials,[2] which released hydrogen chloride and phosgene into the air.[1] As a result, residents within a 1-mile (1.6-kilometer) radius were evacuated, and an emergency response was initiated from agencies in Ohio, Pennsylvania, and West Virginia. The U.S. federal government sent Environmental Protection Agency (EPA) administrator Michael S. Regan to provide assistance on February 16, 2023."} ]
f = open("trainingdata.jsonl", "w")
for example in examples: f.write(json.dumps(example) + "\n")
file = openai.File.create(file=open("trainingdata.jsonl"), purpose='fine-tune')
From here, we instructed OpenAI to begin fine-tuning a model using DaVinci as a base model, but including the additional information about the 2023 train derailment in Ohio.
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")
openai api fine_tunes.follow -i {fine_tune['id']}
result = openai.Completion.create(
model="davinci:ft-personal-2023-02-16-20-32-47",
prompt=prompt
)
print(result["choices"][0]["text"])Officials say the train derailed in Nantes Dorian, just west of
Fine-tuning using more data from RSS feeds
pip install rss-parser
from rss_parser import Parser
from requests import getrss_urls = [
"https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
"https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
"http://feeds.bbci.co.uk/news/rss.xml?edition=us",
"http://rss.cnn.com/rss/cnn_world.rss",
"http://rss.cnn.com/rss/cnn_us.rss",
"https://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36",
"https://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32",
"https://feeds.a.dj.com/rss/RSSWorldNews.xml",
"https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml",
"https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en"
]for url in rss_urls:
xml = get(url)
parser = Parser(xml=xml.content)
feed = parser.parse()
for item in feed.feed:
prompts.append({"prompt": item.title, "completion": item.description})f = open("rss-trainingdata.jsonl", "w") for prompt in prompts:
f.write(json.dumps(prompt) + "\n")
openai tools fine_tunes.prepare_data -f rss-trainingdata.jsonl -q
file = openai.File.create(file=open("rss-trainingdata_prepared.jsonl"), purpose='fine-tune')
fine_tune = openai.FineTune.create(training_file=file['id'], model="davinci")openai api fine_tunes.follow -i {fine_tune['id']}
prompt = "Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="davinci",
prompt=prompt + '\n\n###\n\n'
)
print("Before (non-finetuned) result: " + result['choices'][0]['text'])
result = openai.Completion.create(
model="davinci:ft-personal-2023-02-16-21-29-25",
prompt=prompt + '\n\n###\n\n'
)
print("After (finetuned) result: " + result['choices'][0]['text'])Before (non-finetuned) result:
Additional Information:
Sound Transit’s emergency closure of
After (finetuned) result:
Backgrounder
In the early hours of February 10, 2019
Getting customized results without fine-tuning
prompt = "Given that The 2023 Ohio train derailment (also called the East Palestine train derailment) occurred on February 3, 2023, at 8:55 p.m. EST (UTC−5), when a Norfolk Southern freight train carrying hazardous materials derailed in East Palestine, Ohio, United States. Where did the train carrying hazardous materials derail?"
result = openai.Completion.create(
model="text-davinci-003",
prompt=prompt + '\n\n###\n\n'
)
print(result['choices'][0]['text'])
The train carrying hazardous materials derailed in East Palestine, Ohio, United States.
pip install langchain
from langchain.docstore.document import Document documents = []
for url in rss_urls:
xml = get(url)
parser = Parser(xml=xml.content)
feed = parser.parse()
for item in feed.feed:
documents.append(Document(
page_content=item.title + '. ' + item.description
))from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
prompt = "Where did the train carrying hazardous materials derail?"
chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":documents, "question":prompt}, return_only_outputs=True)["output_text"]InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 17073 tokens (16817 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
Using text embeddings and vector similarity searches to pre-populate a prompt
pip install faiss-cpu
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
search_index = FAISS.from_documents(documents, OpenAIEmbeddings())
prompt = "Where did the train carrying hazardous materials derail?"
chain = load_qa_chain(OpenAI(temperature=0))
chain({"input_documents":search_index.similarity_search(prompt, k=4), "question":prompt}, return_only_outputs=True)["output_text"]
' East Palestine, Ohio.'
It worked! By pairing the similar text search with GPT-3, we’re able to now give answers about news in the RSS feed.
Limitations, possibilities, caveats, and final thoughts
Learn more about Technology at Smart Design
About Carter Parks
Carter Parks is a systems architect who has a knack for applying new technologies to the right problems. He brings expertise in machine learning, full stack web and mobile development, and IoT and has worked with clients in sectors ranging from eCommerce to nutrition, finance, and SaaS. Notable clients include Gatorade. When he isn’t coding, you can find him in the outdoors, probably on a long trail run, or playing the piano.
Resources
Langchain
Question answering
Dagster.io
Build a GitHub support bot with GPT3, LangChain, and Python
GitHub
OpenAI Cookbook