I Got Early Access to ChatGPT API and then Pushed It to It’s Limits. Here’s What You Need to Know.
2023.03.03

"The ChatGPT API, released on March 1, reduced the call cost to 0.002 dollars/1000 tokens, 1/10th of the previous davinci. OpenAI says it has reduced the cost of ChatGPT by 90% since its release in December. This speed of cost reduction reminds us of a familiar concept: Moore's Law. Yes, Sam Altman tweeted on February 27: A new era of Moore's Law may be coming - the world's (artificial) intelligence doubles every 18 months.


I remember that when ChatGPT was first released, someone calculated that the cost of ChatGPT per conversation was too high to compete with the cost of Google search. But in history, we have seen examples of cost drops again and again with Wright's Law, so that we can now drive a $30,000 Tesla, use a iPhone of less than $1,000, and the computing power of four 4090 graphics cards can exceed the world's top supercomputer.

The future is faster than you think, jump in.


-Dai Yusen, Managing Partner of ZhenFund



Today we would like to share an article with you. The author is Alistair Pullen@Buildt.ai, the original title: I got early access to ChatGPT API and then pushed it to its limits. Here's what you need to know. (Translation: f.chen @Zhen Fund) 


Preamble

We’ve been fortunate enough to have access to the ChatGPT alpha API through YC for the last two months. I was initially hesitant to add chat functionality to Buildt; I have been worried that with that functionality would come lazy comparisons or pigeonholing of our product as simply ChatGPT that lives in VSCode when in reality the tech we’ve been building such as our semantic codebase search engine is so much more powerful and nuanced than that. I did over time however realise that there was real value to be had by linking the two technologies – essentially giving a chatbot, with all of its known capabilities and strengths regarding code writing and explanation skills, the context of your entire codebase, which is something that is currently not possible elsewhere. Being able to say ‘Update my onboarding component to have a skip button taking you to the dashboard’ is an extremely powerful thing to be able to do in your codebase, it finds the context itself and is able to execute the change with immense speed and ease.

Quick Overview

You’ll likely have seen this elsewhere, or in the docs, but the ChatGPT API works slightly differently from the standard playground models. In the alpha, we had to wrap our messages with some special tokens in order to get them to work – the <|im_start|>, <|im_end|> tokens were the most obvious. These tokens wrapped the messages between the user and the chatbot and delineate where messages start and end. Fortunately, in the publically available API, there is no need to use these tokens as it’s abstracted away for you in the API like so:

messages=[

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Who won the world series in 2020?"},

    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},

    {"role": "user", "content": "Where was it played?"}

]

Conversation messages fall into three message types: system, assistant and user. These indicate who is ‘speaking’ at any given moment in the conversation, or in other words whose ‘go’ it is. The system the message is special and can only be sent once at the beginning of a conversation. I talk about the prompt in more depth later in this article. The API also works quite primitively when it comes to sending/receiving messages, the current context is essentially stored as a single long string, and it’s up to you to manipulate it in order to make the conversation work correctly – also it’s worth noting that each message includes all of the context that went before it, plus the message, so the number of tokens you use per message increases with every message you send, there is currently no method to only be charged for the additional tokens you use for each new message.

ChatGPT vs davinci-003

Up to this point, many of you will only have played with the davinci family of LLMs, and there are some subtle differences in prompting style between davinci and ChatGPT. For example, the practice of k-shot examples is less prevalent in the ChatGPT space for a couple of reasons (in my opinion); firstly these can take up a good chunk of context in your query, and as a chatbot, this query will be run a lot, therefore any k-shots you include in your chat prompt will be run on every API request which can become very wasteful very quickly. Prompts for ChatGPT also include some bot personality information - for example ‘Your name is BuildtBot, an AI software engineer specialised in code search and understanding’, this can be done in davinci prompts also but I’d argue is less prevalent. ChatGPT prompts are by their nature very much more ‘zero shot’, you do often feel like you’re at the mercy of the model obeying your instructions, but sometimes it’s worth biting the bullet and providing one or two k-shot examples in the system prompt if there’s repeated undesirable behaviour. Often, davinci prompts are somewhat specialised ‘functions’ which perform a given task, however with a chatbot the input is often more general and unpredictable than what you may input into a regular GPT-3 prompt. The net result is under certain circumstances you’ll see unexpected behaviour from ChatGPT which doesn’t obey your system prompt, and simply instructing it not do to that behaviour often isn’t enough.

OpenAI also advocate that in many instances you replace your davinci-003 implementations with the chat API, as it is 10x cheaper - only $0.002/1k tokens! In these instances you can use the ChatAPI like the more standard davinci-003 prompt format, where the system prompt is your prompt body, the user message is your input and the assistant message is the completion

The System Prompt

This is, from my experience so far, the most important thing to get right with a chatbot. As I’ve mentioned already the system prompt is the bot’s ‘brief’ which defines its character, purpose and available actions. Different applications will require very different prompts, and some prompts will be far longer than others – for example, Buildt’s system prompt is >1k tokens in length, which is very long however in our use case because of all of the different actions that can be taken in a codebase we need to very rigidly define what the bot can and can’t do. In many cases simply defining the tone, name and general abilities of the model is enough, but in cases where the bot can actually perform actions outside of its sandbox (what I’ve been calling subroutines) you’ll end up using many more tokens to codify this behaviour. A very basic example system prompt is as follows:

You are an intelligent AI assistant. Answer as concisely as possible.

As you can see it’s just text, but it will never be included in the chat messages - your system prompt is sent with every request, so the tokens you use here are constant overhead.

One approach I’ve enjoyed using for writing an actual ChatBot (we used this approach with BuildtBot) is writing a system prompt like you’re briefing a salesperson before a call with a customer with actions like ‘If you see <X> kind of behaviour, then respond with <Y>’, ‘Respond as helpfully and concisely as possible, whilst always ensuring you stay on topic’, and then rounding off the prompt with ‘Ok, the user is about to ask you a question’. I can’t say for certain why, but this approach seems to have a small improvement on the bot’s ability to understand and act upon the user’s request. If your use case involves a lot of different scenarios where in some the bot can do something and in others, it can’t, I’ve found it’s better to dispense with a prosaic request to the LLM asking it what to and not to do and just give it a couple of k-shot examples (which can sometimes take up the same number of tokens as a long-winded explanation of capabilities).

My final point about the system prompt is that I’ve noticed that its weight/influence on the output wanes the longer the context window becomes. It is potentially just me but I’ve noticed its likelihood to adhere to its instructions (particularly if they’re complex and logic based) the closer you get to the max 4k tokens context window limit. I haven’t found a great solution to this yet, maybe reminders periodically of its purpose throughout the context window, but these will count toward the total token limit so may do more harm than good depending on the use case.

Subroutines and memory editing

One of the core things that’ll set your chatbot apart from vanilla ChatGPT is giving it the ability to ‘do’ things. Performing actions outside of its sandbox is a very exciting and compelling prospect, but it comes with its own challenges and potential risks. You should always be extremely careful when passing the output of these LLMs, which is untrusted, as an input into another service. I’ve seen a number of people on Twitter playing around with executing code that ChatGPT writes for example – this fills me with fear, please don’t do it unless you are correctly sanitizing the LLM output.

I have found a reasonably good method for allowing the chatbot to perform actions, insert those actions into the context window, and then continue. This is exactly what we do with Buildt search, whereby BuildtBot realises the user is looking for something, either explicitly where they have overtly said ‘Find me <X>’ or implicitly, if the task the user has asked in some way relies upon a search being done, for example ‘Update the login component to do <XYZ>’ implicitly relies upon finding the login component. I’ll elaborate on the ways that I establish this user intent in a moment.

Firstly the principle of subroutines here is that the chatbot will interrupt itself with a stop sequence to signal that it thinks a subroutine needs to be run. The way we can do this is by using special tags/tokens that we designate to represent these different sequences, I’ve found using a similar format to ChatGPT’s proprietary tags works well, for example, we have <|search_query ...|> as a tag, and will be adding more for the Coach and Genie features we’ll be shipped in due course. These tags and the criteria for their showing should be defined clearly in your system prompt. The way these tags actually work in practice is as follows: we treat them similarly to React/JSX tags when it comes to syntax, for example, the real-world creation of a search subroutine within Buildt would look like this: <|search_query query={"Find the login component"}|></|search_query|> . Now, this may look odd but there is method to the madness, the most important part of it is the closing tag </|search_query|> as this tag is a stop sequence, so in reality, this tag will never actually appear in the output. The API tells us its termination reason, so we know when it’s stopped due to a stop sequence, and with a bit of parsing it’s then trivial to see the request it’s created for our subroutine.

Next, once we’ve identified and parsed the request (I use a combination of sub-string operations and regular expressions) you can then perform whatever subroutine you require, in our case we’ll pass the query off to our semantic codebase search engine. Once that subroutine is complete, we need to pass in the results into the context window, this is memory insertion – making the chatbot believe that it came up with these results, as it has no way of knowing once you make the next request. Inserting the results is pretty trivial, it’s simply a case of appending the results to the current context string, however, it is important to note that you must include the closing tag to your subroutine before you add your results, i.e. </|search_query|>\\\\n[your results here]this is important because the stop sequence means that the closing tag will never naturally get written to the context window and because these models operate on what they see if you don’t include it here, the next time someone wants to perform this operation later in the conversation, it may not include the stop sequence in its response because it saw it was omitted in this instance. Once you’ve inserted your results you can either: end that turn thereby inserting a <|im_end|> message or you can just leave the conversation open-ended and let the bot continue outputting as if it had just come up with the results itself - this choice will depend on your use case.

At this point it should be fairly clear that there are a lot of string operations that need to be performed, I will go into more depth about this in the implementation tips section of this article but I will say that it is worth a) having an enum of all of the possible tag types you intend to have, and b) have some kind of method that will strip out all of these tags from your strings before you present them to the user.

Finally, we come to identifying user intent; one of the most difficult problems we’ve found has been working out what the user actually wants to happen, and as a result what subroutines/actions we need to take. This is a hard problem because many of these things are implicit in users’ requests, and may be required as dependencies for other steps. The way I have found success in this problem (and by no means am I saying this is the only way of tackling this) is by adding in some further tags which must be written by the ChatBot before it writes any user-facing text, see the example below:

< system prompt...>

// User Message

Update the onboarding component to include a skip button // Implicit search for the login component


// Assistant response

<|user_intonation analysis={"It looks like the user is trying to update the onboarding component in their codebase"}|/>

<|determination isCodebaseSpecific={true} requiresSearch={true}|>

Sure, first let me find the onboarding component ``<|loading|>``

This is the most approximate way I’ve found for getting equivalent functionality to “Let’s think step-by-step” in the ChatGPT promoting world. It forces the model to verbalise its reasoning and pre-meditate it’s actual answer before delivering it. A final tip I’d give on this front: to ensure that the model always produces these pre-message tokens I start my assistant completions with <|user_intonation and let it continue from there, so it doesn’t ignore it or choose to omit that step.

Context Window Management

This is a tricky problem, particularly as the token limit for the ChatGPT API is a mere 4000 tokens, which in reality isn’t long at all - you also have to manage context yourself, if your request is > 4000 tokens, like any other OpenAI request you’ll get a 400 error back. There are rumours that there will be an 8k token version in due course but no word on how long that’ll be. In the meantime, we have to come up with some ways to manage the context window to try to ensure that the core of the user’s train of thought is preserved. This is particularly difficult if you have a chatbot which can perform subroutines, as, at least in our case, the results of those routines can be very lengthy and take up a good chunk of the context window. There are a couple of things to bear in mind: the system prompt must always be included in each request, and simply removing the earliest messages in the context window isn’t necessarily always the best solution. This is probably the largest ‘your milage may vary’ factor to this process as its implementation will depend heavily on how the chatbot will be used. If the bot is operating in a scenario where it’s unlikely the user will want to go back in time to where they started then it should be fine to prune early messages, but that won’t always be the case.

Implementation Tips

I’d reiterate that having a strong function to strip out content that the user shouldn’t see from the chat messages is very important, we have tags we can wrap things in to ensure that entire blocks won’t render in the final chat messages, e.g if your query involves a search to make it happen, but isn’t itself innately a search query, then you shouldn’t show the search results in the returned message.



download_image.jpeg

Alistair Pullen

CEO