A Practical Example – Real Python

You’ve used ChatGPT, and you understand the potential of using a large language model (LLM) to assist you in your tasks. Maybe you’re already working on an LLM-supported application and have read about prompt engineering, but you’re unsure how to translate the theoretical concepts into a practical example.Your text prompt instructs the LLM’s responses, so tweaking it can get you vastly different output. In this tutorial, you’ll apply multiple prompt engineering techniques to a real-world example. You’ll experience prompt engineering as an iterative process, see the effects of applying various techniques, and learn about related concepts from machine learning and data engineering.You’ll work with a Python script that you can repurpose to fit your own LLM-assisted task. So if you’d like to use practical examples to discover how you can use prompt engineering to get better results from an LLM, then you’ve found the right tutorial!Understand the Purpose of Prompt Engineering
Prompt engineering is more than a buzzword. You can get vastly different output from an LLM when using different prompts. That may seem obvious when you consider that you get different output when you ask different questions—but it also applies to phrasing the same conceptual question differently. Prompt engineering means constructing your text input to the LLM using specific approaches.
You can think of prompts as arguments and the LLM as the function to which you pass these arguments. Different input means different output:

While an LLM is much more complex than the toy function above, the fundamental idea holds true. For a successful function call, you’ll need to know exactly which argument will produce the desired output. In the case of an LLM, that argument is text that consists of many different tokens, or pieces of words.

Note: The analogy of a function and its arguments has a caveat when dealing with OpenAI’s LLMs. While the hello() function above will always return the same result given the same input, the results of your LLM interactions won’t be 100 percent deterministic. This is currently inherent to how these models operate.

The field of prompt engineering is still changing rapidly, and there’s a lot of active research happening in this area. As LLMs continue to evolve, so will the prompting approaches that will help you achieve the best results.
In this tutorial, you’ll cover some prompt engineering techniques, along with approaches to iteratively developing prompts, that you can use to get better text completions for your own LLM-assisted projects:

There are more techniques to uncover, and you’ll also find links to additional resources in the tutorial. Applying the mentioned techniques in a practical example will give you a great starting point for improving your LLM-supported programs. If you’ve never worked with an LLM before, then you may want to peruse OpenAI’s GPT documentation before diving in, but you should be able to follow along either way.
Get to Know the Practical Prompt Engineering Project
You’ll explore various prompt engineering techniques in service of a practical example: sanitizing customer chat conversations. By practicing different prompt engineering techniques on a single real-world project, you’ll get a good idea of why you might want to use one technique over another and how you can apply them in practice.
Imagine that you’re the resident Python developer at a company that handles thousands of customer support chats on a daily basis. Your job is to format and sanitize these conversations. You also help with deciding which of them require additional attention.
Collect Your Tasks
Your big-picture assignment is to help your company stay on top of handling customer chat conversations. The conversations that you work with may look like the one shown below:

You’re supposed to make these text conversations more accessible for further processing by the customer support department in a few different ways:

Remove personally identifiable information.
Remove swear words.
Clean the date-time information to only show the date.

The swear words that you’ll encounter in this tutorial won’t be spicy at all, but you can consider them stand-ins for more explicit phrasing that you might find out in the wild. After sanitizing the chat conversation, you’d expect it to look like this:

Sure—you could handle it using Python’s str.replace() or show off your regular expression skills. But there’s more to the task than immediately meets the eye.
Your project manager isn’t a technical person, and they stuck another task at the end of this list. They may think of the task as a normal continuation of the previous tasks. But you know that it requires an entirely different approach and technology stack:

Mark the conversations as “positive” or “negative.”

That task lies in the realm of machine learning, namely text classification, and more specifically sentiment analysis. Even advanced regex skills won’t get you far in this challenge.
Additionally, you know that the customer support team that you’re preparing the data for will want to continue working on it programmatically. Plain text isn’t necessarily the best format for doing that. You want to do work that’s useful for others, so you add yet another stretch goal to your growing list of tasks:

Format the output as JSON.

This task list is quickly growing out of proportion! Fortunately, you’ve got access to the OpenAI API, and you’ll employ the help of their LLM to solve all of these challenges.

Note: The example in this tutorial aims to provide a realistic scenario where utilizing an LLM could help with your work as a Python developer. However, it’s important to mention that sanitizing personally identifiable information is a delicate job! You’ll want to make sure that you’re not accidentally leaking information.
There are also potential risks of using cloud-based services such as the OpenAI API. Your company may not want to send data to the OpenAI API to avoid leaking sensitive information, such as trade secrets.
Finally, keep in mind that API usage isn’t free and that you’ll pay for each request based on the number of tokens the model processes.

One of the impressive features of LLMs is the breadth of tasks that you can use them for. So you’ll cover a lot of ground and different areas of use. And you’ll learn how to tackle them all with prompt engineering techniques.
Prepare Your Tools
To follow along with this tutorial, you’ll need to know how to run a Python script from your command-line interface (CLI), and you’ll need an API key from OpenAI.

Note: If you don’t have an OpenAI API key or don’t have experience running Python scripts, then you can still follow along by copying and pasting the prompts into the web interface of ChatGPT. The text that you get back will be slightly different, but you might still be able to see how responses change based on the different prompt engineering techniques.

You’ll focus on prompt engineering, so you’ll only use the CLI app as a tool to demonstrate the different techniques. However, if you want to understand the code that you’ll be using, then it’ll help to have some experience with Python classes, defining your own Python functions, the name-main idiom, and using Python to interact with web APIs.
To get started, go ahead and download the example Python script that you’ll work with throughout the tutorial:

The codebase represents a light abstraction layer on top of the OpenAI API and exposes one function called get_chat_completion() that’ll be of primary interest for the tutorial. The function interacts with OpenAI’s /chat/completions endpoint to generate responses using different models, such as GPT-3.5-Turbo and GPT-4. You’ll explore both models, starting with GPT-3.5-Turbo, and eventually you’ll move on to the more powerful GPT-4 model.
Most of the code in app.py revolves around setting up and fetching the settings from settings.toml.
The script also parses a command-line argument to allow you to conveniently specify an input file.
The input files that you’ll primarily work with contain LLM-generated customer support chat conversations, but feel free to reuse the script and provide your own input text files for additional practice.

Note: If you’re curious, take a moment to read through the code and familiarize yourself with it. Understanding the script isn’t a requirement to understand the concepts that you’ll learn about in this tutorial, but it’s always better to know the code that you’re executing.

The heart of the codebase is settings.toml. This TOML settings file hosts the prompts that you’ll use to sharpen your prompt engineering skills. It contains different prompts formatted in the human-readable settings format TOML.
Keeping your prompts in a dedicated settings file can help to put them under version control, which means you can keep track of different versions of your prompts, which will inevitably change during development.

Note: You can find all the versions of all the prompts that you’ll use in this tutorial in the README.md file.

Your Python script will read the prompts from settings.toml, assemble them meaningfully, and send an API requests to OpenAI.
Alternatively, you can also run all the text prompts directly in the OpenAI playground, which will give you the same functionality as the script. You could even paste the prompts into the ChatGPT interface. However, the results will vary because you’ll be interacting with a different model and won’t have the opportunity to change certain settings.
Set Up the Codebase
Make sure that you’re on Python 3.11 or higher, so that you can interact with TOML files using the standard library. If you haven’t downloaded the codebase yet, go ahead and click the link below:

Unzip the folder and use your CLI to navigate into the folder. You’ll see a handful of files. The most important ones are app.py and settings.toml:
./
├── LICENSE
├── README.md
├── app.py
├── chats.txt
├── requirements.txt
├── sanitized-chats.txt
├── sanitized-testing-chats.txt
├── settings.toml
├── settings-final.toml
└── testing-chats.txt

The file settings.toml contains placeholders for all the prompts that you’ll use to explore the different prompt engineering techniques. That’s the file that you’ll primarily work with, so open it up. You’ll use it to iteratively develop the prompts for your application.
The file app.py contains the Python code that ties the codebase together. You’ll run this script many times throughout the tutorial, and it’ll take care of pulling your prompts from settings.toml.
After you’ve downloaded and unpacked the codebase, create and activate a new virtual environment. Then use pip to install the required dependencies:

Note that this tutorial uses openai version 1.13.3. OpenAI may introduce breaking changes between API versions, so make sure that you install the pinned dependencies from the requirements file. Then you’ll be able to work through the tutorial without any hiccups.
To run the script successfully, you’ll need an OpenAI API key with which to authenticate your API requests. Make sure to keep that key private and never commit it to version control! If you’re new to using API keys, then read up on best practices for API key safety.
To integrate your API key with the script and avoid leaking it publicly, you can export the API key as an environment variable:

After you’ve added your API key as an environment variable named OPENAI_API_KEY, the script will automatically pick it up during each run.
At this point, you’ve completed the necessary setup steps. You can now run the script using the command line and provide it with a file as additional input text:

The command shown above combines the customer support chat conversations in chats.txt with prompts and API call parameters that are saved in settings.toml, then sends a request to the OpenAI API. Finally, it prints the resulting text completion to your terminal.

Note: Using a settings.toml file for API call parameters and prompts is just one option. You don’t need to follow this structure if you have a different project organization.
For more information about how to make calls to OpenAI’s API through the official Python bindings, check out the official API reference.

From now on, you’ll primarily make changes in settings.toml. The code in app.py is just here for your convenience, and you won’t have to edit that file at all. The changes in the LLM’s output will come from changing the prompts and a few of the API call arguments.
Freeze Responses by Setting the Temperature to Zero
When you’re planning to integrate an LLM into a product or a workflow, then you’ll generally want deterministic responses. The same input should give you the same output. Otherwise, it gets hard to provide a consistent service or debug your program if something goes wrong.
Because of this, you’ll want to set the temperature argument of your API calls to 0. This value will mean that you’ll get mostly deterministic results.
LLMs do text completion by predicting the next token based on the probability that it follows the previous tokens. Higher temperature settings will introduce more randomness into the results by allowing the LLM to pick tokens with lower probabilities. Because there are so many token selections chained one after one the other, picking one different token can sometimes lead to vastly different results.
If you use the LLM to generate ideas or alternative implementations of a programming task, then higher values for temperature might be interesting. However, they’re generally undesirable when you build a product.
In the example codebase, you can adjust temperature right inside your settings.toml file:

The initial value is set at 0. All the examples in this tutorial assume that you leave temperature at 0 so that you’ll get mostly deterministic results. If you want to experiment with how a higher temperature changes the output, then feel free to play with it by changing the value for temperature in this settings file.
It’s important to keep in mind that you won’t be able to achieve true determinism with the current LLM models offered by OpenAI even if you keep temperature at 0:

An edge-case in GPT-3 with big implications: Inference is non-deterministic (even at temperature=0) when top-2 token probabilities are <1% different. So temperature=0 output is very close to deterministic, but actually isn’t. Worth remembering. (Source)

So, while you can’t entirely guarantee that the model will always return the same result, you can get much closer by setting temperature to 0.

Note: OpenAI operates with continuous model upgrades so when you run the examples, you’ll probably interact with a different LLM than when this tutorial was written. Therefore, the responses you’ll get might be significantly different than the output that you’ll see in this tutorial.
Nevertheless, all the techniques that the tutorial introduces are valid prompt engineering techniques that you can mix and match to improve the responses that you’ll get from an LLM for your personal projects.

Another approach that improves determinism in the results is to set a value for the seed parameter. The provided code sets the seed to 12345. However, this only has an effect on some of the models.
Start Engineering Your Prompts
Now that you have an understanding of prompt engineering and the practical project that you’ll be working with, it’s time to dive into some prompt engineering techniques. In this section, you’ll learn how to apply the following techniques to your prompts to get the desired output from the language model:

Zero-shot prompting: Giving the language model normal instructions without any additional context
Few-shot prompting: Conditioning the model on a few examples to boost its performance
Using delimiters: Adding special tokens or phrases to provide structure and instructions to the model
Detailed, numbered steps: Breaking down a complex prompt into a series of small, specific steps

By practicing these techniques with the customer chat conversation example, you’ll gain a deeper understanding of how prompt engineering can enhance the capabilities of language models and improve their usefulness in real-world applications.
Describe Your Task
You’ll start your prompt engineering journey with a concept called zero-shot prompting, which is just a fancy way of saying that you’re asking a question or describing a task:

Remove personally identifiable information, only show the date, and replace all swear words with “😤”

This task description focuses on the requested steps for sanitizing the customer chat conversation and literally spells them out. This is the prompt that’s currently saved as instruction_prompt in the settings.toml file:

If you run the Python script and provide the support chat file as an argument, then it’ll send this prompt together with the content of chats.txt to OpenAI’s text completion API:

If you correctly installed the dependencies and added your OpenAI API key as an environment variable, then all you need to do is wait until you see the API response pop up in your terminal:

In the example output, you can see that the prompt that you provided didn’t do a good job tackling the tasks. And that’s putting it gently! It picked up that it should do something with the huffing emoji and reduce the ISO date-time to only a date. Your results might not have tackled all of that. Overall, nearly all of the work is left undone and the output is useless.

Note: The text above represents an example response. Keep in mind that OpenAI’s LLM models aren’t fully deterministic even with temperature set to 0, and the seed parameter set. Additionally, OpenAI operates with continuous model upgrades so when you run the example, you’ll probably interact with a different LLM than when this tutorial was written.

If you’re new to interacting with LLMs, then this may have been a first attempt at outsourcing your development work to the text completion model. But these initial results aren’t exactly exhilarating.
So you’ve described the task in natural language and gotten a bad result. But don’t fret—throughout the tutorial you’ll learn how you can get more useful responses for your task.
One way to do that is by increasing the number of shots, or examples, that you give to the model. When you’ve given the model zero shots, the only way to go is up! That’s why you’ll improve your results through few-shot prompting in the next section.
Use Few-Shot Prompting to Improve Output
Few-shot prompting is a prompt engineering technique where you provide example tasks and their expected solutions in your prompt. So, instead of just describing the task like you did before, you’ll now add an example of a chat conversation and its sanitized version.
Open up settings.toml and change your instruction_prompt by adding such an example:

Once you’ve applied the change, give the LLM another chance to sanitize the chat conversations for you by running the script again:

You’ll have to wait for the LLM to predict all the tokens. When it’s done, you’ll see a fresh response pop up in your terminal:

Okay, great! This time at least the LLM didn’t eat up all the information that you passed to it without giving anything useful back!

Note: You’ll mostly see truncated example responses throughout this tutorial to illustrate the points. The truncated output is noted with an ellipsis (…). When you run the script, you’ll get longer output. Because of changing models and non-determinism, your output will most likely be slightly—to significantly—different.

This time, the model tackled some of the tasks. For example, it sanitized the names in square brackets. However, the names of the customers are still visible in the actual conversations. It also didn’t censor the order numbers or the email address.

The model probably didn’t sanitize any of the names in the conversations or the order numbers because the chat that you provided didn’t contain any names or order numbers. In other words, the output that you provided didn’t show an example of redacting names, order numbers, or email addresses in the conversation text.
Here you can see how important it is to choose good examples that clearly represent the output that you want.

So far, you’ve provided one example in your prompt. To cover more ground, you’ll add another example so that this part of your prompt truly puts the few in few-shot prompting:

You added a second example that contains both a customer name as well as an order number in the chat text body. The example of a sanitized chat shows both types of sensitive data replaced with a sequence of asterisks (****). Now you’ve given the LLM a good example to model.
After editing instruction_prompt in settings.toml, run your script again and wait for the response to print to your terminal:

Wait? Where did most of the output go? You probably expected to see better results, but it looks like you’re getting only two of the conversations back this time!

You’ve added more text to your prompt. At this point, the task instructions probably make up proportionally too few tokens for the model to consider them in a meaningful way. The model lost track of what it was supposed to do with the text that you provided.

Adding more examples should make your responses stronger instead of eating them up, so what’s the deal? You can trust that few-shot prompting works—it’s a widely used and very effective prompt engineering technique. To help the model distinguish which part of your prompt contains the instructions that it should follow, you can use delimiters.
Use Delimiters to Clearly Mark Sections of Your Prompt
If you’re working with content that needs specific inputs, or if you provide examples like you did in the previous section, then it can be very helpful to clearly mark specific sections of the prompt. Keep in mind that everything you write arrives to an LLM as a single prompt—a long sequence of tokens.
You can improve the output by using delimiters to fence and label specific parts of your prompt. In fact, if you’ve been running the example code, then you’ve already used delimiters to fence the content that you’re reading from file.
The script adds the delimiters when assembling the prompt in app.py:

In line 13, you wrap the chat content in between >>>>> and <<<<< delimiters. Marking parts of your prompt with delimiters can help the model keep track of which tokens it should consider as a single unit of meaning.
You’ve seen in the previous section that missing delimiters can lead to unexpected results. You might receive less output than expected, like in the previous example, or an empty response. But you might also receive output that’s quite different from what you want! For example, imagine that the chat content that you’re reformatting contains a question at the end, such as:

Can you give me your order number?

If this question is the last line of your prompt without delimiters, then the LLM might continue the imaginary chat conversation by answering the question with an imaginary order number. Give it a try by adding such a sentence to the end of your current prompt!
Delimiters can help to separate the content and examples from the task description. They can also make it possible to refer to specific parts of your prompt at a later point in the prompt.
A delimiter can be any sequence of characters that usually wouldn’t appear together, for example:

The number of characters that you use doesn’t matter too much, as long as you make sure that the sequence is relatively unique. Additionally, you can add labels just before or just after the delimiters:

START CONTENT>>>>> content <<<<<END CONTENT
==== START content END ====
#### START EXAMPLES examples #### END EXAMPLES

The exact formatting also doesn’t matter so much. As long as you mark the sections so that a casual reader could understand where a unit of meaning begins and ends, then you’ve properly applied delimiters.
Edit your prompt in settings.toml to add a clear reference to your delimited content, and also include a delimiter for the examples that you’ve added:

With these adaptations to your instruction_prompt, you now specifically reference the content as >>>>>CONTENT<<<<< in your task description. These delimiters match the delimiters that the code in app.py adds when assembling the prompt.
You’ve also delimited the examples that you’re providing with #### START EXAMPLES and #### END EXAMPLES, and you differentiate between the inputs and expected outputs using multiple dashes (——) as delimiters.
While delimiters can help you to get better results, in this case your output is quite similar to before:

It’s noticeable that the model only shows the two example data that you passed as examples. Could it be that your prompt leads to something similar like overfitting? Using the actual data that you want to sanitize as your training data is, anyway, not a good idea, so in the next section, you’ll make sure to change that.
In this section, you’ve learned how you can clarify the different parts of your prompt using delimiters. You marked which part of the prompt is the task description and which part contains the customer support chat conversations, as well as the examples of original input and expected sanitized output.
Test Your Prompt Across Different Data
So far, you’ve created your few-shot examples from the same data that you also run the sanitation on. This means that you’re effectively using your test data to provide context to the model. Mixing training, validation, and testing data is a bad practice in machine learning. You might wonder how well your prompt generalizes to different input.
To test this out, run the script another time with the same prompt using the second file that contains chat conversations, testing-chats.txt. The conversations in this file contain different names, and different—soft—swear words:

You’ll keep running your script using testing-chats.txt moving forward, unless indicated differently.
Once you’ve waited for the LLM to generate and return the response, you’ll notice that the result isn’t very satisfying:

The model now understands that you meant the examples as examples to follow when applying edits and gives you back all of the new input data. However, it didn’t do a great job following the instructions.
The model didn’t identify new swear words and didn’t replace them. The model also didn’t redact the order numbers, nor did it anonymize the names. It looks like it only managed to reformat your date strings.
So your engineered prompt currently doesn’t work well, and generalizes even worse. If you built a pipeline based on this prompt, where new chats could contain new customer names, then the application would probably continue to perform poorly. How can you fix that?
You’ve grown your prompt significantly by providing more examples, but your task description is still largely just the question that you wrote right at the beginning. To get better results, you’ll need to do some prompt engineering on the task description as well.
Describe Your Request in Numbered Steps
If you break up your task instructions into a numbered sequence of small steps, then the model is a lot more likely to produce the results that you’re looking for.
Go back to your prompt in settings.toml and break your initial task description into more granular, specific substeps:

With these step-by-step instructions in place, you’re ready for another run of your script and another inspection of the newly generated output:

That’s a significant improvement! The model managed to follow the pattern of replacing the names in square brackets with [Agent] and [Customer], respectively. It correctly identified some new swear words and replaced them with the huffing emoji. The model also redacted the order numbers, and anonymized the names in the conversation texts.
Often, one of the best approaches to get better results from an LLM is to make your instructions more specific.
Framing your tasks in even smaller and more specific steps, will generally get you better results. Don’t shy away from some repetition:

Note: You may have received different output. Keep in mind that the results aren’t fully deterministic. However, with better prompts and increased specificity, you’ll move closer to mostly deterministic results.

Increasing the specificity of your instructions, and introducing numbered steps, helped you create a well-performing prompt. Your prompt successfully removes personally identifiable information from the conversations, redacts swear words, and reformats the ISO date-time stamp, as well as the usernames.
You could consider your initial task as completed, but there’s more that you want to do, and more prompt engineering techniques to explore. You also know that there are newer models that you could work with, and your success has further piqued your curiosity. It’s time to switch to a different LLM, see how that influences your output, and then continue exploring other techniques.
Perform Chat Completions With GPT-4
You’ve decided to switch to an even more powerful LLM, GPT-4. In the rest of this tutorial, you’ll use GPT-4 to continue exploring other important prompt engineering techniques:

Role prompting: Using a system message to set the tone of the conversation, and using different roles to give context through labeling
Chain-of-thought prompting (CoT): Giving the model time to think by prompting it to reason about a task, then including the reasoning in the prompt

You’ll also use GPT-4 to classify the sentiment of each chat conversation and structure the output format as JSON.
Switch to a Different Model
If you’re working with the provided script, then all you need to do is pick a chat model from chat_models in settings.toml and use it as the new value for model:

Changing these settings will send your request to a different model. Like before, it’ll assemble your prompt in the way necessary for a /chat/completions endpoint request, make that request for you, and print the response to your terminal.
For the rest of this tutorial, you’ll work with OpenAI’s latest version of the GPT-4 model. If you don’t have access to this model, then you can instead keep working with the model that you’ve been working with so far.
If you’ve been following along using ChatGPT, then you’re stuck with whatever model currently powers it. Unless you’re a ChatGPT Plus subscriber, then you can change the model to GPT-4 on the website.

Note: The prompt engineering techniques that you’ll learn about in this section aren’t exclusive to newer models. You can also use them without switching models, but you’ll probably get completion results that are more different.

Without changing your prompt, run your script another time to see the different results of the text completion based only on using a different LLM:

Some responses may be relatively similar to the ones with the older model. However, you can also expect to receive results like the one shown above, where most swear words are still present.
It’s important to keep in mind that developing for a specific model will lead to specific results, and swapping the model may improve or deteriorate the responses that you get. Therefore, swapping to a newer and more powerful model won’t necessarily give you better results straight away.

Note: Generally, larger models will give you better results, especially for prompts that you didn’t heavily engineer. If you want, you can go back to your initial prompt and try to run it using GPT-4. You’ll notice that the results are somewhat better than, although different from, the initial results that you got using GPT-3.5-Turbo.

Additionally, it’s also helpful to keep in mind that API calls to larger models will generally cost more money per request. While it can be fun to always use the latest and greatest LLM, it may be worthwhile to consider whether you really need to upgrade to tackle the task that you’re trying to solve.
Add a Role Prompt to Set the Tone
There are some additional possibilities when interacting with the API endpoint that you’ve only used implicitly, but haven’t explored yet, such as adding role labels to a part of the prompt. In this section, you’ll use the “system” role to create a system message, and you’ll revisit the concept later on when you add more roles to improve the output.
Role prompting usually refers to adding system messages, which represent information that helps to set the context for upcoming completions that the model will produce. System messages usually aren’t visible to the end user. Keep in mind that the /chat/completions endpoint models were initially designed for conversational interactions.
You can also use system messages to set a context for your completion task. You’ll craft a bespoke role prompt in a moment. However, for this specific task, the role prompt is likely less important than it might be for some other tasks. To explore the possible influence of a role prompt, you’ll take a little detour and ask your model to play a role:

You keep instruction_prompt the same as you engineered it earlier in the tutorial. Additionally, you now add text to role_prompt. The role prompt shown above serves as an example for the impact that a misguided prompt can have on your application.
Unleash, thou shall, the parchment’s code and behold the marvels unexpected, as the results may stir wonderment and awe:

As you can see, a role prompt can have quite an impact on the language that the LLM uses to construct the response. This is great if you’re building a conversational agent that should speak in a certain tone or language. And you can also use system messages to keep specific setup information present.
For completion tasks like the one that you’re currently working on, you might, however, not need this type of role prompt. For now, you could give it a common boilerplate phrase, such as You’re a helpful assistant.
To practice writing a role prompt—and to see whether you can release your customer chat conversations from the reign of that 16th century villain poet—you’ll craft a more appropriate role prompt:

This role prompt is more appropriate to your use case. You don’t want the model to introduce randomness or to change any of the language that’s used in the conversations. Instead, you just want it to execute the tasks that you describe. Run the script another time and take a look at the results:

That looks much better again! Abide concealed in yonder bygone era, ye villainous poet!
As you can see from these examples, role prompts can be a powerful way to change your output. Especially if you’re using the LLM to build a conversational interface, then they’re a force to consider.

For some reason, GPT-4 seems to consistently pick [Client] over [Customer], even though you’re specifying [Customer] in the few-shot examples. You’ll eventually get rid of these verbose names, so it doesn’t matter for your use case.
However, if you’re determined and curious—and manage to prompt [Client] away—then share the prompt that worked for you in the comments.

In the final section of this tutorial, you’ll revisit using roles and see how you can employ the power of conversation to improve your output even in a non-conversational completion task like the one you’re working on.
Classify the Sentiment of Chat Conversations
At this point, you’ve engineered a decent prompt that seems to perform quite well in sanitizing and reformatting the provided customer chat conversations. To fully grasp the power of LLM-assisted workflows, you’ll next tackle the tacked-on request by your manager to also classify the conversations as positive or negative.
Start by saving both sanitized conversation files into new files that will constitute the new inputs for your sentiment classification task:

You could continue to build on top of the previous prompt, but eventually you’ll hit a wall when you’re asking the model to do too many edits at once. The classification step is conceptually distinct from the text sanitation, so it’s a good cut-off point to start a new pipeline.
The sanitized chat conversation files are also included in the example codebase:

Again, you want the model to do the work for you. All you need to do is craft a prompt that spells out the task at hand, and provide examples. You can also edit the role prompt to set the context for this new task that the model should perform:

You can now run the script and provide it with the sanitized conversations in sanitized-testing-chats.txt that were the output of your previously engineered prompt:

You added another step to your task description and slightly modified the few-shot examples in your prompt. Not a lot of extra work for a task that would have required a lot more work without the help of an LLM. But is this really sufficient? Take a look at the output once your script has finished running:

The output is quite promising! The model correctly labeled conversations with angry customers with the fire emoji. However, the first conversation probably doesn’t entirely fit into the same bucket as the rest because the customer doesn’t display a negative sentiment towards the company.
Assume that all of these conversations were resolved positively by the customer service agents and that your company just wants to follow up with those customers who seemed noticeably angry with their situation. In that case, you might need to tweak your prompt a bit more to get the desired result.
You could add more examples, which is generally a good idea because it creates more context for the model to apply. Writing a more detailed description of your task helps as well, as you’ve seen before. However, to tackle this task, you’ll learn about another useful prompt engineering technique called chain-of-thought prompting.
Walk the Model Through Chain-of-Thought Prompting
A widely successful prompt engineering approach can be summed up with the anthropomorphism of giving the model time to think. You can do this with a couple of different specific techniques. Essentially, it means that you prompt the LLM to produce intermediate results that become additional inputs. That way, the reasoning doesn’t need to take distant leaps but only hop from one lily pad to the next.
One of these approaches is to use chain-of-thought (CoT) prompting techniques. To apply CoT, you prompt the model to generate intermediate results that then become part of the prompt in a second request. The increased context makes it more likely that the model will arrive at a useful output.
The smallest form of CoT prompting is zero-shot CoT, where you literally ask the model to think step by step. This approach yields impressive results for mathematical tasks that LLMs otherwise often solve incorrectly.
Chain-of-thought operations are technically split into two stages:

Reasoning extraction, where the model generates the increased context
Answer extraction, where the model uses the increased context to generate the answer

Reasoning extraction is useful across a variety of CoT contexts. You can generate few-shot examples from input, which you can then use for a separate step of extracting answers using more detailed chain-of-thought prompting.
You can try zero-shot CoT on the sanitized chat conversations to embellish the few-shot examples that you’ll use to classify the chat conversations more robustly. Remove the examples and replace the instructions describing the reasoning on how you would classify the conversations in more detail:

You spelled out the criteria that you want the model to use to assess and classify sentiment. Then you add the sentence Let’s think step by step to the end of your prompt.
You want to use this zero-shot CoT approach to generate few-shot examples that you’ll then build into your final prompt. Therefore, you should run the script using the data in sanitized-chats.txt this time:

You’ll get back a reference to the conversations, with the reasoning spelled out step by step to reach the final conclusion:

The reasoning is straightforward and sticks to your instructions. If the instructions accurately represent the criteria for marking a conversation as positive or negative, then you’ve got a good playbook at hand.
You can now use this information to improve the few-shot examples for your sentiment classification task:

You’re using the same examples as previously, but you’ve enhanced each of the examples with a short chain of thought that you generated in the previous call. Give your script another spin using sanitized-testing-chats.txt as the input file and see whether the results have improved:

Great! Now the first conversation, which was initially classified as negative, has also received the green checkmark.

Note: The input chat conversations that you supply through the few-shot examples now contain additional text that the input in sanitized-chats-testing.txt doesn’t include. Using your prompt engineering skills, you’ve effectively prompted the LLM to create reasoning steps internally and then use that information to aid in the sentiment classification task.

In this section, you’ve supported your examples with reasoning for why a conversation should be labeled as positive vs negative. You generated this reasoning with another call to the LLM.
At this point, it seems that your prompt generalizes well to the available data and classifies the conversations as intended. And you only needed to carefully craft your words to make it happen!
Structure Your Output Format as JSON
As a final showcase for effective prompting when incorporating an LLM into your workflow, you’ll tackle the last task, which you added to the list youself: to pass the data on in a structured format that’ll make it straightforward for the customer support team to process further.
You already specified a format to follow in the previous prompt, and the LLM returned what you asked for. So it might just be a matter of asking for a different, more structured format, for example JSON:

In your updated instruction_prompt, you’ve explicitly asked the model to return the output as valid JSON. Then, you also adapted your few-shot examples to represent the JSON output that you want to receive. Note that you also applied additional formatting by removing the date from each line of conversation and truncating the [Agent] and [Customer] labels to single letters, A and C.
You’re still using example chat conversations from your sanitized chat data in sanitized-chats.txt, and you send the sanitized testing data from sanitized-testing-chats.txt to the model for processing.
In this case, you receive valid JSON, as requested. The classification still works as before and the output censors personally identifiable information, replaces swear words, and applies all the additional requested formatting:

Your output may be different and show some small hiccups, but overall, this output is quite impressive and useful! You could pass this JSON structure over to the customer support team, and they could quickly integrate it into their workflow to follow up with customers who displayed a negative sentiment in the chat conversation.
You could stop here, but the engineer in you isn’t quite satisfied yet. All the instructions just in a single prompt? Your premonition calls and tells you tales about maintainability. In the next section, you’ll refactor your prompts to apply role labels before you set up your LLM-assisted pipeline and call it a day.
Improve Your Output With the Power of Conversation
You added a role prompt earlier on, but otherwise you haven’t tapped into the power of conversations yet.

Note: A conversation could be an actual back-and-forth interaction like when you’re interacting with ChatGPT, but it doesn’t need to be. In this tutorial, the conversation consists of a series of messages that you send to the model all at once.
So it might feel a bit like you’re having a conversation with yourself, but it’s an effective way to give the model more information and guide its responses.

In this final section, you’ll learn how you can provide additional context to the model by splitting your prompt into multiple separate messages with different labels.
In calls to the /chat/completions endpoint, a prompt is split into several messages. Each message has its content, which represents the prompt text. Additionally, it also has a role. There are different roles that a message can have, and you’ll work with three of them:

“system” gives context for the conversation and helps to set the overall tone.
“user” represents the input that a user of your application might provide.
“assistant” represents the output that the model would reply with.

So far, you’ve provided context for different parts of your prompt all mashed together in a single prompt, more or less well separated using delimiters. When you use a model that’s optimized for chat, such as GPT-4, then you can use roles to let the LLM know what type of message you’re sending.
For example, you can create some variables for your few-shot examples and separate variables for the associated CoT reasoning and outputs:

You’ve disassembled your instruction_prompt into seven separate prompts, based on what role the messages have in your conversation with the LLM.
The helper function that builds a messages payload, _assemble_chat_messages(), is already set up to include all of these prompts in the API request. Take a look into app.py to check out the separate messages, with their fitting roles, that make up your overall prompt:

Your prompt is now split into distinct parts, each of which has a certain role label:

Example input has the “user” role.
Reasoning that the model created has the “system” role.
Example output has the “assistant” role.

You’re now providing context for how user input might look, how the model can reason about classifying the input, and how your expected output should look. You removed the delimiters that you previously used for labeling the example sections. They aren’t necessary now that you’re providing context for the parts of your prompt through separate messages.
Give your script a final run to see whether the power of conversation has managed to improve the output:

This JSON structure is looking legitimately great! The formatting that you wanted now shows up throughout, and the conversations are labeled correctly.
Additionally, you’ve improved the maintainability of your prompts by splitting them into separate labels. You can feel proud to pass on such a useful edit of the customer chat conversation data to your coworkers!
Key Takeaways
You’ve covered common prompt engineering techniques, and here, you’ll find a few questions and answers that sum up the most important concepts that you’ve covered in this tutorial.
You can use these questions to check your understanding or to recap and solidify what you’ve just learned. After each question, you’ll find a brief explanation hidden in a collapsible section. Click the Show/Hide toggle to reveal the answer. Time to dive in!

Knowledge about prompt engineering is crucial when you work with large language models (LLMs) because you can receive much better results with carefully crafted prompts.

The temperature setting controls the amount of randomness in your output. Setting the temperature argument of API calls to 0 will increase consistency in the responses from the LLM. Note that OpenAI’s LLMs are only ever mostly deterministic, even with the temperature set to 0.

Few-shot prompting is a common prompt engineering technique where you add examples of expected input and desired output to your prompt.

Using delimiters can be helpful when dealing with more complex prompts. Delimiters help to separate and label sections of the prompt, assisting the LLM in understanding its tasks better.

Testing your prompt with data that’s separate from the training data is important to see how well the model generalizes to new conditions.

Yes, generally adding more context will lead to more accurate results. However, it’s also important how you add the additional context. Just adding more text may lead to worse results.

In chain-of-thought (CoT) prompting, you prompt the LLM to produce intermediate reasoning steps. You can then include these steps in the answer extraction step to receive better results.

Next Steps
In this tutorial, you’ve learned about various prompt engineering techniques, and you’ve built an LLM-assisted Python application along the way. If you’d like to learn more about prompt engineering, then check out some related questions, as well as some resources for further study below:

Yes, prompt engineer can be a real job, especially in the context of AI and machine learning. As a prompt engineer, you design and optimize prompts so that AI models like GPT-4 produce desired responses. However, it might not be a stand-alone job title everywhere. It could be a part of broader roles like machine learning engineer or data scientist.

Prompt engineering, like any other technical skill, requires time, effort, and practice to learn. It’s not necessarily easy, but it’s certainly possible for someone with the right mindset and resources to learn it. If you’ve enjoyed the iterative and text-based approach that you learned about in this tutorial, then prompt engineering might be a good fit for you.

The field of prompt engineering is quite new, and LLMs keep developing quickly as well. The landscape, best practices, and most effective approaches are therefore changing rapidly. To continue learning about prompt engineering using free and open-source resources, you can check out Learn Prompting and the Prompt Engineering Guide.

Have you found any interesting ways to incorporate an LLM into your workflow? Share your thoughts and experiences in the comments below.

Take the Quiz: Test your knowledge with our interactive “Practical Prompt Engineering” quiz. Upon completion you will receive a score so you can track your learning progress over time:
Take the Quiz »