Automatic Prompt Engineering (Part II: Building an APE Workflow)

Categories:

Chatbots Prompt engineering Llm Nlp

This is a continuation of the blog post Series: Automatic Prompt Engineering for Retrieval-Augmented Generative Models. In Part I, we discussed the main concepts of the Prompt Engineering Paradigm, what are some of the companies offering Prompt Engineering solutions, what is a RAG (Retrieval-Augmented Generative Model) and how prompting affects the bad quality of the answers we get.

Now, let’s talk about how to tackle the issues, including hallucinations, using Prompt Engineering techniques.

Prompt Engineering

Prompt Engineering refers to the intentional design and construction of prompts to guide language models in generating desired outputs. It involves formulating clear and specific instructions or queries that elicit the desired information or response from the model. Prompt engineering plays a crucial role in harnessing the capabilities of language models and optimizing their performance for various applications.

Now, let’s come back to the RAG use case. Let’s suppose you run a company, and you decide to implement a chatbot to cross-pollinate all the knowledge of the organization among all employees and departments in such a way that a user just logs in and asks a question to the chatbot.

How could you guarantee that the answers they are getting from the chatbot are not sub-optimal? Well, the answer is simple: if you want a quality answer, you need a quality question.

Manually adjusting all the questions is impossible. However, we can do that automagically.

Automatic Prompt Engineering

One groundbreaking development in this field is the emergence of Automatic Prompt Engineering (APE). Automatic Prompt Engineering refers to the systematic and intelligent generation of prompts to guide language models, enabling more precise and targeted outputs, minimizing hallucinations.

In the paper called Large Language Models are human-level Prompt Engineerings, three ways of using LLMs are mentioned to make sure the prompts return quality results.

LLM as an inference model: The typical use of a LLM. You ask a question (e.g, “What does error SB5005 mean?”) and you get an answer (e.g, “The error code means the server is restarting”)
LLM as a resampling / paraphrase model: You can ask the model to generate other ways to ask the same question, adding more specific details of things to avoid / prevent / pay attention to (for example, including sources), etc. For instance, other ways to say “What does SB5005 error code mean?”):
- Could you please provide an explanation of the SB5005 error code?
- Could you elaborate on the SB5005 error code and its implications?
- …
LLM as a scoring model: You can use the LLM to score the results given some conditions. It’s important that we create a series of checks which can evaluate, given a question Q, knowledge K and an answer A, and several parameters as: truthfulness, informativeness, objectivity, how literal the answer is, inclusion of sources, quotes, etc. You can even use other kinds of metrics such as Rouge, Bleu, or regular expressions to check if there are links or quotes, etc. as additional metrics.

Last, but not least, you can make decisions based on the evaluation. For example:

You can resample (see point 2) and use the best version of your prompt;
You can just adjust the prompt adding some improvements.

If we put together all of those steps, we may end up with a process of creation of an optimized prompt, possibly a version of the one entered by the user, but maximizing the scores.

In our case we will not use the resampling mechanism (2 and 4a). Instead, we will just add some adjustments (4b) to the prompts we are using, depending on the evaluation metrics we get.

Evaluation mechanisms and metrics

There are several checks that we want to include in our evaluation scores, and we will group them depending on the approach to evaluate.

First group: No LLM required

In this group we can include:

Metrics such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation). They are primarily designed to evaluate the quality of machine-generated text, such as translations or summarizations. While they can provide some insights into the quality of an answer to a question, they are not specifically tailored for evaluating question-answering systems. Nevertheless, they can still be used as reference metrics for a rough evaluation. Here’s how they work:

BLEU (Bilingual Evaluation Understudy): BLEU is commonly used for evaluating the quality of machine translation output by comparing it to one or more human reference translations. It calculates a score based on n-gram precision, where n-grams are sequences of n words. BLEU evaluates the overlap of n-grams between the generated text and the reference(s), rewarding exact matches and partial matches. A higher BLEU score indicates a closer resemblance to the references.
Applying BLEU to evaluate an answer to a question involves considering the generated answer as a translation of the question. You would need to treat the question as the reference and compare it to the generated answer using BLEU. However, this approach may not capture the specific nuances and adequacy of the answer to the question.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is commonly used for evaluating the quality of summarization systems by comparing the generated summary to one or more human reference summaries. It calculates recall scores based on overlapping units, such as n-grams, word sequences, and skip-bigrams. ROUGE assesses the coverage of important information in the generated summary compared to the references.
To evaluate an answer using ROUGE, you would need to treat the generated answer as a summary of the question and compare it to human reference summaries using ROUGE scores. However, similar to BLEU, ROUGE may not fully capture the relevance and accuracy of the answer to the question.

2. Other checks as regular expressions for matching Quotes or URLs, which allow you to better understand if the answer is well supported.

Here you have an example of how Galileo AI Prompt Inspector returns the BLUE, ROUGE and Halluciation scores in its evaluation process.

Second group: LLM required

As we were saying before, LLMs can also be used for evaluation. We can come back to the model with another question, after getting the answer, this time to retrieve a score on different parameters.

The Generated Knowledge Prompting for Commonsense Reasoning proposes the following:

However, at Mantis NLP we recommend the following checks:

Explicitness: Is the question contained explicitly in the answer or has a literal mention to parts or it?
Helpfulness: Does the provided answer really help answer and support or contradict the question?
Directness: Does the provided answer support the question directly?
Grammaticality: Is the answer grammatically correct and complete?
Relevance: Is the answer relevant to the question and covers the same topics and no other topics?
Edge: Does the answer contain edge cases?
Factuality: Does the answer only provide facts from the knowledge and nothing else?
Supposition: Does the answer contain a conclusion or supposition extracted by analyzing the knowledge?
Objectivity: Is the answer provided totally objective given what is expressed in the knowledge?
Creativity: What is the level of creativity present in the answer given the question?
ThirdPartyOpinion: Does the answer contain a third-party opinion?
Source: Does the answer provide the source of information?

Feedback

If a metric fails, the evaluation system will recommend a change to your prompt. The Automatic Prompt Optimization (APO) paper proposes a Deep Learning model which will adapt your prompt for you.

However, we see 3 different ways to do it, some simpler than others:

A Rule-based system. Each metric will have a suggestion assigned. We will just add them as instructions appended to the prompt.
A Deep-learning model adapting the prompt, as stated before.
And probably the easiest one, triggering another call to the LLM to ask for the better way to rephrase the prompt given that it triggers X1..Xn evaluation issues. (see series IV of the blog post)

If you want more information about any of those three approches or you are more interested in the prompt for 3), reach out to us at hi@mantisnlp.com

Check in our next entry of the blog, where we will share with you hands-on code about how to build an APE workflow.

In this second blog post, you got to know the main techniques involved when creating Automatic Prompt Engineering (APE) workflows. In the subsequent 2 blog posts (Parts III and IV) of the series, we will share with you code snippets showcasing some of the steps we have explained, such as Automatic Knowledge Retrieval, Automatic Knowledge Integration, Evaluation and Prompt Improvement.

Want to know more?

I hope you enjoyed this post. Remember this is a series of 4 posts and it has a continuation. Make sure you read them in a sequential order, starting from Part I.

If you need help in your NLP processes, MantisNLP will be glad to help! Feel free to reach out at hi@mantisnlp.com and stay tuned for more blog posts!

Automatic Prompt Engineering (Part II: Building an APE Workflow) was originally published in MantisNLP on Medium, where people are continuing the conversation by highlighting and responding to this story.

Automatic Prompt Engineering (Part III: Hands-on: Knowledge Retrieval)

This is the continuation of Automatic Prompt Engineering (APE) for …

Read Post

Do you have a Natural Language Processing problem you need help with?

Let's Talk