A common frustration among expats working at Xomnia is the occasional struggle with the communication style of the Dutch government. When a colleague accidentally missed a deadline on a fine which turned out to be a fine for missing a fine, this “fineception” was all the motivation we needed to build a chatbot to answer all questions related to the City of Amsterdam - since all the relevant information is on their website, but spread throughout multiple pages that are several clicks deep.
Asking our questions directly to a chatbot would be a massive improvement over having to manually navigate the relevant sections of the website. Therefore, we started our chatbot project with a simple hypothesis: By using Large Language Models (LLM), any website can be transformed into a chatbot.
This blog will explain what we did, how we did it, and what we found to be the peculiarities of creating a chatbot based on LLMs. The chatbot we created can be viewed here.
LLM versus traditional NLP
This blog aims to explain how to work with LLMs. Therefore, we assume that the target audience probably have some intuitions around Natural Language Processing (NLP). However, these intuitions don’t all transfer to the field of LLMs. Let’s explain why by examining our use case.
Chatbots have been a staple of customer service for a while now. Before LLMs, we would have used traditional NLP to make a chatbot. NLP techniques such as rule-based systems, pattern matching and template-based responses are at the lower end of technical complexity. A more complex NLP technique would be to use a neural net to answer questions by training on a dataset of questions and answers.
These techniques work and are being widely used. They do, however, have a massive downside: They require a dataset of question-answer pairs or human-defined rules. However, we don’t want to have to create these new data sets. We would much rather just use the information already available to us. In our example, we have a treasure trove of information in the municipality’s website, but none of this information can be easily used by traditional NLP methods.
This limitation of NLP is what makes LLMs a game changer, since the latter works in the exact opposite way to these traditional techniques. LLMs require only pre-existing resources. As a side note, the methods we explore in this blog have been used on the website of the City of Amsterdam, but are widely applicable on other websites too.
Extracting data from the website
The data we need to create a chatbot is found on the municipality’s website, but we do need to apply some processing steps. We extracted all English language web pages from the publicly available website using a web crawler. Note that before performing web crawling (i.e. downloading html files) on any website, you should check the “robots.txt'' file; this specifies the rules for crawling a website. We start at the home page and recursively follow every link to a “amsterdam.nl/en/…” domain. Using some simple html processing steps leaves us with a heap of unstructured, raw text.
If this was a traditional NLP project, we would spend our time on finding the right processing steps for the text (think of stop word deletion, lemmatization, stemming, and so on). These steps are not needed when working with LLMs. Instead, we try to remove as much uninformative text as possible. For starters, we remove large sections of text that are repeated between multiple html files. We also remove sections of text that feature hyperlinks, e.g.: “click here for more information”. When converted to text, these sections tend to be confusing as the implied information is not actually present. A final step is to use the URL to assign one or more topics to each webpage. As an example, the webpage with this URL:
will get the topics “civil-affairs” and “first-registration.” After these steps, we save the processed text to a vector database, which we will talk about more later.
Choosing the right LLM and hosting it
Having gathered and processed the data for our chatbot, we turned our attention to choosing a suitable LLM and hosting option. The hype around ChatGPT has unleashed a gold rush among tech companies. Consequently, every large tech company now offers a fully managed LLM: LLaMa 2 by Facebook, JARVIS by Microsoft, PaLM 2 by Google, and so on. For those wanting to self-host an LLM, here’s a list of open source LLMs. The recent rise of the small language models expands our options even further.
We strongly recommend using a fully managed LLM. Self-hosting an LLM will quickly become a painful cascade of infrastructure needs that is gonna hurt at some point. Mainly, this is because if your application is time critical, the LLM will have to run on a beefy GPU with CUDA, and you will have to either provision your own server or configure a massively expensive virtual machine. Containerizing anything with CUDA drastically increases memory requirements, leading to difficulties in deploying the app.
We recommend self-hosting LLMs only if you have a strong reason to do so. The other alternative to fully managed LLMs is small language models, which somewhat mitigate the requirements for beefy hardware. However, using a small language model comes at the cost of having to finetune and retrain the model which will require many additional human working hours.
In the end, we chose a fully managed LLM, ChatGPT, to build our question-answer bot.
What is prompt crafting?
Now that we have the data and the model for our application, let’s investigate the working principle of our LLMs. LLMs tend to challenge some of the intuitions that a data scientist may have regarding large pre-trained neural networks. Typically, when a large pretrained model (e.g. a computer vision model) is released, it has been trained on a very general dataset and will have to be finetuned on a dataset that is specific to our use case.
LLMs, however, require no retraining at all (although this is possible). This might seem confusing. ChatGPT is trained on a mix of licensed data, data created by human trainers, and publicly available data. It is very unlikely that any of this data mentions the recycling policy of the Amsterdam municipality, so how could it possibly know how to answer our questions?
Instead of fine tuning, we have to perform “prompt crafting”. A prompt is the message we send to the LLM. We create a prompt consisting of context and a question, and ask the LLM to answer the question using the context.
Answer this question:
“Do I need a permit to moor a boat in Amsterdam?”
By using this context:
“Boating has a rich history in Amsterdam’s canals -
Mooring in Amsterdam is only possible with a valid vignette-
SAIL Amsterdam is a yearly boat show in Amsterdam -
In the provided example, the LLM will know to investigate the context and can answer the question correctly because we provided all the relevant information in the prompt itself.
Automated prompt crafting
As mentioned in the previous section, the working principle of prompt crafting is to combine a question with the context required to answer the question. This can be done automatically quite simply by filling the empty spots in a template with one or more documents:
Answer this question:
By using this context:
Obviously, the question is provided by the user, so our task is to find the correct context. Because our data pre-processing is light and our model is already trained, finding the correct context is the most influential step in optimizing performance of our chatbot. Note also that the total amount of text an LLM can process is limited; therefore, we can’t upload all our data in hopes that the LLM will figure out the correct answer. In this section, we will review three methods of finding the best possible context: vector databases, topic selection and agents.
We briefly mentioned vector databases in our section on importing data. A vector database stores key-value pairs where the value is a document and the key is a vector embedding. If you are new to the concept of vector databases, we advise you to read up on them as they are becoming a cornerstone of AI applications.
The value of a vector database is that it allows the user to perform semantic search. Semantic search allows searching for a given word or sentence documents that are similar in meaning. A more traditional database would look for documents by checking the presence of words in a document. To elaborate on the point above, we share the following example:
Suppose we would want to look up information relevant to the question:
“What pets require a permit for ownership?”
A more traditional method of looking up information would be to query documents for keywords like: “dog, pet, cat, crocodile” and so on. This method has many problems, including but not limited to:
- Keyword search is not context sensitive: “Bat” could mean either an animal or an instrument for striking but keyword search will not distinguish between the two meanings.
- Keyword search is not exhaustive: There is no way to make sure we’ve included all keywords relevant to a query.
- Keyword search requires performing stemming and lemmatization on our text, or we would have to include every permutation of every keyword.
A vector database has none of these problems. With a vector database, an embedding function is used to group together documents that are similar in semantic meaning. For example, “king” and “president” are grouped together as they are both rulers of countries. The same embedding function is used to find documents that are semantically similar to the question posted by the user.
When using a vector database, we assume that the phrasing of a question and the phrasing of the answer are semantically related. This is not guaranteed. For example, say we ask the question: “What happens if I don’t eat for two weeks?”. Semantically, this question falls in the realm of “health” or “dangerous activities.” An embedding function may not extract this semantic meaning from just our question. It’s probable that it will return documents about restaurants, recipes, and diets due to the presence of the word “eat.” As a result, the app will not answer the question correctly.
As an extra precaution, we previously assigned topics to all of our documents. We can use ChatGPT to ask which topics are relevant for answering a given question. This allows us an extra handle on finding the correct documents for a question by filtering on topics. This is useful when the connection between a question and the information required to answer it is not directly semantically related.
Langchain is a Python library with many functionalities for interacting with LLMs. Their most useful feature is probably the agent. An agent can break up a complex question into simpler questions (“task decomposition”) and use Python functions to answer questions.
If we ask “what is the fifth Fibonacci number?” to an agent that has been provided a Python function for computing the Fibonacci sequence, it will know to use the function to answer the question. Langchain gives the following question as an example for task decomposition: “Where does the winner of the 2023 US open live?” An agent will identify the first task: “Who won the 2023 US open?” and the second task is the logical follow up: “Where does Novak Djokovic live?” to get to the final answer: “Marbella, Spain.” Agents can be used to answer questions directly or as a tool to perform an intermediate. They are useful in combination with the previous methods if a question needs to be decomposed in multiple questions.
Creating a user interface for LLM
We are almost at the point where the rubber meets the road. The website’s data is combined with the user’s question to form a prompt, which is answered by an LLM. The technical components can be served with a lightweight API because we outsourced the heavy lifting to a fully managed LLM. The biggest component of our API will be the vector database and the embedding model. The resulting package can be neatly packaged in a docker container and served by a web app using your cloud provider of choice. To give our chatbot a user interface, we used a static html page. You are invited to play around with the bot via this link.
Conclusion and Observations
As mentioned earlier, there exists a gold rush atmosphere around the topic of LLMs. People are thinking of new and interesting use cases and we have no doubt that companies are being founded right now that will become big names in the near future.
Despite how interesting these developments are, the real innovation when it comes to this blog lies within reusing a resource that already exists: Only the text on a website in this case.
A few years ago, we would have had to toss aside most of the hard work done in growing the corpus of knowledge found on “https://www.amsterdam.nl/en/” when creating a chatbot and create an entirely new dataset. Nowadays, the website is really all we need.
We started this blog with a small thesis statement: Any website can be turned into a chatbot. It worked for the website of the municipality of Amsterdam. If you are interested in building your own chatbot, below are some of our observations during the development of this chatbot:
- A big challenge of working with LLM apps is working with highly unstructured data. This will happen by default, however, because if the data were well structured, the use case at hand could have been solved using classical NLP techniques.
- Testing an LLM application is difficult from a software engineering perspective. It is not possible to assert the correctness of our responses by an equality check. We’ve found that it helps to find questions that call for a very precise, known response. Questions that ask “Who?" or "Where?" or "When?” can be used by checking if the response contains a name, place or time. Truthfulness checks exist but are meant for universal datasets, not specific applications.
- Focusing on user experience pays off massively. LLMs are incredibly versatile, so even if it is not possible to answer the question of the user exactly, we can still provide related information. We can even point to relevant information because we already performed document search.
- Agents are a very “cool” method of working with LLMs, but are highly unpredictable. A minor rephrasing of exactly the same question will cause wildly different answers. They are fascinating when they exhibit “human” behavior, only to demonstrate a baffling lack of understanding for seemingly simple questions. At time of writing they are brand new, so who knows where they will be in a few years?
- In our case, vector databases were probably a bit overkill, as the data was compact enough to be completely in memory. If this is true for your application, avoid vector databases because they can call for a heavier docker image like bookworm instead of the lightest image that works.
- Although it was out of scope for our project, we can recommend implementing a form of feedback from the user as to the quality of the response. This can allow you to cache answers from correctly answered questions, as we found out in internal use that some questions pop up to a disproportionate degree.
- It can not be emphasized enough: Only self-host an LLM if you really need to!
If you are interested in what we can do for your company, contact us!