Leveraging your own data with LLM’s — Experiments, Limitations and Future Improvements

Rui Oliveira
Runtime Revolution
Published in
12 min readDec 7, 2023

--

The Challenge

LLMs have been undeniably garnering significant attention in the world of software development. These powerful and versatile models enable the automation and acceleration of a wide variety of Natural Language Processing tasks, with their potential application seeming almost endless. Here are just a few examples of the type of tasks LLMs can be applied to (with many more being frequently added to the list):

  • Text Generation
  • Summarisation
  • Sentiment Analysis
  • Chatbots
  • Translation

Although LLMs have access to information on a large variety of topics (up to their data cutoff at least — GPT-4 for example only holds information up to September 2021, at the time of writing), a common challenge companies and developers are currently facing is how to leverage the capabilities of LLMs on real world data. It’s great that LLMs have access to such a vast array of information, but what about our own data? What if we could use our own databases to complete tasks with LLMs in such a way that the answers we get back are empowered by our own private information, which isn’t available anywhere else?

The challenge we set for ourselves was thus: given that you have access to a database holding some information, how can you leverage LLM technology to gain some insight into the data contained on that database? This challenge was the start of a series of experiments with LLMs, of which we will highlight two, that helped us deepen our understanding of LLM technology.

Note: All examples presented use the famous Sakila database which simulates data for a Movie Rental store and thus contains information about movies, actors, rental details, etc… The concepts explored can nonetheless be extended to other relational databases.

Two main approaches seem to dominate when it comes to using your own data with LLMs: Fine Tuning and Retrieval Augmented Generation (RAG).

Fine tuning refers to the process of taking a pre-trained LLM and further training it to improve its performance at a specific task. This is relevant to our topic because it can also help make the LLM more domain specific, adding new information which wasn’t previously part of its knowledge base. In our case this would mean training our model on data coming from our chosen database.

Although fine tuning is a very enticing way to incorporate your own data into an LLM’s knowledge base, there are some obstacles and difficulties associated with this approach. Firstly, the process of fine tuning is very computationally expensive and time consuming. You may also have to pre-process your data in such a way as to be usable in the fine tuning process which also offers its own challenges.

Given these challenges we turned our attention to another common approach: Retrieval Augmented Generation (RAG). With RAG, when a user wants to ask a question to an LLM which pertains to some information not available in the LLMs knowledge base, the question is first passed to a specific type of model called a Retriever. This Retriever system will first look at your knowledge base and search for the most promising pieces of documentation that seem relevant for answering your question. The returned documents, as well as the question asked are then added to a prompt (thus augmenting it). This augmented prompt is passed to an LLM which will use the documents returned by the retriever as context for answering the question.

The Retrieval Augmented Generation Flow

RAG Using Text

In order to be able to use the RAG methodology, we had first to find a way to translate our databases into a format that would be usable by a retriever. Since these models expect to have a knowledge base available for the retrieval of information the approach we settled on was to programmatically translate our database into a textual representation of its data. If a table represented a list of users, for example, each row was transformed into a small paragraph explaining the characteristics of that particular user. These texts were then split up into small chunks of text, converted into embeddings (for the sake of brevity, we can say these are just numerical representations of our text that retrievers can interpret) and saved into a Vector Database (Databases tailored to holding embeddings — we used ChromaDB in our setup). Conversion into embeddings was carried out with the Flag Embedding model, which often takes the top spots on the MTBE ( Massive Text Embedding Benchmark) leaderboard on the Overall and Retrieval categories, and as such seemed well suited for this task.

Transformation of a database into embeddings usable by a retriever system

With our database now converted into embeddings, the last step was to connect everything to an LLM and to start asking questions. Interfacing with the various models we tried was done through the famous LangChain framework. The first thing we tried out was setting a local model, running on our own machine. We experimented with the Nous Hermes 13B model, which is an enhanced version of the Llama 13b model that rivals GPT-3.5-turbo in performance across a variety of tasks. We were able to get some promising results with this setup, which seemed to be able to return well structured answers to our questions. There was only one problem: its performance. Since local LLMs are dependent on the computing power of the machine running them, getting back an answer for a simple question could take upwards of 3 minutes. We could of course have looked into cloud computing solutions for hosting our model but this would have added its own set of constraints and associated costs.

This motivated us to try some alternatives, which led us to the OpenAI GPT family of LLMs. By using the GPT-3.5-turbo model we were able to greatly reduce the time needed, with those same time consuming questions now just taking a mere seconds to get back an answer (for a small fee of course). You can find some examples below of the kind of question-answer pairs generated by this setup.

Example answers using GPT-3.5

Limitations

The main bottlenecks we identified for this approach turned out to be, not only the quality of the data being used as a knowledge base, but also of the chosen retriever system. On one hand, if the data you are using is not of satisfactory quality (either because there is insufficient data in your database to answer the questions you wish to ask or you failed to convert it into some form of text which adds enough contextualisation), the quality of the answers you will get back will be equally poor. On the other hand if the retriever system you are using is unable to locate the most relevant pieces of information necessary to answer the question currently being asked, you will invariably find that even when using the most powerful large language model available you will still fail to produce high quality answers.

We could also conclude that, given our main objective of using data originating from a database, this approach would be better suited for very specific types of data. Since one of the necessary steps in our chosen approach is the conversion of each database table into a text offering some context to our data, it would seem this approach is better suitable for those databases holding tables that are already rich in context themselves and that aren’t frequently updated. So if the database of your website or product has a lot of product descriptions, tutorials, FAQs or other heavily textual content, this method may be of interest to you.

Future Improvements

Given the found limitations, possible improvements would probably focus on enhancing the quality of the document retrieval process.The quality of the retriever itself seems to have a large impact on results. Since we first ran these experiments, a new version of the retrieval model used has been released, for example, which has a more reasonable similarity distribution and is now being recommended over the previous version. Using an improved model such as this would probably have a positive impact on results.

In the same vein improving the transformation process, from database rows to contextualised text would probably have a positive impact as well. In this case close collaboration with product owners, developers responsible for database management and data analysis experts could help enhance the contextualisation provided in the produced documents, leading to a more rich and useful knowledge base.

It should also be noted that the experiments above were run with the OpenAI models available to us at the time, and using more recent (and powerful) models would probably have a positive impact on results. OpenAI recently announced new versions of their GPT 3.5-turbo and GPT-4 models, which not only offer larger contexts but also better performance and enhanced instruction following (among other improvements), at a reduced cost. As such, it would probably be worthwhile to consider using these models.

PandasAI

Although the previously described RAG approach allowed us to ask questions of our data we really couldn’t take much advantage of the fact that we were dealing with data coming from a relational database. These kinds of databases organise data into tables and keep track of the different relationships between the entities within. If it were possible to keep data in table format, the kind of questions we could ask would also shift, enabling us, for example, to practise some form of data analysis over our data.

Thus enters PandasAI, a Python library that adds Generative AI capabilities to the famous pandas data analysis and manipulation library. The concept is simple yet quite elegant: Imagine you want to create a chart based on some data you hold. Usually you would write some code, maybe using pandas, that would take your data, make the necessary transformations and then plot that data into chart format. With PandasAI, you can just pass your request (e. g. “Plot the histogram of countries showing for each the gdp, using different colours for each bar”), which will then be encapsulated into a larger prompt. This prompt, which is then passed to an LLM of your choice, contains instructions to request the LLM to produce the code necessary for solving the problem described in your query. To increase contextualisation for your query, a small sample of each table holding your data (5 rows from each table) is also passed along. The returned code is then run and the final result is presented to the user. Creating charts is just one example of what can be done with this library, it can also be used to filter tables, input missing values, generate features in a dataset and more.

In our setup, data is first extracted from databases and saved on pandas dataframes. These are passed to PandasAI, along with the users’ questions. The following diagram describes the whole flow:

Data analysis flow with PandasAI

PandasAI is compatible with a wide variety of models and we experimented with both GPT-3.5 turbo and GPT-4. Not surprisingly most consistent results seemed to be attained with GPT-4 (although at the expense of a much slower response time). Below you will find examples of textual, chart and table outputs obtained with PandasAI:

Example of a prompt resulting in a text output.
Example of a prompt resulting in a chart output.
Example of a prompt resulting in a table output.

Limitations

Although PandasAI is a very impressive tool, the very nature of it’s dependance on LLM’s means that sometimes the results returned won’t be the ones we predict. Since the prompt sent asks for the creation of code there is always the possibility that code with errors is returned and run. Although the library has some measures in place to fight this problem (e. g. by default if an error occurs the library will try to resolve the problem 3 times by sending new prompts aiming at fixing the erroneous code), you will sometimes find that your request still ends in an error.

Even if code with no errors is produced there is also no guarantee that the final result will be exactly as you expected. Often, especially when asking for more complex charts or tables you will find that the final result presented to the user has some slight variation from what was originally asked. From charts with inverted axis to tables showing columns you specifically asked to be excluded it is sometimes the case that the generated final product will need some tweaking.

Luckily PandasAI offers a way to look at what code exactly was generated,so you can always take a look at this code and make whatever changes you find more suitable. This will allow you to make any changes you desire, although this will also require some knowledge on Python code. More importantly, the ability to access this code means that it will be reusable endlessly with no need to run the same prompt again. You can use this code to, for example, produce dashboards, with data being updated at set intervals.

Code example generated by PandasAI

It should be noted that the GPT-4 LLM model seems to greatly reduce the number of errors and inconsistencies, so use of this model is probably recommended.

Future Improvements

Since PandasAI is an ongoing project, there are continuous improvements as new features are added and the library is further developed. One such feature, which was added since we first began experimenting with it, is the possibility of using PandasAI as an Agent, which enables communication with the LLM model in a conversational way. It is also now possible to ask clarifying questions, as well as asking for an explanation of a returned answer or even to rephrase a query being passed. These kinds of features have the potential, to a great extent, of alleviating the need of having to make changes to the final code produced, and allow users even with no coding knowledge to more easily interact with the library.

Besides taking advantage of the functionalities being added to PandasAI, other improvements should probably be considered. PandasAI allows, for example, the usage of custom prompts, which can be used as an alternative to the default instructions the library passes along to the LLMs.These custom prompts allow us to tailor the final output to our own use case and to add any constraints we desire during the code production steps. These can also be used to give LLMs more context about our data, thus potentially reducing the amount of errors generated.

Finally since PandasAI uses data in table format, various kinds of preprocessing like filling in missing data or renaming and pre-selecting tables and columns may be set in place, which can further help with consistency of results.

Conclusion

We can say that our original objective was accomplished. We set out to explore ways in which we could leverage the capabilities of LLMs on private data and we were able to try out two different approaches each with an unique positive outcome.

One of these approaches centred on using our data in tandem with a retriever model and an LLM to create a system capable of answering questions about our data. Although the LLMs we used were completely oblivious to the data needed for answering our questions through the use of RAG we were able to empower them with the necessary context for that task.

The other approach allowed us to do data analysis tasks on our data. The absence of any knowledge about our data was simply bypassed by providing a small sample of all the available data. By asking for code capable of resolving our problem we were also able to avoid having to pass the whole database. Although the LLM models only had a small glimpse into the information contained, they still were able to produce code capable of answering our questions and producing code. This code becomes then reusable on different mediums from dashboards to reports, which means you don’t have to run the same prompts more than once.

The final results of our experiments can be said to be most definitely positive. With the continuous evolution of LLM related technologies, we can expect techniques in this area to keep evolving, which gives us the opportunity of improving our own results. Although both of these approaches present some limitations they also present endless possibilities. And we couldn't be more excited to keep exploring them.

--

--