SpringTree

EN
NL

Jan van den Berg

Sep 15 ● 5 min read

So you want to add an LLM to your app? Read this first

Integrating LLMs into software isn’t plug-and-play. From validating outputs to managing costs and model changes, I share key lessons and how frameworks like LangChain help bridge deterministic systems with dynamic AI.

In November 2022 we all got a taste of things to come when OpenAI released ChatGPT 3.5. AI wasn’t new to us. But never before had it been so intuitive. Who knew a simple chat interface would be the breakthrough that brought AI to the masses? I knew, then, that it was a matter of time before we would be asked to integrate LLMs into a software product.

We’ve come a long way since then, and over the years we receive more and more AI work from our clients. I wrote this article to illustrate what one can expect when you’re going to integrate LLMs into your software product.

The main challenge is bridging the gap between deterministic systems and dynamic, non-deterministic ones. This requires a bit more explanation.

Software was — and still is, for the most part — deterministic. When you type 2 + 2 in a calculator, you always get the same result. Databases and APIs depend on deterministic input and output. Without it, nothing works. Not so much with AI. Remember LLMs are general knowledge systems, so you can ask them to do all sorts of things. There’s no single way to ask a question. You can ask: “Summarize this text”, or “Make this text shorter”, and LLMs will give you a similar response. Emphasis on the word similar.

You have to validate your AI workflow output before you use it in your deterministic system. For example, let’s say you ask LLMs to summarize and classify support emails as urgent or not urgent. Instead of trusting the model blindly, you run a validation step: compare its labels against a small dataset you’ve already tagged by hand. If the model consistently mislabels certain cases — say, password reset requests that should always be urgent — you know you need to adjust your prompts, fine-tune the model, or add a rule on top.

So once you’ve done that, you might think you’re finished. Not quite. There are more things you need to consider. Let me walk you through them:

Understand the costs
LLMs differ a lot in pricing. You carefully select the right LLM for the right job. You need to know how many tokens a single task execution will cost, on average. And then you extrapolate that to the number of tasks, and you get an idea of how much this project will cost.

Models will change
If you found a model that works for you, please understand that there is no guarantee that you can keep using this model in the future. For example, with the introduction of GPT-5, OpenAI only allowed you to use GPT-5 and GPT-4o. But you no longer had access to older models. So model changes are not a possibility, it’s a certainty that you need to prepare for.

Measure the quality of the output
Just like models will change, your software is also subject to change, the same goes for the prompts you’ll write to interact with the LLMs. So how do you measure the impact of these changes? How do you know that the new version of your software is better, or at least the same as the previous version? You have to create a dataset of what constitutes good results, and run your new software against that dataset and compare the results.

Retrieval-augmented generation
LLMs are trained on vast amounts of public data, but they lack specific, up-to-the-minute, or proprietary information relevant to your context. So, if you asked LLMs about the specifics of your internal customer support emails, it wouldn't be very helpful. To overcome this, you need a system in place to do Retrieval-augmented generation (RAG), which essentially involves retrieving relevant information from your private data sources first, and then feeding that information to the LLM along with your query, allowing the model to generate accurate and contextually informed responses based on your unique knowledge base.

We were faced with these challenges when we started our journey with AI. And that’s why we chose LangChain. LangChain, created in 2022 by Harrison Chase, is an open-source framework that facilitates the integration of LLMs into applications and offers off-the-shelf solutions to the challenges mentioned above. It has received 115K stars on GitHub, as of writing this article, and is available in Python and JavaScript.

Next to LangChain, there’s LangGraph that allows us to create dynamic AI workflows at scale. And last, but not least, there’s LangSmith, a platform where you can debug, test, and monitor AI app performance.

So to recap; When integrating LLMs: validate the output, plan for costs, prepare for model churn, measure quality, and use frameworks like LangChain to bridge the gap. If you’re considering adding AI to your product and want guidance on choosing the right models, workflows, and frameworks, we’d be glad to help.

⁠Sources:
LangChain homepage: https://www.langchain.com/