Implementing and Evaluating Large Language Models in Business Operations

AI : Implementing and Evaluating Large Language Models in Business Operations

Generative AI and large language models (LLMs) are becoming integral in various professional settings, such as knowledge management, search engine alternatives, and programming assistance. For businesses, especially from a compliance perspective, a planned and structured deployment by informed employees is preferable to the uncontrolled spread of shadow IT with AI tools and models. Once prototypes are tested and ready for production, companies might face structural challenges. This article outlines how companies can help themselves from the start.

Currently, typical use cases of LLMs in operational business include customer communication and sales, such as chatbots and lead qualification. LLMs are also gaining traction in legal, procurement, finance, HR, and product development departments. The potential applications are vast, complicating organizational decisions on where to begin and which product suits which use case.

Sofiane Fessi, with 15 years of experience in analytics and data science, primarily in digital and e-commerce sectors, shares insights. Before joining Dataiku as Regional Vice President Sales Engineering for Central Europe, Fessi advised major UK companies on applying data science to web analytics and e-commerce data.

Fessi suggests not starting with overly ambitious projects, as implementing AI should consider human perspectives. Decision-makers and employees are more motivated when they see quick results in reduced administration and increased productivity.

To find the right model and assess its quality, team leaders and decision-makers should follow some basic rules. Data teams shouldn’t dictate to business departments what to use; instead, end-users know best what helps them. Uncoordinated shadow IT often arises when employees independently procure products. Organizations should involve actual users from the outset. Additionally, there should be a clear method for evaluating AI applications’ performance after the experimental phase. Flexibility is also crucial; if a solution is unsuitable, it should be easily replaceable. Establishing frameworks for participation, transparency, control, and compliance covers most groundwork.

The quality of LLMs depends not only on the models but also on available data and employees’ ability to use AI applications. A survey of 400 IT leaders revealed that 58% of companies lack quality data for AI processing or access to it. However, only 57% of respondents were training employees in data handling. Once these conditions are met, companies can focus on evaluating LLM mechanisms.

LLMs’ result quality can be assessed based on accuracy, relevance, and clarity of responses. Accuracy checks if the response is factually correct, while relevance measures if the response is related to the question. Correctness doesn’t always imply relevance, as LLMs might provide a correct answer that doesn’t address the question.

Many companies only sample and manually review LLM performance, which isn’t adequate. With rapidly growing data volumes, it’s impossible for humans to keep up, which is why LLMs exist. A robust, data-driven evaluation framework is essential. Criteria like fidelity, correctness, relevance, and context precision serve as bases for monitoring techniques like “LLM-as-a-judge,” where a secondary LLM acts as a proxy for human evaluation. Other useful metrics include BERT-Score, ROUGE, and BLEU, based on statistical NLP techniques.

Transitioning from the experimental phase to operational use requires more than outlined rules and data issues. The survey also highlighted that 44% of IT leaders cite resource shortages and 28% a lack of know-how as common barriers. Compliance and control have dominated discussions since the AI Act. The key is transferring the agility and motivation seen in shadow IT to a controlled environment. Budgets for LLM use, data access rights, and using various LLMs should be easily managed. If compliance takes too much time, it becomes a motivation killer.

Ultimately, everything hinges on genuine value. Forcing employees is futile. Participation in and use of LLMs should be as simple as possible. By 2025, AI agents will significantly accelerate operational applications. Compulsion is counterproductive; 65% of leaders say their GenAI initiatives are financially successful and offer strategic value. What more convincing argument is needed?

For those already using large language models and retrieval-augmented generation, security considerations are crucial. The iX 01/2025 issue provides valuable guidance.

The “Three Questions and Answers” series by iX aims to address IT challenges succinctly, whether from a user’s, manager’s, or administrator’s perspective. Share your daily practice insights or user tips with us, or leave a comment in the forum.

Exit mobile version