Nearly two years after ChatGPT's debut, the software market continues to see a surge of new Generative AI (GenAI) solutions emerging almost every day. In ongoing conversations with enterprise leaders and founders in the AI community, our team at Wipro Ventures has observed many enterprises eager to experiment and integrate with GenAI solutions, but few with significant adoption within customer-facing products and broader organizational operations.
Despite massive investments of capital and technical talent throughout the AI stack, we know that most enterprises still view response accuracy, content safety, and regulatory compliance as critical hurdles that hold them back from scaled GenAI deployment in production settings.
The Necessity of Evaluation
Despite the initial excitement and high expectations for products like ChatGPT, many users have also experienced their potential to produce erroneous or hallucinated results. This reinforces the need for rigorous quality assurance of GenAI products that leverage Large Language Models (LLMs) before we can expect long-term enterprise adoption. Previous AI solutions have underscored the importance of predictable and consistent performance, particularly in use cases where inaccurate outcomes have severe negative consequences. For example, autonomous vehicle (AV) companies have spent years improving their performance and several published studies even suggest that AVs significantly reduce traffic accidents. However public trust in AVs continues to waver due to a few recent high-profile accidents.
Similarly, for enterprise use cases like data analysis, customer support, and code generation, user trust in GenAI can quickly erode if outputs are not consistently accurate and safe. Enterprise surveys reveal the top concerns cited about GenAI adoption are often related to hallucinated false information or offensive and off-brand language.
All software today goes through extensive pre-release testing and post-launch monitoring to ensure it behaves as expected, remains secure, and handles exceptions gracefully. We believe existing MLOps platforms and new LLMOps solutions will find ways to apply the same rigor to GenAI-integrated applications to achieve consistent and predictable performance.
Understanding the Complexities
Prior to the popularization of the GPT models, many companies were already using AI models for tasks like fraud detection, product recommendations, spam filtering, sentence autocompletion, and object recognition. Even if these models sometimes produced false positive or negative results, the quality of output for this previous generation of AI use cases was straightforward for users to define and evaluate.
Assessing GenAI models like LLMs is more challenging. GenAI use cases often focus on content creation, where output quality is subjective. This subjectivity complicates the reliability and scalability of both automated and human-led performance evaluations, the latter of which is further challenged by individual biases. While most researchers and builders in the LLM community rely on standardized benchmarks to measure model performance, these benchmarks do not fully capture the nuances of LLM usage within applications. A high-quality LLM response often requires not just accuracy of information but also tone, brevity, and relevance that is personalized to the user.
Evaluation Methods
Current evaluation methods and metrics assess three primary areas:
Model Benchmarking: Using predefined metrics to measure a foundational AI model’s performance. There are several benchmarking frameworks that are openly available and standardized across the AI research community, the most popular of which evaluate models’ general reasoning, knowledge, and understanding of language (e.g. MMLU, HellaSwag). These benchmarks are useful for comparing baseline performance when selecting or tuning a model.
System Implementation: Evaluates the impact of various components used in a GenAI application besides the models, such as prompts, data pipelines, embedding and indexing, model context, or search and retrieval algorithms. As developers combine and configure these components to build increasingly complex AI systems, the design and implementation of each component will determine how well user inputs generate the desired outputs.
Output Quality: Assessing GenAI outputs using human-based judgment to determine the quality of responses. Besides answer correctness, evaluation criteria can also consider factors like relevance to the original query, coherence in response, faithfulness to the retrieved context, etc. These assessments can be done manually by human evaluators or LLMs that have been prompted or tuned to apply human-like preferences. Using LLMs to automate output evaluation is growing in popularity given the speed of implementation and flexibility in language understanding, but scalability of cost and performance will become a challenge.
What’s Next?
As AI applications gain adoption and evolve from copilots toward autonomous agents, we expect significant innovation in evaluation solutions and metrics. Agentic applications will greatly expand the complexity and expected performance of AI models, requiring even higher standards for reliable and safe performance. GenAI applications require specialized technical talent and computing resources to operationalize, so they must provide proportionally transformative performance to justify the high cost of investment.
Over the past several years, the team at Wipro Ventures has been committed to investing in strategic and innovative AI applications. These portfolio companies, such as Avaamo (conversational AI), Lilt (enterprise translations), Kibsi (computer vision), Kognitos (process automation), Tangibly (trade secret management), and Ema (agentic AI employees) – each represent significant opportunities for enterprises to improve operational productivity and create transformative customer experiences. We believe GenAI capabilities will continue to quickly evolve and we are still in the earliest stages of realizing the technology’s benefits.
Looking forward, we see several critical areas of focus for the future success of GenAI solutions:
Platforms for creating use case-specific evaluation frameworks: Organizations adopting AI applications will use them for different tasks, in different industries, and to represent unique brands, all while complying with different geographic regulations and preferences. Many enterprises will lack the technical skillsets needed to build the evaluation datasets and LLM evaluators that best represent their individual use cases. There has already been an explosion of open-source solutions in this segment, and we expect future enterprise offerings to integrate and consolidate these capabilities.
Optimized evaluation pipelines: LLM responses require numerous accuracy and safety checks before delivery, increasing development, operational, and inference costs while affecting application responsiveness. As LLM-driven applications increase in complexity, we anticipate new purpose-built tooling for more efficient deployment of LLM evaluation infrastructure.
Tighter feedback loops between user interactions and model performance: Today, many AI applications have their user experiences graded with direct user feedback or rates of user conversion and retention. These approaches, while reflective of user preferences, are slow and have binary outcomes. We predict the emergence of more sophisticated LLMOps tools to programmatically assess LLM outputs against human performance, using top-rated responses to refine models and their evaluation datasets.
If you are a practitioner focusing on building new tooling and infrastructure for evaluating GenAI applications, or interested in discussing any of these concepts, please don’t hesitate to get in touch with me at jason.chern@wipro.com.