As enterprises explore generative AI (GenAI), they inevitably find that the success of their GenAI initiatives depends on data quality. Poor data quality cascades through machine learning (ML) pipelines, leading to flawed business decisions and missed opportunities. To establish trust in AI systems, enterprises must find new approaches to continuously improve data quality.

Dual Pillars of GenAI: Core and Contextual Data

At the heart of GenAI lie two types of data: core and contextual. Core data fuels the training of large language models (LLMs), enabling them to decipher complex patterns and structures. This foundational knowledge is critical for generating outcomes that are not only logical but also deeply relevant to the task at hand. Conversely, contextual data enriches this process by providing nuanced information about specific situations or environments, leading to more tailored and effective GenAI outputs.

To ensure high-quality core data, enterprises should prioritize the following: 

1. Contextualization: Contextualization is essential for GenAI's effectiveness, especially in creating tailored content. A notable example is a MedTech organization enhancing healthcare professionals' (HCPs) experiences through personalized marketing, achieved by aggregating and meticulously validating data from diverse sources. This comprehensive approach to data amalgamation and quality control enabled the creation of data products that significantly improved HCP engagement, demonstrating GenAI's personalization power.

2. Comprehensiveness: Ensuring that data encompasses all relevant groups and scenarios is crucial. For instance, models that will be used across the globe must be trained on data from all targeted countries.

3. Bias Mitigation: Iterative dataset reviews and bias detection applications play an important role in minimizing model bias. Beyond data management, engineering of prompts during model training plays a vital role in generating unbiased, inclusive outputs.

4. Regulatory Compliance: Adhering to legal standards, especially concerning personally identifiable information (PII) and confidential data, is imperative. Such data must be appropriately masked or excluded to meet regulatory demands, safeguarding privacy and compliance.

Contextual Data for Crafting Prompts

The art of crafting prompts for GenAI models hinges on three core concepts: clarity, specificity, and bias consciousness. Clear and unambiguous prompts enhance the accuracy of outcomes, while specificity ensures that the AI comprehends and addresses the context effectively. Moreover, a keen awareness of potential biases in language guides the development of more equitable and balanced AI responses. This fosters an environment where GenAI can thrive responsibly and ethically.

Essential Elements of Core Data Quality

Core data is frequently plagued by issues such as missing values, inconsistencies, and inaccuracies. These challenges compromise the reliability and trust in data, ultimately impeding the training and performance of GenAI models. Common data quality challenges that affect the scalability of GenAI include:

1. Unlabelled or Poorly Labelled Data: The lack of context in datasets can lead large language models (LLMs) to generate incorrect outputs. This issue is magnified by diverse data needs across business and IT sectors, highlighting the need for a centralized inventory of well-labeled, metadata-rich datasets. Without such organization, searching for data becomes time-consuming, and doubts about data quality and origin arise. Ensuring datasets are quality-checked, curated, and easily accessible improves trust in data, streamlines access, encourages collaboration, and guarantees consistent metrics across users. This optimizes the efficiency and effectiveness of data use for GenAI and other data-driven projects.

2. Incomplete Data and Missing Values: Incomplete data hampers model performance by offering an incomplete analysis view. Data quality solutions address this by producing detailed reports on data exceptions, such as missing values, crucial for data stewards to identify and correct. Corrective actions may include updating values at their source or setting default values in the data pipeline. This ensures clean data for downstream use.

Implementing data quality solutions systematically improves data quality, monitored through key performance indicators (KPIs) set for organizational teams. KPIs track enhancements in data accuracy, completeness, and timeliness.

3. Data Accuracy: A data quality solution is essential for ensuring the integrity and reliability of data used in GenAI applications. It produces detailed reports on data issues, including missing values. The helps data stewards make necessary corrections, such as updating values at their source or setting defaults in the pipeline for cleaner data downstream.

Implementing data quality solutions results in significantly enhanced data quality, monitored for accuracy, completeness, and timeliness through KPIs.

Proactive management of data quality ensures GenAI models are trained on high-quality data, enhancing the precise, reliable generation of insightful outcomes.

4. Data Freshness: Outdated data can mislead models, affecting their relevance to current trends. A data observability solution improves data freshness by detecting anomalies early, offering insights into the recency of data loads and identifying stale data. For instance, a retail company using old transaction data might misalign product recommendations. With data observability, current data is ensured, making GenAI model recommendations timelier and more relevant for effective customer engagement.

5. Duplicates: Duplicate data entries can lead to redundancy and biased analysis. A data quality solution identifies duplicates using match rules, such as comparing customer datasets by name, email, and phone number. It assigns match scores to gauge confidence in the duplicates, automating record merging for high-confidence matches and flagging others for review. This method ensures data accuracy for training and analysis, preventing skewed model performance due to over-represented data.

6. Data Consistency: Inconsistent data hampers a model's ability to apply learnings across contexts. For example, if a healthcare professional is active in a CRM, but inactive in a marketing application, resolving this discrepancy is vital for accurate data segmentation. Identifying and reconciling such inconsistencies requires selecting a primary data source for truth and adjusting data pipelines accordingly. This ensures models can generalize effectively, boosting the reliability and performance of GenAI applications.

Scaling GenAI for Broader Enterprise Applications

The success of enterprise GenAI programs is deeply intertwined with data quality readiness. To scale GenAI programs effectively with high-quality data, organizations can follow these six strategic steps:

1. Data Quality Strategy and Executive Sponsorship: Establish a comprehensive vision around drivers and priorities for data quality management. Aligning with business priorities and securing executive sponsorship are crucial for the success of data quality initiatives and for engaging various business stakeholders.

2. Start Small: Clearly understand the business objectives and current data quality challenges. Identify and prioritize use cases for implementation, such as data quality for the customer domain, or a specific GenAI use case/application. Assess the expected outcomes from the GenAI application/use case, as well as the underlying data and relevant data quality standards/KPIs.

3. Establish a Data Quality Framework: Define data quality processes and procedures, such as critical data element (CDE) identification or issue management. Outline a target operating model that details the engagement model, organizational structure, and roles and responsibilities (RACI). Ensure that data owners and stewards are identified and that their roles are aligned with prioritized use cases.

4. PILOT Solution and Scale: Develop a scalable solution that can adapt to future needs, including the ability to integrate new data sources, datasets, or rules. Create a central repository for reusable rules, intuitive dashboards for data quality reports, and a mechanism for tracking data issues. Engage the business throughout the implementation to gather feedback and provide training tailored to various personas and roles, such as issue management workflows or the creation of new critical data elements. Identify data champions within working groups, to help disseminate outcomes and incrementally onboard other use cases/business functions.

5. Data Quality Remediation: Implement a robust remediation process where fixes are managed close to the source to prevent data issues from propagating downstream. Establish a technology-enabled workflow for issue management to track exceptions and take corrective actions. Hold regular data governance meetings with stakeholders to ensure prompt issue resolution.

6. Continuous Data Quality Monitoring and Improvement: Treat data quality management as an ongoing process that involves regular monitoring against defined metrics and KPIs, detecting anomalies in data flows, alerting data stewards, and tracking data quality issues to resolution.

Make Data Quality Job #1

High-quality, trusted data is the backbone of any successful GenAI initiative. Organizations must prioritize data readiness as they define their GenAI use cases and objectives. Delaying attention to data quality can lead to diminished AI ambitions and missed opportunities for innovation and growth.