As enterprises rush to adopt GenAI, many are discovering that traditional ideas of “data quality” don’t quite fit. For years, quality meant fixing duplicates, cleaning customer records, and validating structured data. But GenAI systems thrive on something different — unstructured information like documents, reviews, and policies. When this data is messy, incomplete, or hard to access, even the most advanced models will fail to deliver value.
Just like other forms of AI, GenAI – AI based on large language models (LLMs) such as ChatGPT – depends heavily on data. In most cases, you need good data for it to work well.
However, the data that GenAI thrives on is not the same as the data typically used for data science or machine learning. In those areas, data quality usually refers to ensuring structured records are accurate – for example, that customer addresses are correct, there are no duplicate transactions, and telephone numbers are valid. Most existing data quality tools and frameworks are designed to handle these types of issues in structured datasets.
GenAI, however, excels at working with unstructured data – documents, images, and other text-rich content. To get the best results from GenAI, you need to get this unstructured data in order. If your unstructured data is of poor quality, your GenAI outputs will be too.
In this context, data quality is still about accuracy and availability, but it’s less about formatting and more about the quality and relevance of the underlying information. Instead of transaction records, your “data” might be standard operating procedures, customer reviews, or policy documents.
Two areas where GenAI is already delivering value
1. Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) and its variations form a fundamental architecture behind many successful GenAI deployments. In a RAG system, the input to the LLM is your question alongside supporting documentation.
For example, if you ask:
“How do I change a broken spring in an ACME 123 device?”
The LLM receives the question together with relevant sections from the ACME 123 Maintenance Manual. It then analyses this information to generate the answer.
This approach has a clear advantage: the model uses your own knowledge and documentation rather than relying on what it was trained on originally. However, it will only perform well if:
- The maintenance document exists
- It contains the right information
- It’s accessible to the LLM (typically stored in a vector database)
Your key data quality challenges, then, are ensuring these conditions are met. That includes:
- Verifying that information is accurate and up to date
- Refreshing the vector database whenever documents or procedures change
In practice, much of this data is stored in formats that need processing. Useful documents often exist as PDFs or Word files, which must be converted into plain text for RAG ingestion. If older documents have been scanned as images (for example, some PDFs), Optical Character Recognition (OCR) will be required to extract the text. OCR can introduce its own errors, so it’s important to understand and manage their impact.
You’ll also need to monitor RAG quality over time. Like any production system, it should be tested regularly to detect when performance drops below acceptable thresholds. Mature RAG implementations rely on curated test questions with expected answers, monitored using tools such as ragas.io. These benchmarks must be reviewed and updated as your system evolves.
2. Employee or customer service bots
Another area where organisations are seeing success with GenAI is in automating customer service or internal employee queries – for example, HR chatbots.
Modern service bots use LLMs to interpret questions and generate responses across a wide range of topics. Some queries mirror those handled in RAG systems (like our broken spring example), so the same data quality principles apply – ensuring information is accurate and current.
Other queries are more transactional, such as:
“How many days of annual leave do I have left?”
Here, the data quality concerns are similar to traditional structured data quality. The accuracy of the answer depends on the integrity of the source system. The main challenge is ensuring the GenAI system can securely access this trusted data while respecting appropriate permissions and governance.
There’s also growing interest in AI agents that can act on your behalf – for example, an AI holiday agent that books your trip and requests your leave automatically. These systems work with both structured and unstructured data, so their performance depends on the quality of both.
Emerging standards: Model Context Protocol (MCP)
There are many ways to connect data to GenAI applications, but one promising approach is the Model Context Protocol (MCP) (modelcontextprotocol.io). MCP is a generic standard for providing data to AI-based services. It holds the potential to make your data reusable across different AI solutions and agents, reducing duplication and integration overhead.
What about training your own model?
So far, we’ve focused on using existing models. Training a large language model from scratch is extremely expensive (often exceeding $100M for the latest models) and requires specialist expertise. Most organisations don’t need to do this because today’s models are already highly capable and available for use.
That said, there are valid reasons to train or fine-tune smaller models. You might do so for distillation (to create a lightweight model for a specific task), cost reduction, performance optimisation, or security and control if you need to host models locally.
In these cases, data quality is absolutely critical. Model performance depends directly on the quality and relevance of your training data. Careful selection, curation, and validation of training documents are essential to achieving good results.