The maturation of artificial intelligence (AI) and machine learning (ML) technologies, combined with the rapid emergence of generative AI (GenAI) tools, has generated strong interest among transportation agencies eager to leverage this transformative technology to enhance operations, strengthen analytical capabilities, and support more informed decision-making. As these tools depend on high-quality input data to train large language models (LLMs) and utilize retrieval-augmented generation (RAG) techniques to deliver accurate responses to staff queries, it is critical that organizations prioritize the assessment and remediation of low-quality data as well as data that does not comply with defined business rules.
This research will examine the feasibility of using AI to automate the creation of data business rules through the analysis of manuals, processes, policies, training materials, and related documents. AI tools would be trained to profile datasets, identify inconsistencies and quality issues, and generate data quality rules compatible with existing tools. The structure of these rules—including tool-specific, importable structured markup formats—would be derived from ingesting application documentation, pre-existing validation logic, and natural language prompts enhanced by sample data. When provided with data for quality review, the AI system would automatically correct errors based on business rules, supporting documentation, and inferred relationships across datasets, without requiring the manual development and execution of individual validation rules.
This research will document the feasibility of each capability tested, the success rates of different approaches, implementation requirements and complexity, what proved effective or ineffective, and the challenges encountered. It will also provide recommendations for future research.
Expected benefits include a clearer understanding of the current capabilities of AI to improve data quality and reduce the effort needed to maintain it. The research will also help define appropriate and inappropriate uses of AI in the context of data quality. Additionally, it will streamline the process of performing data quality assessments, resulting in improved data quality, and offer a shared understanding of current potential, performance limitations, and future research needs.