This paper outlines the relevant EU regulatory framework, QC and software assurance requirements as well as a risk-based validation strategy for deploying AI tools within a GxP environment or for interactions with HCPs. The focus is on AI validation in pharma, ensuring tools are both innovative and compliant.
1. Relevant regulatory frameworks within the EU
Under the European regulatory framework for medicinal products, the quality and reliability of any computerized system that could influence understanding or decision-making are paramount. Standards for these factors are captured in Good Practices, referred to as GxP. GxP regulations require accuracy, traceability and reproducibility of information provided through standards defined by organizations such as ICH (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use) or through regulation explicitly for the EU. For manufacturing and distribution, the EU GMP Regulations as presented in EudraLex Vol. 4 are applicable.
Annexes to the EU GMP Regulations further detail the requirements as defined by the EU Commission. In EU GMP Annex 11, the requirements for the use of computerized systems are defined. It requires that all such systems undergo formal validation, typically through installation qualification, operational qualification and performance qualification to demonstrate that they consistently perform as intended. In the EU, this process is referred to as computerised system validation (CSV), which traditionally follows a structured, documentation-heavy top-down approach. Even if a new version of the annex has recently been published for public consultation, it still follows former patterns of software development and deployment not reflecting today’s standards such as cloud computing and computing environments serviced by third parties – not to mention concepts such as blockchain and AI.
By contrast, in the United States, the FDA promotes computer software assurance (CSA) as a successor for CSV which uses a more flexible, risk-based approach integrating platforms and services provided by third parties. While the EU has not formally adopted CSA, CSA-style risk-based testing can be applied within Annex 11/GAMP 5 when justified and documented.
For the EU, regulations about the use of AI are developed in a new annex to EU GMP, Annex 22. This is where GMP meets AI, bridging traditional pharmaceutical manufacturing standards with modern AI-driven processes. By July 2025, the first version of that annex has been published for public consultation until October 2025. This version is limited to output generated by AI tools based on static content.
On top of the GxP requirements, the EU AI Act is being rolled out in phases. Chatbots that perform factual retrieval are usually categorized as “limited-risk” under the AI Act, subject mainly to transparency obligations (e.g., disclosure that the user is interacting with AI). The output of the Chatbot is reviewed by an expert so there is human interaction that drives the application of the retrieved information to patients. However, even as a low-risk application, governance and quality assurance are of the utmost importance.
In parallel, GDPR governs the processing of any personal data within physician interactions, demanding robust access control, auditability and secure data storage. Personal data (including physician identifiers and patient cases shared during queries) must be processed under the GDPR Art. 5 principles of data minimization and purpose limitation, with explicit consideration of retention and audit trail requirements.
2. Quality Control
In the pharmaceutical context, software assurance means providing documented, lifecycle-long evidence that the system is fit for its intended use, resilient to change and consistently reliable. Following a framework to ensure compliance (such as the ISPE GAMP 5 principles), a risk-based approach should guide the depth of verification and validation activities. Supplier qualifications are essential, especially for cloud-hosted models or third-party natural language processing components, where service agreements must address data integrity, change notification and uptime commitments. Change management procedures should govern updates to datasets, model versions or conversation handling logic, ensuring that no modification enters production without prior verification. Operational controls, including user access management, monitoring and incident response processes, provide an additional safeguard against both technical and compliance failures. Continuous assurance, through periodic re-testing and accuracy monitoring, ensures that performance does not degrade over time—a risk that is particularly relevant for AI models subject to drift.
QC in this context must address the accuracy of the AI system. AI quality control encompasses accuracy, reliability and auditability across diverse input-output scenarios. The AI tool must reliably retrieve information from trustworthy sources, without introducing promotional bias or off-label interpretations. Each interaction should be logged with the necessary details (e.g., query, retrieved source, generated output and system version) to create an audit trail that would withstand a regulatory inspection. A central question is: how is accuracy measured? Metrics such as exact match accuracy, factual accuracy rate and critical error rate must be complemented by qualitative assessments of conversational coherence with a wide range of possible inputs, e.g., variations of the questions asked or the input provided. Given potential patient safety impact, the critical error rate should be set as low as possible. For instance, when validating a pharmacovigilance case-intake assistant, exact match accuracy may be applied to verify that all required data is captured correctly, factual accuracy rate can be applied to literature summarization tools and critical error rate is particularly relevant when AI is used for dosage-related outputs.
3. AI as a tool for validation
As AI systems can process a wide variety of inputs and generate equally diverse outputs, validating them by providing a fixed set of input phrases and comparing the resulting outputs for completeness, correctness and consistency presents a significant challenge. The traditional validation paradigm—where a specific input deterministically produces one predefined output—is no longer sufficient or appropriate for AI-driven systems. Instead, validation must account for acceptable ranges of variation, ensure that mandatory information is always present and confirm that outputs remain accurate, safe and compliant within defined boundaries.
One of the most promising developments in this space is the use of AI itself to evaluate AI systems under a risk-based testing paradigm. Traditionally, CSV or CSA processes involve manually designing test cases, crafting prompts and manually reviewing chatbot outputs—a highly resource-intensive process. Emerging tools now allow the automated generation of thousands of prompts based on defined parameters. These prompts are then fed into the AI tool and the outputs are evaluated or screened automatically using AI against defined quality categories, such as the five ChatGPT QC categories (factual accuracy, completeness, relevance, safety and style). For example, in a validation scenario, 1000 automatically generated physician queries about contraindications could be submitted to the AI tool with outputs scored for factual accuracy against the content, relevance to the query and safety (no off-label advice). By leveraging AI in this way, it becomes possible to test with far greater coverage in less time, enabling much more comprehensive risk-based validation. This demonstrates the potential of AI for GxP validation where AI is both the subject and the instrument of assurance. That process is guided by human expertise and the clear definition of expected outcomes. A subject matter expert (SME) must specify representative input patterns as well as the corresponding expected outputs that the system should generate. The SME also determines which elements of an output are mandatory and which may be considered optional or acceptable alternatives. In certain cases, the SME may explicitly define content that must not be produced by the tool, such as off-label or promotional statements. To ensure comprehensive testing, combinations of inputs and their respective expected outcomes should also be defined. As the number of input patterns and the variety of possible outputs grow, the dataset to be analyzed and compared can quickly reach a level of considerable complexity, requiring structured methods and tools for efficient evaluation.
Generating a wide range of inputs based on data defined by the expert as well as checking the output against expected outcome across the results provided is a perfect field for the use of AI. Variations of the input can be generated by AI, fed automatically into the AI tool and the results of the exercise can be checked against the expectation of the expert as well as across all results to point out differences and to evaluate completion of the testing. All that can be presented in a testing report generated by AI to be then checked by the expert.