iSpeak Blog

Validating Artificial Intelligence with Probabilistic Output in a Good Manufacturing Practice (GMP) Environment

Bogdan Klimas
Illustration_Validating AI in GMP-750.jpg

Source: illustration generated by the author using AI 

The question is not, “Can we validate AI?”—It is, “Are we applying the same standard we apply to people? “

The draft of EU GMP Annex 22 issued in late 2025 reads: “The document applies to models with a deterministic output which, when given identical inputs, provide identical outputs. Models with a probabilistic output which, when given identical inputs, might not provide identical outputs are not covered by this document and should not be used in critical GMP applications.

This sentence from the draft of new regulations is the main trigger for me to compare it to the historical law as on the picture above and offer the following polemic.

The question immediately arises:

Do trained personnel working on critical GMP applications consistently produce identical outcomes from identical inputs? In practice, not always. Humans are not AI, but numerous AI models operate in a way that is closer to human judgment (statistical, variable within defined limits) than to classic rule-based computerized systems.

Let’s start with a couple of questions, which are provocative, but meant to progress the conversation.

If manual tasks performed by humans are evaluated solely based on operator qualification rather than direct validation, shouldn’t AI be evaluated using similar criteria?

Or should it still be treated as a machine, thereby rejecting its benefits just due to its “human-like” attributes and behaviors?

These are not theoretical questions. They are exactly the questions that determine whether AI becomes (a) a controlled improvement, (b) an uncontrolled risk, or (c) a “forbidden” topic that everyone uses anyway.

1) Can the “Traditional Validation” rules be applied to AI?

Classic continued process verification (CSV)/Annex 11-style validation is comfortable when a system behaves like a calculator: fixed logic, predictable outputs, and changes that are obvious (new version → revalidate). But, is it worth using AI as a calculator (or a car with maximum speed of 2 or 4 miles per hour)?

Probabilistic AI is different because it behaves more like an analytical test method or even more as a qualified operator: performance is probabilistic and depends heavily on input data, context, and operating conditions.

Regulators already hint at this shift without calling it “human-like.” The US Food and Drug Administration’s (US FDA) Artificial Intelligence in Drug Manufacturing discussion paper explicitly positions AI as something that may monitor/control advanced manufacturing processes (i.e., a quality-relevant actor that must be controlled across its lifecycle).

In parallel, the European Union (EU) has drafted a dedicated GMP Annex 22: Artificial Intelligence, which reads like a direct admission that “Computerized System Validation” needs extra AI-specific scaffolding (intended use, acceptance criteria, test data independence, explainability, confidence, operation).

So classic validation still applies—user requirement specification (URS), risk assessment, testing, traceability, change control, but it needs an additional layer:

With AI, the system is validated the traditional way, while performance is treated as something to be qualified.

If the “system validation” is the only part completed, this may result in a compliant IT shell wrapped around an unmeasured statistical engine. In such cases, Annex 11 controls may demonstrate that AI outputs are attributable, auditable, and secure, but do they demonstrate that those outputs are correct. The draft of Annex 22 can feel as though it sidesteps this challenge by effectively excluding probabilistic AI models from GMP use even though these technologies evolve faster than the regulatory drafting, consultation, and publication cycle.

2) “Human work can’t be validated, only qualified.” Can this way of thinking be extended to AI?

This is where people either get stuck—or get clever.

AI is not a person, and it is still a computerized system component, so it absolutely falls under validation requirements when used in a GMP context. However, the intuition behind the statement remains valid: the best mental model for AI performance aligns more closely with human qualification than with deterministic software testing.

The reason lies in existing GMP practice, GMP already accepts that human-performed critical steps are inherently variable and probabilistic, and the control strategy is:

  • Define what “good performance” looks like (i.e., define what acceptable variabilities in these decisions are, including accepted mistakes ratio)
  • Evaluate and qualify performance against known challenges
  • Requalify periodically (e.g., to ensure required human skills remain within the accepted standard)
  • Intervene when performance drifts or trends outside of control

This approach does not represent reduced rigor, on the contrary, it reflects a mature way of controlling systems that cannot be reduced to pass/fail logic. This framework maps surprisingly well to the behavior of AI systems.

Nevertheless, realism and transparency are essential. Unpredicted behavior is not necessarily more frequent in AI than in humans, but when it occurs it can be more abrupt, more systemic, and harder to anticipate without proper application and explicit drift/out-of-trend controls in place.

Considering pros and cons the conclusion is that the qualification and requalification mindset can, and should, be used to responsibly extend the application of probabilistic AI within GMP environments. However, this approach must not be misused as a mechanism to bypass validation. Instead, it should be recognized as the missing second half of the control framework, because the technical foundation of AI still requires a traditional CSV approach.

3) The Sterile Visual Inspection Precedent: Why Knapp-Style Logic can Apply to AI in GMP controls

Sterile visual inspection already uses Knapp-style qualification to show that automated inspection performs as well as or better than human inspectors. This raises an important question: why cannot the same logic be applied to other probabilistic AI-driven GMP controls?

Visual inspection is GMP’s open secret: defect detection is inherently probabilistic. Human inspectors miss defects and performance varies both between individuals and within the same individual over time. Rather than ignoring this reality, the industry built a disciplined approach using challenge sets and probability-based qualifications.

Although automated visual inspection is considered the industry standard, it remains a probabilistic method. Decisions rely on signals and thresholds, which means that false rejects and false accepts are bound to occur. Teams tolerate this because these systems minimize the impact of human variability and often provide more consistent detection than individual inspectors, as demonstrated through Knapp testing.

As described in the literature, Knapp’s methodology is based around repeated inspections to establish rejection probabilities for challenge containers and define qualification expectations. The objective is not perfect detection, but rather demonstrable, measurable performance that can be statistically characterized and maintained under control.

Automation is acceptable when it can be demonstrated that performance is at least as good as, and often better than, manual inspection, and it can be kept under control over time.

This exact principle is not articulated in a single universal GMP statement governing all AI applications, which is why visual inspection is often viewed as a special case.

US FDA’s process analytical technology framework contains a strikingly similar “equivalent or better” principle, allowing real-time quality assurance can justify itself as at least equivalent to, or better than, traditional end-product testing on collected samples.

USP General Notices reflect this philosophy by requiring that any alternative method must demonstrate results that are equal to or better than those obtained with the official compendial procedure.

The intent is not to “make up a new GMP principle” but to connect existing GMP expectations by:

  • Qualifying probabilistic tasks through challenge sets or defined scenarios
  • Demonstrating “equivalent or better assurance” when replacing or augmenting an accepted control
  • Keeping the approach robust through requalification, change control, and ongoing monitoring

What makes visual inspection special is that it already has a mature culture of comparative performance evidence. That maturity can be exported to other AI controls.

That is not reckless. It is consistent.

4) A Proposed AI Performance Qualification Model that QA can Defend

One simple framework that feels “GMP-native” is this:

Define → Challenge → Qualify → Monitor → Requalify → Control Change

Observe the difference: while deterministic systems require implementation and verification, probabilistic AI demands that implementation focus on addressing challenges—because logic alone cannot provide proof; instead, validation comes from observing performance under stress.

This approach supports building an AI Golden Set in the same manner as an inspector challenge set: curated, governed, versioned, and designed to expose failure, not to flatter performance.

Acceptance criteria are then set match the risk—not “100 percent accuracy” and not “it looks good,” but metrics that correspond to potential harm:

  • Miss rate/sensitivity for critical events
  • False reject rate (workload + risk of masking true signals)
  • Performance by class and severity
  • Error analysis + mitigations (what happens when the model is uncertain)

Then the crucial step most organizations skip is recognizing that approval is not permanent. Instead, it is managed like a qualified operator via:

  • Periodic re-qualification (risk-based)
  • Drift monitoring (data + performance signals)—the best in continuous way,
  • Formal triggers that open an investigation

This fits perfectly with the draft of the EU Annex 22’s structure: intended use, acceptance criteria, independence of test data, explainability/confidence, and operation under control. So why should the “human-like” (variable-within-limits) nature of AI outcomes be excluded from an acceptable GMP control model?

Moreover, “human error” failure modes are already understood—fatigue, distraction, stress, workload, and other personal factors. GMP does not deny these; it manages them through training, qualification, supervision, and risk-based controls. If we accept human variability under a controlled regime, it is only consistent to require explicit performance monitoring and drift controls for probabilistic AI.

5) Where “Traditional Validation” Still Applies, and Cannot be Negotiated

Performance qualification of probabilistic AI is not the whole story despite looking fresh and new. However, inspectors will continue to devote substantial attention to boring routine aspects, as these areas are often the ones where failures may become systemic.

For GMP use, the classic controls around the AI component and its surrounding system are still needed. See the key examples below:

  • URS and traceable requirements
  • Audit trail and data retention
  • Access control and segregation of duties
  • Validated interfaces (manufacturing execution systems /laboratory information management systems /quality management systems, sensors, cameras)
  • Incident handling and periodic review
  • Change control

And it is necessary to say aloud what everyone sometimes tries to euphemize:

  • Retraining is a change
  • Threshold adjustments are a change
  • Feature pipeline updates are a change
  • Camera/lighting/sensor changes that alter inputs are a change

This is exactly why lifecycle framing (ISPE GAMP® concepts applied to machine learning (ML)) is so useful: it puts ML under the same discipline as any other high-impact subsystem, only with more explicit attention to “ground truth” and performance evaluation.

The “AI validation vs. qualification vs. lifecycle control” debate is already visible in public, credible sources, just to mention recent publications of ISPE, US FDA, European Medicines Agency:

  • ISPE GAMP® Guide: Artificial Intelligence (released 2025)
  • US FDA’s discussion paper on AI in drug manufacturing (signals the regulatory issues and need for lifecycle control)
  • EMA’s reflection paper on AI in the medicinal product lifecycle (risk-based scrutiny, trustworthiness framing)
  • EU draft GMP Annex 22 (explicitly introduces AI-specific GMP expectations—acceptance criteria, test data independence, explainability/confidence, operation)

6) Boundaries and Open Questions

Any credible discussion of probabilistic AI in GMP must acknowledge that not all GMP decisions are suitable for variable outcomes (yet). Highlighting the need to define risk-based boundaries by manufacturers is not a rejection of AI, but a normal outcome of risk-based thinking.

Some extremely specific activities may remain excluded—not as a rule, but as risk-based exceptions suggested to be done in planned new regulations. We could see here, for example:

  • High-impact, irreversible batch release decisions
  • Fully autonomous GMP controls without human oversight, particularly where no independent or redundant detection or mitigation layer exists

This mirrors existing GMP practice. Human autonomy is already limited in similar situations through dual review, segregation of duties, or escalation requirements. The same logic applies to AI: intended use matters.

A second important concern is accountability: who is responsible when AI gets it wrong?

The answer is straightforward and familiar with GMP: responsibility remains with organization management, just as when unqualified or poorly supervised personnel are involved. GMP never assigns responsibility to a tool; it evaluates qualification, supervision, monitoring, and governance. AI is no different.

Open questions—such as acceptance criteria, drift thresholds, or explainability expectations—are not a weakness. GMP has always progressed by managing uncertainty through evidence, monitoring, and lifecycle control rather than denying it.

The objective is not to lower standards or automate responsibility, but to acknowledge variability explicitly and control it better than today, calmly, transparently, and consistently with established GMP principles.

7) Summary Conclusion

Just as the early attempts to control motor vehicles through restrictive prohibitions delayed, but did not prevent progress, this plan of excluding probabilistic AI from GMP domain is unlikely to be sustainable. Rather than repeating such historical mistakes, it is wiser to establish a robust, risk-based control framework, so that rapidly evolving AI can be governed responsibly instead of drifting into informal or uncontrolled use.1, 2, 3, 4, 5


ISPE members: View ISPE Communities of Practice. 
Not an ISPE member? Join today.

Disclaimer

iSpeak blog posts provide an opportunity for the dissemination of ideas and opinions on topics impacting the pharmaceutical industry. Ideas and opinions expressed in iSpeak blog posts are those of the author(s) and publication thereof does not imply endorsement by ISPE.


Submit Your Best Content to ISPE

ISPE’s official blog, iSpeak accepts contributions from our Members and professionals in the pharma industry.  

What We Look For 

References

  • 1

    United Kingdom legislation. The Locomotives Act 1865.

  • 2

    European Commission. Annex 22: Artificial Intelligence. Draft guideline issued in the stakeholder consultation on EudraLex Volume 4 – Good Manufacturing Practice Guidelines: Chapter 4, Annex 11, and new Annex 22. 2025.

  • 3

    International Society for Pharmaceutical Engineering (ISPE). ISPE GAMP® Guide: Artificial Intelligence. 2025.

  • 4

    US FDA, Center for Drug Evaluation and Research (CDER). Artificial Intelligence in Drug Manufacturing. Discussion Paper. 2023.

  • 5

    EMA. Reflection paper on the use of Artificial Intelligence (AI) in the medicinal product lifecycle. EMA/CHMP/CVMP/83833/2023. 2024.