Data Science for Pharma 4.0™, Drug Development, & Production—Part 1
Digital transformation and digitalization are on the agenda for all organizations in the biopharmaceutical industry. But what are the main enablers of intelligent manufacturing? We hypothesize that data science–derived manufacturing process and product understanding is the main driver of digitalization in the bioprocessing industry for biologics manufacturing. In this article, the first of a two-part series, we analyze the prerequisites for establishing data science solutions and present key data science tools relevant to the process development stage.
Part 2 will focus on applications of data science tools in product life-cycle management. The goal of the series is to highlight the importance of data science as the necessary adjunct to digitalization for intelligent biomanufacturing. We focus on the potential of data science in industry and present data science tools that might deliver fast results in different subject areas. We want to stimulate the use of data science to transform drug production and achieve major business goals, such as accelerated time to clinic and market and improved process robustness based on continued process verification (CPV).
Digitalization and Data Science
Industry 4.0 and the industrial internet of things (IIoT), as defined elsewhere,
The term “digitalization” is often used very broadly. In this article, we define it within the pharmaceutical context as the conversion of all data along the product life cycle, from pharmaceutical development and manufacturing onward, into a computer-readable format. However, the availability of data is itself insufficient. We need to learn from the data through data science because human capabilities may be limited due to the complex nature of the data.
Specifically, digitalization and related data science tools for the biopharmaceutical industry mainly act on two dimensions:
- The manufacturing process chain (Figure 1)
- The product life cycle (addressed in Part 2)
These dimensions are, of course, interlinked.
The enhancement of the manufacturing process chain by digitalization affects the supply chain, logistics, and predictive—rather than preventive—maintenance. In addition, the enhancement impacts process robustness, process understanding for product life-cycle manangement purposes, and deviations management.
- Hence, the objectives of the digitalization-enhanced manufacturing process chain are:
- Integrate data from facilities, sites, suppliers, and clients.
- Use quality metrics to increase process and manufacturing transparency.
- Allow feedback loops within the life cycle for continuous improvement.
- Establish multiproduct facilities in response to shorter product life cycles.
- Allow flexible resource management.
- Establish transparent and flexible business process workflows.
Without any doubt, sufficient data are already available to realize these objectives. However, the data must first be managed to achieve a structured format—data management is a key prerequisite for data science. Additionally, there is the strong need for software tools to generate knowledge from data. Current tools include data visualization and data-driven or mechanistic models;
Data science is used to gather and provide the knowledge needed for the timely and predictive development of new pharmaceutical drugs. Recent advancements in data science pave the way for a second revolution in medical sciences based on a rational understanding of the patient’s disorder, science-driven (rather than heuristic) drug development and manufacturing, and knowledge-based and individually tailored therapy.
Data science is the most recent data, information, knowledge, wisdom (DIKW) concept.
Data Science Prerequisites
Data Alignment, Contextualization, and Integration
According to surveys, data scientists spend the majority of their work time preparing and processing data.
The intellectual property of pharmaceutical companies is mainly present in the form of data.
This arrangement is not cost-effective. Data processing and organization do not deliver any value by themselves, even though they are indispensable prerequisites to advancing drug development and production. Moreover, manual data manipulation bears the risk of introducing human errors into the data set.
There are multiple reasons for this awkward predicament. One is the vast diversity of data and data applications in the pharmaceutical industry—material supply information, data on the history of the used strains, experimental design data, process raw data, analytical data, derived process data, associated metadata, statistical models, mechanistic models, hybrid models, single-unit operation models, holistic models (e.g., integrated process models and digital twins), analysis workflows, validation workflows, and batch records, to name just a few. (A digital twin is a virtual representation of a physical or intangible object existing in the real world. With a twin, we can design experiments, predict process outcomes and even optimize the process in real-time.
Standards and Interfaces
Open-source, interoperable, platform-independent, widely accepted and implemented standard formats would help reduce the amount of time currently used for data processing and contextualization. However, given the market economy and the speed of development, the IT industry is affected more than any other industry by proprietary de facto standards, such as using Microsoft products. These standards prevent an efficient reuse of algorithms and data science tools.
Various attempts are currently under discussion for implementation of open-source standard formats. For example, the Allotrope Foundation aims to define and implement a common data format that focuses on the contextualization and linking of analytical data.
To use open-source standards and formats in pharmaceutical manufacturing, regulated companies must adopt proper controls. For example, the US FDA’s predicate rules influence which tools and systems are used in computer validation.
Open-source projects are far from general implementation, and it is unclear whether any of these approaches will become a commonly used stand-ard. There is a danger that these efforts will not result in applied tools. Sometimes, the overhead to implement an interface in a standardized format is simply too high relative to the cost of an unstandardized format such as a simple CSV file. To encourage the adoption of open-source standard tools, leading software providers, application developers, and industrial organizations need to commit to making tools and their related interfaces easy to apply and implement.
The historic evolution of the Open Platform Communications (OPC) standard is a noteworthy model for other internet of things (IoT) applications. An OPC interface is used in the automation industry for communication between different software tools. Previous OPC standards had several technical drawbacks that limited their use for the IoT. One of its biggest issues was that they were based on Microsoft’s DCOM specification. Communication in a complex network was usually only possible with workarounds such as additional OPC tunnel tools. During the last decade, collaboration among different parties resulted in a unified architecture standard, OPC-UA, which is a generic, open-source, platform-independent, network- and internet-ready interface standard with built-in advanced security features. OPC-UA is functionally equivalent to the older versions of OPC but is extensible and allows modeling of data as more complex structures. The named features make OPC-UA the preferred standard for IoT applications.
Data Security and Data Integrity
In the pharmaceutical context, data security and following GAMP® guidelines
Data integrity is a related aspect of data security. Data integrity often refers to the completeness, consistency, and accuracy of data. According to US FDA guidance,
Sandboxes and Test Environments
In discussions of data scientists’ challenges and efforts, the actual implementation of developed algorithms and tools in production environments along the product life cycle is sometimes overlooked. Often, data scientists can quickly develop an algorithm or set up a model. However, it is not always clear how these results can be brought into production and used in a real-time context.
Current manufacturing environments are quite different from the development environments of data scientists. Data scientists are eager to use the latest tools and state-of-the-art technology. In contrast, automation specialists setting up the production environments are more cautious about the adoption of new technology that might put safety at risk. They usually rely on older, but time-proven, stable tools and software products. Bringing these two worlds together is a challenge, which is rarely addressed early enough in the implementation process.
Similarly, algorithms developed by data scientists in their development environments cannot simply be copied and pasted into the production environment. Often, a finished algorithm is just a prototype that proves feasibility. Too many promising tools never end up in a production facility.
Sandbox mode development and test environments can help overcome these issues. Development of data science tools requires time-consuming exploration involving many feedback cycles, such as agile development strategies. Data scientists need virtual production environments to test their algorithms and software solutions to improve these tools’ applicability for commercial GxP-compliant drug manufacturing.
Process Development Tools
Factors such as increased competition and resulting cost pressures in the biopharmaceutical market, new modalities for personalized medicine, and reported threats of drug shortages provide incentives for organizations to develop, in the words of the Janet Woodcock, Director of FDA CDER, a “maximally efficient, agile, flexible pharmaceutical manufacturing sector that reliably produces high-quality drugs without extensive regulatory oversight.”
- Accelerate bioprocess development by avoiding iterations and deploying strategies to reduce the number of experiments.
- Develop universal scale-down models to accelerate process characterization, allow troubleshooting, and avoid unexpected scale-up effects.
- Deploy process analytical technology (PAT) strategies to (a) provide improved real-time operational control and compliance, (b) serve as an objective basis for process adjustments, and (c) provide a comprehensive data set for technology transfer decisions.
- Implement strategies to target integrated bioprocess development rather than optimizing single-unit operations only.
- Capture platform knowledge to achieve synergies with other products and extrapolate to other process modes such as continuous manufacturing.
- Allow life-cycle management aligned with ICH Q12,
including holistic knowledge and product life-cycle management. This should include clear feed-back loops from manufacturing into process development to establish holistic manufacturing control strategies.
Development-Oriented Tools
Table 1 lists data science tools that can be used in the bioprocessing life cycle. When an organization has a data management system in place that follows ALCOA principles, these tools can address almost all industrial needs.
Statistical tools for business process workflows
Multivariate analysis (MVA) has been used for decades, and several MVA-dedicated software tools are available to improve business process workflows. More advanced methods have recently been developed, and respective good practice guidelines have been established.
Digitalization will furthermore allow seamless interfacing between operational historians, laboratory information management systems, ELN, and so on, and should include automated feature extraction from 2D data (e.g., spectroscopic or chromatographic data) and 3D data (e.g., from flow cytometry or microscopy). Inspired by FDA validation guidelines,
Tools | ||||||||
---|---|---|---|---|---|---|---|---|
Industrial Needs | Statistics and Data Science Workflows |
Digital Twin Generation |
Digital Twin Deployment |
Integrated Digital Twins |
Digital Twin Validation, Evolution, Maintenance |
Real-Time Environments |
Process Control Strategies |
Production Control Strategies |
Accelerated process development |
X | X | X | X | X | |||
Scale-down models | X | X | X | |||||
PAT | X | X | X | X | X | |||
Integrated process development |
X | X | X | X | X | |||
Platform knowledge | X | X | X | X | X | |||
Life-cycle management |
X | X | X | X |
Workflows for model and digital twin generation
Since the publication of the ICH Q11 guideline,
As data are made available in cloud solutions, digitalization will further enable software-as-a-service (SaaS) solutions for the generation of minimum targeted models,
Model and digital twin deployment
Models capture comprehensive process understanding. Hence, they are perfect tools to provide knowledge for manufacturing intelligence. In technological language, models can be deployed in a multiparametric control strategy in a real-time context. Model-based control or model predictive control (MPC) software sensors are well established in the conventional process industry, but they have hardly been used so far in value-added process industries such as the biopharmaceutical sector. Why not?
One issue is the lack of appropriate knowledge management tools and strategies. Once data are provided to the modeled process in a real-time con-text, knowledge management tools are needed to check whether the model and the underlying knowledge are still valid. As ICH Q12 emphasizes, the industry needs computational model life-cycle management (CMLCM) strategies,
The execution of the knowledge-based strategy will lead to intelligent manufacturing and the realization of Pharm 4.0™. Digital twins can help enable us to achieve this goal in variety of ways.
It is widely recognized that digital twins can be used for experimental design. For example, digital twins help predict optimal feed profiles to a certain target function such as the optimum time-space yield.
Digital twins can also be deployed in real-time throughout the product life cycle as process control strategies. When digital twins are set in place, they can be used to predict the impact of design decisions, anticipate bottlenecks, and provide efficient up-front training for new operational processes and advanced operator support (e.g., by means of augmented reality).
What is needed for digital twin deployment, and how can digitalization help? Currently, many digital twin implementations use classic proprietary academic tools, such as MATLAB, but the industry currently tends to use open-source environments, such as R and Python, whose functional scope can be extended by built-in editors, such as Jupyter Notebook. From the user’s perspective, there is clearly a need to provide harmonized computational environments, thus avoiding manual data transfers or establishing interfaces between individual software packages. These interfaces should also follow the prerequisites of data management described previously.
Integrated process models
Established process models mainly focus on single-unit operations. However, process robustness and demonstrated manufacturing capability can only be reached via a consistently robust process chain. A seamless interplay between the unit operations needs to be elaborated for this purpose (Figure 3).
Model validation, evolution, and maintenance
To deploy a model, we need to make sure it is valid and remains valid in a GxP environment. Initial model validation is only the first step, as the model will evolve as unforeseen variables are encountered and planned changes are implemented over the product life cycle. The ongoing validation process is commonly known as CMLCM.
Real-time environments
Real-time environments in the Pharma 4.0™ context are currently specified in ongoing work by the ISPE Pharma 4.0™ Special Interest Group’s Plug & Produce working group. The main requirements for digital twin and knowledge deployment solutions with respect to data science are to:
- Allow real-time data management and feature extraction from different data sources as input vector to digital twins (actually the outputs of the real process).
- Use modular design for flexible production,
Urbas, L., F. Doherr, A. Krause, and M. Obst. “Modularization and Process Control.” Chemie Ingenieur Technik 84, no. 5 (2012): 615–623. doi:10.1002/cite.201200034 enabling quick product changeovers and continuous biomanufacturing, with all its requirements for standardization of data interfaces. - Provide the ability to integrate and run complex digital twins for timely control of product quality, including multiple-input and multiple-output (MIMO) and MPC algorithms, just as they have been in place in other market segments, such as the chemical industry, for decades. This is essentially a call for executing PAT (ICH Q8[R2]) in its full definition.
- Use similar real-time environment design throughout the product life cycle. The development environment needs to truly reflect the manufacturing environment capabilities.
Multiparametric control strategies
In real-world applications of data science to parts of the overall biopharmaceutical manufacturing process, we face challenges related to the significant multidimensionality of CPP and CQA interactions. We must ensure that the control strategy allows operational decisions in this multidimensional space and is not reduced to single independent controls. The use of single independent controls would lead to the loss of all process understanding links gathered in the development of the control strategy. We should allow multiparametric feedback control to detect deviations and automatically adjust operations, decision support, and advanced operator support.
Production control strategies
Beyond the pure process control strategy discussed previously, we need to generate agile production control strategies that allow agile and flexible production beyond the control strategy in the submission file and act on the complete value chain illustrated in Figure 1. This will enable ICH Q10 pharmaceutical quality systems to enter the next era of life-cycle management described in ICH Q12. Many data science and digitalization aspects such as process maps and process data maps have recently been proposed.
Conclusion
As data science–trained engineers, we have the obligation to show the profits of integrated tools and workflows. The key enabler of digitalization in the bioprocessing industry is the ability to handle knowledge. In this article, we briefly identified prerequisites for data alignment, contextualization, and integration, as well as recommend standards and interfaces. We emphasized the need for future efforts to improve data security and data integrity, and the importance of having sufficient sandboxes and test environments.
As this article makes clear, we need to develop integrated digital twins linking complete process chains for predictive end product quality. We also need to increase our focus on the validation, evolution, and maintenance of digital twins. The full potential of digital twins will be apparent when these tools are implemented in real-time environments: We can use them for both process control strategies and production control strategies. The latter type of strategy will be addressed in the second part of this series, where we focus on the ICH Q12 product life cycle.
Data Science for Pharma 4.0™, Drug Development, & Production—Part 1
Digital transformation and digitalization are on the agenda for all organizations in the biopharmaceutical industry.