March / April 2021

Data Science for Pharma 4.0™, Drug Development, & Production—Part 1

Christoph Herwig, PhD
Frank B. Nygaard, PhD
Michelangelo Canzoneri
Stacy L. Springs
Jacqueline M. Wolfrum
Richard D. Braatz, PhD
Stefan Robert Kappeler
Valentin Steinwandter
Effect of Indutry 4.0 enablers on the manufacturing process chain. Upper arrow: Conventional approach. Lower arrow: Perceived extended process chain using IIoT enablers.

Digital transformation and digitalization are on the agenda for all organizations in the biopharmaceutical industry. But what are the main enablers of intelligent manufacturing? We hypothesize that data science–derived manufacturing process and product understanding is the main driver of digitalization in the bioprocessing industry for biologics manufacturing. In this article, the first of a two-part series, we analyze the prerequisites for establishing data science solutions and present key data science tools relevant to the process development stage.

Part 2 will focus on applications of data science tools in product life-cycle management. The goal of the series is to highlight the importance of data science as the necessary adjunct to digitalization for intelligent biomanufacturing. We focus on the potential of data science in industry and present data science tools that might deliver fast results in different subject areas. We want to stimulate the use of data science to transform drug production and achieve major business goals, such as accelerated time to clinic and market and improved process robustness based on continued process verification (CPV).

Digitalization and Data Science

Industry 4.0 and the industrial internet of things (IIoT), as defined elsewhere, 1 have become the innovation drivers and game differentiators of today’s industry and all related business areas. They redefine complete value chains encompassing production planning, warehousing and logistics, manufacturing process and material design, plant operations and safety, monitoring and maintenance of facility and equipment, tight integration of suppliers and customers, and marketing and sales.

The term “digitalization” is often used very broadly. In this article, we define it within the pharmaceutical context as the conversion of all data along the product life cycle, from pharmaceutical development and manufacturing onward, into a computer-readable format. However, the availability of data is itself insufficient. We need to learn from the data through data science because human capabilities may be limited due to the complex nature of the data.

Specifically, digitalization and related data science tools for the biopharmaceutical industry mainly act on two dimensions:

  • The manufacturing process chain (Figure 1)
  • The product life cycle (addressed in Part 2)

These dimensions are, of course, interlinked.

The enhancement of the manufacturing process chain by digitalization affects the supply chain, logistics, and predictive—rather than preventive—maintenance. In addition, the enhancement impacts process robustness, process understanding for product life-cycle manangement purposes, and deviations management.

  • Hence, the objectives of the digitalization-enhanced manufacturing process chain are:
  • Integrate data from facilities, sites, suppliers, and clients.
  • Use quality metrics to increase process and manufacturing transparency.
  • Allow feedback loops within the life cycle for continuous improvement.
  • Establish multiproduct facilities in response to shorter product life cycles.
  • Allow flexible resource management.
  • Establish transparent and flexible business process workflows.

Without any doubt, sufficient data are already available to realize these objectives. However, the data must first be managed to achieve a structured format—data management is a key prerequisite for data science. Additionally, there is the strong need for software tools to generate knowledge from data. Current tools include data visualization and data-driven or mechanistic models;2 these are used to provide ontologies and taxonomies.3

Data science is used to gather and provide the knowledge needed for the timely and predictive development of new pharmaceutical drugs. Recent advancements in data science pave the way for a second revolution in medical sciences based on a rational understanding of the patient’s disorder, science-driven (rather than heuristic) drug development and manufacturing, and knowledge-based and individually tailored therapy.

  • 1Bazzanella, A, A. Förster, B. Mathes, et al. “Whitepaper—Digitalisierung in der Chemieindustrie.” 2016. https://dechema.de/dechema_media/Downloads/Positionspapiere/whitepaper digitalisierung_final-p-20003450.pdf
  • 2Kroll, P., A. Hofer, S. Ulonska, J. Kager, and C. Herwig. “Model-Based Methods in the Biopharmaceutical Process Lifecycle.” Pharmaceutical Research 34, no. 0724–8741 (2017): 2596–2613. doi:10.1007/s11095-017-2308-y
  • 3Herwig, C., O. F. Garcia-Aponte, A. Golabgir, and A. S. Rathore. “Knowledge Management in the QbD Paradigm: Manufacturing of Biotech Therapeutics.” Trends in Biotechnology 33, no. 7 (2015): 381–387. doi:10.1016/j.tibtech.2015.04.004
E ect of Indutry 4.0 enablers on the manufacturing process chain. Upper arrow: Conventional approach. Lower arrow: Perceived extended process chain using IIoT enablers.

Application of data science tools to advance intelligent manufacturing

Data science is the most recent data, information, knowledge, wisdom (DIKW) concept.4 In the bioprocessing industry, it is used to turn data into information, which can then be transformed into knowledge applicable across the product life cycle. Thus, it permits organizations to follow the ICH Q12 guideline for life-cycle management by providing a data-based set of established conditions (ECs).5 Data science also allows intelligent processing in control strategies according to the sequence of primary data collection, followed by the evaluation of information, generation and provision of knowledge, and, finally, a high level of manufacturing intelligence and comprehensive understanding. Figure 2 illustrates how data science tools enable intelligent manufacturing, which, in our view, is the main goal when following Industry 4.0 principles.

Data Science Prerequisites

Data Alignment, Contextualization, and Integration

According to surveys, data scientists spend the majority of their work time preparing and processing data. 6 Having worked for many years as and with data scientists, our experience upholds this finding. Up to 80% of the data scientist’s time is dedicated to data alignment, cleaning, and contextualization, and setting up test data sets. Data scientists must repeat these basic work tasks on a daily basis because there is an unlimited amount of possible and different data formats, many of which are unsuitable or nonstandardized. As a result, only about 20% of the highly skilled data scientist’s time is available for building training sets, writing algorithms, building and refining models, and delivering knowledge.

The intellectual property of pharmaceutical companies is mainly present in the form of data.

This arrangement is not cost-effective. Data processing and organization do not deliver any value by themselves, even though they are indispensable prerequisites to advancing drug development and production. Moreover, manual data manipulation bears the risk of introducing human errors into the data set.

There are multiple reasons for this awkward predicament. One is the vast diversity of data and data applications in the pharmaceutical industry—material supply information, data on the history of the used strains, experimental design data, process raw data, analytical data, derived process data, associated metadata, statistical models, mechanistic models, hybrid models, single-unit operation models, holistic models (e.g., integrated process models and digital twins), analysis workflows, validation workflows, and batch records, to name just a few. (A digital twin is a virtual representation of a physical or intangible object existing in the real world. With a twin, we can design experiments, predict process outcomes and even optimize the process in real-time.7 ) Currently, all these types of data are usually stored as paper records or in nonstandardized relational databases (such as programming scripts) or tabular files. As we have noted, converting such data to knowledge is the primary purpose of data science. However, other essential prerequisites must be implemented before the conversion can succeed.

Standards and Interfaces

Open-source, interoperable, platform-independent, widely accepted and implemented standard formats would help reduce the amount of time currently used for data processing and contextualization. However, given the market economy and the speed of development, the IT industry is affected more than any other industry by proprietary de facto standards, such as using Microsoft products. These standards prevent an efficient reuse of algorithms and data science tools.

Various attempts are currently under discussion for implementation of open-source standard formats. For example, the Allotrope Foundation aims to define and implement a common data format that focuses on the contextualization and linking of analytical data.8 The lofty goals of the Allotrope project are much appreciated, and the project is well supported by discrete manufacturing industries. A similar project, with focus on the process industry, is Data Exchange in the Process Industry (DEXPI). This initiative aims to set up an ISO standard to enforce a common data storage and exchange approach.9 ,10

To use open-source standards and formats in pharmaceutical manufacturing, regulated companies must adopt proper controls. For example, the US FDA’s predicate rules influence which tools and systems are used in computer validation.

Open-source projects are far from general implementation, and it is unclear whether any of these approaches will become a commonly used stand-ard. There is a danger that these efforts will not result in applied tools. Sometimes, the overhead to implement an interface in a standardized format is simply too high relative to the cost of an unstandardized format such as a simple CSV file. To encourage the adoption of open-source standard tools, leading software providers, application developers, and industrial organizations need to commit to making tools and their related interfaces easy to apply and implement.

The historic evolution of the Open Platform Communications (OPC) standard is a noteworthy model for other internet of things (IoT) applications. An OPC interface is used in the automation industry for communication between different software tools. Previous OPC standards had several technical drawbacks that limited their use for the IoT. One of its biggest issues was that they were based on Microsoft’s DCOM specification. Communication in a complex network was usually only possible with workarounds such as additional OPC tunnel tools. During the last decade, collaboration among different parties resulted in a unified architecture standard, OPC-UA, which is a generic, open-source, platform-independent, network- and internet-ready interface standard with built-in advanced security features. OPC-UA is functionally equivalent to the older versions of OPC but is extensible and allows modeling of data as more complex structures. The named features make OPC-UA the preferred standard for IoT applications.

Data Security and Data Integrity

In the pharmaceutical context, data security and following GAMP® guidelines11 are highly critical. The intellectual property of pharmaceutical companies is mainly present in the form of data. Research and development data, drug manufacturing recipes, process metadata, and batch records all contain massive amounts of information that can be potentially converted into valuable knowledge. In the past, pharmaceutical companies protected their data using organizational and technical firewalls, resulting in data segmentation. Parts of a company’s data infrastructure today are strictly shielded from other parts by using separated networks. Obviously, a high interconnectivity of devices and systems counteracts such security approaches. Pharma 4.0™ demands devices, users, and data scientists to be part of a common network. At the same time, access permissions need to be handled with fine granularity, providing just those access permissions to each element in the network that are actually required. To realize the Pharma 4.0™ vision, companies need to rethink their IT security fundamentals, deriving security systems from the internet of information and implementing them into the IIoT.

Data integrity is a related aspect of data security. Data integrity often refers to the completeness, consistency, and accuracy of data. According to US FDA guidance,12  “Complete, consistent, and accurate data should be attributable, legible, contemporaneously recorded, original or a true copy, and accurate (ALCOA).” With data being the foundation of work and the basis for decisions, data integrity is crucial to the pharmaceutical industry. Regulatory agencies reacted to the lack of data quality and integrity in the past by publishing guidelines highlighting the importance of data integrity for the industry and for the patient safety. 12 , 13 14 To guarantee unequivocal data integrity, centralized IT systems have to be reassessed due to the complexity of validation of good data housekeeping, and alternative solutions need to be discussed. One approach is to separate parts of the data by, for example, storing them in a distributed ledger that is not under the control of a single party and therefore allows better control of data management.15

Sandboxes and Test Environments

In discussions of data scientists’ challenges and efforts, the actual implementation of developed algorithms and tools in production environments along the product life cycle is sometimes overlooked. Often, data scientists can quickly develop an algorithm or set up a model. However, it is not always clear how these results can be brought into production and used in a real-time context.

Current manufacturing environments are quite different from the development environments of data scientists. Data scientists are eager to use the latest tools and state-of-the-art technology. In contrast, automation specialists setting up the production environments are more cautious about the adoption of new technology that might put safety at risk. They usually rely on older, but time-proven, stable tools and software products. Bringing these two worlds together is a challenge, which is rarely addressed early enough in the implementation process.

Similarly, algorithms developed by data scientists in their development environments cannot simply be copied and pasted into the production environment. Often, a finished algorithm is just a prototype that proves feasibility. Too many promising tools never end up in a production facility.

Sandbox mode development and test environments can help overcome these issues. Development of data science tools requires time-consuming exploration involving many feedback cycles, such as agile development strategies. Data scientists need virtual production environments to test their algorithms and software solutions to improve these tools’ applicability for commercial GxP-compliant drug manufacturing.

Process Development Tools

Factors such as increased competition and resulting cost pressures in the biopharmaceutical market, new modalities for personalized medicine, and reported threats of drug shortages provide incentives for organizations to develop, in the words of the Janet Woodcock, Director of FDA CDER, a “maximally efficient, agile, flexible pharmaceutical manufacturing sector that reliably produces high-quality drugs without extensive regulatory oversight.”16 Specific hands-on goals in process development can be summarized as follows:

  • Accelerate bioprocess development by avoiding iterations and deploying strategies to reduce the number of experiments.
  • Develop universal scale-down models to accelerate process characterization, allow troubleshooting, and avoid unexpected scale-up effects.
  • Deploy process analytical technology (PAT) strategies to (a) provide improved real-time operational control and compliance, (b) serve as an objective basis for process adjustments, and (c) provide a comprehensive data set for technology transfer decisions.
  • Implement strategies to target integrated bioprocess development rather than optimizing single-unit operations only.
  • Capture platform knowledge to achieve synergies with other products and extrapolate to other process modes such as continuous manufacturing.
  • Allow life-cycle management aligned with ICH Q12,5 including holistic knowledge and product life-cycle management. This should include clear feed-back loops from manufacturing into process development to establish holistic manufacturing control strategies.

Development-Oriented Tools

Table 1 lists data science tools that can be used in the bioprocessing life cycle. When an organization has a data management system in place that follows ALCOA principles, these tools can address almost all industrial needs.

Statistical tools for business process workflows

Multivariate analysis (MVA) has been used for decades, and several MVA-dedicated software tools are available to improve business process workflows. More advanced methods have recently been developed, and respective good practice guidelines have been established.17 The goal is to identify correlations between process variables, raw material attributes, product quality attributes, and the metadata from electronic batch records or electronic lab notebooks (ELN), and use those correlations as data-driven models to both establish control strategies (e.g., in stage 1 validation tasks) and generate hypotheses for mechanistic investigations and improved process understanding.

Digitalization will furthermore allow seamless interfacing between operational historians, laboratory information management systems, ELN, and so on, and should include automated feature extraction from 2D data (e.g., spectroscopic or chromatographic data) and 3D data (e.g., from flow cytometry or microscopy). Inspired by FDA validation guidelines,18 such tools should be integrated in a business process workflow or in process maps;19 for example, they might be linked to risk assessments, which in turn are facilitated by data science. This integration will help organizations establish clear traceability of decision-making along the manufacturing process development and manufacturing process characterization steps. For example, power analysis is used to reduce the risk of overlooking a critical effect of potential critical process parameters (CPPs) on critical quality attributes (CQAs), and to justify the proven and acceptable operating ranges.20

Table 1: Mapping data science tools to industrial needs in the bioprocessing life cycle (“X” indicates a main application).
Industrial Needs Statistics and
Data Science
Digital Twin
Digital Twin
Digital Twins
Digital Twin
Process Control
Accelerated process
X   X X   X X  
Scale-down models X X   X        
PAT     X X X X X  
Integrated process
X     X   X X X
Platform knowledge X X   X X     X
X     X X     X

Workflows for model and digital twin generation

Since the publication of the ICH Q11 guideline,21 the use of first-principle models (which can be implemented in digital twins) is encouraged along the product life cycle. These models are also perceived to be a significant enabler for life-cycle management in accordance with ICH Q12.5 ,22 However, a digital twin should not be the product of a single modeling expert, because the acceptance of the model as well as life-cycle maintenance will fade out when the expert is no longer available. As viable alternatives, concise workflows for model generation have been in place for more than a decade.23 These workflows are known as good modeling practice and use classical mathematical tools for model calibration—such as sensitivity, practical identifiability analysis, and observability analysis—for deploying the model with a suitable PAT environment.24

As data are made available in cloud solutions, digitalization will further enable software-as-a-service (SaaS) solutions for the generation of minimum targeted models,25 in which mechanistic links, as assembled and uploaded by the academic community, are tested for suitability to the given data set and modeling goal. Hence, existing data science workflows are facilitated by novel cloud solutions, resulting in an accelerated, sound, and science-based approach to generate digital twins.

Model and digital twin deployment

Models capture comprehensive process understanding. Hence, they are perfect tools to provide knowledge for manufacturing intelligence. In technological language, models can be deployed in a multiparametric control strategy in a real-time context. Model-based control or model predictive control (MPC) software sensors are well established in the conventional process industry, but they have hardly been used so far in value-added process industries such as the biopharmaceutical sector. Why not?

One issue is the lack of appropriate knowledge management tools and strategies. Once data are provided to the modeled process in a real-time con-text, knowledge management tools are needed to check whether the model and the underlying knowledge are still valid. As ICH Q12 emphasizes, the industry needs computational model life-cycle management (CMLCM) strategies,26 which are also an integral part of product life-cycle management, to enable feedback loops and continuous improvement of the process chain and product life cycle.

The execution of the knowledge-based strategy will lead to intelligent manufacturing and the realization of Pharm 4.0™. Digital twins can help enable us to achieve this goal in variety of ways.4

It is widely recognized that digital twins can be used for experimental design. For example, digital twins help predict optimal feed profiles to a certain target function such as the optimum time-space yield.27 Recently, digital twins have also be used for automated model-based redesign of experiments; in this context, the digital twin is deployed in real-time and informational content is maximized concurrent to the ongoing experiment.28 ,29  These approaches clearly outperform classical design-of-experiment approaches in terms of both the number of experiments and the accurate identification of process parameters critical for optimum process performance.

Schematic description of an integrated process model using a Monte Carlo approach: 1,000 simulations are performed, each having a di erent set of process parameters and a di erent initial specifi c CQA concentration.

Digital twins can also be deployed in real-time throughout the product life cycle as process control strategies. When digital twins are set in place, they can be used to predict the impact of design decisions, anticipate bottlenecks, and provide efficient up-front training for new operational processes and advanced operator support (e.g., by means of augmented reality).

What is needed for digital twin deployment, and how can digitalization help? Currently, many digital twin implementations use classic proprietary academic tools, such as MATLAB, but the industry currently tends to use open-source environments, such as R and Python, whose functional scope can be extended by built-in editors, such as Jupyter Notebook. From the user’s perspective, there is clearly a need to provide harmonized computational environments, thus avoiding manual data transfers or establishing interfaces between individual software packages. These interfaces should also follow the prerequisites of data management described previously.

Integrated process models

Established process models mainly focus on single-unit operations. However, process robustness and demonstrated manufacturing capability can only be reached via a consistently robust process chain. A seamless interplay between the unit operations needs to be elaborated for this purpose (Figure 3).30 Integrated process models, similar to those established in other industries (ASPEN, G-Proms), need to be applied to integrated bioprocesses. These models should quantitatively demonstrate the process understanding and include the elaborated proven acceptable ranges (PARs) of individual unit operations. 17  Subsequently, the models should allow the interconnection of the individual unit operations and be able to assess the error propagation within the variation in the PARs using sensitivity studies and, for example, Monte Carlo simulations. As a result, integrated process modeling enables the identification of process parameters that are critical to the entire process chain. It further allows the definition of the necessary control strategies along the entire process and, in turn, reduces the number of experiments needed for comprehensive process validation and for the defined and proven production capability.

Model validation, evolution, and maintenance

To deploy a model, we need to make sure it is valid and remains valid in a GxP environment. Initial model validation is only the first step, as the model will evolve as unforeseen variables are encountered and planned changes are implemented over the product life cycle. The ongoing validation process is commonly known as CMLCM.26 For execution of CMLCM, digitalization features computational model environments (CMEs). The engineering environment allows for further development of methods and requires significant input from modeling experts. Workflows contained in the customer tuning environment can be triggered by the customer and run (mostly) automatically. The in-line system contains parts of the model (algorithms) required and frequently called at run time. Data science tools for fault diagnosis need to be further developed and implemented in this environment to deploy this CME concept.

Real-time environments

Real-time environments in the Pharma 4.0™ context are currently specified in ongoing work by the ISPE Pharma 4.0™ Special Interest Group’s Plug & Produce working group. The main requirements for digital twin and knowledge deployment solutions with respect to data science are to:

  • Allow real-time data management and feature extraction from different data sources as input vector to digital twins (actually the outputs of the real process).
  • Use modular design for flexible production,31 enabling quick product changeovers and continuous biomanufacturing, with all its requirements for standardization of data interfaces.
  • Provide the ability to integrate and run complex digital twins for timely control of product quality, including multiple-input and multiple-output (MIMO) and MPC algorithms, just as they have been in place in other market segments, such as the chemical industry, for decades. This is essentially a call for executing PAT (ICH Q8[R2]) in its full definition.
  • Use similar real-time environment design throughout the product life cycle. The development environment needs to truly reflect the manufacturing environment capabilities.

Multiparametric control strategies

In real-world applications of data science to parts of the overall biopharmaceutical manufacturing process, we face challenges related to the significant multidimensionality of CPP and CQA interactions. We must ensure that the control strategy allows operational decisions in this multidimensional space and is not reduced to single independent controls. The use of single independent controls would lead to the loss of all process understanding links gathered in the development of the control strategy. We should allow multiparametric feedback control to detect deviations and automatically adjust operations, decision support, and advanced operator support.

Production control strategies

Beyond the pure process control strategy discussed previously, we need to generate agile production control strategies that allow agile and flexible production beyond the control strategy in the submission file and act on the complete value chain illustrated in Figure 1. This will enable ICH Q10 pharmaceutical quality systems to enter the next era of life-cycle management described in ICH Q12. Many data science and digitalization aspects such as process maps and process data maps have recently been proposed.14


As data science–trained engineers, we have the obligation to show the profits of integrated tools and workflows. The key enabler of digitalization in the bioprocessing industry is the ability to handle knowledge. In this article, we briefly identified prerequisites for data alignment, contextualization, and integration, as well as recommend standards and interfaces. We emphasized the need for future efforts to improve data security and data integrity, and the importance of having sufficient sandboxes and test environments.

As this article makes clear, we need to develop integrated digital twins linking complete process chains for predictive end product quality. We also need to increase our focus on the validation, evolution, and maintenance of digital twins. The full potential of digital twins will be apparent when these tools are implemented in real-time environments: We can use them for both process control strategies and production control strategies. The latter type of strategy will be addressed in the second part of this series, where we focus on the ICH Q12 product life cycle.


Data Science for Pharma 4.0™, Drug Development, & Production—Part 1

Digital transformation and digitalization are on the agenda for all organizations in the biopharmaceutical industry.

  • 30Zahel, T., S. Hauer, E. Mueller, et al. “Integrated Process Modeling—A Process Validation Life Cycle Companion.” Bioengineering (Basel) 4, no. 4 (2017): 86. doi:10.3390/bioengineering4040086
  • 17
  • 26
  • 31Urbas, L., F. Doherr, A. Krause, and M. Obst. “Modularization and Process Control.” Chemie Ingenieur Technik 84, no. 5 (2012): 615–623. doi:10.1002/cite.201200034
  • 14