Features
January / February 2026

Content Addressability in the Future of Knowledge Management

James Canterbury
Tanya Sharma
Andrew Killingsworth
0126_PE_JF_Canterbury

Data integrity drives quality and compliance. Content addressable storage (CAS) stores and retrieves files by content type—not file location—using cryptographic fingerprints. The result: a mathematically provable version control, without broken links and far fewer audit headaches.

As the industry addresses the intersection of human knowledge and artificial intelligence (AI), CAS will play a pivotal role in managing AI context and applying guardrails that humans can understand and audit.

CAS Background

CAS is a unique data storage mechanism that uses the content itself to create an identifier that is then used to locate the file. This differs from traditional file systems, which rely on metadata and directory systems to locate files (i.e., the name of a file and the folder it was saved in). To create the identi-fier, CAS uses a hash function—a form of cryptographic algorithm—to convert the contents of a file into a unique hash value. This is logically called the content identifier (CID). This hash value serves as the address for storing and retrieving the data.

Because the CID is derived from the contents of the file, if the file changes, the CID does as well. Therefore, it is sometimes thought of as a digital fingerprint. But when it is used in conjunction with a change log, it also can form the bedrock of a very rigid form of version control. CAS systems store and retrieve files using the content hash, which acts as a permanent, consistent identifier. This eliminates issues with broken paths or moved files, making it easier to locate the correct file no matter where it is physically stored in the system. To understand why this is important, we need to first consider why we store data and how the history of data storing technology and systems access have evolved.

Why and How Data Was Historically Stored

Data is stored for the purposes of recall and exchange. Before the advent of the internet, both recall and exchange were done using physical media. If someone had a file they wanted to share, they would save it to a disk and hand the disk over. This method of data exchange continues to be used today, although to a lesser extent. This has some interesting side effects. First, physical media copies create copies of an original file (i.e., placing a file on a disk does not delete it from the original disk). Second, files tend to be organized in ways that are easy for humans to find them (e.g., give the file a recognizable name, and if there are several files, group them together in a folder).

Sometimes the physical media and the directory structures overlap. Some of us can recall fond memories of labeling and organizing bins of 8-inch floppy disks. Because of the “copy” effect, version control and collaboration were difficult, but with networks and the introduction of cloud storage in the 2000s, we were able to give people access to the “same” file, which shifted us away from physical storage. However, we still tended to keep our familiar file names and directory structures because it “just made sense.” It is interesting to consider that shared drives and file systems were at the heart of mainframe computers in the 1950s–1970s but that they were disrupted by use of personal computers in the 1980s. Today, with cloud storage and distributed collaboration we are—in some ways—back to where we started.

The advantage of the index is that the mapping is small enough for your computer to maintain in memory, which makes it much easier to search.

A History of Data Storage

If we look back at the history of data storage, we see two parallel tracks between the technology used to store data and the software systems built to interact with that technology. Initially, data was stored in large, cumbersome magnetic tapes and hard disk drives that offered limited capacity and slow data retrieval times. These longer retrieval times were not only caused by slower tapes; when the storage was filled up, they had to be physically removed and filed. Storing and retrieval meant someone manually combing through a library to find the right one and putting it back into the computer.

As technology progressed, so did the storage solutions. Floppy disks, CDs, and DVDs became popular for storing data more compactly and with greater accessibility. Although more data could fit on smaller media, these were still physical forms of storage. Exchanging data meant a physical transfer, and accessing the data required the ability to read what was on the disk. This gave rise to all sorts of standards and file formats, operating system compatibility, converters, compression, and encryptions. We still live with much of this today, but it’s become innocuous to us. In 20 years, if we try to explain to someone why a .doc file would not open on a Macintosh computer in 2010, we will struggle with the absurdity of our response. This is because the naming of folders and directories has always been abstract and can vary widely from one user to another.

With the advent of cloud storage in the early 2000s, the systems used to interact with files changed rapidly. Directory structures were no longer limited to your desktop. Tools like Microsoft SharePoint and Google Workspace started putting us all in the same directory structure. Although this was a huge advancement in terms of collaboration and knowledge sharing, it led to many more difficulties in access control and version control. Essentially, these were (and are still) shared folders that define a file based on its name, some additional metadata, and a storage location.

Legacy issues from physical media exchange still plague network storage, such as the difficulty of duplicates. Anyone who has accidentally copied their photo library instead of moving their photo library knows this pain well. In a directory system, duplicates are painful because two identical files in two different directories are considered two different files. And if you did want duplicates (say for a backup), how do you know they are really the same? We will return to that question later.

Network storage in the early 2000s was a massive leap in data exchange, but it exacerbated the issues with data recall by making the pool in which to find files exponentially larger. Decentralized storage promises to extend data exchange even further. It also has a better way of facilitating recall in the form of CAS. To understand why this matters at a foundational level, we need to consider what happens when a computer searches for a file.

Searching for Files

At its core, all searching is recursive: start looking somewhere and keep looking until you find it. In a directory-based storage system, we need to “traverse the hierarchy” or navigate to one directory and look for the file. If it’s not there, we keep looking in other directories or subdirectories until it’s found. Based on how directories are set up, we can optimize this search for breadth or depth, but it is not inherently efficient. We’ve probably all felt the pain of searching a computer’s c:/ drive and losing 10 minutes of our lives.

Most likely, your PC carries an index of the most frequently used files and where they are located. This index is a mapping of the file name to its storage location, but it is just a shortcut to speed things up. If your file is not in the index, that’s a whole different issue. More modern operating systems do a much better job at indexing, and recently software companies have been encouraging people to tag files. If you are disciplined enough to do that consistently and with some amount of logic, tagging adds some information into the file’s metadata, which makes it that much easier to group and find things.

We can speed things up even more by caching files or indexes, using parallel processing, or excluding lists to avoid searching in system folders (which is great until you are trying to find that auto-recover backup stored in: ~/Library/Containers/com.microsoft.Excel/Data/Library/Application Support/Microsoft). The advantage of the index is that the mapping is small enough for your computer to maintain in memory, which makes it much easier to search. The addition of metadata through tagging and more advanced operating systems provides more context for the search and improves results. But in the end, the index itself is an abstraction of the underlying files and one more thing that must be maintained.

These are not new issues, and in industries such as pharmaceuticals, where there are tight controls over documentation, we can use records management systems to enforce more formal and rigorous processes for version control, access control, and training requirements. By their nature, these address indexing issues and searchability—but they tend to work best when applied within a single organization. As organizations become more fragmented (e.g., joint ventures, contract manufacturing organizations, shared services), they too fall into the same trappings of directory-based storage systems. These issues beg the question, “If directory-based systems require an index to be efficient, and indexes are used to abstract away the directory, why not get rid of directories completely?”



Enter CAS

We briefly introduced the concept of CAS previously. In the context of traditional storage, here are some of its primary advantages. Remember that CAS uses the hash of a file’s contents as its identifier and as its storage locator, which sounds like a more complex index. Even more so when you consider that a hash is typically a 64-character hexadecimal string that looks something like this: 8a5edab282632443219e051e4ade2d1d5bbc671c781051bf1437897cbdfea0f1. This may be very difficult for a person to remember, but to a computer, it is only 32 bytes of information. As a hexadecimal, the number can be used like any other number in calculations.

Hashes have some interesting mathematical properties. When you concatenate two hashes together and hash the result, you get a single hash from which you can mathematically prove that the two underlying hashes exist. It’s worth noting here that you cannot determine what the underlying hashes are. You can only determine whether a hash that you already knew existed was included in the original set.

In 1979, Ralph Merkle patented this approach for creating hash trees. This process can be used recursively to produce a root hash, which can be quickly verified to determine if a given hash value is included anywhere in the set. Hash trees, or Merkle trees, are foundational to CAS because they eliminate the need to traverse the hierarchy. Further, the location of the underlying data (i.e., the file with the hash as its CID) is determined mathematically—and not based on rules and procedures—which make it excellent for universally locating data. They are also fast. To check if a hash exists in a Merkle tree that contains 1 million leaf nodes (that’s 1 million files) takes an average computer about four microseconds (or 0.000004 seconds). For comparison, try searching for a file in a Windows directory that contains 1 million other files. Although CAS helps locate files efficiently, it doesn’t necessarily help store them or share them. In addition, we need a protocol that leverages CAS to take advantage of it.

Made for Each Other: CAS and Blockchain

The Interplanetary File System (IPFS) is one example of a file-sharing protocol that takes advantage of CAS and Merkle trees—specifically, Merkle directed acyclic graphs DAGs)—to optimize the storage and location of files. This file-sharing protocol’s name may seem straight out of science fiction, but the system was designed to minimize latency in retrieving data from distributed networks, with distance being the biggest variable in the equation. IPFS intended to show that it could still synchronize files across massive distances – such as between planets. IPFS was one of the first protocols to combine CAS with identity and immutability components that have emerged out of public blockchains. This lets us do several things that are very important in the pharmaceutical records management world.

Signed

When a piece of content is added to the network, it is signed using a private key. The corresponding public key of this process is evident on the network as proof of who added it. This still takes a bit of additional work to associate that public key with an individual or an entity, but with the emergence of digital identity solutions that associate various identity trust anchors with a public/private key pair it is becoming easier to identify someone using public key infrastructure.



The addition of metadata through tagging and more advanced operating systems provides more context for the search and improves results. But in the end, the index itself is an abstraction of the underlying files and one more thing that must be maintained.

Sealed

Adding a piece of content to a CAS using a unique identifier does not guarantee that the content will always be retrievable. If storage is your primary driver, you need a way to prove that the content is available and that it hasn’t changed. Most protocols use some form of “sealing.” This is the intersection between the storage sector of the hardware where the data resides and the publishing of its CID to be able to find it later. Most protocols take this a step further by offering continual proof that the data is there and building incentive mechanisms on top of the storage to continue providing those proofs, such as Filecoin does on top of IPFS.


Delivered

Once you’ve stored data on a decentralized network, you need to leverage the other side of the protocol to retrieve it. On a fully open network, anyone with a copy of the CID can find the file and pull it up. This works great for public information, and you get all the guarantees that the returned data came from whomever signed it and that it was unaltered in between. For private data, there is a bit more involved, but you can think of the storage provider (the entity that is running the specific hardware where your data has been sealed) as a gatekeeper, and you can rely on many of the traditional controls we have grown comfortable with in cloud storage. In addition, you can encrypt the data before you even store it. Access controls around data retrieval are an active area of development within the Decentralized Storage Alliance.

To summarize, here are a few of the key advantages of using a CAS in combination with Merkle trees.



Content addressing

Each file is considered a node in the Merkle DAG and is identified by the hash of its contents. This allows you to uniquely identify and retrieve content based on its cryptographic hash rather than location.

Data integrity verification

The hierarchical structure of Merkle DAGs enables efficient verification of data integrity. When retrieving content, you can verify each piece against its hash to ensure it hasn’t been tampered with.

Deduplication

Identical content produces the same hash, allowing you to naturally deduplicate data across the network.

Efficient data synchronization

The Merkle DAG structure allows you to efficiently sync data between nodes by comparing tree structures and only transferring missing pieces.

Large file chunking

You can break large files into smaller chunks, each represented as a node in the Merkle DAG. This enables parallel processing and efficient storage of large datasets.

Content linking

You can use Merkle DAGs to create links between different pieces of content, allowing for complex data structures to be represented and traversed.

Versioning

When combined with a blockchain, the immutable nature of Merkle DAGs allows you to easily represent different versions of content by creating new nodes that link to previous versions.

Distributed storage

The content-addressed nature of Merkle DAGs facilitates protocols that allow data to be stored and retrieved from multiple nodes in the network without relying on centralized servers.

Retrieved

The primary advantage of CAS is its efficiency and reliability in data retrieval. Because each piece of data is associated with a unique hash, searching for data involves merely computing the hash of the desired content and looking it up directly. This reduces the time and complexity involved in data searches, which is especially beneficial in environments where speed and accuracy are paramount, such as in pharmaceutical manufacturing and quality control.

  • Simplicity and speed: Retrieving data using CAS is incredibly fast because the system doesn’t have to look through folders or directories. It goes directly to the data using its unique hash.
  • Security and integrity: CAS also enhances security. Because each piece of data is uniquely tied to its hash, any alteration in the data would change the hash. This change can be easily detected, making unauthorized modifications easy to spot.

This unique method of storing and accessing data provides distinct advantages in environments where speed and data integrity are paramount. For pharmaceutical companies, for instance, being able to quickly and reliably access data without fear of it being tampered with can significantly streamline operations and ensure compliance with strict regulatory standards.

Simplicity and speed: Retrieving data using CAS is incredibly fast because the system doesn’t have to look through folders or directories. It goes directly to the data using its unique hash.

Security and integrity: CAS also enhances security. Because each piece of data is uniquely tied to its hash, any alteration in the data would change the hash. This change can be easily detected, making unauthorized modifications easy to spot.

CAS and the AI Context Problem

Generative AI models can quickly read by analyzing and processing documents, but these models have short memories. That is, the smaller and cleaner the chunk of information you feed them, the better the response. Embedding the CID into a prompt (and requiring that they be referenced in the output) creates the foundation for a solid audit trail. CAS gives each piece of content (e.g., sentence, paragraph, or image) its own immutable hash.

One area of the authors’ research is studying the effect of CAS on optimizing the context window for modern large language models (LLM). By extracting knowledge from documents and storing it as hashed snippets called “context units,” we can select only the relevant units, feed them to the model, and generate a Merkle root for the purposes of easy auditing. The result:

  • Faster inference (optimizing the use of the models context window)
  • Measurable provenance, meaning auditors can match output to exact context units
  • Higher accuracy rates because stale or altered text changes the hash and flags itself
  • Protection against prompt injections due to the formulaic nature of context units
  • Granular version control allowing a single context unit to be updated without reloading the entire source document

CAS therefore isn’t just storage plumbing; it’s an on-ramp to trustworthy, regulator-friendly AI. It should be noted that models are not necessarily trained using only content addressable data (though some of it is certainly in their datasets, as much of the decentralized web is stored in this manner). The approaches already discussed add the most value when interacting with these models using retrieval-augmented generation (RAG). Now, rather than trying to RAG out meaningful information from a 200-page standard operating procedure (SOP), a set of precise context units can be fed into the prompt.

Cryptography in CAS

Cryptography is integral to the operation of CAS systems. The hash functions used in CAS are a type of cryptographic algorithm designed to take an input (or message) and return a fixed-size string of bytes. The output, known as the hash, is unique to each unique input. It is computationally infeasible to find two different inputs that produce the same hash output. This is a property known as collision resistance.

One of the most important features of producing a hash includes basic math. There are a set of known algorithms (SHA-256 being the most common) that are widely used. If someone were to provide you with a hash, the name of the algorithm, and the source contents, you could reproduce it on any computer. You do not need to rely on their proprietary system or your custom records management solution. Math works everywhere.

This cryptographic feature ensures that the data stored in a CAS system is immutable and tamper evident. Any alteration to the stored data results in a different hash value. This effectively flags any unauthorized changes and protects the integrity of the data. This is particularly critical in the pharmaceutical sector, because data integrity directly affects compliance with regulatory standards such as those enforced by the U.S. Food and Drug Administration and European Medicines Agency.

The hash function SHA-256 was first designed by the U.S. National Security Agency in and published in 2001, but hash algorithms have a rich history dating back to the 1950s, when IBM researcher Hans Peter Luhn introduced one of the first hashing techniques for organizing information. If you logged into any electronic device today, you used some form of a hash algorithm. There are even a whole set of quantum-resistant hash algorithms being worked on in anticipation of the next computing revolution. This means there is not a secret formula in CAS. Most of the underlying systems and protocols that make it work are completely open-source software.

Leveraging Context Units for AI-Ready Data

By treating each knowledge fragment as a traceable unit, companies can unlock the speed of AI while still satisfying regulators’ demands for provenance and data integrity. The following are practical ways pharmaceutical teams can blend CAS, context units, and AI over the next few years.

Context-tag GMP documents

Condense the knowledge of large SOPs, batch records, and validation reports into sentence- or paragraph-level context units. Store each unit under its own CID so it can be pulled instantly by an AI model.

Build an audit-ready copilot

Let generative AI answer operator or inspector questions but force the model to cite every context unit (and its CID) used. The answer and the citation list form an audit trail you can attach to the batch record.

Automate change-control impact checks

When a document is updated, its hash changes. An AI agent can compare the new CID to dependent context units (e.g., in training materials or electronic logbooks) and flag anything that must be reapproved.

Use sealed context libraries for model fine-tuning

Fine-tune small domain models only on sealed, version-controlled context units. If the underlying data changes, you can regenerate the model or roll back with full traceability.

Create context gates

Before an internal LLM processes a prompt, a gatekeeper script checks that every referenced CID belongs to the approved library, preventing shadow knowledge from creeping into regulated workflows.

Independence from Underlying Systems

One of the standout features of using cryptography in CAS is its independence from the underlying hardware or storage technology. Whether the data is stored on local servers, in a cloud environment, or across a distributed network, the cryptographic principles apply uniformly, ensuring consistent security and integrity of the data. This independence is vital for pharmaceutical companies that operate across various jurisdictions and need to maintain data integrity and compliance regardless of the geographical location of their data storage facilities.

Implementing CAS with Blockchain Technology

Blockchain technology offers a robust framework for implementing CAS in a manner that further enhances data security and accessibility. At its core, blockchain is a distributed ledger technology where transactions are recorded in a tamper-evident, chronological manner. When combined with CAS, each piece of data can be stored on the blockchain, with its unique hash value serving as both the identifier and the verifier of the data’s integrity.

For pharmaceutical engineering professionals, this means that every modification—whether it’s a change in a drug formula or an update to manufacturing parameters—can be independently verified and traced back to its origin. This transparency and accountability are indispensable in a field where compliance with quality standards and regulatory mandates is essential.

One of the most important features of producing a hash includes basic math. There are a set of known algorithms (SHA-256 being the most common) that are widely used. If someone were to provide you with a hash, the name of the algorithm, and the source contents, you could reproduce it on any computer.

Feedback from Outside the Industry

Previously, the ISPE GAMP® Blockchain and Decentralized Information Network Special Interest Group spoke with blockchain veteran Ken Fromm (Managing Director of BuildETH) in one of their bimonthly calls. Here is what he had to say about how decentralized storage and content addressability will play a pivotal role in the digital infrastructure of all areas of business and personal data usage:

“Data storage shares a similar wave of progression with other areas of technology in that it has shifted from on-premises to cloud and now is beginning a shift to decentralized solutions. In going from on-prem to cloud, firms saw dramatic increases in the ability to provision and scale to commercial-grade loads. They also reduced complexity as common APIs replaced proprietary interfaces and complicated network topologies. Decentralized storage is continuing this progression with paradigm-shifting features like content addressability, data verification, open APIs, compliance via cryptographic proofs, a decentralized global marketplace for storage, and more. As with each shift, the gains are not fully realized until you can no longer imagine going back to the old ways. Mark my words, the future of data storage is content addressable, decentralized, and verifiable.”

Conclusion

Content addressable data backed by strong cryptography is the launchpad for AI-first pharmaceuticals. Over the next few years, we expect to see the following.

  • AI copilots with built-in provenance: Every answer cites the exact CIDs (i.e., context units) used, giving regulators an instant audit trail.
  • Cryptographic proofs over documents: Hash commitments and Merkle trees will replace manual spot-checks during quality audits.
  • Smart change control: A document’s hash flip automatically alerts linked training, batch records, and models so nothing slips through the cracks.
  • Verifiable data marketplaces: Firms that share sealed context units can license them with confidence, knowing each reuse is traceable.

As Ken Fromm notes, “the future of data storage is content addressable, decentralized, and verifiable.” That future is tailor-made for AI systems that must explain every step. There are several key takeaways for leaders. First, start hashing critical knowledge today. This includes SOPs, batch data, and R&D reports. Store those hashes (or the files themselves) on a CAS-enabled network or blockchain. Then require AI tools to consume only approved context units and to return their CIDs in the response. Teams that adopt this triad—content addressability, cryptography, and AI—will move faster, reduce risk, and stay inspection-ready.

Not a Member Yet?

To continue reading this article and to take advantage of full access to Pharmaceutical Engineering magazine articles, technical reports, white papers and exclusive content on the latest pharmaceutical engineering news, join ISPE today. In addition to exclusive access to all of the content in Pharmaceutical Engineering magazine, you will get online access to 24 ISPE Good Practice Guides, exclusive networking events, regulatory resources, Communities of Practice, and more.

Learn more about the valuable benefits you'll receive with an ISPE membership.

Join Today