Protein Information Resource
The Protein Information Resource, or PIR, is an online, publicly available resource that contains databases of protein sequences and computer programs that support the study of proteins, their functions, evolutionary histories, and interactions with other biomolecules in living organisms. A non-profit research institution called the National Biomedical Research Foundation, or NBRF, established the PIR in 1984. Since proteins were first sequenced in the late 1950s, scientists have used protein sequence data to understand biological functions, interactions, and pathways. Protein sequences contribute to many kinds of research, including genomics, proteomics, and systems biology. From 1984 to 2004, the PIR housed one of the first, free online protein sequence databases and computer programs that allowed searching and comparison of those sequences. During that twenty-year period, scientists could use the PIR to identify an unknown protein and determine its evolutionary history and function, including its role in embryological development.
The need for the PIR grew out of the need to collect, organize, and distribute a comprehensive collection of protein sequences for comparative computational research. In 1960, Robert Ledley, a researcher with a background in dentistry, mathematics, and engineering, founded the NBRF to explore the uses of computers in the fields of biology and medicine. In that same year, Ledley hired Margaret Dayhoff, a researcher with an interdisciplinary background in biochemistry and computing to develop computer programs for analyzing and comparing protein sequences. At that time, advances in protein sequencing and computing spurred the emergent field. As a consequence of their various comparative sequence research projects, Dayhoff and others at the NBRF started to compile an onsite, general-purpose collection of all the sequences that other laboratory scientists had published in the scientific literature. From the time of its creation the Protein Sequence Database, or PSD as they later called it, was stored in digital formats that a computer could read.
Both the PSD and the methods of distributing it, underwent several technological advancements to accommodate the accelerating growth of published sequences before it became a fixture of the PIR. Technologies for sharing copies of that data with a widespread audience at that time were initially limited. In 1965, Dayhoff and colleagues published their collection of about seventy sequences in a paperbound volume entitled Atlas of Protein Sequence and Structure, or Atlas. From the second edition in 1966 to the fifth volume in 1972, the Atlas contained static snapshots of all the data in the PSD. It also contained resources for making sense of that sequence collection, such as computer algorithms. By 1973, the volume had become too large and expensive to reprint the entire database. Dayhoff and colleagues started printing supplemental volumes of the Atlas at that time to provide researchers who already had copies with updated information. That same year, they started storing and selling snapshots of the PSD on magnetic tapes, which were a data storage technology that was more compact, faster, cheaper, and reusable than paperbound books. After they published the last supplement of Atlas in 1978, the team stopped production of the paperbound collection altogether, but managing the PSD internally was still a technological challenge. Also, in 1978, Dayhoff devised a computer program that made it easier to store, organize, and otherwise manage the PSD in a more efficient and cost-effective way, which aided the development of the PIR.
In September 1981, Dayhoff made the computer program, a predecessor of the PIR called the PSD System, available online over a telephone network for a subscription fee. The PSD System had several programs that enabled users to find a sequence or collection of sequences of interest. Users could retrieve entries based on biological source or taxonomic classification. They could also translate nucleic acid sequences into protein sequences or vice versa. At that time, the NBRF made the PSD System available through a subscription model. Less than a week before she died, Dayhoff submitted a proposal to the Division of Research Resources at the National Institutes of Health, or NIH, to further develop the PSD System and make it freely available to anyone. When Dayhoff died in 1983, Ledley and another colleague at the NBRF assumed leadership of the project and used the funding Dayhoff obtained to oversee the launch of that proposed free, online resource.
In 1984, the NBRF launched the PIR under the name Protein Identification Resource, as a free online resource that included access to the PSD, several other databases, and computer programs to process the sequences contained in them. The PIR provided users access to copies of several databases: Los Alamos National Laboratory’s GenBank Database, the European Molecular Biology Laboratory’s Nucleotide Data Library—both of which were similar to the PSD but for nucleic acid sequences found in deoxyribonucleic acid, or DNA, and ribonucleic acid, or RNA—and several other special purpose databases. By 1985, the PSD contained over 3,061 protein sequences organized by an evolutionary classification called superfamilies, a concept Dayhoff conceived of in a late edition of the Atlas that groups proteins by more distant evolutionary relationships. The hierarchical organization made many of the integrated computer programs for making sense of the sequence data much faster. The PIR had a program for searching the database for a sequence that matched a user’s query sequence. However, in 1985, other researchers unaffiliated with the NBRF published a more sophisticated search algorithm called FASTA. That searching program relied on an information table called the , which Dayhoff had originally calculated and included in the Atlas. In 1985, those researchers implemented a modified version of that algorithm called FASTP and donated a version of it for use by the PIR. By 1986, PIR had incorporated FASTP into the system for sequence search.
In 1988, the NBRF joined several international partners to form PIR-International, an association of protein sequence data collection centers. The goal of PIR-International was to standardize and synchronize the sequence data in their respective collections. Partners originally included the Martinsried Institute for Protein Sequences at the Max Planck Institute for Biochemistry in Martinsried, Germany, and the Japan International Protein Information Database at the Science University of Tokyo, Japan. PIR-International operated with a decentralized approach, in which each participating center was responsible for curating and maintaining the literature and sequence data within its domain. Researchers and affiliated institutions submitted new entries to PIR-International by sending published literature articles to the respective centers associated with their locations.
The center would not only update its own database but would also send the entry to the other centers by electronic mail in a standardized, computer-readable format. Through the decentralized model, the association aimed to synchronize the data at each participating center while avoiding redundant entries. Another novel feature of the PIR-International member databases at that time was that additional information accompanied each protein entry. The annotations included information about the originating organism, references to primary literature, information about sequence determination, and the function and characteristics of the protein. The annotations were also searchable. By 1996, NBRF started to submit protein sequences to PIR-International by translating nucleotide sequences from GenBank and other nucleotide sequence databases affiliated with their international partners in Europe and Japan into their resulting proteins. Around that time, the NBRF changed the name of the PIR from Protein Identification Resource to Protein Information Resource.
The superfamily classification system in protein databases posed challenges, so PIR-International introduced a new organizational approach in 1995, which marked a significant shift in how proteins were grouped and annotated in the PSD. In the 1970s, Dayhoff had grouped PSD entries by the superfamily classification. At that time, scientists assumed related proteins were usually similar along their entire sequence. Over time, scientists discovered that a minority of sequences belong to more than one protein family or have segments dissimilar to its homologs for various other reasons. In 1995, the NBRF database architects began to address those problems by decoupling the concept of superfamily classification from relative placement in the database. Instead, they grouped homologous proteins over the majority of their lengths. They labeled what superfamilies the protein belonged to in the associated protein entry annotation. The new organization system improved the speed of computational programs searching and making sense of the database. By 1997, essentially all sequence entries were classified in that way.
In the late 1990s, the PIR took advantage of the internet to adapt the database architecture to the continued accumulation of sequence data. In 1993, researcher engineers began to standardize and promote the development of the World Wide Web, or WWW, which is a service that operates over the Internet to provide users with linked information. That same year, research engineers also developed one of the first graphical internet browsers that ran on office and home computers, an innovation people credited with the rise of Internet use in the 1990s. By 1998, the PIR began operating a WWW site that included a description of the project, general announcements, facilities submitting new proteins other entries, and search and retrieval of some. The search and retrieval facilities provide access to weekly updates of the PIR-International PSD and other PIR databases.
By 1997, the PIR no longer maintained direct access to versioned copies of the GenBank database, but their computer programs could still perform operations on that data. Before 1997, the PIR made copies of relevant databases like GenBank available over the telephone network, CD-ROMs, and magnetic media. The databases were growing, and it was no longer practical to maintain versioned copies of a growing number of specialized databases, including several databases that contained the entire genomes for the likes of the fruit fly and baker’s yeast. Instead, NBRF engineers began using the WWW-enabled hyperlinks so that they could link to HTML-formatted entries that entirely different organizations maintained. The PIR outsources some computer programs over the internet. Engineers wrote the computer program Basic Local Alignment Search Tool, or BLAST, for sequence look-ups that were faster and more capable than its predecessor, FASTA, which many of the same engineers wrote. Instead of maintaining a copy of the BLAST algorithm, the PIR forwarded the requests to a computer at the NCBI dedicated to running the BLAST program.
Cathy Wu, a researcher specializing in bioinformatics, joined NBRF in 1999 and led the development of the PIR’s integrated Protein Classification database, or iProClass, as a principal investigator. iProClass was a database of databases that linked diverse types of database entries in meaningful ways. Prior to joining the NBRF, Wu’s most-cited research involved using a kind of machine learning called neural networks to classify proteins into superfamilies. Wu and colleagues devised a computer program to automatically extract citations and annotations from publications in the scientific literature. They used the citation data to attribute the sequence data to the originating authors, which improved the accuracy of the annotations in the PIR. Those annotations and many other kinds of digital objects were in the iProClass database and linked to over forty national databases that contained protein families, structures, functions, genes, genomes, literature, and taxonomy-based meaningful relationships. The NBRF made the iProClass database available through the PIR website. Users could search for any kind of object using text, sequences, or unique identifiers and click on hyperlinks to go to a page about related objects in other databases.
In 2002, the NBRF joined two international partners to begin merging their region’s respective protein sequence databases, including the PIR-International PSD, into a single internationally managed database. The NBRF, the European Bioinformatics Institute, and the Swiss Institute of Bioinformatics pooled their financial resources and expertise to form the Universal Protein Resource, or UniProt, consortium. The NIH awarded the NBRF a grant to collaborate with their UniProt consortium partners to form a single worldwide database of protein sequences and functions that unified the PIR-International PSD with the European and Swiss protein databases. Unlike PIR-International, which existed in a distributed manner until 2004, the UniProt protein sequence database was a centralized database available over the Internet. Wu and others headed the UniProt consortium, which established two-way cross-references between the UniProt and PIR protein databases that integrated their sequences and annotations to track former PIR sequence database entries easily. The NBRF maintained the national PSD until its final release in 2004.
As of 2025, despite an organizational shift due to leadership changes and the shift in emphasis on the development of UniProt, the PIR website is operational. In 2010, the NBRF dissolved in parallel with its founder’s retirement. Around that same time, Wu accepted the Edward G. Jefferson Chair of Bioinformatics and Computational Biology at the University of Delaware. In place of the NBRF, Wu established an organization that also held the name Protein Information Resource and retained many of the NBRF staff who remained physically located at the Georgetown University Medical Center in Washington, DC. As director, Wu also hired new staff at the University of Delaware in Newark, Delaware. As of 2025, the organization called PIR is a contributing member of the UniProt Consortium and maintains the PIR resource, including the site and associated tools, many of which are still functional. The PIR updates and releases iProClass every three months. The user interface is fully functional and contains more than 290 million entries from databases from UniProt and the National Center for Biotechnology Information. Several other search and analysis tools on the site are operational, including a Text Search tool to navigate iProClass. Many of the pages also point to external, standalone web tools like iTextMine, which is a text-mining tool to extract information from biomedical text from sources like Medline automatically.
For much of its history, the PIR served as the home for a growing database and an integrated resource to provide computational tools and make sense of what Dayhoff frequently referred to as the explosive growth of published sequenced proteins. The NBRF oversaw the transition of the PIR PSD from an internal sequence collection to an internationally coordinated sequence database. When the NBRF made the PSD available over a telephone network in 1984, it contained around 200,000 protein sequences. By the time of its last release in 2004, the PSD contained over one million proteins. The PSD laid the foundation for the international UniProt sequence database, which contains hundreds of millions of protein sequences. Through tools like iProClass, the PIR serves as an integrated resource that facilitates sequence searches, comparisons, and comprehension. The PIR also supports diverse biological research, such as embryology, in which proteins play crucial roles in gene expression regulation, cellular processes, and structural formation during development.
Sources
- Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (1990): 403–10. http://www.gersteinlab.org/courses/452/09-spring/pdf/Altschul.pdf (Accessed May 26, 2025).
- Andreessen, Marc, and Eric Bina. “NCSA Mosaic: A Global Hypermedia System.” Internet Research 20 (2010): 472–87.
- Barker, Winona C., David G. George, and Lois T. Hunt. “Protein Sequence Database.” Methods in Enzymology 183 (1990): 31–49.
- Barker, Winona C., David G. George, Hans-Werner Mewes, and Akira Tsugita. “The PIR-International Protein Sequence Database.” Nucleic Acids Research 20 (1992): 2023–26. https://pmc.ncbi.nlm.nih.gov/articles/PMC333980/ (Accessed May 26, 2025).
- Barker, Winona C., David G. George, Lois T. Hunt, and John S. Garavelli. “The PIR Protein Sequence Database.” Nucleic Acids Research 19 (1991): 2231–36. https://pmc.ncbi.nlm.nih.gov/articles/PMC331356/ (Accessed May 26, 2025).
- Barker, Winona C., John S. Garavelli, Daniel H. Haft, Lois T. Hunt, Christopher R. Marzec, Bruce C. Orcutt, Geetha Y. Srinivasarao, Lai-Su L. Yeh, Robert S. Ledley, Hans-Werner Mewes, Friedhelm Pfeiffer, Akira Tsugita. “The PIR-International Protein Sequence Database.” Nucleic Acids Research 26 (1998): 27–32. https://pmc.ncbi.nlm.nih.gov/articles/pmid/9399794/ (Accessed May 26, 2025).
- Chen, H. R., and W. C. Barker. “The Protein Identification Resource and Its Applications.” Trends in Genetics 1 (1985): 221–23.
- Dayhoff, M. O., W. C. Barker, R. M. Schwartz, B. C. Orcutt, and L. T. Hunt. “Data Base for Protein Sequences.” In Proceedings of the June 7–10, 1976, National Computer Conference and Exposition, 261–66. New York: Association for Computing Machinery, 1976. https://dl.acm.org/doi/pdf/10.1145/1499799.1499841 (Accessed May 26, 2025).
- Dayhoff, Margaret Oakley. Atlas of Protein Sequence and Structure (Volume 5). Washington, D.C: National Biomedical Research Foundation, 1972.
- George, David G., Robert J. Dodson, John S. Garavelli, Daniel H. Haft, Lois T. Hunt, Christopher R. Marzec, Bruce C. Orcutt, Kathryn E. Sidman, Geetha Y. Srinivasarao, Lai-Su L. Yeh, Leslie M. Arminski, Robert S. Ledley, Akira Tsugita, and Winona C. Barker. “The Protein Information Resource (PIR) and the PIR-International Protein Sequence Database.” Nucleic Acids Research 25 (1997): 24–27. https://pmc.ncbi.nlm.nih.gov/articles/pmid/9016497/ (Accessed May 26, 2025).
- George, David G., Winona C. Barker, and Lois T. Hunt. “The Protein Identification Resource (PIR).” Nucleic Acids Research 14 (1986): 11–15. https://pmc.ncbi.nlm.nih.gov/articles/PMC339349/ (Accessed May 26, 2025).
- George, David G., Winona C. Barker, Hans-Werner Mewes, Friedhelm Pfeiffer, and Akira Tsugita. “The PIR-International Protein Sequence Database.” Nucleic Acids Research 22 (1994): 3569–73. https://pmc.ncbi.nlm.nih.gov/articles/pmid/7937060/ (Accessed May 26, 2025).
- George, David G., Winona C. Barker, Hans-Werner Mewes, Friedhelm Pfeiffer, and Akira Tsugita. “The PIR-International Protein Sequence Database.” Nucleic Acids Research 24 (1996): 17–20. https://pmc.ncbi.nlm.nih.gov/articles/PMC145575/ (Accessed May 26, 2025).
- Gillies, James, and Robert Cailliau. How the Web Was Born: The Story of the World Wide Web. Oxford: Oxford University Press, 2000. https://www.google.com/books/edition/How_the_Web_was_Born/pIH-JijUNS0C?hl=en&gbpv=1&dq=Gillies,+James,+and+Robert+Cailliau.+How+the+Web+Was+Born:+The+Story+of+the+World+Wide+Web&printsec=frontcover (Accessed May 26, 2025).
- “Margaret Oakley Dayhoff 1925–1983.” Bulletin of Mathematical Biology 46 (1984): 467–72.
- Orcutt, B. C., D. G. George, and M. O. Dayhoff. “Protein and Nucleic Acid Sequence Database Systems.” Annual Review of Biophysics 12 (1983): 419–41.
- Protein Information Resource. “PIR-PSD Database [PIR - Protein Information Resource].” Protein Information Resource. https://proteininformationresource.org/pirwww/dbinfo/pir_psd.shtml (Accessed May 26, 2025).
- Protein Information Resource. “Staff Members [PIR - Protein Information Resource].” Protein Information Resource. https://proteininformationresource.org/pirwww/about/staff.shtml (Accessed May 26, 2025).
- Protein Information Resource. “History [PIR - Protein Information Resource].” https://proteininformationresource.org/pirwww/about/aboutpir.shtml (Accessed May 26, 2025).
- Sidman, Kathryn E., David G. George, Winona C. Barker, and Lois T. Hunt. “The Protein Identification Resource (PIR).” Nucleic Acids Research 16 (1988): 1869–71. https://pmc.ncbi.nlm.nih.gov/articles/pmid/3353227/ (Accessed May 26, 2025).
- Strasser, Bruno J. “Collecting, Comparing, and Computing Sequences: The Making of Margaret O. Dayhoff’s Atlas of Protein Sequence and Structure, 1954–1965.” Journal of the History of Biology 43 (2010): 623–60. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=1c6a8c05535964dae20ff2a564b0462cc22954f7 (Accessed May 26, 2025).
- Strasser, Bruno J. “The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine.” Isis 102 (2011): 60–96. https://www.researchgate.net/profile/Bruno-Strasser/publication/51214024_The_Experimenter's_Museum_GenBank_Natural_History_and_the_Moral_Economies_of_Biomedicine/links/02e7e51e59d90858a2000000/The-Experimenters-Museum-GenBank-Natural-History-and-the-Moral-Economies-of-Biomedicine.pdf (Accessed May 26, 2025).
- Strasser, Bruno J. Collecting Experiments: Making Big Data Biology. Chicago: University of Chicago Press, 2019. https://www.google.com/books/edition/Collecting_Experiments/g_aVDwAAQBAJ?hl=en&gbpv=1&dq=Collecting+Experiments:+Making+Big+Data+Biology&printsec=frontcover (Accessed May 26, 2025).
- UniProt. “About UniProt.” UniProt. https://www.uniprot.org/help/about (Accessed May 26, 2025).
- Wu, Cathy H., and Daniel W. Nebert. “Update on Genome Completion and Annotations: Protein Information Resource.” Human Genomics 1 (2004): 229. https://pmc.ncbi.nlm.nih.gov/articles/PMC3525084/ (Accessed May 26, 2025).
- Wu, Cathy H., Chunlin Xiao, Zhenglin Hou, Hongzhan Huang, and Winona C. Barker. “iProClass: An Integrated, Comprehensive and Annotated Protein Classification Database.” Nucleic Acids Research 29 (2001): 52–54. https://pmc.ncbi.nlm.nih.gov/articles/PMC29833/ (Accessed May 26, 2025).
- Wu, Cathy H., George Whitson, Jerry Mclarty, Adisorn Ermongkonchai, and Tzu-Chung Chang. “Protein Classification Artificial Neural System.” Protein Science 1 (1992): 667–77. https://pmc.ncbi.nlm.nih.gov/articles/pmid/1304365/ (Accessed May 26, 2025).
- Wu, Cathy H., Lai-Su L. Yeh, Hongzhan Huang, Leslie Arminski, Jorge Castro-Alvear, Yongxing Chen, Zhangzhi Hu, Panagiotis Kourtesis, Robert S. Ledley, Baris E. Suzek, C. R. Vinayaka, Jian Zhang, Winona C. Barker. “The Protein Information Resource.” Nucleic Acids Research 31 (2003): 345–47. https://pmc.ncbi.nlm.nih.gov/articles/PMC165487/ (Accessed May 26, 2025).
Keywords
Editor
How to cite
Publisher
Handle
Rights
Articles Rights and Graphics
Copyright Arizona Board of Regents Licensed as Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported (CC BY-NC-SA 3.0)