A Systematic IT Solution for Culture Collections Community to Explore and Utilize Preserved Microbial Resources Worldwide
Nominate an Early Career Researcher for the 2018 WDS Data Stewardship Award here (Deadline: 21 May 2018)
A Blog post by Linhuan Wu (2017 WDS Data Stewardship Award Winner)
Demands on Information technology from the culture collections
Global culture collections play a crucial role in long-term and stable preservation of microbial resources, and provide authentic reference materials for scientists and industries. With the development of modern biotechnology, available knowledge on a certain microbial species is growing unprecedentedly. Especially, since the advent of high-throughput sequencing technology, enormous sequence data have been accumulated and are increasing exponentially, making data processing and analysis capacity indispensable to microbiological and biotechnological research. Culture collections must not only preserve and provide authentic microbial materials, but also function as data and information repositories to serve academia, industry and the public.
Figure 1. A system-level overview of the WDCM databases.
There is a gap between the capacity of culture collections and the needs of their potential users for a stable and efficient data management system, as well as advanced information services. Ideally, curators and scientists from culture collections now need to not only share data but also design and implement data platforms to meet the changing requirements of the microbial community. However, not all culture collections can afford the infrastructure and personnel to maintain their own databases and ensure a high-level of data quality, let alone provide additional services such as visualization, statistical, and other analytical tools to enhance understanding and utilization of the microbial resources they have preserved.
The WFCC-MIRCEN World Data Centre for Microorganisms (WDCM) was established 50 years ago, and became a WDS Regular Member in 2013. The longstanding aim of WDCM is to provide integrated information services for culture collections and microbiologists all over the world. To clear the roadblocks in utilization of information technology and capacity building in culture collections, WDCM has constructed its own informative system, and a comprehensive data platform with several constituent databases (Fig. 1).
Figure 2. Web interface of CCINFO database.
Culture Collections Information Worldwide (CCINFO), which serves as a metadata recorder, has collected detailed information of 745 culture collections in 76 countries and regions up to January, 2018 (Fig. 2). In addition, WDCM has assign a unique identifier for each culture collection registered in CCINFO to facilitate further data sharing.
To help culture collections establish an online catalogue and further digitalize information of microbial resources, WDCM launched the Global Catalogue of Microorganisms (GCM) project in 2012 and tried to build up a system with fast, accurate, and convenient data accessibility gradually. The current version of the GCM database has recorded 403,572 strains from 118 collections in 46 countries and regions, and performs automatic data quality control, including validation of the data format and contents—for example checking species names with taxonomic databases.
WDCM also developed the Analyzer of Bioresource Citation (ABC), a data mining tool to extract information from public sources such as PubMed, World Intellectual Property Organization, Genomes OnLine Database, and the National Center for Biotechnology Information Nucleotide database. By automatically linking catalogue information submitted from each culture collection with the data mining results of ABC, this greatly enriches the information that users of GCM can acquire in one search.
General Standards to Promote Data Sharing
At present, a major problem impeding the exploitation of microbial resources by academia and bioindustries is the low efficiency of data sharing. Since most of the culture collections tend to use different data forms for data management and publication, users have to tap into a huge body of data to find valuable information, let alone obtain suitable microbial materials efficiently, which has resulted in considerable wastes of time and money.
Although many international organizations and initiatives have implemented their own data standards or recommended datasets for microbial resources data management—for instance, Darwin Core and the Organization for Economic Cooperation and Development's Best Practice Guidelines for Biological Resource Centres—there is still a long way to go before realizing efficient data exchange, sharing, and integration globally.
WDCM has established minimum and recommended datasets and has implemented them in the database management system of GCM to ensure uniform data formats and data fields. WDCM has also committed to developing a standard under International Organization for Standardization Technical Committee 276 – Biotechnology; namely, AWI 21710, 'specification on data management and publication in microbial resource centres (mBRCs)'. This work is trying to improve data traceability of microbial resources preserved in different culture collections by normalizing local identifiers and mapping the ones already in use, as well as give recommendations for popularizing a machine-readable persistent identifier system to enable data integration.
The WDCM database is now using a centralized model for data integration. Future developments such as 'Big Data' technologies, including Semantic Web or Linked Open Data, will enable the system to provide more flexible data integration from broader data sources. Linking WDCM strain data to, for example, environmental, chemistry, and research literature datasets can add value to data mining and help in targeting microorganisms as potential sources of new drugs or industrial products. Linking microbial strain data to climate, agriculture, and environmental data can also provide tools for climate-smart agriculture and food security. WDCM will work with Research Infrastructures, publishers, research funders, data holders, and individual collections and scientists to ensure data interoperability and the provision of enhanced tools for research and development.