Internationally Standardized Data are Beginning to Unravel the Story of Life

The field of biological sciences has been an early advocate of the creation of databases. The National Center for Biotechnology Information (NCBI) in the US, the European Bioinformatics Institute (EBI; part of the European Molecular Biology Laboratory) in Europe, and the DNA Data Bank of Japan (DDBJ) have all succeeded in centralizing gene information databases. Following suit, the Database Center for Life Science (DBCLS) is integrating various life science databases held by different Japanese universities, research institutes, medical institutions, and other bodies. These efforts are outlined below by project leader and DBCLS chief Dr. Yuji Kohara at the National Institute of Genetics.

What is an integrated database?

Life science data such as genomic and gene expression data have been accumulating steadily over the past years. Additionally, large volumes of data generated by past research projects are stored at research institutes and in university servers. In some cases these data are inaccessible due to the shortage of maintenance work, which often leads to loss of information. An integrated database is designed to allow everyone to access and make use of randomly distributed databases without gathering them in one place. To this end, standardizing the format of all data is imperative. Like the Google search engine, currently available database search services allow a meta search of all Japanese life science databases. We step in at this point for the further development of these database searches.

RDF: the common international format

The data obtained by using search engines like Google is read out in a document format (i.e. the webpage identified by the search). Following this, a human being is required to read the document, understand its meaning, sift through the information, and write a report. Our aim is to evolve the worldwide web from its current status as a “web of documents” into a “web of data.” For example, in the next-generation version of Wikipedia named DBpedia, a computer can combine various pieces of information and create a separate piece of information based on the array of information. The Resource Description Framework (RDF) is rapidly becoming the global standard as the most suitable format for the realization of data from this Semantic Web. RDF writes data in a form that allows the subject, predicate, and object, known as a triple, to be defined unambiguously. Arranging all data in triples broadens the ways of using it exponentially to include the creation of graphs and high-level searches including meanings. In addition, because the same word can have a different meaning in different fields, the “ontology” can be set for the meaning map of each field (also known as a “dictionary setting”). This is aimed at a further improvement in the accuracy of search results.

“BioHackathon” to standardize databases internationally

Led by Project Assistant Professor Toshiaki Katayama (DBCLS) and his colleagues, the “BioHackathon” workshop is held each year (jointly hosted by the DBCLS and the NBDC) to create international standards through discussion and cooperation regarding the future vision for database development and its specifications. Such a workshop, similar to a short-term residential camp where contemporary problems are resolved, is known as a “hackathon.” Owing to its productivity, a ‘hackathon’ is currently actively adopted in the IT sector and other industries. The BioHackathon gathers currently active technology developers working on the frontline in areas such as databases, ontology, and application development. Problems that would usually take a month or more to resolve are resolved in less than a week in such a biohackathon. So far, four Japanese meetings have been held to disseminate this new knowledge and knowhow obtained at the main event. Based on the success of the biohackathon, it seems that our efforts over the past years are paying off.

BioHackathon 2014 in Tohoku

Questions designed in natural language return high quality search results

One of the most pressing questions among the various technological problems is ‘How to interrogate databases to elicit the right information’? Dr. Jin-Dong Kim, Project Associate Professor (DBCLS), has recently developed a new technology to address this question. For example, when an input in natural English text such as “what are the genes related to disease X?” is presented, the technology developed by Dr. Kim automatically generates SPARQL (SPAROL Protocol and RDF Query Language, the language used to interrogate the RDF data). Gene-related data relevant to disease X, correctly linked with regards to meaning, are then displayed in an integrated way. Such data would include literature information, genome data, and molecular structures. Despite being known as difficult to use, several life science researchers have been able to master SPARQL. Meanwhile, progress is being made towards the establishment of criteria for performance evaluation to allow application of this technology in clinical and therapeutic settings.

Besides this, we are working on handling moving images and anticipate the realization of “data centric science,” whereby biologists may look at the results and analysis of cell behavior and uncover their significance. We will certainly continue to strive toward such a future.

Written at National Institute of Genetics in Mishima City, from where Mount Fuji can be seen on a clear day. In spring 2014, the main body of DBCLS moved from the original site in Tokyo University to Kashiwa in Chiba, with some DBCLS members moving to the National Institute of Genetics.

(Text in Japanese: Yuji Kohara, Toshiaki Katayama, Jin-Dong Kim, Rue Ikeya. Photographs: Mitsuru Mizutani. Published: April 1, 2014)