The Making of the Databases of Scientists’ Dream
Isaac Newton experienced his epiphany while in a garden with apple trees, and Archimedes, a bathhouse. In the case of modern life scientists, their databases could become their go-to place to seek inspiration to find their own eureka moments.
Bioinformatics experts from around the world, including those in Japan, are making progress on the decades-old initiative to develop ways to blend all sorts of life science data and create the most comprehensive and cohesive reservoir of information imaginable to accelerate scientific discoveries. The ultimate database they envision would not only make an astronomical volume of information available at the user’s fingertips, but also show how seemingly unrelated data are connected in the real world. For example, data on a human gene could encompass anything from the explanations about how the same gene works in other species’ systems to what diseases are associated with the gene.
With big data growing even bigger and more important in life science, there is an impetus for the movement to consolidate scattered biological data. Two experts from the Database Center for Life Science (DBCLS) who have been on the forefront of the movement will explain what progress has been made thus far in Japan and what challenges still lie ahead.
Ask an Expert: Susumu Goto (Database Center for Life Science, DBCLS)
Dr. Susumu Goto (right) is a professor for Database Center for Life Science (DBCLS) at the Research Organization of Information and Systems’ Joint Support-Center for Data Science Research. Before joining the DBCLS in 2017, he worked for 23 years at Kyoto University Institute for Chemical Research, where he helped Prof. Minoru Kanehisa found a bioinformatics database called Kyoto Encyclopedia of Genes and Genomes (KEGG) and developed his expertise in bioinformatics, an interdisciplinary science field that uses computer to sort and analyze big biological data. Dr. Goto earned his bachelor’s and doctor’s degrees in computer science in 1989 and 1994 respectively, both from Kyushu University. Dr. Toshiaki Katayama (left), a Project Assistant Professor at the DBCLS, has led BioHackathon, the DBCLS’s international coding camp for life science data standardization, since its inception 10 years ago. For Dr. Goto’s bio, click here. For Project Assistant Professor Katayama’s bio, click here.
Database as an Encyclopedia of Biology
Life scientists began dreaming up integrated biological databases during the 1990s as they became flooded with billions of pieces of genomic data. Among those dreamers were scientists at Kyoto University Institute for Chemical Research, who envisioned creating a virtual encyclopedia of biology. They appropriately named their project, Kyoto Encyclopedia of Genes and Genomes (KEGG).
“Back then, all you saw in databases was character data without context, such as nucleotide acid and amino acid sequences, whereas what’s in textbooks was mostly graphical data, explaining how metabolic pathways and signal transduction pathways work,” Prof. Goto, one of KEGG’s founders, said, looking back on the early days. “We thought it would be great if we could combine both types of information. We figured the web, which was becoming popular at the time, would work well as media for that.”
Understanding Human Genes in Relation to Other Species’ Genes
KEGG is a database that brings together genomic and genetic data with the information about genes’ roles in “pathways,” such as metabolic pathways and signal transduction pathways.
“Once the genomes of the human and tens of thousands of other species were sequenced, it became clear what genes are there. And we can use diagrams to show things like what each gene does to cause a chain of metabolic reactions. KEGG calls the flowchart describing this network a ‘pathway map,’” Prof. Goto said.
Pathway maps make it easier for people to grasp and compare the pathway patterns and the genes involved among various species.
“KEGG also catalogue genes found in ocean and soil samples and compile information about the gene functions. So, we expect the KEGG database will enable us to look at the relationships among species from the whole new viewpoint of how one species’ chemical processes influence another’s,” Prof. Goto said.
Standardizing Data to Make It Sharable Among All Databases
There are many databases in Japan that are open to the public. These databases come in all sizes and forms, covering a range of species from microbes to eukaryotic to primates to birds. Major databases include Tohoku University’s Tohoku Medical Megabank that collects human genomic information and Kazusa DNA Research Institute, which, for the most part, deals with plant genomes.
The DBCLS wants researchers to be able to simultaneously access many of these databases and effortlessly cross-reference all kinds of data, including chemical compounds and structures, pathway information and genome data on various species. This means all data must be standardized for integration, and developing technology for it is the key, Prof. Goto said.
“We are using semantic web technology to develop a data integration system that works with a data model called Resource Description Framework (RDF). RDF allows the system to learn the context of data and link data accordingly. So, you can search a gene by an associated disease and vice versa, for example. You can also look something up by gene-disease correlation types,” Prof. Goto said. “With this system, it’s super easy to locate the information that you are looking for. The system also helps you analyze and interpret data more accurately and efficiently.”
While the system is being constantly improved and updated, researchers across Japan can already take advantage of it to obtain diverse data from multiple databases. It’s made an open source system so people can use it as it suits them, according to Prof. Goto. “The user might be a bioinformatician interested in using the system’s API to develop a proprietary application, or could be a data scientist who wants to extract information from the databases to make a discovery,” Prof. Goto said.
Big Data for Genomic Therapy
For the past 10 years, the DBCLS has organized BioHackathon, an annual international coding camp, to promote standardization and integration of life science data.
“The organizers are particularly focused on the integration of medical databases in recent years because more people are growing interested in connections between genomic mutation and the causes of human diseases and that sort of thing,” said Project Assistant Professor Toshiaki Katayama, who has led the hackathon since its inception.
For BioHackathon 2017, Project Assistant Professor Katayama said he expects lively discussions on the topic of genome graphs, a cutting-edge tool for analyzing individual genomes.
“There are slight individual differences in human genomes. Research institutions have traditionally stored variant data separately as reference data. Genome graphs bundle an array of variant datasets into an intelligible batch of information” Project Assistant Professor Katayama said. “This would allow researchers to laser-focus on the genomes of the Japanese, for example. Genome graphs are very useful that way.”
What’s also key to clinical use of big data is the development of the standardized “pipelines” that automates a string of tasks from integrating data to connecting genes with diseases to identifying the best medical treatment, Project Assistant Professor Katayama said.
“The pipeline that we envision would use a single graph to bundle all genomic information, allowing the users to apply the same pipeline with the cutting-edge genome graph technology to the subgraph sorted by disease or race,” Project Assistant Professor Katayama said.
Project Assistant Professor Katayama pointed out that the development of Common Workflow Language (CWL) used to write such data processing should also become a theme for this year’s BioHackathon.
“The entire world is working together to move toward data standardization,” he said.
Interviewer: Rue Ikeya
Photographs: Yuji Iijima unless noted otherwise
Released on: January 10, 2019 (The Japanese version released on Sept. 11, 2017)