BioHackathon 2014: towards data integration

A “hackathon” is a coding event aimed at resolving issues by assembling leading programmers to focus for a concentrated period on development. The first BioHackathon that adopted a similar format occurred in 2008. Since then, top-level researchers and technicians dealing with life science databases have met each year (hosted by NBDC and DBCLS). In recent years, the use of the standard data format for the Semantic Web in the Web 3.0 era, Resource Description Framework (RDF), has been promoted and the BioHackathon has become established globally as an important opportunity to take on the integration and standardization of data. This event was held for the seventh time in 2014 from 9th to 14th November and was jointly hosted by the Tohoku University and the Tohoku Medical Megabank Organization, in Sendai and Matsushima. It was attended by 78 people from 10 countries. The event was organized by Project Assistant Professor Toshiaki Katayama (DBCLS) and Project Associate Professor Atsuko Yamaguchi (DBCLS), who tell us more about the event below.

Seven good years thanks to the BioHackathon

The original trigger for arranging such an event was being invited to overseas hackathons and discovering what a highly productive method they were. Because research is a human activity, the most efficient way of working is with other people. Nowadays we can access things like research presentations and keynote speeches online. Therefore, I believe that Q&A sessions, discussions over coffee, and other such interactions are most fruitful even at conferences. A hackathon is essentially a collection of just these best bits. After assembling people who created the world’s important databases, we began by identifying problems. By writing straightforward programs for these problems, we can accomplish tedious tasks at the click of a mouse, resulting in a hundred times faster search and data management. Using a series of such resolutions, we are able to progress bioinformatics in an extremely efficient way. Furthermore, the continuity of these hackathons every year has resulted in several joint projects with overseas institutions. - Prof. Katayama

Making Japanese genome data useful for physicians

In collaboration with the Tohoku Medical Megabank Organization, which has collected genome data from the Tohoku region, we are now focusing on the genome and medical data of individual people, which have not so far been much used in the Semantic Web. Data concerning individual genomes involves privacy issues, and we are still finding it difficult to align with the concept of open data led by the Semantic Web. In this regard, i) the formulation of specifications to allow such private data to be integrated into public life science data; and ii) the development of technology that organizes this data ready for health applications. Incidentally, the “standardization” of data is a central mission of the BioHackathon and several international standards have already been proposed. ‘Official collection of data that link the genetic characteristics and disease trends among Japanese people, followed by a crosschecking of this data with individual genome data in a clinical setting to play a role in diagnosis and treatment’ is the shared vision that we are currently pursuing - Prof. Katayama.

”Unsung heroes” allowing the use of RDF

In order to pursue format standardization and integration, basic technological development for databases is essential. One such technological development is that which seamlessly links a distributed open database and one’s own data, and allows the desired data to be obtained.. It facilitates this greatly if the data if modeled in RDF. RDF is a data model that allows a computer to extract meaning from data and is currently the most commonly used format that can link a high volume of data with significant relationships. However, technology is not currently mature to process diverse distributions of RDF data within a realistic amount of time. Standardizing the list of contents with details of data location, automatic generation of metadata, and automatic collection of data without undue pressure on a particular server is the technological core that we are aiming to improve - Prof. Yamaguchi.

About RDF

Databases integrate knowledge

Many researchers deal with biological experiments in the laboratory, and the arrival of next generation sequencers allows instant generation of large volumes of data from separate pieces of traditional lab research. Several technological innovations in experimental methods such as the measurement of quantitative variations within cells have also contributed to an increase in the volume of data. If we can sort these large volumes of data so that they can be used for different purposes, and if we extract their essence, we should be able to create an integrated biological knowledge base. A database is akin to a library. By constructing a database that allows a bird’s eye view of all fields of life science simultaneously while showing precise details, we create something like a textbook. Additionally, the “discovery” of such a model from available data will not be attributed to a human being but to a computer. This is the future direction of the data integration journey - Prof. Katayama.

Thoughts from the opening day of the symposium: “Millions of different environments exist on this planet, and each of these rolling landscapes has its niche of adapted living organisms. All life on earth thrives due to genes. If we use the information and theories obtained from life science databases to chemically synthesize living organisms and apply some kind of stimulus, it may be possible that the compound will come to life. This will also lead to the understanding of the switch between life and death.” Project Assistant Professor Toshiaki Katayama

(Text in Japanese: Toshiaki Katayama, Atsuko Yamaguchi, Rue Ikeya. Photographs: Mitsuru Mizutani. Published: December 10, 2014)