The NEXT STEP of Genome X Statistics

In March 2016, the Research Commons Project will end following three fruitful years. One of its research projects, "Genetic Function Systems," had evolved over 10 years having carried forth integrated research since the founding of the Institute. On January 25, the symposium, “Mathematics and Modeling in Genetics and Statistics,” was held at the National Graduate Institute for Policy Studies in Roppongi, Tokyo to report its findings. From the view point of the statistics used to analyze genetic data (specifically genomes), what is “Genetic Function Systems” project and how will it unfold in the future? The organizers of this symposium, Professors Satoshi Kuriki (in the photograph on the left) and Hironori Fujisawa(in the photograph on the right) of the Institute of Statistical Mathematics, spoke on this topic.

“Selective Inference” In Focus

This symposium is intended to be a debriefing of the project, with the goal to introduce the statistical method developed by the project team to the community involved in genome analysis. Furthermore, we decided to take advantage of this opportunity and invite researchers outside of the project to introduce other new trends related to this field. The guest speaker’s topic is "Selective Inference," which has been a new trend in machine learning. I am very excited to hear that multiple testing, a statistical method which has been my specialty, is not only developed in another context, but expanded considerably in recent years (Kuriki).

With regards to selective inference, we were surprised to find, from our statistical point of view, how much machine learning is capable of. This is because while machine learning intends to discover something from data, like data mining, statistics places strong emphasis on the reliability of the method, e.g. whether the results really can be reproduced in the future. Multiple testing is, in fact, one of the methods that very much embody these statistical properties (Fujisawa)

How is the genome linked to the variety of life?

In Limbo between “Too Difficult” and “Too Simple” Mathematics

Ten years ago, when we started the interdisciplinary research, I was excited by the new network of researchers that suddenly opened up in front of me. We sized each other up, wondering what knowledge each brought to the table. Coming into contact with the world of genomes, I felt that the world of genomes is what seemed like a treasure-chest of data. More recently, everyone has completely familiarized themselves with one another, and I’m often asked for advice about the submission of articles or responses to reviewers (Kuriki).

The research itself is interesting, but the most difficult part was how to publish our paper. Existing research fields have corresponding scientific journals, and there is an implicit understanding that certain types of content will be approved for publication. However, there is no dedicated journal for novel research. When we submitted our integrated research to journals covering genetics, they told us the mathematics we used was too difficult; when we submitted the same research to a statistics journal, they told us it was too simple (laughs). We also submitted our research to a bioinformatics journal—which falls in the middle—but even then, we had a tough time (Fujisawa).

Multiple Testing of Rice Genotypes That Cause Reproductive Isolation

There is one multiple testing problem among the analyses conducted by Professor Fujisawa, Professor Nori Kurata (National Institute of Genetics), interdisciplinary project researcher Yoshiaki Harushima, and myself, that deals with the large data collected for the genotypes of rice. There are 12 total rice chromosomes, and it is known that there is a certain combination of two specific gene loci that prevents the rice from surviving. It becomes possible to calculate the unique combination of chromosomes when modeling to include the strains established by crossing pure breeds and the phenomena where these can “cross” during meiosis. However, when the experiment was actually performed, there was a “deviation”—this is the famous “Dobzhansky-Muller model” that causes reproductive isolation if the genes are non-conforming—and the challenge was that it was genetically necessary to identify this causative gene. Therefore, we used the tube method to compare the theoretical and measured values of 1,000-by-1,000 genotype combinations, conducting a total of 550,000 tests. We published a paper calculating the peak of each combination and the significance of each peak (Kuriki).

More Stimulus for Statisticians!

Genomic data is immense. There was difficulty in processing such vast information by using traditional statistical methods, and this prompted a change in stance from traditional statistics. This was also true for cases in which I personally had an interest: “outliers,” which are typically excluded as they deviate significantly from the central data, and “missing data,” from a lack of observational data. With the advance of technology and enhanced accuracy, I thought “outliers” would be eliminated, but the truth is that new technology is unstable, and it frequently caused the number of outliers to increase. This spurred the motivation to automate the process and/or come up with new statistical methods. We were also motivated by  the incomplete raw scientific data. that we encountered throughout the project. These new forms of stimuli continue to inspire new research (Fujisawa).

In photo, from left:(From the National Institute of Genetics)Vice Director Toshihiko Shiroishi; Assistant Professor Toyoyuki Takada; Professor Nori Kurata, Project Director. Professor Takashi Tsuchiya(National Graduate Institute for Policy Studies, Joint Supervisor). (From The Institute of Statistical Mathematics) Professor Satoshi Kuriki, Professor Hironori Fujisawa

(Text in Japanese: Satoshi Kuriki, Hironori Fujisawa, Rue Ikeya. Photographs: Toshiaki Kitaoka. Published: March 10, 2016)