A Vision for a Future Where Statistical Data Benefits Society

Japan's 1947 Statistics Act underwent a thorough revision in April 2009, around 60 years after it was first introduced. The term “statistics” is derived from the condition of a state, a science of state that deals with population data as a basis for new government guidelines. The Human and Social Data Project is a research commons database. The aim, in collaboration with the “Social Communication” applied research project, is to compile and provide a useful database for the research community, and also to support the sharing and development of statistical methods used in social sciences and other fields. Sub-project leader Professor Hiroe Tsubaki (Institute of Statistical Mathematics) will tell us more.

Science-based policymaking and secondary use of official statistical data

The importance of statistical science was first recognized in the 16th and 17th centuries, when history was thought of as a “moving statistic.” It also began to spread around this time, predominantly in Germany. The collection of data for scientific purposes—or evidence-based policymaking, as we would call it today—did not begin until around the end of the 19th century. In Japan, Okuma Shigenobu (twice prime minister of Japan) got the statistics collection underway in 1881, and the first national census was conducted during the Taisho period (1912-26). While data collected by the government was originally used exclusively for the primary purpose of policymaking, the revised Statistics Act seeks to promote its secondary use. This is based on the idea of data being the intellectual property of the population as a whole, and it involves making greater use of data in research and feeding back the findings to the general population. In this context, we entered into a partnership agreement with the National Statistics Center, which is Japan’s central organization for the secondary use of data, including research purposes. We also work together with Hitotsubashi University and other higher education institutions to support the use of data in actual research.

Supporting research and modeling

Along with providing data, ROIS supports analytical and modeling work of researchers in the social sciences and other fields. For example, if we want to determine the validity of a particular modeling method for data analysis in social sciences, logging or otherwise recording it at ROIS would enable us to archive the information that such a method is currently being used in the social sciences. The process of then identifying which of these methods do or do not lead to solutions or recommendations is also important. In addition, we are aiming to build on the “modeling knowledge” gathered in this way at the grassroots level so as to generate a kind of general-purpose model and enable it to be shared among and reused by researchers.

The on-site lab at the Institute of Statistical Mathematics, which is fitted with surveillance cameras and applies other measures to meet stringent criteria, allows official statistics to be used in a secure environment. The institute began to provide anonymized official data in 2010 and to promote its secondary use. There are also facilities for overnight accommodation in the Akaike Guesthouse, which makes it possible to welcome researchers who do not live nearby. However, data that has been anonymized to protect personal information does not provide a sufficient amount of information for large-scale modeling in social science data analysis. In response to this, the secondary use research group of the Ministry of Internal Affairs and Communications considered the matter and a recommendation was put forward to the Science Council of Japan’s master plan via the Japanese Society of Applied Statistics. It was adopted in 2013. This useful procedure involves enabling universities and other institutions across Japan to make use of official statistical data. It was included in the basic plan of the Cabinet Office’s statistical committee, and was approved by the Cabinet in March 2014. The National Statistics Center has also decided that in the future, data is to be collected from all government ministries, not just the Ministry of International Affairs and Communications. This should provide even greater opportunities for universities to use the data.

Advantages of connecting official statistics and cyberspace information

As an applied statistician, my role is to pioneer new applied disciplines in the field of statistics. I hope that this will also further the development of mathematical statistics, another core field of statistics. The Social Communication Project tackles suicide prevention and food safety, which are topics that grew out of integrated research to be considered disciplines in their own right. My field of statistical data would be categorized as a more “classical” discipline, and it primarily involves identifying data characteristics as well as statistical modeling. By contrast, there are also a lot of types of data that simply by being collected can create value, and the Cyberspace Data Project that Project Director Noboru Sonehara and Project Researcher Yu Ichifuji (National Institute of Informatics) are working on is a prime example of this. These two types of data have both advantages and disadvantages. For example, a census—a traditional population statistic—is conducted once every five years, but nowadays we can track people’s movements in real time using mobile phones. A population census and the number of mobile terminals are entirely different matters, but if at some point we were able to link and model census and mobile telephone information, this would enable us to estimate population migration in the interim period. I believe that using one statistic to adjust for the bias in another is an extremely powerful technique.

Social communication project

Big data for quicker problem solving

The statistical approach is based on the cycle of plan-do-check. Adding “action” to this gives us PDCA, which Japanese car manufacturers are well known for having used in their QA processes to become global leaders between the 1960s and early 1990s. Today, statistics is used in the “check” stage. This involves comparing the actual data with the data predicted by the data model, and identifying the problem of which discrepancies need to be resolved. For example, the overall suicide rate in Japan is currently falling, but it continues to increase in certain age groups and areas. This means that the problem lies in the variation between different areas. We then gather the data thought to be the cause together with the results, and develop a statistical model. Using this to control for the cause and evaluate the effect of environmental factors enables us to take the analysis even further. Here, classical confirmatory statistical modeling remains as important as ever. However, huge advances have been made in recent years in these kinds of problem-solving processes; these advances involve the identification of factors from big data. There are now a range of methods for finding where the problem lies and the possible causes for the same, such as machine learning, data mining, and artificial intelligence. We could say that the statistical cycle has been given a new lease on life.

Professor Hiroe Tsubaki believes that even plotting a graph of cause and effect is important training, and that even elementary-school children can do all sorts of things as long as they go through the statistical cycle properly. His work, which also relates to the development of data scientists, also involves educational and dissemination activities surrounding “statistics for problem solving” aimed at elementary and junior-high-school pupils.

(Text in Japanese: Hiroe Tsubaki, Rue Ikeya. Photographs: Mitsuru Mizutani. Published: February 10, 2015)