Statistical thinking reshapes natural language processing

For humans and machines to coexist, natural language processing—the study of human language—is essential. Language has long been seen as a tool of thought that is unique to human, and not found in other animals. While it has a logical structure, natural language is different from programming languages; it cannot be explained only with a logic. How can we model the meaning of the words we humans use?

In academics, natural language processing has been developed using various types of mathematics, such as statistics, discrete mathematics, and optimization techniques. Associate Professor Daichi Mochihashi (The Institute of Statistical Mathematics) participates in meta-knowledge analysis project. His research focuses on probabilistic and statistical ideas to realize an "unsupervised learning" by machines from observed words. Here, we introduce this research frontier.

People's Actions Have a Pattern

Statistical analysis for modeling

I study language models that can provide general explanations of the probability by which individual words are used. I would like to derive the knowledge that has been diligently recorded by humans in the past from data on how it was actually used. This approach of learning something based on its use is itself a statistical method. Natural language processing is a fairly new academic field. Coming from statistical backgrounds, we reference the inquiry and understanding found in traditional linguistics to propose methods for modeling natural languages and solving problems. Our work involves such tasks as automating annotation, which has previously been done manually, and tackling problems that cannot be solved from only the logical statements.

Circle of collaboration unique to basic research

Actually, statistical methods for analyzing natural languages is beneficial for other types of data that share some characteristics with a language. Today, joint studies are taking place on statistical applications, not only for natural language processing but also other fields. In a joint study with the National Institute for Japanese Language and Linguistics, for example, we are focusing on voice, an important element of language, to lay a new foundation for language models. We are also working on applications of statistical methods developed for language processing, such as the hierarchical Bayesian language model and the hidden Markov model. Such applications include music information processing, acoustic models, and time series analysis in robotics.

Logic alone cannot express the meaning of the word “tatazumai”("atmosphere")

I believe that, in joint studies for linguists and social scientists, it is meaningless just to  ask the statistician for analysis of data without understanding statistics. This is because the analysis methods are not supplemental; they form a basis for new knowledge. Ideally, one should understand both fields, and joint studies should be the first step toward this goal. I say this because I originally majored in arts and then switched to science later. As a graduate student, it was my strong belief that logic alone could not represent a meaning of a word “tatazumai” (“atmosphere"). It is impossible to obtain the meaning simply by studying the definition of the word itself. Rather, the meaning of a word is shaped by its relationship with other words, and the way the word is observed. In this regard, statistical and probabilistic methods enable us to represent the meaning of such words as "tadazumai.”

Encounter with project director Yusuke Miyao

In a project at the Research Commons analyzing meta-knowledge structures, I wanted to share my probabilistic methods and knowledge, so I began working on a project with Associate Professor Yusuke Miyao (National Institute of Informatics). The phrase “tokoro-de” ("by the way") has the power to change the topic in a conversation; I recently proposed to measure this power by comparing the distribution of words before and after that phrase.

Associate Professor Miyao takes an innovative approach to analyzing syntax, an essential part in the logic of words. He builds a type of meta-logic system, instead of simple logical expressions, to express logical relationships as inclusion relationships between a set and its subordinate sets. Statistically speaking, such set theory is difficult to express and very intriguing. In our project, we would like to study if human language use can be modeled entirely by statistics, or if human intervention is needed somewhere.

Daichi Mochihashi serves as the organizer for the second volume of the "Iwanami Data Science" series of articles on natural language processing, to be published in February 2016. Authors also include Associate Professor Miyao. An increasing number of researchers and technicians, not only in universities but also in companies, are working on natural language processing for its use as a social infrastructure. "We want to introduce our research frontier of natural language processing to the interested readers as easily as possible."

(Text in Japanese: Daichi Mochihashi, Rue Ikeya. Photographs: ERIC. Published: January 12, 2016)