Toward Open Web Data

Since a time well before the invention of the web, Professor Hideaki Takeda (National Institute of Informatics) has focused on human knowledge and on how machines might exploit the abundance of this knowledge. The web in the early 1990's, though unpopular at the time, had a strong impact on Professor Takeda, and led to his conviction that future knowledge would accumulate on the web. He then quickly switched his research from artificial intelligence (AI) to knowledge acquisition using the web. His ideas resonated with a lecture delivered around six years ago by Professor Noriko Arai (National Institute of Informatics), entitled "Let us advance sharing of academic information." Professor Takeda then proposed the development of a foundation dedicated to circulating academic information through the use of an extension of the web known as the semantic web. This drove him to join the e-science Basic Technology project at Research Commons.

Artificial Intelligence: The Wait for Machine-readable Big Data

Approaching human knowledge through engineering

Today, the AI field is in its third boom. Eventually, however, in order for AI research to develop further, machines must somehow become able to acquire human knowledge. What is knowledge in the first place? Although we have representative sources of human knowledge such as the Encyclopedia Britannica, knowledge also includes everyday information such as restaurant locations and train schedules. Back when there were no means for gathering such information, no one imagined it would be possible to know every detail from even the far corners of the world. A recent topic of broad interest, IBM's Artificial Intelligence Watson, is one realization of the prolonged dream for an enormous knowledge base. Today, I consider trivial information in everyday life even more important than some of our deepest philosophical knowledge. My research is directed at an engineering approach to this everyday knowledge, including the information disseminated over blogs and social networks.

DBpedia: a breakthrough

How should such knowledge on the web be arranged in order for both machines and humans to utilize it? Tim Berners-Lee, the inventor of the web, proposed the "semantic web" to render the information on the Internet a reusable semantic network. This technology uses as its standard a structured language (a resource description framework: RDF), which is composed of subjects, predicates, and objects. Data written in this language are called linked open data (LOD). As the name suggests, LOD are "openly" placed on the Internet and can be "linked" to other data. For example, given the sentence "the event taking place in Tokyo on the 10th of next month," machines can extract information such as "when," "where," and "what," and smoothly link to other databases related to "events next month." DBpedia, the RDF dataset from Wikipedia, was one of the innovations that drove the web to transition toward this novel method of description. Having generated data that can be understood by machines, an RDF can disseminate knowledge without human intervention. For example, it is easy and seamless to harness the knowledge on Wikipedia in a company database. It is said that almost the entirety of human knowledge is contained on Wikipedia. As Wikipedia provides us with a rich source of human knowledge, DBpedia provides machines with a rich source of machine-readable human knowledge.

The LODAC project links data

Nonetheless, LOD is said to constitute less than 1% of all data on the web. In order to fulfill our goal of sharing academic information, we must first prepare a mechanism with which our knowledge can be shared. To this end, we have been operating a Japanese version of DBpedia, DBpedia Japanese, in this project. For example, we expanded the ontology by adding Sumo to the sport categories. In another project, we established entity linking to connect the strings for search inputs to items in Wikipedia for added convenience. We also created a database with LOD by gathering information regarding the collections in over 100 art galleries and museums. In addition, we have commissioned specialists to collate works from the same artists, and we are now operating a site called LODAC Museum, which features consolidated searches. In the field of biology, we built the integrated databases that were previously scattered owing to differences in subfields such as molecular biology and physiology, and in categories such as plants, animals, and microbes. The challenge in doing so pertained especially to species names, as we came to understand that taxonomic names often change with new scientific discoveries. Therefore, we invented an ontology that accommodates such changes in knowledge, and made the LODAC Species website accessible to the public where the transitions in scientific names could be tracked. Furthermore, we tackled similar problems afflicting Japanese biological names by connecting them to information around the world, resulting in the largest database for Japanese biological names at the previously mentioned web site.

The future of openly linked scientific data

Scientists generate research data and then publish papers to record their research. Often, however, this data remains unshared. Indeed, the academic "culture" has now shifted in response to this issue. This is a problem not only in biology, but also in science generally. The importance of using LOD as a standard to publicly share research data for future use is now widely recognized. Efforts are ongoing to make the research data increasingly open. One such effort is a project at Research Commons and DBCLS. There is also the emerging preference for DOIs over URIs (the latter of which change with the data’s location), in order to identify papers and researchers on the web. For many papers, moreover, it might suffice to present the data as structured information, rather than communicating this data in sentence form. In truth, no one can possibly read all of the papers that are published, and therefore it is much more fruitful to search exclusively for papers relevant to one’s research. With LOD, anyone can find such relevant information, and this benefits everyone involved.

DBCLS: Towards Data Integration

"As I study the data available on social media, I am increasingly fascinated by how knowledge emerges. Such data surely contain knowledge that will persist in later generations. When analyzing such knowledge, I believe that it is more important to look at the knowledge, information, and data in hand, rather than concern ourselves with grand theories." In addition to his research, Professor Takeda serves as Chairman of the DOI registration organization, Japan Link Center (JaLC), and as Director of ORCiD, a project for the integrated management of author IDs in academic papers.

(Text in Japanese: Hideaki Takeda, Rue Ikeya. Photographs: ERIC. Published: December 10, 2015)