Protecting Private Information in Data
In today’s society, data is one of the most valuable commodities with the potential to transform the way we live, conduct research and business. But, as the public’s enthusiasm about achieving Big Data’s potential grows, so does people’s awareness in the importance of protecting personal information within available data. Privacy protection is a top-priority issue of the Big Data era, and many researchers are working harder than ever to develop new technologies that enable us to use data without compromising our security. Academic and governmental organizations are also partnering to ensure privacy protection of microdata in order to promote further uses of official statistics.
Ask the Expert: Prof. Kazuhiro Minami (Joint Support-Center for Data Science Research, The Institute of Statistical Mathematics)
Prof. Minami specializes in privacy protection technology research with a focus on anonymization technology for data usage. He received his PhD in computer science from Dartmouth College and has served as a lecturer at the University of Illinois and as an associate professor at the National Institute of Informatics before being appointed to his current position in 2020.
Ask the Expert: Dr. Isao Takabe (Professor, the Faculty of Data Science, Rissho University)
At the time of the interview for this article, Dr. Takabe served as the director of the Statistical Data Utilization Center, which is operated by the Ministry of Internal Affairs and Communications. During his 19-year tenure at the Institute, Dr. Takabe used his background in the research of big data aggregation technology to make more official statistical micro data available through the Institute. He also worked to promote data science through collaborations with local governments and regional universities, and to support evidence-based policymaking (EBPM). A graduate of Waseda University’s Faculty of Science and Engineering, Dr. Takabe received his PhD from The Graduate University for Advanced Studies, or commonly known as SOKENDAI. He was appointed to the current position at Rissho University in the spring of 2021.
Protecting personal information contained in data
We live in the digital era, and that means almost all of our daily activities -- ranging from how we use our phones to what we purchase online -- are tracked by service providers to amass Big Data. In Japan, the 2017 amendment of the Personal Information Protection Law made it legal for companies to share this type of Big Data with third parties so long as it’s anonymized, paving the way for commercial reuse of data.
In order to help protect personal information contained in such shared data, Prof. Kazuhiro Minami at the Joint Support-Center for Data Science Research at The Institute of Statistical Mathematics has been working to develop anonymization technologies through a statistical science approach.
“You might think that anonymization is simply a matter of removing personal names and IDs. In reality, however, that’s far from adequate. For example, by combining gender, address and other types of information, you can identify persons to which the information belongs, or, in some cases, figure out their attributes,” Prof. Minami said.
And that has already been happening, according to Prof. Minami.
“An example is what happened in 1997 in Massachusetts in the United States. Someone successfully obtained then-Gov. William Weld’s medical records, including his diagnosis, by combining the aggregated medical data available from the state with other public data,” Prof. Minami said. “Open data is accessible by the public, which includes those who have information that can be combined with the open data to generate new information, as well as those who try to misuse the data with malicious intent,” he said. “The purpose of anonymization is to process data to keep anyone looking at it from discerning personal confidential information.”
Making it easier to link ‘official statistical microdata’ with other available data
Japan is known for the quality of its social survey data. The Statistics Bureau of Japan, for example, owns abundant microdata (individual data) from the national census and other official statistics. At the Statistical Data Utilization Center in Wakayama, which is operated by the Ministry of Internal Affairs and Communications, the Center’s Director Isao Takabe is conducting research on how to effectively promote the use of such official statistical microdata. He is also collaborating with various local governments and academic institutions on data science projects.
“To enable researchers to take full advantage of official statistical microdata, we are creating the ‘Onsite Facility’ in different locations. ‘Onsite Facility’ is a data-secure space where researchers can visit to safely explore the data and conduct exploratory analysis of the data to create new value. There are currently 12 locations right now, and we intend to expand this nationwide.”
Dr. Takabe is particularly involved in the Center’s and the Institute of Statistical Mathematics’ joint research project to develop technologies that can show connections among different kinds of data, such as data-matching and record linkage technologies.
Record linkage refers to the task of going through multiple databases to identify any pieces of information related to a particular individual or company — typically by using the individual or company’s name, address, etc. — and putting them all together.
“There is a movement to make municipal administrative records as well as corporate point-of-sale data publicly accessible. If you are trying to use that data along with official statistical microdata together, you’ll need to figure out which pieces of data should be linked up and where to start. I am working to develop the technology for that record linkage process,” Dr. Takabe said.
When various pieces of data are put together in a way that makes sense, it creates new value. And that is the great thing about big data, Dr. Takabe said.
“Researchers are analyzing available data through various lenses, including that of economics, urban engineering, corporate marketing and governmental policy making. I believe these types of research will continue to grow in number.”
About Onsite Facility at the Joint Support-Center for Data Science Research
The Onsite Facility located at the Joint Support-Center for Data Science Research in Tachikawa, Tokyo, is a highly secure room that meets all the security standards set by the National Statistics Center. Researchers can use the terminals inside the room to directly access and analyze the questionnaire information that are on the National Statistics Center’s central data management servers. Because the data access is provided through a thin client system, it is not possible to download any data onto the terminals, according to Dr. Motoi Okamoto, the chief research administrator at the Research Organization of Information and Systems. In addition, the room is equipped with multiple surveillance cameras to prevent Onsite Facility users from taking photos of what’s displayed on the terminal screens. “The number of users has been slowly but surely increasing, and we hope to help further expand the use and research of Japanese official microdata by collaborating with members of the Consortium for Research on Official Microdata.
Creating security standards for handling of data analysis results
Once researchers visit an Onsite Facility and analyze data, they can use the results for their studies. However, they are not allowed to physically take any results out of the facilities unless they have cleared the National Statistical Center’s screening beforehand. Prof. Minami has put serious effort into developing safety standards for what kind of anonymization should be performed on the analyses before they are taken outside of the facilities.
“You may think personal information won’t leak because research papers generally publish analyses in a highly aggregated format. But tabular data, such as that of tally sheets, is surprisingly tricky. For example, tally sheets have a total amount for both rows and columns. Let’s say you hid one of the cells (grids), but people can still easily recover the cell’s value by subtracting other cells from the total,” Prof. Minami said. “So, you need to identify which areas to hide in order to keep the data anonymous. This is what’s called ‘secondary confidentiality,’ and humans cannot solve this problem by themselves when tally sheets are complex. By using the R programming language, we developed a tool that can identify the range of data to hide with minimal processing, and can determine an ‘anonymization interval’ that ensures adequate levels of uncertainty for the hidden cell values. In addition, the tool generates support documents that you can use to evaluate the safety of the intervals.”
With the cooperation of several Onsite Facilities, this secondary confidentiality automation tool is currently being evaluated as a potential key component of safety standards.
“We are hoping to use this as a basic tool to ensure data privacy and to process confidential information to market it safe for researchers to take with them.”
Making data research easier through Consortium of data users
At the Joint Support-Center for Data Science Research, Prof. Minami also leads the operation of the Consortium for Research on Official Statistical Microdata, which aims to bring together industry, academics and government to promote the use of official statistical microdata for academic research. With a membership of more than 73 currently, the Consortium organizes educational programs and advocacy activities for its members while gathering their feedback and requests. The number of members from econometrics and sociology research fields has been on a rise lately, along with engineers, including those specializing in urban design and energy-related fields.
“Until recently, researchers who analyzed official statistical microdata were analyzing just that, for the most part. But now that private companies are making all sorts of data publicly available more and more — a phenomenon that was once hard to imagine — we can expect to create tremendous values from linking that with official statistical data,” Dr. Takabe said. “Official statistics, such as the Economic Census, are more limited in survey scope than corporate data. But they are known for overwhelmingly high ‘coverage rates’ of target populations, because they are mostly surveys of all respondents. In the meantime, corporate data contains many pieces of much more detailed information. So, linking both data should significantly boost the data’s usability.”
And that makes it all the more important to ensure the protection of privacy, Dr. Minami said.
“The more data you link together, the more valuable the data becomes. But the risk of personal information being deducted from the data also increases, making it more difficult to anonymize data,” Dr. Minami said. “It’s not easy to find a balance between Prof. Takabe’s research and my research. That’s why it’s important for us to collaborate to expand data use.”
Interviewer: Rue Ikeya
Photographs: Yuji Iijima (column)
Released on: Nov. 10, 2021 (The Japanese version released on Sept. 10, 2020)
* This interview was conducted online.