Robot typing illustration.

(iStock.com/filo)

Data mind

Computer scientists Heather Zheng, Ben Zhao, and Blase Ur mine data to study behavior and expose security flaws.

Our technology-dominated society generates a massive amount of data about us, collected in a variety of contexts, with and without our informed consent. Social media, security footage, genomic databases: these data streams and our increasing awareness of them are contributing to a growing anxiety over privacy and the need for security. But this information also helps researchers understand how humans interact with technology, build better systems to serve that interaction, and ultimately protect us—both in cyberspace and in the real world.

Mining data to model and predict human behavior can affect our lives on individual and institutional levels. This type of modeling is one focus of the Systems, Algorithms, Networks, and Data (SAND) Laboratory, jointly run by Ben Zhao and Heather Zheng, a married team who joined UChicago in 2017 as Neubauer Professors of Computer Science.

On a small scale, models can help predict how users will behave on apps like Whisper, an anonymous social network, or penny auction sites like DealDash—predictions that can help developers create software better suited to how real people use it.

Larger-scale projects include Zheng and Zhao’s collaboration with UChicago Medicine, analyzing data from vital sensors to study the behavior of patients and caregivers, as well as collecting information about their environment, such as temperature and noise levels. The team aims to improve both the modes of information capture and the interpretation of that data, the combination of which can “improve the health care system,” says Zheng, “making it more efficient and predictable.”

Scaling up even further, SAND Lab is working with the Array of Things project, capturing information around the city of Chicago. “A hospital is like a little city,” says Zheng, and much of what they learn at UChicago Medicine can be applied to a smart city, such as how to analyze anomaly events or the movement of populations. Their goal is to develop sensors that can capture information, which can be analyzed and then influence further action—like traffic moderation—in real time.

Data-driven analysis can be performed in a number of fields, notes Zhao. For instance, one day he’d like to use data “to reverse engineer the legal profession,” he says, “trying to understand how predictable people are in the courtroom and whether we can produce a model for how court cases will go.”

Advances in data science, a field that has become a focus of the Department of Computer Science, don’t happen in a vacuum. Researchers discover patterns and glean insights from data, but such mining puts scientists in a precarious position. What if the population who provided that data doesn’t want their information used in certain ways? What if they weren’t even aware it was being collected?

To guard against these dilemmas, data scientists also scrutinize how to use data sets ethically and safely. “We’re looking at how to access sensitive material, like large databases of genomic or body sensor data, in a secure way that guarantees patient privacy while still allowing researchers to gain useful information,” says Zhao.

“When we do our work, we use anonymization,” adds Zheng. The models they build don’t reflect identifiable information. The data from which they glean patterns never leave their servers, and they never share it—“unlike Facebook,” she says, which had information on some 50 million users provided to voter-profiling firm Cambridge Analytica by a third party.

Facebook’s data scandal epitomizes the dangers of collecting data on a massive scale and why it’s so important to safeguard that information, says Zhao. While the scandal wasn’t a breach in the sense that a malicious party broke in, Facebook had a design weakness that was exploited. “They should have had better protection mechanisms to verify and validate when data was being shared. I think their model for allowing that much data to be collected was problematic in the first place, so what happened is not terribly surprising.”

Data breaches are one area of focus for another researcher, Blase Ur, a Neubauer Family Assistant Professor in Computer Science who also joined UChicago in 2017. Ur takes a human-centered approach to studying computer security and privacy, including analyzing how vulnerable an individual’s other accounts are after one account has been compromised. “Lots of people have a major coping mechanism of reusing the same or similar passwords,” says Ur—passwords and other means of authentication being one focus area of his SUPERgroup (Security, Usability, & Privacy Education & Research) collective at UChicago.

So, what is a better password, how is its strength measured, and what will real people be able to remember?

Ultimately a strong password is one that is unpredictable to attackers. “Hackers are basically data scientists,” says Ur. They look at credentials that have been leaked online, usually by other hackers, and build statistical models for typical passwords. Posting the spoils of their attacks in forums is a hacker point of pride—and information useful to computer scientists like Ur. (He doesn’t interact with them, just studies their tactics.) By evaluating how obvious a particular password would be, based on those models, he can use the hackers’ posts to understand how vulnerable the average person is.

Yet knowing a password’s strength isn’t enough; users need to know why and how to strengthen it while maintaining manageability. So, in 2016 Ur and a Carnegie Mellon team built a meter that tells users how prevalent a certain typographic substitution is and offers suggestions to avoid vulnerabilities. The meter’s artificial neural network—a brain-inspired system that mimics how humans process information—learned by scanning millions of passwords and identified trends attackers might exploit.

Zheng and Zhao also use artificial intelligence neural networks to expose vulnerabilities. In 2017 they trained a network using thousands of Yelp restaurant reviews, which was then able to write fake reviews that were indistinguishable from real ones. (Can you tell the difference? Take the quiz.) The reviews were rated not just believable but useful, demonstrating that such technology could be used maliciously to influence human opinion.

As far as we know, attackers are not yet using AI-powered technology to create fake reviews. Bad actors are still largely using on-demand crowdturfing systems, where a large pool of human workers are paid to complete malicious tasks. But Zheng and Zhao believe that the threat is real and imminent for companies like Yelp and Amazon, and so they are using what they learned from creating fake reviews to develop countermeasure algorithms to detect them.

Using AI-powered detection to fight AI-powered generation is crucial, Zhao says, because undermining commerce is just the beginning. Artificially created content can shake society’s confidence in what is and isn’t real. In Zheng and Zhao’s 2017 paper describing their work on reviews, they note that AI can help detect fake news—a problem that has skyrocketed, particularly since the 2016 election cycle. This defense is especially important because AI could one day generate convincing fake news too.

Zheng and Zhao have a long history of alerting companies of security weaknesses. In 2016 they received media attention when they discovered a security flaw in the crowdsourced navigation app Waze that allowed fake “ghost riders” to report false accidents, reroute traffic, and secretly track users’ locations. They’ve also identified security flaws at LinkedIn and live video streaming apps Periscope and Meerkat. They even notified Facebook of security concerns in 2009, but privacy was less of a public concern then, says Zhao.

The notion of privacy has since evolved. When companies realized that privacy norms were shifting, they closed off a lot of access, but by then data had already been collected—by Cambridge Analytica, for instance. “You can’t unopen a box,” Zhao notes.

When news of the Facebook scandal broke widely in March 2018, the public outcry brought to the fore issues of online tracking, profiling (whether for advertising or voter influencing), knowledge, and consent.

Ur thinks transparency could help ease public concerns. “Everyone using the internet shouldn’t have to be an expert in data science to know about these things,” he says, so he’s building a privacy tool—an open-source browser extension that, in essence, tracks the trackers. It would tell you if a site you’re visiting is collecting your data. According to a research survey Ur and collaborators published in April, there is a widespread belief that using a browser’s incognito or private mode will keep sites from gathering your data—it won’t. It keeps information from being stored locally, so your roommate or spouse can’t see what sites you’ve visited. Your internet provider and government agencies, however, can.

One of Ur’s SUPERgroup research questions is, if all this information is available, how will it change public attitudes about our online lives? The European Union—where personal privacy is protected more staunchly than in the United States—may provide a hint.

In May 2018 the EU General Data Protection Regulation went into effect, replacing a directive drafted in 1995, when the world was far less digitized. The scope of its protections, which in practice extend beyond Europe, includes consent, notification of compromise, and the “right to be forgotten” by removing previously collected data. The law also addresses data portability, enabling users to easily “move 14 years of social media history to another platform, making meaningful competition possible,” notes Ur.

It’s hard to predict how such legally enforced protections might change privacy expectations—or anxieties—in the United States, should we adopt them beyond what’s required to continue business relations with Europe. American and European citizens have different ideas about which data sharing practices are acceptable. (In June California signed into law the similar but much more limited California Consumer Privacy Act, but it won’t take effect until January 2020.)

But increased transparency of those activities—whether provided by government enforcement or the types of technology SAND Lab and the SUPERgroup are developing—can only strengthen the “informed” part of informed consent.