Data mining for meaning
SSD faculty analyze big data to find smart data.

Experimental scientists have long collected and analyzed large data sets, but with the explosion of personal data production—citizens digitizing their lives—social scientists now also regularly rely on data computation. The social nature of these data provides rich veins of information to study larger populations in simultaneous contexts.

In the Social Sciences Division, Kathleen Cagney, AM'90; Todd Schuble; and James Evans use computation differently, but they all rely on what's become known as "big data"—data sets so large they require more robust computing systems than were traditionally used—to show trends across a larger sampling or multiple layers of data.

Health surveys

Cagney, associate professor of sociology and health studies and director of the University's Population Research Center, studies social inequality and how it affects health, with a focus on neighborhood, race, and aging. Health, she says, "is something we bring with us. It's a marker of social inequality that sticks."

Cagney leads the social-science-research side of the Chicago Lakeside Development, a decades-long, multipartner project to develop more than 600 acres of land between 79th Street and the Calumet River, the previous site of US Steel's South Works mill. Her team is surveying Lakeside's surrounding neighborhoods to study "what would generally contribute to social glue in neighborhoods," including regional physical and dental health. Questions include how residents rate their own health, where they go for health care, how long it takes to get there, and by what mode. By analyzing data that organizations like the CTA collect for their own purposes, she hopes to omit some logistical questions. "We can focus survey research on what people really think," Cagney says, "rather than retrospective reports of service or transportation use." Cagney will share the findings with developers to inform future decisions, such as whether to build clinics or improve transportation.

Data maps

Schuble, lecturer and manager of Geographic Information Systems and Science (GIS), splits his time between helping the division map its data and doing his own research. Schuble studies Chicago's southeast side, trying to figure out its "health neighborhood" by borrowing data from the South Side Health and Vitality Studies, conducted by the University's Urban Health Initiative, and examining where residents go to receive health care. He looks at how constraints such as access to transportation or insurance might correlate with age or socioeconomic class.

In his GIS support role, Schuble takes data from faculty and maps them using specialized geographic information software. "People think it's Google Maps or the GPS on my phone," he says, "but that's the tip of the iceberg. Where GIS blossoms is in its analysis power." Geographic information systems can overlay information to reveal patterns and trends, whether variables correlate, whether a model can be created to make predictions.

Such analysis traces back to the 1854 cholera outbreaks in London, when physician John Snow mapped cases, identified the disease epicenter, and concluded that a contaminated water pump was the source. Today's GIS makes possible the same type of data management and analysis but on a grander scale because of larger data and computing capability. The system can map disease transmission over an entire continent or health-care access in a single neighborhood, correlated with cultural mores, daily activities, income level, racial or ethnic distribution, age—whatever contexts the researcher deems relevant.

Idea networks

Evans, associate professor of sociology and a Computational Institute fellow, works in metaknowledge—knowledge about knowledge. He directs the Knowledge Lab, which seeks to understand "where ideas come from and the institutions that facilitate question asking and methods of discovery," he says—"how it is that people become certain and how they forget."

In early 2013 the John Templeton Foundation awarded Evans and colleagues a $5.2 million grant to create the Metaknowledge Network, a community of researchers from nine institutions, including Stanford, Princeton, and Harvard. They explore sources such as scientific journals, online encyclopedias, social media, news, and patents on a large scale, Evans says, to reveal trends in how ideas are created and received over time, and how those processes shape science.

For example, studying scientists' careers, Evans has found that major award winners performed risky research during times when "people were prepared to understand how important the new ideas were," indicated by how often their work was cited by peers. This trend helps define the characteristics of successful scientists, such as Andrew Fire and Craig Mello, the 2006 Nobel laureates in physiology or medicine, who discovered RNA interference, the ability to use RNA to silence genes. The process and the new fundamental understanding of RNA involved several unusual combinations, and earned a Nobel only eight years after the research was published.

Big data

The emergence of big data can be attributed to an abundance of available information and advanced computing systems. Evans notes the demand for online materials, leading to digitized print and audio files and new digital resources. The reduced size and cost of cameras and recorders also puts data production in the hands of ordinary citizens, he says, and "suddenly, many more things are available for computation."

There is so much information available, however, that academics now distinguish between big data and "smart data," the latter describing analytics used to extract relevant information from big data sets. Much of the information collected is "noise," Schuble explains. For instance, social media produces millions of data every day, but there may not be meaning in those sets. Cagney says about Twitter: "The idea that all these data resources are really tapping the gestalt of a community, I'm not so convinced."

When relevant data are extracted, computation allows social scientists to ask questions they couldn't before and "pushes the work in a new way," says Cagney. Schuble finds advanced computation opportunities exciting. "I tell people GIS will often create more questions than answers, because you'll see things occur on the map and you'll be like, how is that happening?" Computation in social sciences helps clear away the need to see what is happening and focus on why.