A new major in data science helps undergrads make sense of it all.
Our world is awash in data. The digitization of everyday life means humanity tosses out measurable data in our wake, which might—in the right hands—give us insight into how our economy, our society, even our universe works. Our web browsing histories? Data. Car loan applications? Data. Scientific experiments? More data than scientists know what to do with.
The future belongs to those who can take that data and derive meaning from it—which is why UChicago launched a major in data science this academic year. But how does a field go from being a set of miscellaneous tools for computer programmers to a full-fledged major?
Lies, damned lies, and statistics
Social Sciences 401 has pointed Gothic windows and a peaceful treetops-height view onto the Midway Plaisance’s Winter Garden. But none of the dozen students gathered here for Data Science 26100: Statistical Pitfalls and Misinterpretation of Data was gazing out the window on the November day I visited. Their attention was fixed on the slide that David Biron, assistant senior instructional professor of statistics and director of undergraduate data science, was projecting.
According to the catalog description, Biron’s course provides “tools for thinking critically about data and models that constitute evidence,” and examines “examples of misleading language and graphics.” Which means there’s a lot of looking at other peoples’ mistakes, immortalized in poorly thought-out studies and articles.
Today’s lesson is all about regression to the mean—the statistical phenomenon that can make natural variation in data look like real change. Take an example just a few miles from the quads. Chicago White Sox catcher Yermín Mercedes began the 2021 season by getting eight hits in eight at-bats, posting an impossibly high 1.000 batting average. Baseball fans recognized this was an outlier; sure enough, over the season, Mercedes’s batting average regressed to the mean, dropping to a pedestrian .271. Outliers in a data set like this tend to revert to a value closer to the mean over time, regardless of whatever effect is being studied; a simplistic analysis misinterprets that as cause and effect. Regression to the mean trips up many researchers.
Biron cues up an example from a 1987 New York Times article touting the ability of a beta-blocker called propranolol to help overanxious students on their SATs. Twenty-five test-takers who scored lower than expected took the SAT again, but this time they were given propranolol. They performed an average of 120 points higher the second time. Sounds good? Not to Biron: “So, what might be wrong with the design of the study?”
A student points out it’s not a random sample and there’s no control group. She’s right on the money: “This is the worst thing you can do” in designing an experiment, says Biron. It’s easy to think of reasons other than anxiety that could have depressed the students’ original scores. Or it might be another example of regression to the mean. The point is, he explains, “with the original study design, you can’t prove anything.”
The next example was another drug study, this time on the bone density of patients taking one of two drug intended to prevent osteoporosis. Biron explains the data were measured at the start of the study, after 12 months, and after 24 months. People who lost bone density in the first 12 months of treatment seemed to recover it by month 24. But people who gained bone density in the first 12 months ended up losing it by the end.
If you only look at the data after 12 months, he says, you might conclude there’s a class of patients for whom the drugs were extremely effective, and another for whom they were worse than nothing. But the convergence of all the patients’ results by 24 months demonstrates regression to the mean. The outliers at 12 months—both positive and negative—were just chance. If the researchers had focused only on the patients who appeared to be responding to the drugs, Biron explains, they would have missed the regression to the mean of the other patients.
Biron closes by giving the students pointers on how to avoid bad data and flawed conclusions. Avoid preselecting data based on a cutoff; focusing on extreme data points that might regress to the mean later—as in the bone-density study—ends up biasing your conclusions. Randomly allocate subjects to trial and control groups as much as possible. (Remember the SAT study, bereft of a control group.) Take multiple baseline measurements to understand what natural variation exists before your study begins. (You’d want to know what role random chance plays in baseball statistics before declaring Yermín Mercedes the finest hitter ever, wouldn’t you?) Following this advice will minimize spurious effects in data analysis, he says: “Luck does not persist from trial to trial.”
Data—it’s not just for data scientists
Students interested in data science fall into roughly three cohorts, says Biron. One group is primarily interested in statistics and theory. Another group is drawn to the computational aspect. But the third and largest group comes from all over: the humanities, the social sciences, the biological sciences.
Asked whether data science should be thought of as a toolkit for researchers—methods to interrogate data, regardless of field—or as a discipline all its own, Biron responds, “It’s a bit of both. But the entire spectrum, all the gray levels between, are actually useful and required.”
When developing the major, Biron says, the College faced a choice. “One approach would be to build a lot on the commonalities with statistics and computer science and basically pick and choose—make a program of the most relevant computer science and statistics courses.” The other alternative: treat it as a new and separate discipline, “which meant it would take a little more time to get off the ground because you can’t come up with this whole new curriculum, like 12 courses, in one year.”
The organizers chose the second option. Data science started as a minor in 2019, laying the groundwork for and gauging the interest in a full major by 2020. Like many plans, that one was pushed back a year by the COVID-19 pandemic. Biron notes data science is compatible as part of a double major with every other major in the College except one: molecular engineering. There are so many required classes for molecular engineering that “it’s very hard to fit a second major along with that.” (Consider this a challenge to overachievers.)
Even for students who don’t major or minor in the field, he sees an advantage in picking up data science skills. “You can be more creative and more critical of what you’re doing if you understand” the techniques behind the science, he says. “You can be a much better part of a team.”
Biron is enjoying teaching both majors and nonmajors. “It’s a joy to have a class with philosophy students and a couple of economics students and somebody from biology,” he says. “It’s really fun.”
Dirty, messy data
Given the old chestnut about UChicago being a place of theory and not practice, you might suspect a new program like data science would be heavily tilted toward the former.
Au contraire, says David Uminsky, senior research associate for computer science and executive director for data science. Whereas dealing solely with theoretical data might be enough to get you through a statistics program, he says, “you cannot separate a data science education from looking at real data.” The data clinic class, which he oversees, is where the rubber hits the road for students—a real-world test of their skills.
The clinic layers “the theory and learning from the UChicago classroom into a rigorous practice setting,” with “the structure of really being on a data science team.” Students learn how to problem solve as a team and how to handle meetings with mentors and clients. They also learn how to use scrum frameworks, which enable members to build self-organizing, cross-functional teams, and Agile methodologies, processes in which software development is broken down into iterative stages. These methods are de rigueur in software development and becoming common in other business fields.
Students must apply to the data clinic class; then they’re offered a list of possible projects, which they rank in order of preference. The instructors match each student with a team and a project. Some projects are from research groups on campus; some are from the national laboratories. Others come from nonprofits, government, and industry. Uminsky notes they’re trying to develop more relationships with Chicagoland nonprofits—South Side ones in particular.
By way of example, Uminsky offers a project done in conjunction with Inclusive Development International (IDI), a nonprofit that monitors corporate activities in the developing world. IDI wanted to track deforestation in Indonesia around palm oil harvesting.
An extremely versatile product used in many processed foods, palm oil accounts for 40 percent of the world’s vegetable oil. It’s an incredibly efficient crop, using just 6 percent of the land used to cultivate all vegetable oils worldwide. The problem: palms only grow in the tropics. This provides a financial incentive for unscrupulous growers to clear fragile rainforests—often illegally—for palm oil plantations. Multinational corporations are under pressure to show that their palm oil comes from responsible growers.
As it happens, palm fruit must be processed within 24 hours of harvest, which means plantations have to be close to mills. So students in the clinic first took lists of every registered mill. (The vast majority are in Indonesia and Malaysia.) They partitioned the land into what are called Voronoi cells, irregular polygons in which the entire region enclosed is closer to that particular mill than any other. Plantations within each cell were presumed to supply that mill.
The students matched historic data derived from satellite imagery for each mill’s territory to learn whether or not the surrounding area was still protected virgin rainforest or whether it was being cleared for plantations. Recent clearing was a sign that the mill might be supplied by environmentally irresponsible plantations; conversely, virgin rainforest in a mill’s territory was a sign that rainforest might be in imminent danger of being cleared. All this from openly available data.
Uminsky says this problem will take multiple clinic teams to solve. “They’ll make progress,” he says. “We’ll get to certain milestones. And then the next team will pick it up.” The first results came in late 2020, with the launch of an online tool matching mills with each multinational’s list of palm oil providers and giving them a score based on how well they were living up to their environmental pledges.
The upside of working with dirty, messy data, then, is that students’ efforts can go from the classroom to having an impact on real people within a few years, or even months.
Trending up?
So far 13 students have declared data science as their major—not bad, considering the new major was announced just a month before the academic year began.
Ever the careful statistician, Biron estimates the program’s eventual size—somewhere between 30 and 50 undergraduate majors at any one time— based on interest from information sessions and his own conversations with students. Meanwhile, the Autumn 2021 introductory classes in data science had 90 students. The analysis of these data is left to the reader, but I’ll venture that the trend is clearly positive.