The ability to sort and query through expanding oceans of digital information takes an increasingly central role in the world. How far can this new technology take us? And at what point does Big Data become too big?
One of the most elemental ways of thinking about “Big Data”—quite possibly the most prominent buzz word permeating the tech world’s modern lexicon—and the role massive amounts of data play in the technical and scientific communities’ advancements, is, fundamentally, the ability to detect and examine highly diffused patterns.
Big Data as a concept on its own refers to such a large and complex quantity of digitally stored information that it is problematic to manage with traditional computer processing tools. Harnessing it has become the more broad term for data scientists’ ability to identify correlations and eventually make predictions by using powerful computers to navigate and extrapolate from vast oceans of stored data.
Today, data is produced so fast and on such a large scale that its sheer size can be debilitating. Every day, social media users, researchers and corporations, just to note a few sources, create roughly 2.5 quintillion bytes of data—so much that 90 percent of the data in the world today has been created in the last two years alone, according to Zettaset, a Big Data security organization located in Mountain View.
But, in a fusion between expanded data-storage capacity and computers that are now powerful enough tp manage, access, sort and query Big Data “sets” as a whole, a variety of industries have begun to unlock value from data that was previously unattainable or unknowable—from Facebook tracking people’s posts for advertising purposes, to genomics research on how cancerous cells mutate, or the NSA surreptitiously collecting domestic phone records to try and pin down potential threats to national security.
“Scale makes a very big difference,” says David Haussler, an expert in bioinformatics and genomics at UC Santa Cruz. “The amount of data we’re dealing with in our society is growing by factors of a thousand, which puts us in a different universe in many ways. And it enables things that are unheard of at smaller scales.”
In 2012, Haussler’s team established the Cancer Genomics Hub, or CGHub, at UCSC—the nation’s first large-scale data repository for the National Cancer Institute’s cancer genome research programs. His team manages CGHub’s computing infrastructure, which is located at the San Diego Supercomputer Center (SDSC), a 19,000-square-foot facility that hosts data storage and technical equipment for the entire UC system. CGHub allows scientists from all over the world, including those at UCSC, to access and research Big Data on cancer patients’ cells.
“We want to be able to share genomics data because you can’t recognize the patterns of mutations that are common to cancers—occurring again and again—unless you compare many, many cancer samples,” he says.
Without the ability to access such a vast set of samples and research how they compare to one another, Haussler says insight is lost for lack of a bigger picture: “Essentially,” he says, “you won’t be able to see the forest through the trees.
“So,” he continues, “it’s a Big Data problem. Just looking at one [genome mutation] sample, it all looks like a random chance event. It isn’t until you see thousands or hundreds of thousands of samples that you begin to see that, oh, there’s this subtle pattern of mutations that keeps happening again and again in this special sub-sub-type cancer—that’s what makes that type of cancer tick.
“Now we know that there’s a certain type of patient who has this pattern of mutation, then we can figure out what genes are disrupted, and how we would approach treating that patient’s special type of cancer. We call this precision medicine. Instead of giving everyone chemotherapy, we want a precise treatment that is targeting the particular patterns of mutations in a patient’s tumor.”
To grasp the amount of genomics Big Data researchers are working with, each genome file—the DNA record from a tumor or a normal tissue—is 300 billion bytes.
“Multiply 300 billion by the number of tissue samples,” he says. “And note that there are 1.2 million new cases of cancer every year in the United States—that’s a lot of data.
“So, we work very hard to work out the technology for how we would actually store and manage all of this vital data,” Haussler adds. “The ability to store the data and the ability to navigate the data—those two are intricately intertwined.”
Cumulatively, CGHub has transmitted 10 petabytes of data—each petabyte is equivalent to one million gigabytes—to research labs all over the world so that qualified scientists can do cancer work, Haussler says. In some cases, research labs are focusing on a single sub-sub-sub type of cancer, while others are comparing across all cancer types, looking for general patterns.
“That’s the simple stat that I’m most proud of—those 10 petabytes of data,” Haussler says. “The most famous cancer research centers in the world have downloaded the data; the M D Anderson Cancer Center in Houston, or the [Memorial] Sloan-Kettering Cancer Center in New York, for example.”
A History of Data
Humans have been gathering, testing, and making deductions using data for hundreds of years, often for the same ends—to derive some benefit from their environments or predict how events could play out. Prior to digital Big Data technology; prior to petabytes, terabytes, and gigabytes; and prior to supercomputers, CDs and floppy discs, “Big Data” was the accumulation, sharing, and cross-referencing of information over space and time, says Ethan Miller, a professor of computer science and the director of the Center for Research in Storage Systems at UCSC. Without digital data storage or computer processing technology, people crunched data the old-fashioned way.
For example, 2,500 years ago—about 400 B.C., when people possessed very little knowledge about the world relative to modern society—Miller poses the question, how did ancient societies such as those in Greece and China discover that chewing on willow bark would help reduce pain, fever and inflammation? (The major chemical component of willow bark is salicin, which is very similar to the active ingredient in aspirin.)
The answer, he says, is that they tried various things until they found something that worked.
Those early researchers would have had to conduct their sample tests in many places, waited for other sample observations from other researchers to exchange findings, and accumulate empirical data over long periods of time to test the validity of a wide variety of early herbal remedies.
In the case of coming to the conclusion that willow bark works well to relieve pain; “that might have taken thousands of years and who knows how many data samples to conclude,” Miller says.
In this manner, lacking the advanced forms of computer technology that data scientists and researchers enjoy today, Big Data was practiced across many centuries and expansive geography. “It was people traveling from village to village,” Miller notes. “It was across space and time, and it was much more difficult. Essentially, that’s what societies were trying to do, herbal remedies just being one example—accumulate Big Data over centuries.”
He also says that today one of the most common errors people make when interpreting data is confusing correlation with causation. In theory, the two are easy to distinguish, according to STATS, a statistical assessment service nonprofit. An action or occurrence may trigger another, such as smoking, which causes lung cancer—or it can correlate with another action or occurrence, such as smoking having a correlation with alcoholism. If one action causes another, then they are most certainly correlated. But just because two things correlate does not mean that one caused the other, even if it seems to make sense.
In the case of genomics research and cancer, Big Data’s capacity to show correlations in such vast sample sets allows scientists to narrow down the odds of what could be causing a gene to mutate, but not actually pinpoint the cause itself, Miller says.
“And that’s what Big Data is good for,” he continues. “It does not give you definitive yes or no answers on what’s causing anything; what it does give you are very helpful hints that allow you to focus your efforts where they will be most useful. Big Data can give you better confidence intervals. The more [samples] you can compare against, the more you can try to tease out what does matter and what doesn’t.”
Miller says that one of the most significant problems with interpreting Big Data and correlations is the human tendency to perceive correlation as causation; a statistical concept called “overfitting.”
“Humans have this problem,” he says. “It’s what happens when you see something, and you make very detailed observations about it, but it turns out a lot of it was just random. It’s the reason people see a face in the moon, or Jesus on their toast. It’s the human brain saying this thing looks like a face, it must have gotten put there for a reason.”
Star constellations in the sky are another example, he says. The stars that form the Big Dipper or Orion’s Belt are relatively nowhere near each other. They are many light years apart from one another in terms of distance from the Earth, but they happen to line up in the sky in such a way that people see images and project meaning onto them. And projecting meaning or concluding causation out of data where it is not warranted can lead people astray.
“Overfitting is assuming that, just because you found something, then it must be important,” Miller says. “A very simple example would be taking the names of faculty as UCSC, and saying, hypothetically, that Miller was a more common name at the university than any other. Overfitting would be saying, because there are more Millers here, that means having the name Miller means you’re more likely to be a college professor. But obviously, no, it doesn’t. It means Miller is a more common name. But that tendency is a very human thing.”
Data Super Crunch
The San Diego Supercomputer Center (SDSC), where Haussler and his team travel to in order to access genomics data, is a hub for all aspects of Big Data research, according to SDSC Director Michael Norman.
Overall, the center can hold 15 petabytes, making it one of the nation’s largest Big Data supercomputer facilities. The supercomputer system at the National Center for Supercomputing Applications at the University of Illinois, called “Blue Waters,” holds 25 petabytes, while the United States Department of Energy national laboratories is up around the 50 petabytes range, Norman estimates; “just to give you some notion of what the high-water marks are.”
The SDSC comprises three supercomputers, which provide the power for processing the Big Data stored there.
“It’s basically 10,000 individual computer processors all together, with a very fast network so that they can work on very large problems,” he says. “It’s one thing to store a lot of data, it’s another to efficiently work on it.”
Norman compares the Big Data center to an industrial park, as it might look from a bird’s eye view.
“You’d see it has a lot of infrastructure, roads, railroad lines, power, utilities,” he says, “And a supercomputer center really needs to have data superhighways inside of it in order to convey the raw material—which is the data—to the factory—which is the supercomputer. One of the biggest advances to Big Data in the last few years is high-speed networking, conveyed by fiber optics.”
To gather genomics data, Haussler says he travels to San Diego, downloads what he wants to research over optical fiber transmitters, and then returns with it to UCSC.
“In the old days, people accessed the data they wanted by moving it to themselves,” Norman says. “At a petabyte, that’s no longer feasible. So, in the Big Data world, there is the phrase, ‘bringing the computer to the data,’ which means actually moving the computer power to be close to the data.”
In November of last year, UCSC received a sponsorship deal from Hitachi Data Systems, a Big Data storage corporation in Silicon Valley, according to a Nov. 4 UCSC press release. The deal provided the university with funding for cancer genomics research as well as a petabyte-capacity data-storage system, which is overseen by Miller. UCSC researchers are using the storage system to manage genomics data and study ways to improve the storage and processing for large amounts of genomic data. Miller says the petabyte of data storage sits on machine room racks, which are about six feet tall, 19 inches wide and span back about 30 inches.
“It’s not that exciting,” he says.
National Security in Big Data
The National Security Agency took the international spotlight last year when government contractor Edward Snowden leaked government documents unveiling a vast phone and Internet communications-gathering program.
On June 6, the New York Times editorialized, “There is every reason to believe the federal government has been collecting every bit of information about every American’s phone calls except the words actually exchanged in those calls.”
This Big metadata is digitally stored, where a computer algorithm can scan and flag suspicious communication patterns for further examination. Otherwise, the data is never actually seen by human eyes.
Both the NSA and genomics researchers’ use of Big Data is oriented toward determining patterns, in the case of the NSA, searching for very few select callers’ relationships, among millions of samples—potential terrorists—and in the case of genomics, common patterns in cell mutations among millions of tissue samples.
The NSA uses Big Data to get an idea about what people are communicating with one another, from where, and how often, Miller says.
“If person X is a known terrorist, and person Y has been to Saudi Arabia three times in a year and has made lots of calls back there, and maybe they’re taking flight lessons but not learning how to fly the plane, or doing something that’s suspicious, then the government can look at the people they’re in communication with and chronicle them as possibilities and profile them for suspicious activities,” he says. “[The NSA] is trying to look at what’s weird here. And Big Data makes it easier to find outliers and to find connections between people. Just like in genomics with the way people’s cells behave. Those are all correlations you can find.”
Norman says both the genomics researchers’ data and the NSA’s telephone records data are expressed as graphs of relationships.
“The thing about these graphs is they have millions of nodes, and millions of edges—the lines connecting the nodes,” he says. “In the case of the NSA, the node is a phone number that placed a call to another phone number. In the case of genomics data, it is a specific gene on the DNA sequence that may be connected to other genes through different regulatory pathways in our biology.”
Using Big Data, Norman says, genome scientists can assemble their graphs into stories that, ideally, tell them something informative about the data set—cancer illness—or, for the NSA, which samples warrant closer inspection.
With the Snowden leaks’ recent domination of the media, Big Data, for many, has become suggestive of an Orwellian government that uses digital spying to control or keep tabs on its citizens. And while he says those concerns are valid and need to be addressed, Haussler hopes the good that Big Data can do for society, as a tool, will not be shunned out of fear and trepidation. He says something great can be lost if people allow the negativity they associate with Big Data to override their ability to see its potential.
“The unfortunate collision of the ‘Bigs’— Big Brother and Big Data—is a detriment to the social understanding of the field, and to the extent that that’s reinforced, and Big Data equals Big Brother as an equation in the public’s mind, then we lose an enormous opportunity for science and the ability to have better health and lead better lives,” Haussler says.
However, he continues, the loss of privacy and the changing face of what people in America think about their privacy is a very real and massive change.
“I like to think about [Big Data] like the highway system,” Haussler says. “Before the highway system, there weren’t so many driving deaths, and you couldn’t traffic illicit materials as easily, but there are all kinds of good things that became possible because of the highway system, as well, and I don’t think anyone seriously thinks that we want to get rid of it. So that’s what it is—it’s a data highway. It’s good in many ways, but also scary in some ways.”
Pay Them in Data
Millions of people enjoy free online services such as Facebook, Google email, search engines, and much more. However, getting a “free” service from a company is not purely out of the goodness of their heart, says Miller.
“They’re getting something back in return, and what they’re getting is your data,” he says. “You pay them back with data about your observations, what you do, and your interests, which they can, in turn, sell [to advertisers]. This has been obvious for some time, but Big Data is changing the game. That data has become more valuable than ever because they can correlate it across other things and create more targeted ads, and, in a sense, this is a good thing.”
Companies such as Google and Facebook, which store Big Data on their clients, might supply information to an advertiser that one of their users is a photographer and that they like Canon equipment. The advertiser might then show that person a great deal on a Canon lens, and they would happily purchase it.
“I might care about that,” Miller says. “The point is that ads for things you might actually want are useful. Viagra advertisements in your spam mail; probably not so much.”
There are privacy issues with Big Data, and there are ways to get around them, but they require ongoing initiative, he says. For example, instead of uploading his photos to Facebook or Instagram, Miller puts his photos on a website called SmugMug, which does not run ads. But he also pays them $50 a year.
“You have to pay for the service somehow,” he says. “I pay in dollars. Maybe you want to give away your data; that’s your business.”
Big Data and its role in reformatting society’s online privacy standards is still very hazy terrain. In December last year, a federal judge ruled that the NSA’s bulk phone record collection was unconstitutional and violates Americans’ reasonable expectation of privacy. The ruling, however, was stayed pending an appeal. And that is just one of the more prominent examples of legal issues orbiting data privacy. As Big Data grows more vast and people’s lives more rooted in technology and online activity, society’s interests in and expectations of privacy will undoubtedly become increasingly examined and tested by law.
As data storage capabilities expand, more Big Data is being stored for longer periods of time. This field of research iscalled archival storage, which Miller is involved with at UCSC. This sort of technology will allow scientists to determine correlations over longer periods of time, which can be extremely informative.
“Currently, researchers can’t do work on how patients’ genomes have evolved over the last 30 years because that genome data isn’t stored,” Miller explains. “But down the line, they will be able to.”
Archival data storage is also very important for other fields of study such as climate change and weather behavior.
“We track the path of a hurricane after its formation or how a storm will hit California based on historical data,” he says. “That’s tremendously valuable, but you need lots and lots of observations over a long time to draw out the patterns.”
However, at the rate that Big Data is accumulating, Miller says it will eventually become more than data scientists can store or work with.
“It gets very difficult because at its rate of expansion, we’re going to drown in data if we’re not careful,” he says. “We can’t store infinite data, but the amount of data we are generating is increasing rapidly.
“So, the questions are, ‘Can we keep all of it?’ ‘Should we keep all of it?’ and ‘How much of it will really be relevant?’ Right now, we don’t know the answers.”