September 25, 2013 Science and Big Data

                                                                                                                                                                                 Click here for a pdf version.

To Know, but Not Understand: David Weinberger on Science and Big Data

by David Weinberger  original

Thomas Jefferson and George Washington recorded daily weather observations, but they didn’t record them hourly or by the minute. Not only did they have other things to do, such data didn’t seem useful. Even after the invention of the telegraph enabled the centralization of weather data, the 150 volunteers who received weather instruments from the Smithsonian Institution in 1849 still reported only once a day. Now there is a literally immeasurable, continuous stream of climate data from satellites circling the earth, buoys bobbing in the ocean, and Wi-Fi-enabled sensors in the rain forest. We are measuring temperatures, rainfall, wind speeds, C02 levels, and pressure pulses of solar wind. All this data and much, much more became worth recording once we could record it, once we could process it with computers, and once we could connect the data streams and the data processors with a network.

This would not be the first time. For example, when Sir Francis Bacon said that knowledge of the world should be grounded in carefully verified facts about the world, he wasn’t just giving us a new method to achieve old-fashioned knowledge. He was redefining knowledge as theories that are grounded in facts. The Age of the Net is bringing about a redefinition at the same scale. Scientific knowledge is taking on properties of its new medium, becoming like the network in which it lives.

In this excerpt from my new book, Too Big To Know, we’ll look at a key property of the networking of knowledge: hugeness.

In 1963, Bernard K. Forscher of the Mayo Clinic complained in a now famous letter printed in the prestigious journal Science that scientists were generating too many facts. Titled Chaos in the Brickyard, the letter warned that the new generation of scientists was too busy churning out bricks — facts — without regard to how they go together. Brickmaking, Forscher feared, had become an end in itself. “And so it happened that the land became flooded with bricks. … It became difficult to find the proper bricks for a task because one had to hunt among so many. … It became difficult to complete a useful edifice because, as soon as the foundations were discernible, they were buried under an avalanche of random bricks.”

If science looked like a chaotic brickyard in 1963, Dr. Forscher would have sat down and wailed if he were shown the Global Biodiversity Information Facility at GBIF.org. Over the past few years, GBIF has collected thousands of collections of fact-bricks about the distribution of lifeover our planet, from the bacteria collection of the Polish National Institute of Public Health to the Weddell Seal Census of the Vestfold Hills of Antarctica. GBIF.org is designed to be just the sort of brickyard Dr. Forscher deplored — information presented without hypothesis, theory, or edifice — except far larger because the good doctor could not have foreseen the networking of brickyards.

How will we ever make sense of scientific topics that are too big to know? The short answer: by transforming what it means to know something scientifically.

Scientific knowledge is taking on properties of its new medium, becoming like the network in which it lives.
Indeed, networked fact-based brickyards are a growth industry. For example, at ProteomeCommons.org you’ll find information about the proteins specific to various organisms. An independent project by a grad student, Proteome Commons makes available almost 13 million data files, for a total of 12.6 terabytes of information. The data come from scientists from around the world, and are made available to everyone, for free. The Sloan Digital Sky Survey — under the modest tag line Mapping the Universe — has been gathering and releasing maps of the skies gathered from 25 institutions around the world. Its initial survey, completed in 2008 after eight years of work, published information about 230 million celestial objects, including 930,000 galaxies; each galaxy contains millions of stars, so this brickyard may grow to a size where we have trouble naming the number. The best known of the new data brickyards, the Human Genome Project, in 2001 completed mapping the entire genetic blueprint of the human species; it has been surpassed in terms of quantity by the International Nucleotide Sequence Database Collaboration, which as of May 2009 had gathered 250 billion pieces of genetic data.

There are three basic reasons scientific data has increased to the point that the brickyard metaphor now looks 19th century. First, the economics of deletion have changed. We used to throw out most of the photos we took with our pathetic old film cameras because, even though they were far more expensive to create than today’s digital images, photo albums were expensive, took up space, and required us to invest considerable time in deciding which photos would make the cut. Now, it’s often less expensive to store them all on our hard drive (or at some website) than it is to weed through them.

Second, the economics of sharing have changed. The Library of Congress has tens of millions of items in storage because physics makes it hard to display and preserve, much less to share, physical objects. The Internet makes it far easier to share what’s in our digital basements. When the datasets are so large that they become unwieldy even for the Internet, innovators are spurred to invent new forms of sharing. For example, Tranche, the system behind ProteomeCommons, created its own technical protocol for sharing terabytes of data over the Net, so that a single source isn’t responsible for pumping out all the information; the process of sharing is itself shared across the network. And the new Linked Data format makes it easier than ever to package data into small chunks that can be found and reused. The ability to access and share over the Net further enhances the new economics of deletion; data that otherwise would not have been worth storing have new potential value because people can find and share them.

Third, computers have become exponentially smarter. John Wilbanks, vice president for Science at Creative Commons (formerly called Science Commons), notes that “[i]t used to take a year to map a gene. Now you can do thirty thousand on your desktop computer in a day. A $2,000 machine — a microarray — now lets you look at the human genome reacting over time.” Within

days of the first human being diagnosed with the H1N1 swine flu virus, the H1 sequence of 1,699 bases had been analyzed and submitted to a global repository. The processing power available even on desktops adds yet more potential value to the data being stored and shared.

The brickyard has grown to galactic size, but the news gets even worse for Dr. Forscher. It’s not simply that there are too many brickfacts and not enough edifice-theories. Rather, the creation of data galaxies has led us to science that sometimes is too rich and complex for reduction into theories. As science has gotten too big to know, we’ve adopted different ideas about what it means to know at all.

For example, the biological system of an organism is complex beyond imagining. Even the simplest element of life, a cell, is itself a system. A new science called systems biology studies the ways in which external stimuli send signals across the cell membrane. Some stimuli provoke relatively simple responses, but others cause cascades of reactions. These signals cannot be understood in isolation from one another. The overall picture of interactions even of a single cell is more than a human being made out of those cells can understand. In 2002, when Hiroaki Kitano wrote a cover story on systems biology for Science magazine — a formal recognition of the growing importance of this young field — he said: “The major reason it is gaining renewed interest today is that progress in molecular biology … enables us to collect comprehensive datasets on system performance and gain information on the underlying molecules.” Of course, the only reason we’re able to collect comprehensive datasets is that computers have gotten so big and powerful. Systems biology simply was not possible in the Age of Books.

The result of having access to all this data is a new science that is able to study not just “the characteristics of isolated parts of a cell or organism” (to quote Kitano) but properties that don’t show up at the parts level. For example, one of the most remarkable characteristics of living organisms is that we’re robust — our bodies bounce back time and time again, until, of course, they don’t. Robustness is a property of a system, not of its individual elements, some of which may be nonrobust and, like ants protecting their queen, may “sacrifice themselves” so that the system overall can survive. In fact, life itself is a property of a system.

The problem — or at least the change — is that we humans cannot understand systems even as complex as that of a simple cell. It’s not that were awaiting some elegant theory that will snap all the details into place. The theory is well established already: Cellular systems consist of a set of detailed interactions that can be thought of as signals and responses. But those interactions surpass in quantity and complexity the human brains ability to comprehend them. The science of such systems requires computers to store all the details and to see how they interact. Systems biologists build computer models that replicate in software what happens when the millions of pieces interact. It’s a bit like predicting the weather, but with far more dependency on particular events and fewer general principles.

Models this complex — whether of cellular biology, the weather, the economy, even highway traffic — often fail us, because the world is more complex than our models can capture. But sometimes they can predict accurately how the system will behave. At their most complex these are sciences of emergence and complexity, studying properties of systems that cannot be seen by looking only at the parts, and cannot be well predicted except by looking at what happens.

This marks quite a turn in science’s path. For Sir Francis Bacon 400 years ago, for Darwin 150 years ago, for Bernard Forscher 50 years ago, the aim of science was to construct theories that are both supported by and explain the facts. Facts are about particular things, whereas knowledge (it was thought) should be of universals. Every advance of knowledge of universals brought us closer to fulfilling the destiny our Creator set for us.

This strategy also had a practical side, of course. There are many fewer universals than particulars, and you can often figure out the particulars if you know the universals: If you know the universal theorems that explain the orbits of planets, you can figure out where Mars will be in the sky on any particular day on Earth. Aiming at universals is a simplifying tactic within our broader traditional strategy for dealing with a world that is too big to know by reducing knowledge to what our brains and our technology enable us to deal with.

We therefore stared at tables of numbers until their simple patterns became obvious to us. Johannes Kepler examined the star charts carefully constructed by his boss, Tycho Brahe, until he realized in 1605 that if the planets orbit the Sun in ellipses rather than perfect circles, it all makes simple sense. Three hundred fifty years later, James Watson and Francis Crick stared at x- rays of DNA until they realized that if the molecule were a double helix, the data about the distances among its atoms made simple sense. With these discoveries, the data went from being confoundingly random to revealing an order that we understand: Oh, the orbits are elliptical! Oh, the molecule is a double helix!

They are so complex that only our artificial brains can manage the amount of data and the number of interactions involved.
With the new database-based science, there is often no moment when the complex becomes simple enough for us to understand it. The model does not reduce to an equation that lets us then throw away the model. You have to run the simulation to see what emerges. For example, a computer model of the movement of people within a confined space who are fleeing from a threat–they are in a panic–shows that putting a column about one meter in front of an exit door, slightly to either side, actually increases the flow of people out the door. Why? There may be a theory or it may simply be an emergent property. We can climb the ladder of complexity from party games to humans with the single intent of getting outside of a burning building, to phenomena with many more people with much more diverse and changing motivations, such as markets. We can model these and perhaps know how they work without understanding them. They are so complex that only our artificial brains can manage the amount of data and the number of interactions involved.

The same holds true for models of purely physical interactions, whether they’re of cells, weather patterns, or dust motes. For example, Hod Lipson and Michael Schmidt at Cornell University designed the Eureqa computer program to find equations that make sense of large quantities of data that have stumped mere humans, including cellular signaling and the effect of cocaine on white blood cells. Eureqa looks for possible equations that explain the relation of some likely pieces of data, and then tweaks and tests those equations to see if the results more accurately fit the data. It keeps iterating until it has an equation that works.

Dr. Gurol Suel at the University of Texas Southwestern Medical Center used Eureqa to try to figure out what causes fluctuations among all of the thousands of different elements of a single

bacterium. After chewing over the brickyard of data that Suel had given it, Eureqa came out with two equations that expressed constants within the cell. Suel had his answer. He just doesn’t understand it and doesn’t think any person could. It’s a bit as if Einstein dreamed e = mc2, and we confirmed that it worked, but no one could figure out what the c stands for.

No one says that having an answer that humans cannot understand is very satisfying. We want Eureka and not just Eureqa. In some instances well undoubtedly come to understand the oracular equations our software produces. On the other hand, one of the scientists using Eureqa, biophysicist John Wikswo, told a reporter for Wired: “Biology is complicated beyond belief, too complicated for people to comprehend the solutions to its complexity. And the solution to this problem is the Eureqa project.” The world’s complexity may simply outrun our brains capacity to understand it.

Model-based knowing has many well-documented difficulties, especially when we are attempting to predict real-world events subject to the vagaries of history; a Cretaceous-era model of that eras ecology would not have included the arrival of a giant asteroid in its data, and no one expects a black swan. Nevertheless, models can have the predictive power demanded of scientific hypotheses. We have a new form of knowing.

This new knowledge requires not just giant computers but a network to connect them, to feed them, and to make their work accessible. It exists at the network level, not in the heads of individual human beings.