In 1996, I was on a “guru panel” at the first TDWI (Data Warehousing Institute) in Washington, DC. Also on the panel were Bill Inmon and Herb Edelstein. In the audience Q&A session, someone had the microphone and posed this question to the panel: “Do the men who run these companies [she did say men] need all of this data, or can they just run them by gut instinct?"
I was stunned by the question. This was a data warehousing conference, and data warehouses at the time were the most extensive databases in the enterprise: on average, a few hundred gigabytes.
Alan Paller, the moderator, picked up the mic and started walking toward me. To myself, I was saying, “Please don’t give me the mic.” I didn’t know how to answer the question without offending the questioner. At the last second, he turned and handed the mic to Herb. I breathed a sigh of relief. Herb took the mic, let out a big sigh, and said, “Let me tell you how to make a hundred million dollar company. Start with a billion-dollar company and run it on intuition.”
Who was right? The question is even more valid today than it was twenty-five years ago. Since then, the amount of captured and stored data has exploded. The tactics of big data analytics, data science and AI, particularly machine learning, fundamentally require vast amounts of data.
- We are awash in data. Around 2.5 quintillion bytes worth of data is generated each day. There are currently over 60 zettabytes of data in the entire digital universe.
- The Big Data industry has seen tremendous growth in just a few years. It shot up from $169 billion in 2018 to $274 billion in 2022 — a 62% increase.
- The United States has a 51% market share in Big Data and Analytics Solutions (IDC)
- In 2018, the total amount of data created, captured, copied and consumed worldwide was 33 zettabytes (ZB) – the equivalent of 33 trillion gigabytes. This grew to 59ZB in 2020 and is predicted to reach a mind-boggling 175ZB by 2025. One zettabyte is One trillion gigabytes. That’s a lot of data.
- 80-90% of all digital data is unstructured (CIO). This is particularly interesting because the computation cost of converting pictures, video and audio to digitized data for computation exceeds the cost of running the models.
- Does all of this data help to understand causation?
Does causal emergence reverse the bias for the deepest raw data?
An article in New Scientist, A rethink of cause and effect could help when things get complicated, opens with, “Some scientists insist that the cause of all things exists at the most fundamental level, even in systems as complex as brains and people. What if it isn't so?”
Identifying what causes what in complex systems is the aim of much of science. Our current point of view for any kind of experiment is the principle of reductionism. Although we've made fantastic progress by breaking things down into smaller components, this “reductionist” approach has limits. Now, some researchers are suggesting we should zoom out and look at the bigger picture. Having created a new way to measure causation, they claim that, in many cases, the causes of things are found at the more coarse-grained levels of a system.
It's a controversial idea that elements of causation appear, not at the micro level, but at a higher level that is more understandable (and computable). For example, suppose causation can occur in a machine learning model of 100 thousand records instead of 500 million. If they are correct, this will create controversy over whether to save such massive amounts of raw data.
What is causal emergence?
just like the question from 1996, is the answer in data at some higher level? The alternative explanation is that the real cause of the events we study is revealed only at a higher level. This idea is called causal emergence. It defies the intuition behind reductionism and the assumption that a cause can’t simply appear at one scale unless it is inherent in micro causes at finer scales.
The original work on causal emergence was conducted by neuroscientists Erik Hoel at Tufts University in Massachusetts and Renzo Comolatti at the University of Milan in Italy. They prove that causal emergence exists and how we can identify and use it. “We want to take causation from being a philosophical question to being an applied scientific one,” says Hoel.
In 2013, Hoel, working with Albantakis and fellow neuroscientist Giulio Tononi, also at the University of Wisconsin-Madison, introduced a new way to do this, using a measure called effective information.This is based on how tightly a scenario constrains the past causes that could have produced it and the constraints on possible future effects.
Their work has its critics. Judea Pearl, a computer scientist at the University of California, Los Angeles, says that attempts to “measure causation in the language of probabilities” are outdated. But Hoel says the measures of causation they consider include such Pearl-type structures too.
All of this makes people nervous about the issue of free will. Are we free to make decisions like that anyway, or are they preordained? One common argument against the existence of free will is that atoms interact according to rigid physical laws, so the overall behavior they give rise to can be nothing but the deterministic outcome of all their interactions.
Yes, quantum mechanics creates some randomness in those interactions, but if it is random, it can’t be involved in free will. With causal emergence, however, the true causes of behavior stem from higher degrees of organization, such as how neurons are wired, our brain states, past history, etc. That means we can meaningfully say that our brains and minds are the real cause of our behavior.
Interesting to think about, but the point of this article was neither philosophy nor neuroscience. It raised the issue that if causal emergence is a viable alternative to reductionism, perhaps it’s an answer to that 1996 question, “Why do we need all this data?”