In the previous episode of the Big Data Series, we defined the somewhat elusive term, Big Data, and gave some perspective on the emergence and nature of Big Data, its dimensions and salient characteristics, and its cosmological and galactic relevance. Additionally, we explored the notion of the ubiquitous Data Sphere, its emergence as part of our modern technological existence and how it was adumbrated by both a priest and a science fiction writer.
Now we will look inside the content of some Big Data efforts to explore their potential economic and social value. Since it is all too easy to raise warning flags regarding abuses of Big Data, we will save that discussion for a later episode of the Big Data Series. Instead we will focus on how Big Data is reaching out and touching each one of us for the good of all, presumably.
Specifically, we will now explore Reality Mining, Honest Signals, and the Quantified Self in this installment of the Big Data Series. Although one might consider these topics as distinct from Big Data per se, it will hopefully become clear that the nature and scale of Big Data fundamentally transforms each of these concerns in terms of their impact and intrinsic contribution to the ubiquitous Data Sphere that we all live in.
Reality Mining: The Social Spelunkers
MIT professor Sandy Pentland says Reality Mining "is all about paying attention to patterns in life and using that information to help with things like setting privacy patterns, sharing things with people, notifying people -- basically, to help you live your life."
Pentland's Reality Mining is now a common occurrence, thanks to the proliferation sophisticated Smartphones. These handheld devices now have the processing power of low-end desktop computers and collect varied data, thanks to devices such as GPS chips that track location. Researchers such as Pentland are getting better at making sense of all that information, detecting underlying patterns that can reveal much more information than the data alone may suggest. For example, with the aid of some statistical algorithms, that information can identify things to do or new people to meet. It can also make devices easier to use -- by automatically determining security settings, for example. Smartphone data can also shed light on workplace dynamics and on the well-being of communities. It can even help project the course of disease outbreaks and provide clues about individuals' health.
To create an accurate model of a person's social network, Pentland's team combines a phone's call logs with information about its proximity to other people's devices, which is continuously collected by Bluetooth sensors. With the help of factor analysis, a statistical technique commonly used in the social sciences to explain correlations among multiple variables, the team identifies patterns in the data and translates them into maps of social relationships. Such maps are used to accurately categorize the people in your address book as friends, family members, acquaintances, or coworkers. In turn, this information is used to automatically establish privacy settings -- for instance, allowing only your family to view your schedule. With location data added in, the smartphone can predict when you will be near someone in your network.
Reality Mining defines the collection of machine-sensed environmental data pertaining to human social behavior. This new paradigm of data mining makes possible the modeling of conversation context, proximity sensing, and temporospatial location throughout large communities of individuals. Mobile phones (and other innocuous devices) are used for data collection, opening social network analysis to brave new methods of empirical stochastic modeling. These datasets are enormous and Big Data techniques are fundamental to the Reality Mining effort.
Some fundamental social questions that Reality Mining can answer:
How do social networks evolve over time?
How entropic (predictable) are most people's social lives?
How does social information flow?
Can the topology of a social network be inferred from only proximity data?
How can we change a group's social interactions to promote better functioning?
All Roads Lead To Big Data
Reality Mining can help city planners unravel traffic snarls and public health officials track and prevent the spread of illnesses, such as severe acute respiratory syndrome, or SARS. "There is so much societal good that can come from this," says Pentland. "Suddenly we have the ability to know what is happening with the mass of humanity and adapt society to accommodate the trends we can detect, and make society work better." Hari Seldon himself could not have said it any better.
Inrix tracks some 750,000 vehicles traversing 55,000 miles of roadway in 129 cities to gather real-time traffic congestion data that is then used in a variety of ways, such as providing live traffic information to devices made by Garmin (GRMN) and TomTom. Over time, all that data shows useful patterns. "We can build a model for major sporting events that shows what happens if you build the stadium in one place or another," Inrix CEO Mistele says. "We've found that in most cities the biggest determining factor in traffic is school schedules. In other cases, like Washington, D.C., the legislative calendar is very important. We can correlate our data to practically any other variable."
Smartphones can be useful in gathering health-related information, says Alex Kass, a researcher at Accenture. "It's one of the application areas that focus well both on the individual and on large groups," he says. "Researchers can use data on a sample population over a given period—say, a week or a month—and then assume some of them are sick, to provide a more accurate picture of how widely an illness could spread," Kass says. "Information on a particular individual or group could help build more accurate models to predict how an illness spreads from one person to another. People could also use the data to keep better tabs on themselves," Kass says. (More on the Quantified Self later in this episode.)
So as a scientific field of inquiry, Reality Mining is a solid step toward a mathematical sociology that resonates with Hari Seldon's psychohistory, as described in the previous episode of this series. Big Data makes Reality Mining much more powerful in its reach and ultimate omniscience. The Big Data influence is so transformative that some believe that this "Petabyte Age" in which we live in, the Data Sphere, is all you need and that scientific theories, including those used in Reality Mining, be damned. Huh?
Theory? What Theory?
Chris Anderson is the editor in chief of Wired magazine. He believes that Big Data's "Data Deluge" makes the scientific method obsolete. Spoken just like a disgruntled former physicist.
The scientific method is built around testable hypotheses. The hypothesized models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years. Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms, i.e. the model, that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. Big Data without models is an overwhelming cacophony. But according to Anderson, faced with massive data, this approach to science -- hypothesize, model, test -- is becoming obsolete.
Anderson advises to forget taxonomy, ontology, and psychology. "Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves."
Anderson suggests a better way. He suggests the petabytes themselves allow us to say: "Correlation is enough." He says we can stop looking for models and can analyze the data without hypotheses about what it might show. Anderson says we can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Whaaa?!?
Unfortunately, Anderson neglects to mention that the underlying "statistical algorithms" used to sift through the petabytes are not exactly "model-free estimators". In other words, many statistical models are formalizations of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but stochastically related. So Mr. Anderson, "Thar Be Models" in many of the statistical algorithms sifting through the petabytes of collected data. And a model is just a theory, after all. D'oh!
Statistical models of intention, goals and values form the basis of analysis and prediction in Honest Signaling, the patterns of how we socially interact. Researchers say they can get a more accurate picture of what people do, where they go, and with whom they communicate from a device they carry than from more subjective sources, including what people say about themselves. In short, people lie -- smartphones don't. Or so the thinking goes.
But how can you know when someone is bluffing? Paying attention? Genuinely interested? The answer, according to Sandy Pentland in Honest Signals, is that the subtle patterns of how we interact with other people reveal our attitudes toward them. These unconscious social signals are not just a back channel or a complement to our conscious language; they form a distinct communication network. Biologically based “honest signaling” evolved from ancient animal signaling mechanisms, offers an unmatched window into our intentions, goals, and values. If we understand this ancient animal channel of communication, Pentland claims, we can accurately predict the outcomes of situations ranging from job interviews to first dates.
Pentland monitored and analyzed the back-and-forth patterns of signaling among groups of people and found that this second channel of communications, revolving not around words but around social relations, profoundly influences the major decisions of our lives -- even though we remain largely unaware of it. Pentland shows how by “reading” our social networks we can become more successful at pitching an idea, getting a job, or closing a deal. Using this “network intelligence” theory of social signaling, Pentland describes how we can harness the intelligence of our social network to become better managers, workers, and communicators.
The key component of Pentland's methodology is that observational data collected from social groups can be "normalized" along the lines of Honest Signaling in order to support or discount the observational data with patterns gleaned from this second channel of communication. This makes the collection of massive amounts of data more robust and "noise-free" in its predictive power and reinforces any social analysis with an almost uncanny accuracy that transcends individual actions per se and the apparent volition of any sole individual. This is how Big Data can become Big Brother in a most insidious manner. Our animal biology can betray us. As Curly has prophetically stated on many occasions, "I'm a victim of circumstance!"
Are You Receiving Me? Cues and Signals
In biology, signals are traits, including structures and behaviors, that have evolved specifically because they change the behavior of receivers in ways that benefit the signaller. Traits or actions that benefit the receiver exclusively are called cues. When an alert bird deliberately gives a warning call to a stalking predator and the predator gives up the hunt, the sound is a signal. When a foraging bird inadvertently makes a rustling sound in the leaves that attracts predators and increases the risk of predation, the sound is a 'cue'.
Signaling systems are shaped by the extent to which signallers and receivers have mutual interests. An alert bird warning off a stalking predator is communicating something useful to the predator: that it has been detected by the prey; it might as well quit wasting its time stalking this alerted prey, which it is unlikely to catch. When the predator gives up, the signaller can get back to other important tasks. Once the stalking predator is detected, the signalling prey and receiving predator have a mutual interest terminating the hunt. Within species, mutual interests generally increase with kinship. Kinship is central to models of signalling between relatives, for instance when broods of nestling birds beg and compete for food from their parents.
Biologists use the phrase ”honest signals” in a statistical sense. Biological signals, like warning calls or resplendent tail feathers, are considered honest if they are correlated with, or reliably predict, something useful to the receiver. In this usage, honesty is a useful correlation between the signal trait (which economists call ”public information” because it is readily apparent) and the unobservable thing of value to the receiver (which economists refer to as “private information” and biologists often refer to as “quality”). Honest biological signals do not need to be perfectly informative, reducing uncertainty to zero; they only need to be honest “on average” to be potentially useful.
Honest Signaling in animal communication has direct correlates to human communication, especially in social groups. To the extent that human signals are in fact honest, group behavior not only becomes coordinated but also predictable, and can reliably define social affinity groups. These affinity groups can reliably predict mass behavior in ways that social researchers exploit, especially using volumes of Big Data. But what happens when there is intentional deception by bad actors such as fraudulent behavior?
Because there are both mutual and conflicting interests in most animal signalling systems, the fundamental problem in evolutionary signalling games is dishonesty or cheating. Why don’t foraging birds just give warning calls all the time, at random (false alarms), just in case a predator is nearby? If peacocks with bigger tails are preferred by peahens, why don’t all peacocks display big tails? Too much cheating would disrupt the correlation at the foundation of the system, causing it to collapse. Receivers should ignore the signals if they are not useful to them and signallers shouldn’t invest in costly signals if they won’t alter the behavior of receivers in ways that benefit the signaller. What prevents cheating from destabilizing signalling systems? It might be apparent that the costs of displaying signals must be an important part of the answer. However, understanding how costs can stabilize an “honest” correlation between the public signal trait and the private signalled quality has turned out to be a long, interesting process. If many animals in a group send too many dishonest signals, then their entire signalling system will collapse, leading to much poorer fitness of the group as a whole. Every dishonest signal weakens the integrity of the signalling system, and thus weakens the fitness of the group.
An example of dishonest signalling comes from Fiddler crabs, which have been shown to bluff in regards to their fighting ability. Upon regrowing a lost claw, a crab will occasionally regrow a weaker but larger claw that intimidates crabs with smaller but stronger claws. The proportion of dishonest signals is low enough that it is not worthwhile for crabs to test the honesty of such signals, as combat can be dangerous and expensive.
So the key observation gleaned from animal signaling systems that can be directly applied to human social networks is that every dishonest signal directly weakens both the integrity of the signalling system and the overall effectiveness of the group. Bad actors can thus be detected and partitioned using these social criteria to model dishonest behavior and preserve the overall integrity of the data set itself, however large. Big Data actually makes it even harder for bad actors to hide in a statistical sense since outliers are in fact liars, from the social group perspective.
The Quantified Self
The Quantified Self is a social movement to incorporate technical data acquisition on aspects of a person's daily life in terms of inputs (e.g. food consumed, quality of surrounding air), states (e.g. mood, arousal, blood oxygen levels), and performance (mental and physical). The movement was started by Wired Magazine editors Gary Wolf and Kevin Kelly in 2007, as "a collaboration of users and tool makers who shared an interest in self knowledge through self-tracking". The primary methodology of self-quantification is data collection, followed by visualization, cross-referencing and the discovery of correlations. “Almost everything we do generates data,” says Mr Wolf. At the moment, he says, "data from phones, computers and credit cards are mostly used by companies to target advertising, recommend products or spot fraud. But tapping into the stream of data they generate can give people new ways to deal with medical problems or improve their quality of life in other ways." Mr Wolf draws an analogy with the Homebrew Computer Club, which met in Silicon Valley in the 1970s and went from a hobbyists’ group to the basis of a new industry. “We were inspired by our knowledge of this history of personal computing,” he says. “We asked ourselves what would happen if we convened advanced users of self-tracking technologies to see what we could learn from each other.”
With the exceptions of people who are trying to lose weight or improve their fitness, most people do not routinely record or otherwise measure their moods, sleeping patterns or activity levels, track how much alcohol or caffeine they drink or chart how often they walk the dog or have a bowel movement. But some people are doing these things and they are an eclectic mix of early adopters, fitness freaks, technology evangelists, personal-development junkies, hackers and patients suffering from a wide variety of health problems. They share a belief that gathering and analyzing data about their everyday activities can help them improve their lives — an approach known as “self-tracking”, “body hacking” or “self-quantifying”. New technologies make it simpler than ever to gather and analyze personal data. Sensors have shrunk and become cheaper. Accelerometers, which measure changes in direction and speed, are now cheap and small enough to be routinely included in smartphones. All this makes it much easier to take the quantitative methods used in science and business and apply them to the personal Data Sphere.
Quantified Self Business Opportunities
As populations age and health care costs increase, there will be a greater emphasis on monitoring, prevention and maintaining “wellness” in future, with patients taking a more active self-monitoring role. This approach sometimes called “Health 2.0” where the devotees of self-tracking could end up as pioneers of this health care model. Quantified Self Silicon Valley start-ups are launching new devices and software aimed at self-trackers. The future of health care places a greater emphasis on self-monitoring using a variety of sensors and mobile devices to reduce medical costs by preventing disease and prolonging lives.
Tens of thousands of patients around the world are already sharing information about symptoms and treatments for hundreds of conditions on websites such as PatientsLikeMe and CureTogether. This has yielded valuable results, such as the finding that patients who suffered from vertigo during migraines were four times more likely to have painful side effects when using a particular migraine drug. The growing number of self-tracking devices now reaching the market will increase the scope for large-scale data collection, i.e. Big Data, enabling users to analyze their own readings and compare and aggregate them with those of other people.
The thumb-sized Fitbit clips onto a belt and uses an accelerometer and altimeter to measure activity levels and sleep patterns. A readout shows steps walked, stairs climbed and calories burned. Information is also uploaded wirelessly to a website that analyzes and displays the data and lets users compare notes with their friends. Jawbone has released the Up, a wristband that communicates with an iPhone and can also measure physical activity and sleep patterns. Basis is about to launch a wristwatch-like device capable of measuring heart rate, skin conductance (related to stress levels) and sleep patterns, all of which can then be displayed on a “health dashboard”.
GreenGoose offers tiny motion sensors that can be attached to everyday items, sending a wireless signal to a base-station whenever the item is used. A sensor can be attached to a toothbrush, or a watering can, or the collar of a dog, or a toilet seat making it possible to measure and track how often you brush your teeth, water your plants, walk your dog, or use the toilet. The company’s aim is to establish a platform for the gamification of everyday activities.
Large technology companies are also keeping an eye on self-tracking technology. Philips, Vodafone and Intel all regard health-tracking as a promising area for future growth. Philips has launched Vital Signs, an experimental app for Apple devices that uses the built-in camera to measure the user’s heart rate and breathing rate, and chart them over time. Intel has developed an app called Mobile Therapy that pops up randomly and asks users to record their mood, to see how it varies during the week.
The examples listed represent millions of dollars already invested in various attempts to monetize the Quantified Self. The expectations are that "Health 2.0" will ultimately leverage the billions of dollars being spent on health care now and in the near future as well as inform policy makers with detailed, massive amounts of health-related data and predictive models of uncanny accuracy. This could be a boon to future highly effective health care programs or a bust in the abuse of data privacy and trust by private parties trying to gain an unfair economic advantage. Whatever the outcome, Big Data will no doubt play a major role.
The one thing all of these Quantified Self startups have in common is the highly personal data they are all collecting, at a massive Big Data scale. This very personal data also comes with significant hazards. Fitbit inadvertently published the sexual activity of its members because of a privacy oversight of their social strategy that led to very public and thus very embarrassing moments for the company and its customers (see the screenshot). More on Big Data privacy concerns in the next installment of the Big Data Series.
Finally, as a next generation Big Data goal and thought experiment for our collective Quantified Selves, suppose we could encode all the genomes of the 7 billion humans on Earth, requiring about 5.6 exabytes of storage for our collective global genetic information. (Recall from the previous episode of this series, an exabyte is a billion gigabytes or one million terabytes.) Imagine the types of correlations and predictions that would be gleaned from this colossal data source, especially when the genotypes are correlated with phenotype data. That's not just Big Data, that's HUGE DATA. And HUGE social issues as well. Again, we will explore some of these social issues in the next episode.
That's A Wrap!
That's all for this episode of the Big Data Series. Next time we will explore Big Data and Predictive Crime, Total Information Awareness, and the Stellar Wind. Spoiler Alert: Stellar Wind has nothing to do with astronomy in this case. Stay Tuned!