Tag Archives: data science

Edward Tufte

Here’s a fun interview with Edward Tufte, insult comic and author of The Visual Display of Quantitative Information. Here are a couple of his snappy retorts:

…highly produced visualizations look like marketing, movie trailers, and video games and so have little inherent credibility for already skeptical viewers, who have learned by their bruising experiences in the marketplace about the discrepancy between ads and reality (think phone companies)…

…overload, clutter, and confusion are not attributes of information, they are failures of design. So if something is cluttered, fix your design, don’t throw out information. If something is confusing, don’t blame your victim — the audience — instead, fix the design. And if the numbers are boring, get better numbers. Chartoons can’t add interest, which is a content property. Chartoons are disinformation design, designed to distract rather than inform. Thus they reduce the credibility of your presentation. To distract, hire a magician instead of a chartoonist, for magicians are honest liars…

Sensibly-designed tables usually outperform graphics for data sets under 100 numbers. The average numbers of numbers in a sports or weather or financial table is 120 numbers (which hundreds of million people read daily); the average number of numbers in a PowerPoint table is 12 (which no one can make sense of because the ability to make smart multiple comparisons is lost). Few commercial artists can count and many merely put lipstick on a tiny pig. They have done enormous harm to data reasoning, thankfully partially compensated for by data in sports and weather reports. The metaphor for most data reporting should be the tables on ESPN.com. Why can’t our corporate reports be as smart as the sports and weather reports, or have we suddenly gotten stupid just because we’ve come to work?

It’s a very interesting point, actually, that people are willing to look at very complex data on sports sites, really study it and think about it, and do that voluntarily, considering it fun rather than boring, hard work. It’s child-like in a way – I mean in a positive sense, that for children the world is fresh and new and learning is fun. What is the secret of not shutting down this ability in adults. I think it’s context.

more on automated data synthesis

Here’s another article from Environmental Modeling and Software about automated synthesis of scattered research results:

We describe software to facilitate systematic reviews in environmental science. Eco Evidence allows reviewers to draw strong conclusions from a collection of individually-weak studies. It consists of two components. An online database stores and shares the atomized findings of previously-published research. A desktop analysis tool synthesizes this evidence to test cause–effect hypotheses. The software produces a standardized report, maximizing transparency and repeatability. We illustrate evidence extraction and synthesis. Environmental research is hampered by the complexity of natural environments, and difficulty with performing experiments in such systems. Under these constraints, systematic syntheses of the rapidly-expanding literature can advance ecological understanding, inform environmental management, and identify knowledge gaps and priorities for future research. Eco Evidence, and in particular its online re-usable bank of evidence, reduces the workload involved in systematic reviews. This is the first systematic review software for environmental science, and opens the way for increased uptake of this powerful approach.

automated aggregation of scientific literature

I am intrigued by this example from Stanford of computerized review and synthesis of scientific literature:

Over the last few years, we have built applications for both broad domains that read the Web and for specific domains like paleobiology. In collaboration with Shanan Peters (PaleobioDB), we built a system that reads documents with higher accuracy and from larger corpora than expert human volunteers. We find this very exciting as it demonstrates that trained systems may have the ability to change the way science is conducted.

In a number of research papers we demonstrated the power of DeepDive on NMR data and financial, oil, and gas documents. For example, we showed that DeepDive can understand tabular data. We are using DeepDive to support our own research, exploring how knowledge can be used to build the next generation of data processing systems.

Examples of DeepDive applications include:

  • PaleoDeepDive – A knowledge base for Paleobiologists
  • GeoDeepDive – Extracting dark data from geology journal articles
  • Wisci – Enriching Wikipedia with structured data

The complete code for these examples is available with DeepDive.

Let’s just say an organization is trying to be more innovative. First it needs to understand where its standard operating procedures are in relation to the leading edge. To do that, it needs to understand where the leading edge is. That means research, which can be very tedious, and time consuming. It means the organization is paying people to spend time reviewing large amounts of information, some or even most of which will not turn out to be useful. So a change in mindset is often necessary. But tools that could jump start the process and provide short cuts would be great.

This is my own developing theory of how an organization can become more innovative: First, figure out where the leading edge is. Second, figure out how far the various parts of your organization are from the leading edge. Third, figure out how you are going to bring a critical mass of your organization up to the leading edge – this is as much a human resource problem as an innovation problem. Fourth, then and only then, you are ready to try to advance the leading edge. I think a lot of organizations have a few people that do #1, but then they skip right to #4. Then that small group is way outside the leading edge while the bulk of the organization is nowhere near it. That’s not a recipe for success.

visualization

Solomon Messing has a pretty good article on data visualization and communicating scientific information, focusing on the ideas of Tufte and Cleveland. I like the idea that there is a science of what our brains can most easily process, and not just a need to create visual infotainment because we have lost our ability to concentrate on anything else. I’m not quite ready to give up on stacked bar charts in all cases.

When most people think about visualization, they think first of Edward Tufte.  Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism.  I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.

I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers.  Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong  in a museum.  But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.

Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates.  In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data.  These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.

I’m slowly getting on board. I’ve given up pie charts in most cases. I’m not ready to give up stacked bar charts in all cases – I think they serve a purpose. Microscopic multi-panel charts still make my head spin sometimes, although if they were interactive and I could click on one panel to blow it up, that would be cool. There is one thing I am sure he is right about though, which is that the first step to serious analysis and visualization is to leave Excel behind.

Bayes’ Theorem

The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy

There aren’t that many popular books on hard-core statistical approaches to predicting the future. Here is the Amazon description of this book:

Drawing on primary source material and interviews with statisticians and other scientists, “The Theory That Would Not Die” is the riveting account of how a seemingly simple theorem ignited one of the greatest scientific controversies of all time. Bayes’ rule appears to be a straightforward, one-line theorem: by updating our initial beliefs with objective new information, we get a new and improved belief. To its adherents, it is an elegant statement about learning from experience. To its opponents, it is subjectivity run amok. In the first-ever account of Bayes’ rule for general readers, Sharon Bertsch McGrayne explores this controversial theorem and the human obsessions surrounding it. She traces its discovery by an amateur mathematician in the 1740s through its development into roughly its modern form by French scientist Pierre Simon Laplace. She reveals why respected statisticians rendered it professionally taboo for 150 years – at the same time that practitioners relied on it to solve crises involving great uncertainty and scanty information, even breaking Germany’s Enigma code during World War II, and explains how the advent of off-the-shelf computer technology in the 1980s proved to be a game-changer. Today, Bayes’ rule is used everywhere from DNA decoding to Homeland Security. “The Theory That Would Not Die” is a vivid account of the generations-long dispute over one of the greatest breakthroughs in the history of applied mathematics and statistics.

Dense as all this might seem, it matters as we enter a more data-driven future, and we need people with the knowledge and training to deal with it. We should no longer assume that steering our sons into math, statistics, and actuarial science majors means they will never get a date.

There’s a much more hard-core set of slides on Bayes’ Theorem available on R-bloggers.