Tag Archives: data science

earthquake prediction

Here’s an interesting article on earthquake prediction. Basically, it has eluded scientists so far and will probably continue to do so.

Since the early 20th century, scientists have known that large quakes often cluster in time and space: 99 percent of them occur along well-mapped boundaries between plates in Earth’s crust and, in geological time, repeat almost like clockwork. But after decades of failed experiments, most seismologists came to believe that forecasting earthquakes in human time—on the scale of dropping the kids off at school or planning a vacation—was about as scientific as astrology. By the early 1990s, prediction research had disappeared as a line item in the USGS’s budget. “We got burned enough back in the 70s and 80s that nobody wants to be too optimistic about the possibility now,” says Terry Tullis, a career seismologist and chair of the National Earthquake Prediction Evaluation Council (NEPEC), which advises the USGS.

Defying the skeptics, however, a small cadre of researchers have held onto the faith that, with the right detectors and computational tools, it will be possible to predict earthquakes with the same precision and confidence we do just about any other extreme natural event, including floods, hurricanes, and tornadoes. The USGS may have simply given up too soon. After all, the believers point out, advances in sensor design and data analysis could allow for the detection of subtle precursors that seismologists working a few decades ago might have missed.

And the stakes couldn’t be higher. The three biggest natural disasters in human history, measured in dollars and cents, have all been earthquakes, and there’s a good chance the next one will be too. According to the USGS, a magnitude 7.8 quake along Southern California’s volatile San Andreas fault would result in 1,800 deaths and a clean-up bill of more than $210 billion—tens of billions of dollars more than the cost of Hurricane Katrina and the Deepwater Horizon oil spill combined.

 

Nate Silver weighs in

Nate Silver has launched his general election forecast page. He gives Hillary about an 80-20 chance of winning. He has a long discussion post about it here. I found this last paragraph interesting, where he relates a 20% chance of winning to a baseball game:

A 20 percent or 25 percent chance of Trump winning is an awfully long way from 2 percent, or 0.02 percent. It’s a real chance: about the same chancethat the visiting team has when it trails by a run in the top of the eighth inning in a Major League Baseball game. If you’ve been following politics or sports over the past couple of years, I hope it’s been imprinted onto your brain that those purported long shots — sometimes much longer shots than Trump — sometimes come through.

It’s an interesting way of thinking about risk. Let’s say your favorite team is in game 7 of the World Series, down by a run in the top of the eighth. The game is insanely late on the east coast as they always are, and you have to do something important early the next morning, like interview for a job or operate heavy machinery. Do you turn the TV off? No, of course not, you stay tuned.

June 2016 in Review

3 most frightening stories

  • Coral reefs are in pretty sad shape, perhaps the first natural ecosystem type to be devastated beyond repair by climate change.
  • Echoes of the Cold War are rearing their ugly heads in Western Europe.
  • Trump may very well have organized crime links. And Moody’s says that if he gets elected and manages to do the things he says, it could crash the economy.

3 most hopeful stories

  • China has a new(ish) sustainability plan called “ecological civilization” that weaves together urban and regional planning, environmental quality, sustainable agriculture, habitat and biodiversity concepts. This is good because a rapidly developing country the size of China has the ability to sink the rest of civilization if they let their ecological footprint explode, regardless of what the rest of us do. Maybe they can set a good example for the rest of the developing world to follow.
  • Genetic technology is appearing to provide some hope of real breakthroughs in cancer treatment.
  • There is still some hope for a technology-driven pick-up in productivity growth.

3 most interesting stories

index of redevelopment potential

This is an index of redevelopment potential for individual properties in Philadelphia and other cities. The application to real estate is obvious, but I can also see a number of applications to public policy. For example, changes to codes and ordinances can improve the overall health, safety, and environmental impact of a city. But these get implemented slowly and incrementally, especially in older cities with fixed boundaries, where most development is redevelopment. If you had a reasonable prediction of where and when redevelopment is likely to occur, you would know which areas to sit back and be patient, and which areas of the city to intervene more directly if you want to see change on any reasonable time frame.

It’s a little bit of a shame the person is not sharing their methodology. I’ve had a number of urban planners and economists tell me over the years this is very hard to do, and seen a few try and give up. So this is either brilliant, or it is little more than a guess. If it’s brilliant it could be very valuable indeed, so I guess I can see the financial incentive not to publish the details. But there is no way to know the difference without knowing how it is done. The author could at least publish a white paper showing some back testing of the algorithm against historic data.

swing the election

Here’s an interesting interactive tool on FiveThirtyEight.com where you can play around with U.S. voter turnout and preferences among various demographic groups.

I ran a few scenarios:

  • The default scenario is that each demographic group (educated white, uneducated white, black, hispanic/latino, and Asian) votes for the same party in the same proportions as 2012, and turns out at the same rate, but the absolute size of each group is adjusted for changes between 2012 and 2016.
    • electoral votes 332-206 in favor of DEMOCRATS
  • Let’s go back to the default, and all the Asian people stay home.
    • 332-206 in favor of DEMOCRATS (just not enough people, and maybe already concentrated in democratic states)
  • Back to the default, and all the hispanic/latino people stay home.
    • 283-255 in favor of DEMOCRATS (perhaps hispanics/latinos are also concentrated in already democratic states?)
  • Back to the default, and black turnout falls from 66% to 29%
    • 286-252 in favor of REPUBLICANS (perhaps this flips some key midwest swing states like Pennsylvania, Ohio, Michigan, Wisconsin, etc.)
  • Back to the default, and uneducated whites swing strongly to the right, from 62% last time to 69% Republican (maybe a terrorist attack? a major incident with China or Russia? I don’t want to say false flag, this is not one of those conspiracy websites…)
    • 282-256 in favor of REPUBLICANS (probably those swing states again)
  • Stay with the previous scenario, but educated whites swing ever so slightly to the left, from 56% Republican last time to 54% Republican (what would cause this? I don’t know, some crazy right-wing candidate spouting racist nonsense maybe, I’m not naming names…)
    • 275-263 in favor of DEMOCRATS

So the bottom line is that the minority groups tend to vote Democrat.The uneducated whites tend to vote Republican. The educated whites are the swing voters who end up being the deciding factor. So it is hard to see how a Republican candidate who appeals strongly to uneducated whites but alienates educated whites could ever stand much of a chance.

Nate Silver’s Iowa Caucus Predictions

Political season is data science season! Here is some more on Nate Silver’s forecasting methods. If you are reading this in real time (Sunday January 31), by tomorrow night we will find out what actually happens. I will reproduce some graphics here – these are all from the FiveThirtyEight site, so please thank me for the free advertising and don’t send me to copyright jail.

For Clinton vs. Sanders, here is Nate’s average of polls as of today. He gives more recent polls greater weighting, and also adjusts somehow for bias shown in the same polls in the past.

Average of polls: Clinton 48.0% vs. Sanders 42.7%

Now, this is within the 4-6% “margin of error” reported by most polls. (I find this easier to find on the RealClearPolitics site, although curiously it lists margins of error for Democratic polls but not Republican ones. RealClearPolitics does a straight-up poll average without all the corrections that today is Clinton 47.3% vs. Sanders 44%. So all the corrections don’t make an enormous difference.) I can’t easily and quickly find information on whether the “margin of error” is a standard error or a confidence interval or what, but generally when the polls are within the margin of error the media tends to report it as a “statistical tie” or dead heat. And that is exactly what they are saying in this case.

Nate Silver does a set of simulations – it sounds very complicated, but in essence I assume he takes his adjusted poll average for each candidate, some measure of spread like standard error, then runs a whole bunch of simulations. Which leads to results like this:

Clinton-Sanders Simulation

http://projects.fivethirtyeight.com/election-2016/primary-forecast/iowa-democratic/

Based on this, Nate Silver gives Clinton an 80% chance of winning Iowa and Sanders only a 20% chance.

So what’s interesting is that you have the average of polls (48-43 or 47-44 depending on source), which everyone says is a statistical tie. You have Silver’s predicted result (50-43) based on a large number of simulations, and then you have the resulting odds considering both the predicted result and the spread in the predictions (80-20). In other words, the computer is generating random numbers and 80% of simulations end up favoring Clinton. Of course in real life the dice get rolled only once, but these odds seem pretty good for Clinton.

Meanwhile, the Trump-Cruz contest is similarly close in the polls (30-25 in favor of Trump), but the predicted result (26-25 in favor of Trump) and odds (48-41 in favor of Trump) are much closer. From a quick glance, this appears to be because the spreads are much wider. I don’t know why that would be the case – presence of more viable candidates on the Republican side? Or maybe there is just more variability in the polls and nobody actually knows why.

Republican Iowa Caucus simulation

http://projects.fivethirtyeight.com/election-2016/primary-forecast/iowa-republican/

 

 

what is a blizzard?

According to Five Thirty Eight,

Three factors are required for a storm to be classified as a blizzard at a particular place, besides falling or blowing snow:

1. Sustained winds or frequent wind gusts of 35 mph or greater

2. Visibility under a quarter-mile

3. These conditions must persist for three hours.

This definition is the same whether you’ve got 1 inch or 40 on the ground.

what is a p-value?

Five Thirty Eight has a video of statisticians trying to explain what a p-value is. Well, what’s disturbing to me is that they won’t really try. Then again, the maker of the video very well may have cherry picked the most entertaining answers. I can’t reproduce the research so I have no way of knowing.

Here’s another article slamming the humble p-value. It’s true, there will always be some false positives if the data set is large enough. As an engineer, I try to use statistics to back up (or not) a tentative conclusion I have reached based on my understanding of a system. I will question a statistically significant result using my understanding of a system. That way both statistics and system thinking can reinforce and make each other stronger, rather than our relying exclusively on one or the other. Another way to think about this is that as data sets grow and our traditional engineering system analysis methods are just taking too long to apply, we can use statistics to weed out a lot of the data that is clearly just noise, and then focus our brains on a reduced data set that we are pretty sure contains the signal, although we know there are some false positives in there. So i say relax, use statistics, but don’t expect statistics to be a substitute for thinking. Thinking still works.

 

how do you value data?

This article lists six ways a company or organization can try to value its data:

  1. Intrinsic value of information. The model quantifies data quality by breaking it into characteristics such as accuracy, accessibility and completeness.
  2. Business value of information. This model measures data characteristics in relation to one or more business processes. Accuracy and completeness, for example, are evaluated, as is timeliness…
  3. Performance value of information…measures the data’s impact on one or more key performance indicators (KPIs) over time
  4. Cost value of information. This model measures the cost of “acquiring or replacing lost information.”
  5. Economic value of information. This model measures how an information asset contributes to the revenue of an organization.
  6. Market value of information. This model measures revenue generated by“selling, renting or bartering” corporate data

Another article says that algorithms are becoming less valuable as data becomes more valuable.

Google is not risking much by putting its algorithms out there.

That’s because the real secret sauce that differentiates Google from everybody else in the world isn’t the algorithms—it’s the data, and in particular, the training data needed to get the algorithms performing at a high level.

“A company’s intellectual property and its competitive advantages are moving from their proprietary technology and algorithms to their proprietary data,” Biewald says. “As data becomes a more and more critical asset and algorithms less and less important, expect lots of companies to open source more and more of their algorithms.”