Category: Statistics

Margins of error

I suspect most people who read this blog know all this already but I’ve met the same misunderstanding at work recently and also in the context of the opinion polls around the POTUS election. So here is a simplified explanation.

Imagine I have a great big jar of jelly beans, which are the favoured confectionary of probability explanations. There are exactly 500 red jelly beans and 500 blue jelly beans and nothing else – no Jill Stien jelly beans or exotic Even McMulberry flavours. A jelly bean pollster doesn’t know this, though. The pollster wants to estimate the proportion of red and blue jelly beans in the jar BUT is only allowed to look at some of the jelly beans.

The pollster grabs a handful of jelly beans from the jar and looks at the relative proportion of jelly beans. Naturally, I don’t want the pollster to do this very often because they’ll put their germ-ridden hands all over my beautiful jelly beans. So pollster only has this handful to look at. They have to make a key assumption – that the jelly beans were well mixed so that their handful is a random pick of jelly beans in the jar.

The pollster looks at the proportion of red to blue jelly beans. Let’s say they have 5 red and 8 blue jelly beans. The pollster says that the proportion of red to blue is 38% to 62% BUT they also report a margin of error that is quite large. They can’t be sure this figure is right because they know they may have been unlucky. With only 13 jelly beans in their handful, it isn’t wholly impossible that they could pick out nothing but blue jelly beans if the true proportion was 50-50. Now note if they did pick out nothing but blue, this could happen by chance.

Margins of error address only this aspect of errors in polling – that the proportion in the sample was to some extent an ‘unlucky’ pick. Both the reported figure and the margin of error BOTH assume that the picking was done correctly. In our jelly bean example the assumption that the beans were well mixed together.

Now it so happens that I didn’t mix the jelly beans well (although the pollster can’t tell)*. There are actually MORE red towards the top and fewer red towards the bottom of the jar. So the pollster’s assumption was wrong. A clever pollster might try to find ways to deal with this methodologically (e.g. by grabbing beans from both the top and the bottom) but the principle still applies: the reported estimate and the margin of error assume that the sampling methodology was valid. The margin of error doesn’t (and can’t) account for the probability of what in common parlance would be called an ‘error’ (i.e. a mistake).

The Right’s War on Statistics

‘Zero Hedge’ is in a flap about poll ‘oversampling’ here


It even includes a hack email from John Podesta which discusses ways of ensuring that the Democrats own polling over samples minority groups. Again, gasp!

Except. Well over sampling a smaller demographic group is the right thing to do. When I say ‘right’, I don’t mean for opinion polls but for collecting statistics on a population in general.

Say you have a representative sample of a population consisting of a thousand people. Now, of that thousand people you are particularly interested in a sub group that represents 1% of the population. If your sample is exactly proportionate, then it should have 10 people belonging to that sub group. Unfortunately 10 is a shitty sample, if you are unlucky to get 2 odd people with unusual views they then form 20% of your sub-sample.

Sample size is a dark art but the easiest issue to understand is it that magnitude matters. A good sample size is less about the proportion of the whole population in your sample and more about the raw number of people. More is better, but ‘more’ is subject to diminishing returns.

Over sampling means you can get a better picture of the sub group. However, because you end up with more of group X than you should have, their response are then weighted proportionally when looked at the statistics overall.

Are polls manipulated! Well, if by ‘manipulated’ you mean ‘use statistics’ then yes.

The EPH Analysis

An analysis of proposed new Hugo voting rules is out. It’s disappointing to some but I think it validates the change to EPH.

The story so far:

In response to the Sad Puppy/Rabid Puppy slate of the 2015 Hugo Awards, a voting system called E Pluribus Hugo was proposed and passed at the 2015 Worldcon Business meeting. The system used a process of weightings and elimination rounds to make the nomination process have more proportionality without changing the basic mechanics of how people nominate things.

Much thought and tinkering was put into EPH but what it lacked was real data. EPH should make the list of finalists more proportional to the underlying groupings of voters. However, that meant that the impact of EPH couldn’t really be known without knowing to what extent Hugo voters clustered around choices anyway. Without slates, do Hugo voters form natural groupings (perhaps along sub-genres or sub-fandoms) or are they just a noisy mess of stuff? Without real data there is no way of knowing.

While EPH was passed at the 2015 Worldcon Business meeting, it requires ratification this year to come into effect. As part of that process an analysis of the 2015 and 2014 nomination ballots has been done and the results are just out…

What it all means…

I don’t know. No, that isn’t a useful reaction. OK, I’ll try again.

Below is a list of possible talking points, reactions, counter-argument things. I made them up. They don’t necessarily reflect actual people’s views (I’ll say when it does). Bold represent a possible reaction (not mine) and not bold is my response.

I’m also a hostage to fortune because more results are coming – post the Hugo ceremony, data on the 2016 nominations will come out and who knows what that will show.

For a different take try Nicholas Whyte

2015 results show that EPH doesn’t fix the slate problem!

No one thing can fix that problem. However, in most categories, at least one additional non-slated works made it onto the ballot with EPH. That means, probably, instead of No Award winning several categories in 2015, a worthy finalist would have won instead.

EPH+No Award together produce a strong disincentive to puppy-style slates. Slate voting will produce legitimate votes and so is bound to have some impact. The combination of EPH and No Award means that a slate will find it hard to sweep a category and win a Hugo.

EPH doesn’t stop those slate-inclined who just want to get to be a finalist and don’t care about winning!

True, but that was a given. Get enough votes and you get to be a finalist. EPH does demonstrably reduce the chance of that succeeding for a slate of nominees but it doesn’t do anything about a single nominee. Again, get enough votes and you get to be a finalist. The only guaranteed way of stopping that is to create a wholly different kind of award.

There is a non-puppy related change in 2015 Best Graphic Story!

That is interesting. With EPH instead of Sex Criminal 1 getting nominated Schlock Mercenary gets to be a finalist.

Sex Criminals got 60 noms in total and Schlock Mercenary got 51. However, Sex Criminals must have been more clustered with other nominees (such as Saga?) and hence lost out a bit to Schlock Mercenary.

With only one slate nominee, this was an interesting category. I liked Sex Criminals, but I think this is a positive demonstration of EPH. It should result in more variety of nominees without slates.

They didn’t include Best Dramatic Presentation!

The reason the report gives is this:

In testing, it was identified that the results in two categories (Dramatic Presentation, Long and Short Form) were usually producing results with many nominators submitting matching entries to other nominators. This was more due to the smaller pool to nominate from compared to other categories than any external coordination of nominating ballots. As such, we decided to produce results with these categories excluded, as changes in the dramatic presentation categories aren’t as useful for gauging if EPH is acting as appropriate where desired as the other categories would be.

That seems silly to me. There are lots of reasons to expect more organic coordination of ballots in these categories, and seeing how EPH works in that circumstance is useful as a way of comparison.

I hope they change their minds at some point.

A single coordinated minority of less than 20% would still average controlling over 80% of the ballot!

Aside from the exclamation mark, that is a direct quote from the report. This appears to be true but controlling only 80% of the ballot is enough to kill Puppy-style slates without having No Award win multiple categories.

Killing the incentive to use the 2015 Puppy slate tactic is what EPH needs to do. It will do that.

EPH+ would be better!

Probably yes, but I don’t know if other side effects (see below) would be worse.

2015 Puppy-style slates are last year’s problem. EPH doesn’t deal with THIS year’s problem!

True. However, the structural weakness of the Hugo voting system exists regardless and the cat is out of the bag. Others can try to game the Hugo Awards in the same way and perhaps more covertly.

As for the griefing style tactics of Vox Day, I think that needs a qualitatively different approach but that is an argument for another day.

EPH knocks out a potential winner in 2014!

There are few changes to finalists with the 2014 data. I think that confirms that without slates EPH will tend to deliver similar results as the current system. However, what isn’t guaranteed is that the results must be exactly the same.

In 2014 three results are notable.

  • Firstly Best Editor Short has a swap of finalists in the last spot – Sheila Wiliams (86) swaps with Bryan Thomas Schmidt (80).
  • Pro artists also has some swaps, partly because in 2014 a tie for fifth place meant 6 nominees. Essentially four artists with 50, 49, 49 and 48 nominations end up with a different ordering with EPH. The EPH ranking ends up as 49, 48, 49, 50 and that looks fine to me because I have that special kind of innumeracy that results from being overly numerate.
  • Fancast has the most understandable change but also the most problematic. In 2014 this was a three-way tie for last finalist at 35 nominations each. EPH breaks the tie and resolves the issue with a single nominee. Unfortunately, one of those three (SF Signal Podcast) won and would have been eliminated by EPH.

The thing is these are all pretty much very close votes with smaller numbers of voters. Anything different about 2014 would probably have resulted in different outcomes. For the Fancast result, an internet outage or a sick cat could have ended up with a different result. The least error in collating the data could have ended up with different results.

Put another way: Hugo voters did not have a clear consensus of which of these people/works should have been nominated. These cases are not good arguments against EPH.

Yes, but, but EPH+ might make that problem worse!

I’ve really no idea. I guess it might broaden what we might think of as a marginal tie and lead to more notable discrepancies between the number of nominating ballots and grabbing that last spot in the finalists. I don’t know.

The current system doesn’t avoid this issue, it really just hides it. For some categories, there are finalists who we really can’t say are substantially more nominated than others. The differences are small enough to be down to happenstance. And yes, some of those may actually end up being winners.

I think the answer is the number of nominees needs to be more flexible than just 5. However, deciding the rules on when to expand the number of nominees beyond an exact tie is unclear.

Where nominator coordination is not present, there are still significant numbers of changes not only to the long list, but to ballots where it’s not generally considered for anything untoward to have happened. Items removed from the 2014 ballot included a
winner of the Hugo. Had EPH been in place, they would not have been on the ballot.

That is a direct quote from Dave McCarty’s conclusion on the report. Sorry, but that is a flawed counterfactual. If we could somehow rewind the tape back to early 2014 and re-run the 2014 nomination ballot again, how likely is it that we’d have ended up with that exact tie that occurred? EPH changed the result because it broke a tie and the other places where there were changes were also spots with very close votes.

Almost ANY change would have meant that something slightly different would have happened! For SF Signal not have been a finalist required ONE nominator’s vote to be different

The changes to the Ballot and Long list are not easily verified and for people reviewing the detailed results at the end the only way to check that the process is working correctly would require access to secret nomination data and significant time.

That’s Dave McCarty again. Well, ANY verification of results needs access to ballots. Given Dave McC is worried about the 2014 Fancast result shifting by possibly one vote, to verify the CURRENT process would require checking that ballots had been classified correctly and counted correctly.

Assuming the underlying ballot data is correct (i.e. everybody’s nominations have been correctly collated) and in a machine readable form (e.g. a text file or spreadsheet), the EPH check takes seconds. Don’t trust the EPH program you are using? Use a different one and see if you get the same results. EPH is not hard to code, I made an Excel version that only uses standard Excel formulas and NO extra code at all.

So, yes, cleaning the nomination data and getting it all tickety-boo takes time – without a doubt BUT if we wanted to verify that the results DON’T CHANGE under the CURRENT process YOU WOULD STILL NEED TO DO THAT.


The Puppy Axis Returns: Part 2 – Fireside and making sense of it all

In my earlier post, I remarked on how the Fireside report on the underrepresentation of black authors in published SFF short fiction generated an unusual degree of agreement among four major Sad/Rab Puppy protagonists, Larry, Brad, John C Wright and The Dumpster Fire who Walks Like a Man*.

In this post, I want to talk more about the Fireside report, its methodology and flaws and then look at Larry Correia’s “fisk”. I’ll focus on Larry because Brad Torgersen’s blog post is mainly rambling around the issues, while John C Wright and Vox are more open about the source of the animus.

First to the Fireside report. As they say right off:

The methodology is flawed, as it’s based in self-reported data whenever possible, but such data was not always findable or clear.

They also point out:

…we don’t have access to submission-rate data concerning race and ethnicity either overall or by individual magazine…

Other issues/objections that could be raised is national variation. For example, Andromeda Spaceways Inflight Magazine is one of the ‘zines included. It is an Australian magazine with (I think) mainly Australian contributors. Different country, different dynamics of race, ethnicity and self-identification, and different population proportions. [Note: that isn’t meant as a justification for the ‘zine having zero in the study, it is purely an observation of the difficulties Fireside faced in collecting this data].

Additionally, caution needs to be applied at a ‘zine level. For a ‘zine with fewer stories in a year, a single story by one black author would make the difference between zero representation and a reasonable proportion (assuming a 13% black population).

What is notable, is the report is up-front about the issues in their approach and they don’t attempt to hide that there is a substantial degree of uncertainty around the findings. They aren’t claiming some indisputable proof but they are pointing out an obvious red-flag that people should pay attention to.

Having said all of that: zoiks! The resulting number of stories published by black authors across this broad spread of magazines is very low. For interest I tagged ‘zines in the Fireside data that were in the Semiprozine directory (n=20). The proportion of stories by black authors works out much the same as for the total – about 1.9%.

Now maybe getting better data of author self-identification might result in a different picture and the study can’t tell us any specific “why’ of the under representation. Yet we can speculate. A good study (and I think this one is good) is not neccesarily a flawless one but rather one that helps us generate new hypothesis which allows us to find better data. For example we can now ask about some of the “why” behind the results:

  • Is it stories not being accepted? If so, why?
  • Is it stories not being submitted? If so, why aren’t they?
  • Is it the study looking in the wrong places? if so where should it have looked?
  • Is it all the various methodological errors all creating a misleading bias in the data? If so, how come? And does that really seem likely?
  • Is it just that there are lots of black authors being published but the author’s ethnicity isn’t particularly visible?

We also have my favourite Franciscan monk to help us out: William of Okham. He gently reminds us not to over complicate our hypothesis. We have, as a given, a know institutionalised bias against black people in Western societies that has existed for a very long time and which exists both as overt racism and as more subtle forms of discrimination. Finding a group which has been historically under-representated is currently under-representated does not require elaborate explanations. That doesn’t mean we all declare the case closed and never look for better data, it just means that we already have a highly plausible explanation that fits very well with known facts.

And a study is not just about discovering facts and forming hypotheses. What this should inform is what action we should take. When considering that it is worth considering what the downsides of an action will be. Let’s have a look at what ‘zine editors can do in response:

  • Actively try and publish more stories by black authors.

And the down side of that response is:

  • Some extra effort expended but otherwise no obvious down side.

Now what I can’t help noticing is that of the various questions we could ask of the data to get better data, none of them really impact much on that basic response. Looking beyond that response, for example into how the SFF community can help foster talent in diverse communities would be helped with better data but again, we don’t ACTUALLY need better data to make a better start. So I’ll add a more strategic response to this report:

  • Actively try and foster SFF talent and writing in diverse communities.

And the down side of that is:

  • None.


  • More good SFF writing and more SFF fans.

PS. I was going to get into Larry’s fisk in this post but time has moved on, so I’ll save that for a part 3.

*[Copyright: Philip Sandifer]

Don’t Forget Climate Change: Chapter 3 Richard Lindzen

The story so far, I’m reviewing chapter by chapter a book from a dodgy Australian rightwing think tank on climate change.


Intro, Ch 1, Ch2, …

Chapter 3 is another heavy hitter: Richard Lindzen.

The chapter is better described as disingenuous rather than weirdly misleading like chapters 1 & 2 and it is very much an example of Linden’s position. Lindzen is smart and like a lot of smart people he doesn’t like to be wrong or appear to be wrong. However, he found himself on the wrong side of the climate-change debate and now has to position himself so that he can join in the contrarian fun while avoiding some of the nuttier positions.

He starts by criticising people who accept that climate change is occurring for the way they use language.

“In a further abuse of language, the advocates attempt to rephrase issues in the form of yes-no questions: Does climate change? Is carbon dioxide (CO2) a greenhouse gas? Does adding greenhouse gas cause warming? Can man’s activities cause an increase in greenhouse gases?

Personally, I’d have asked that last question slightly differently as “Has human activity caused an increase in greenhouse gases?”. I think Lindzen would still answer yes. But would others? Linden is giving up yet, though. He goes on to say:

“These yes-no questions are meaningless when it comes to global warming alarm since affirmative answers are still completely consistent with there being no problem whatsoever; crucial to the scientific method are ‘how much’ questions. This is certainly the case for the above questions, where even most sceptics of alarm (including me) will answer yes.”

Lindzen is positioning his argument in a place that is sometimes called Lukewarm. The position of the Lukewarmers is that all the basic principles behind global warming are true (see Lindzen’s list), plus other things such as the temperature record are accurate and climate modelling being feasible. Where the Lukewarmers disagree with the ‘consensus’ position is on an issue that Linden highlights in this chapter.

Climate sensitivity is a key question in the issue of global warming. Put simply, it is the question of how much warming we get if we double the amount of CO2 (and other greenhouse gases) in the atmosphere. The Lukewarm position is that the amount of warming we should get is smaller than currently thought but not zero. If they are right then much of the urgency about climate change is misplaced.

Lindzen describes it like this:
“The term climate sensitivity has come to refer to the equilibrated response of global mean temperature anomaly to a doubling of CO2. Because of the logarithmic dependence of the radiative impact of CO2, it doesn’t matter what the starting value for the doubling is. “

Lindzen points at the instrumental record as evidence that sensitivity is low but concedes that the current belief of many climatologists is that this is misleading. A cooling effect (it is argued) caused by aerosols from industrial processes have a cooling effect which masks some of the warming.

Linden argues otherwise and cites his proposed ‘iris effect’ (R.S. Lindzen, M.D. Chou and A.Y. Hou, “Does the Earth have an adaptive infrared iris?” Bulletin of the American Meteorological Society, Vol. 82 (2001) 417-432) as a possible mechanism that would reduce the positive feedback caused by water vapour. Remember water vapour? This book will flit between authors pointing out what a powerful greenhouse gas water vapour is (to contrast it with the relative puny CO2) and more astute authors trying to find ways in which water vapour isn’t the problem that the hypothesis of anthropogenic global warming says that it is.

But back to the iris effect. The name is a metaphor – in bright light the human iris changes to reduce the amount of light entering the eye. Lindzen does not argue that the atmosphere literally has an iris but rather that increased warmth leads to increased atmospheric water vapour (as in the standard model) but that vapour then impacts on high-level cloud formation which then limits additional warming – particularly in the tropics. In Lindzen’s model, this iris effect is a powerful negative feedback which offsets the positive feedback from water vapour.

There should be a name for this: perhaps “climate sceptical irony”. It afflicts the small number of so-called climate sceptics who make at least semi-serious attempt to engage with the science. The problem they have is that simply dismissing the science of global warming leaves a significant set of residual facts i.e. if anthropogenic global warming isn’t happening then a lot is left inexplicable. Consequently, Lindzen (or somebody like Roy Spencer who doesn’t appear in this book) have to provide their own pet theory to explain the discrepancy. At that point, having exhorted us to be sceptical of scientific evidence and scientific authorities, we are supposed to accept this new theory as clearly correct. The ironic truth of climate scepticism is that it depends so very strongly on credulity. has a lengthy article on Lindzen’s iris effect. It is a hypothesis that was taken credibly and generated interesting lines on inquiry – so it shouldn’t be dismissed out of hand, Lindzen’s work is not crackpottery but that also doesn’t mean it was correct. Remember Patrick Michael’s point in Chapter 2 about the Popper-model of the scientific method? Lindzen’s iris hypothesis met the criteria of a genuine scientific hypothesis in that it implies facts about the world that can be tested by observation. And that is what NASA researchers did and what they found was…the iris hypothesis didn’t match observation. Lindzen has been plugging away ever since and recently there has been at least one ray of hope for the Iris Effect – researcher’s tweaked some aspects of a climate model and found some indication of an iris effect after changing the right parameters. Yeah, but climate models aren’t ‘evidence’ according to Chapter 1.

Lindzen has been plugging away ever since and recently there has been at least one ray of hope for the Iris Effect – researcher’s tweaked some aspects of a climate model and found some indication of an iris effect after changing the right parameters. Yeah, but climate models aren’t ‘evidence’ according to Chapter 1.

Lindzen then goes on to some more standard complaints and oddly devotes a page to a weather map of North America but doesn’t really tie it in with his argument. He complains about claims of extreme weather and then trots out a standard cliche:
“even the term global warming is changed to climate change”
‘Climate Change’ has been a major term for decades – e.g.. the last two letters of the IPCC stand for ‘climate change’ and have never stood for ‘global warming’.

Lindzen then digresses into a short discussion of the Milankovich cycles and then that’s about it.

I feel like the book peaks about here. This was the smartest and most well-argued chapter and yet at best this Lindzen phoning in old arguments.

EPH & the GOP in the USA

So this post arose out of comments I made here.

Imagine if the US primary season and Presidential election was replaced by the Hugo voting process.

Everybody would get to nominate three candidates that they liked. The total number of nominations would be counted and the top three nominees would become the three finalists. There would be then a general vote in which people picked one of the three finalists (or No Award if they didn’t like the finalists).

Of course in this fantasy world elections would be very different but by the power of magic this change occurs overnight so that the US still has Democrats and Republicans and exactly the same pool of nominees as there are currently (OK maybe not Santorum and Gilmore because they are polling really low and Fiorina as well because 9 is an easier number to work with).

Continue reading