Category: Statistics

Spotting Fakery?

I previously pointed to an article on people manipulating Amazon rankings for their books, today there is a bigger brouhaha on whether somebody has manipulated the New York Time bestseller list: The method used (if true) isn’t new and political books have been prone to this approach before i.e. buy lots of the book from the right bookshops and head up the rankings.

One thing new to me from those articles was this site: It claims to be a site that will analyse reviews on sites like Amazon and Yelp and then rate the reviews in terms of how “fake” they seem to be. The mechanism looks at reviewers and review content and looks for relations with other reviews, and also rates reviewers who only ever give positive reviews lower. Now, I don’t know if their methods are sound or reliable, so take the rest of this with a pinch of salt for the time being.

Time to plug some things into their machine but what! Steve J No-Relation Wright has very bravely volunteered to start reading Vox Day’s epic fantasy book because it was available for $0 ( ) and so why not see what Fakespot has to say about “Throne of Bones”


Ouch…but to some extent, we already know that the comment section of Vox’s blog is full of willing volunteers ready to do sycophanting stuff and/or trolling/griefing at Vox’s request. Arguably those are genuine reviews, just that they are hard to distinguish between click-farm fakery. Think of it as a kind of Turing Test, which his right-wing minions repeatedly fail by acting like…well, minions.

How reliable is this? There’s no easy way to tell. As a side-by-side experiment I put in Castalia’s attempt at spoiler campaign versus the mainstream SF book they were trying to spoil:

Ironically, the reviews that Vox complains about, probably improve the Fakespot rating of the reviews – i.e. many negative reviews from people will make the rating of the quality of the reviews better. I also don’t see a way in general of Fakespot distinguishing between fake NEGATIVE reviews -i.e. showing that the poor ratings of a book aren’t genuine.

[A note of caution: the site doesn’t re-analyse automatically so the analysis you get may be out of date. The initial ratings for those two books were different but changed when I clicked the option to re-analyse]

I also don’t see a way in general of Fakespot distinguishing between fake NEGATIVE reviews -i.e. showing that the poor ratings of a book aren’t genuine. The basic report seems to assume that fake reviews are for the purpose of the seller artificially boosting a book rather than somebody maliciously trying to make a book look bad.



Even More Hugo Wisdening

I’ve never been a fan of cricket but my family growing up were and there were numerous copies of Wisden in the house, which for those who don’t know of it is best described here I guess some in the house hoped that I might find it intriguing and I could see the appeal but resisted.

These days we’ve got something better! All the fun of tables of dry numbers PLUS science fiction books! I don’t have a round up of other takes on the numbers yet though.

Normally Brandon Kempner at Chaos Horizon has posted something by now but there’s not been a post there since February. I hope he is OK.

Greg Hullender of Rocket Stack Rank is actually in Helsinki – and having a fun time I hope – so probably won’t post anything yet.

In the comments JJ gave links to three rich sources of data:

The first one is great for seeing EPH in action.

Continue reading

The Black SFF Writer Survey Report

This is an interesting read from FIYAH Literary Magazine. I’ll let the report speak for itself and I’m still digesting it but I’d like to pick up a point they make in the introduction:

“A final note: We know that some usual suspects will attempt to invalidate what we’ve captured by claiming that our analysis lacks rigor, or our methodology was faulty. This is a smokescreen that these individuals use to hide the fact that they are against making the speculative fiction publishing space inclusive and respectful to black writers–all writers, really–and their work. Using assumed (and faulty) scientific expertise to attack the experiences of marginalized people is not a new tactic, and one that is frequently used by these groups in an attempt to maintain the oppressive systems that they believe should solely benefit them. They will never admit that fact so we are making it plain here.”

Strongly worded but a reasonable response given some of the muddleheaded reactions we saw to the Fireside report.

This is not to say that the report is somehow methodologically perfect or has flawless data or answers all question. Rather, the point is that gathering a complete data picture of an area of study takes time, multiple studies and necessarily is an iterative process of collecting incomplete data which then inform new surveys and new studies. There is a bootstrap element to all statistical study e.g. how do you know whether your sample is representative without first having statistical data about the population you are sampling, which you can’t get without doing a representative sample of the population your want to sample? The answer is that *perfection* is unobtainable but *good-enough* is both obtainable and part of an iterative process of gaining knowledge.

So does the report have limitations? Yes, obviously – the writers aren’t omniscient.  The question is does it improve our understanding?

Survey results! Freeped by squirrels


After 77 votes, some of which were rigged, the surprise result was “Maybe its is squirrels who do all the real work around here. Just saying” – which isn’t even grammatically correct and wasn’t even an option initially.

Freeped by squirrels.


[Also: nice graph option there from Survey Monkey. The proportionally divided bar graph is a nice alternative to the pie-chart and is arguably easier to read.]

Margins of error

I suspect most people who read this blog know all this already but I’ve met the same misunderstanding at work recently and also in the context of the opinion polls around the POTUS election. So here is a simplified explanation.

Imagine I have a great big jar of jelly beans, which are the favoured confectionary of probability explanations. There are exactly 500 red jelly beans and 500 blue jelly beans and nothing else – no Jill Stien jelly beans or exotic Even McMulberry flavours. A jelly bean pollster doesn’t know this, though. The pollster wants to estimate the proportion of red and blue jelly beans in the jar BUT is only allowed to look at some of the jelly beans.

The pollster grabs a handful of jelly beans from the jar and looks at the relative proportion of jelly beans. Naturally, I don’t want the pollster to do this very often because they’ll put their germ-ridden hands all over my beautiful jelly beans. So pollster only has this handful to look at. They have to make a key assumption – that the jelly beans were well mixed so that their handful is a random pick of jelly beans in the jar.

The pollster looks at the proportion of red to blue jelly beans. Let’s say they have 5 red and 8 blue jelly beans. The pollster says that the proportion of red to blue is 38% to 62% BUT they also report a margin of error that is quite large. They can’t be sure this figure is right because they know they may have been unlucky. With only 13 jelly beans in their handful, it isn’t wholly impossible that they could pick out nothing but blue jelly beans if the true proportion was 50-50. Now note if they did pick out nothing but blue, this could happen by chance.

Margins of error address only this aspect of errors in polling – that the proportion in the sample was to some extent an ‘unlucky’ pick. Both the reported figure and the margin of error BOTH assume that the picking was done correctly. In our jelly bean example the assumption that the beans were well mixed together.

Now it so happens that I didn’t mix the jelly beans well (although the pollster can’t tell)*. There are actually MORE red towards the top and fewer red towards the bottom of the jar. So the pollster’s assumption was wrong. A clever pollster might try to find ways to deal with this methodologically (e.g. by grabbing beans from both the top and the bottom) but the principle still applies: the reported estimate and the margin of error assume that the sampling methodology was valid. The margin of error doesn’t (and can’t) account for the probability of what in common parlance would be called an ‘error’ (i.e. a mistake).

The Right’s War on Statistics

‘Zero Hedge’ is in a flap about poll ‘oversampling’ here


It even includes a hack email from John Podesta which discusses ways of ensuring that the Democrats own polling over samples minority groups. Again, gasp!

Except. Well over sampling a smaller demographic group is the right thing to do. When I say ‘right’, I don’t mean for opinion polls but for collecting statistics on a population in general.

Say you have a representative sample of a population consisting of a thousand people. Now, of that thousand people you are particularly interested in a sub group that represents 1% of the population. If your sample is exactly proportionate, then it should have 10 people belonging to that sub group. Unfortunately 10 is a shitty sample, if you are unlucky to get 2 odd people with unusual views they then form 20% of your sub-sample.

Sample size is a dark art but the easiest issue to understand is it that magnitude matters. A good sample size is less about the proportion of the whole population in your sample and more about the raw number of people. More is better, but ‘more’ is subject to diminishing returns.

Over sampling means you can get a better picture of the sub group. However, because you end up with more of group X than you should have, their response are then weighted proportionally when looked at the statistics overall.

Are polls manipulated! Well, if by ‘manipulated’ you mean ‘use statistics’ then yes.

The EPH Analysis

An analysis of proposed new Hugo voting rules is out. It’s disappointing to some but I think it validates the change to EPH.

The story so far:

In response to the Sad Puppy/Rabid Puppy slate of the 2015 Hugo Awards, a voting system called E Pluribus Hugo was proposed and passed at the 2015 Worldcon Business meeting. The system used a process of weightings and elimination rounds to make the nomination process have more proportionality without changing the basic mechanics of how people nominate things.

Much thought and tinkering was put into EPH but what it lacked was real data. EPH should make the list of finalists more proportional to the underlying groupings of voters. However, that meant that the impact of EPH couldn’t really be known without knowing to what extent Hugo voters clustered around choices anyway. Without slates, do Hugo voters form natural groupings (perhaps along sub-genres or sub-fandoms) or are they just a noisy mess of stuff? Without real data there is no way of knowing.

While EPH was passed at the 2015 Worldcon Business meeting, it requires ratification this year to come into effect. As part of that process an analysis of the 2015 and 2014 nomination ballots has been done and the results are just out…

What it all means…

I don’t know. No, that isn’t a useful reaction. OK, I’ll try again.

Below is a list of possible talking points, reactions, counter-argument things. I made them up. They don’t necessarily reflect actual people’s views (I’ll say when it does). Bold represent a possible reaction (not mine) and not bold is my response.

I’m also a hostage to fortune because more results are coming – post the Hugo ceremony, data on the 2016 nominations will come out and who knows what that will show.

For a different take try Nicholas Whyte

2015 results show that EPH doesn’t fix the slate problem!

No one thing can fix that problem. However, in most categories, at least one additional non-slated works made it onto the ballot with EPH. That means, probably, instead of No Award winning several categories in 2015, a worthy finalist would have won instead.

EPH+No Award together produce a strong disincentive to puppy-style slates. Slate voting will produce legitimate votes and so is bound to have some impact. The combination of EPH and No Award means that a slate will find it hard to sweep a category and win a Hugo.

EPH doesn’t stop those slate-inclined who just want to get to be a finalist and don’t care about winning!

True, but that was a given. Get enough votes and you get to be a finalist. EPH does demonstrably reduce the chance of that succeeding for a slate of nominees but it doesn’t do anything about a single nominee. Again, get enough votes and you get to be a finalist. The only guaranteed way of stopping that is to create a wholly different kind of award.

There is a non-puppy related change in 2015 Best Graphic Story!

That is interesting. With EPH instead of Sex Criminal 1 getting nominated Schlock Mercenary gets to be a finalist.

Sex Criminals got 60 noms in total and Schlock Mercenary got 51. However, Sex Criminals must have been more clustered with other nominees (such as Saga?) and hence lost out a bit to Schlock Mercenary.

With only one slate nominee, this was an interesting category. I liked Sex Criminals, but I think this is a positive demonstration of EPH. It should result in more variety of nominees without slates.

They didn’t include Best Dramatic Presentation!

The reason the report gives is this:

In testing, it was identified that the results in two categories (Dramatic Presentation, Long and Short Form) were usually producing results with many nominators submitting matching entries to other nominators. This was more due to the smaller pool to nominate from compared to other categories than any external coordination of nominating ballots. As such, we decided to produce results with these categories excluded, as changes in the dramatic presentation categories aren’t as useful for gauging if EPH is acting as appropriate where desired as the other categories would be.

That seems silly to me. There are lots of reasons to expect more organic coordination of ballots in these categories, and seeing how EPH works in that circumstance is useful as a way of comparison.

I hope they change their minds at some point.

A single coordinated minority of less than 20% would still average controlling over 80% of the ballot!

Aside from the exclamation mark, that is a direct quote from the report. This appears to be true but controlling only 80% of the ballot is enough to kill Puppy-style slates without having No Award win multiple categories.

Killing the incentive to use the 2015 Puppy slate tactic is what EPH needs to do. It will do that.

EPH+ would be better!

Probably yes, but I don’t know if other side effects (see below) would be worse.

2015 Puppy-style slates are last year’s problem. EPH doesn’t deal with THIS year’s problem!

True. However, the structural weakness of the Hugo voting system exists regardless and the cat is out of the bag. Others can try to game the Hugo Awards in the same way and perhaps more covertly.

As for the griefing style tactics of Vox Day, I think that needs a qualitatively different approach but that is an argument for another day.

EPH knocks out a potential winner in 2014!

There are few changes to finalists with the 2014 data. I think that confirms that without slates EPH will tend to deliver similar results as the current system. However, what isn’t guaranteed is that the results must be exactly the same.

In 2014 three results are notable.

  • Firstly Best Editor Short has a swap of finalists in the last spot – Sheila Wiliams (86) swaps with Bryan Thomas Schmidt (80).
  • Pro artists also has some swaps, partly because in 2014 a tie for fifth place meant 6 nominees. Essentially four artists with 50, 49, 49 and 48 nominations end up with a different ordering with EPH. The EPH ranking ends up as 49, 48, 49, 50 and that looks fine to me because I have that special kind of innumeracy that results from being overly numerate.
  • Fancast has the most understandable change but also the most problematic. In 2014 this was a three-way tie for last finalist at 35 nominations each. EPH breaks the tie and resolves the issue with a single nominee. Unfortunately, one of those three (SF Signal Podcast) won and would have been eliminated by EPH.

The thing is these are all pretty much very close votes with smaller numbers of voters. Anything different about 2014 would probably have resulted in different outcomes. For the Fancast result, an internet outage or a sick cat could have ended up with a different result. The least error in collating the data could have ended up with different results.

Put another way: Hugo voters did not have a clear consensus of which of these people/works should have been nominated. These cases are not good arguments against EPH.

Yes, but, but EPH+ might make that problem worse!

I’ve really no idea. I guess it might broaden what we might think of as a marginal tie and lead to more notable discrepancies between the number of nominating ballots and grabbing that last spot in the finalists. I don’t know.

The current system doesn’t avoid this issue, it really just hides it. For some categories, there are finalists who we really can’t say are substantially more nominated than others. The differences are small enough to be down to happenstance. And yes, some of those may actually end up being winners.

I think the answer is the number of nominees needs to be more flexible than just 5. However, deciding the rules on when to expand the number of nominees beyond an exact tie is unclear.

Where nominator coordination is not present, there are still significant numbers of changes not only to the long list, but to ballots where it’s not generally considered for anything untoward to have happened. Items removed from the 2014 ballot included a
winner of the Hugo. Had EPH been in place, they would not have been on the ballot.

That is a direct quote from Dave McCarty’s conclusion on the report. Sorry, but that is a flawed counterfactual. If we could somehow rewind the tape back to early 2014 and re-run the 2014 nomination ballot again, how likely is it that we’d have ended up with that exact tie that occurred? EPH changed the result because it broke a tie and the other places where there were changes were also spots with very close votes.

Almost ANY change would have meant that something slightly different would have happened! For SF Signal not have been a finalist required ONE nominator’s vote to be different

The changes to the Ballot and Long list are not easily verified and for people reviewing the detailed results at the end the only way to check that the process is working correctly would require access to secret nomination data and significant time.

That’s Dave McCarty again. Well, ANY verification of results needs access to ballots. Given Dave McC is worried about the 2014 Fancast result shifting by possibly one vote, to verify the CURRENT process would require checking that ballots had been classified correctly and counted correctly.

Assuming the underlying ballot data is correct (i.e. everybody’s nominations have been correctly collated) and in a machine readable form (e.g. a text file or spreadsheet), the EPH check takes seconds. Don’t trust the EPH program you are using? Use a different one and see if you get the same results. EPH is not hard to code, I made an Excel version that only uses standard Excel formulas and NO extra code at all.

So, yes, cleaning the nomination data and getting it all tickety-boo takes time – without a doubt BUT if we wanted to verify that the results DON’T CHANGE under the CURRENT process YOU WOULD STILL NEED TO DO THAT.