Larry Correia is a better writer than John C. Wright

Oh, controversial stuff today! This isn’t about fiction writing though. Actually, it is about fiction but not fiction writing as a type of writing. Yeah, I’m still on about the Maricopa election “audit”. Hey, I invested time reading that report, so you all have to listen to me go on about it a bit more.

My panel of right-wing commentators (aka the writers formerly known as Puppies) have naturally been talking about the “audit” of the US Presidential election results in Maricopa County, Arizona. My own earlier discussion is here

So far four former pups have expressed their opinion. Vox Day has once again given up on electoralism and is dismissive of the whole project — he’s far more excited by the Chinese Communist Party these days. That leaves three of them who have leapt to the defence of the so-called CyberNinjas:

Grant leads with a screenshot of a claim about the proportion of legible signatures found on ballots in the “audit”. I can’t find that claim anywhere except the blog he links to and I don’t think there’s something like it in the report. I may be mistaken though, as the report PDF is hard to search (the actual report is a mix of text sections and images of text. Aside from that, it is mainly invective about fraud etc. etc. and avoids the details of the report.

Wright quotes some numbers before launching into a tortured analogy:

“In this case, the audit found:
• 3,432 more ballots cast than voters listed as having cast a ballot.
• 277 Precincts show more ballots cast than people who showed up to vote (VMSS) for a total of 1,551 excess votes.
• 9,041 more mail-in ballots returned than they were mailed out.
In sum, the fraudulent ballots alone total 50,252, whereas Biden’s margin was 10,457, roughly one fifth that total.”

Of course, the actual report didn’t identify ANY fraudulent ballots. It is interesting the specific cases Wright highlights. He picks one of the “High” rated issues and two “Medium” rated issues. What they have in common is that they are the more opaque issues in terms of what the company did and what they actually represent. The largest of those three (the only “High” one) does have an explanation:

Wright naturally overcooks his point so he can go off on a string of “the falsehood is so brazen, so insolent, so vituperative, and so ubiquitous”.

So why am I saying Larry Correia is a better writer? Compare and contrast. Like Grant and Wright, Correia leads with the (correct) claim that the press led with the fact that the outcome of the audit was that Biden still beat Trump. He then continues:

“As I scrolled through dozens of these, I realized that none of them actually said what was in the audit report. Nor were there any links to the actual audit report. As a guy who used to write audit reports I’d rather read the actual document than take some journalism major’s take on it.”

Well, that’s surely very reasonable! Don’t just trust the headlines, go and read the actual report. Makes sense to me. What a calm, rational guy he is! He then continues:

“Except, the second part they aren’t talking about is… are those votes all actual legal votes? And the answer is possibly not (why possibly? I’ll get to that). Then see all those bullet points of problems, weirdness, and fuckery. Which comes down to there being about five times as many questionable votes as Biden’s margin of victory (for the state, in this one county).”

He’ll get to that, you see…when he gets into the details of the claims…which, well, he never does. Essentially, he just repeats Wright’s claim about the total number of votes the report raised questions about, without discussing what the numbers are. It’s basically the same nonsense as Wright but packaged in a more considered tone but with the added spin of authority.

Just under half of these supposedly “questionable votes” come from one category: people who may have changed address during the election.

Fractal misinformation but presented in three different styles (or maybe two and half different styles). As for the mainstream press not digging into this further, I think they made the right call. The main takeaway remains that Biden won, the CyberNinja report attempts to then cloud that finding but when you dig further…Biden still won.

Ninja FUD in Arizona

Cast your mind back to the closing weeks of 2020 and in the US the right was all aflutter about electoral fraud i.e. not at all coping with losing. I’ve covered the extent to which US elections are impacted by fraud before and the answer is lots-and-lots-and-none-at-all. The lots are overt and technically legal and come in the form of gerrymandering and voter suppression. It’s fraud because it is a systematic effort to distort the results of elections so that people who do not have the support of most eligible voters win elections. All election systems have flaws but if you put your effort into making those flaws worse for your own advantage then I have no issue calling that fraudulent, at least morally if not legally.

Putting that aside, the issue of in-person voter fraud and similar shenanigans is rare in the US, largely focused on local elections and (usually) has little impact. Past coverage of the issue prior to 2020 can be read here:

Of course, November 2020 brought fresh claims of voter fraud when Donald Trump was beaten by Joe Biden in the Presidential election. Those claims got quite wild, with all sorts of nonsense from misapplication of Benford’s Law to absurd claims about voting machines, a supposed military “raid” in Germany (wholly made up it seems) and at least one Kraken. What was missing at that point was ninjas.

Amid these attempts to deny reality, those states that swung the electoral college numbers in Biden’s favour received the most attention. Arizona was one of those states, and within Arizona, the populous Maricopa County was of particular interest because it sits electorally and demographically as a place shifting from Republican to Democratic,_Arizona

As a consequence of the desire to change reality, Arizona Senate Republicans hired private contractors to conduct an audit of Maricopa County and things only got stranger from there. The company, calling itself “CyberNinjas” at least added a cyberpunk theme to the process but aside from that, approached the process in a manner that generously could be called “sloppy”

The audit itself was a bit of a circus but apparently, it was sufficient to convince Donald Trump that it would lead to him being re-instated as US President by August 2021 (Observant readers will have noticed that Donald Trump was not re-instated as US President last month)

Fast forward to this week. The CyberNinjas report was leaked ahead of its public reveal and surprise, surprise Joe Biden beat Donald Trump in Maricopa County…which, of course, we all already knew. In fact, in the CyberNinja’s recount Biden had more votes but…let’s face it that’s likely an error on their part in some way. This was not a group that inspired confidence.

Of course, the point of the audit was not intended to come up if with a different value than the previous recounts but to either find a ‘smoking gun’ of electoral shenanigans and failing that just generally cast doubt on the results. That Biden won (again) carries some amusement value but the substantial effort by the GOP was to use the audit report to claim that the results were in some vague way not wholly legitimate. Which, is what they were doing beforehand anyway but now they have spent a lot of money and can do it again.

The GOP spin on the report is a claim that 40 thousand votes, far more than Biden’s margin in the county, are somehow dubious. Interestingly, the CyberNinja’s report is more equivocal. They do list a whole pile of things but looking at the points in detail reveal a whole pile of vague hand waving. You can read the report here (and archive version here)

So what’s this 40K+ that the right is touting? The report breaks down 22 issues and the number of ballots impacted by those issues. The issues are presented with titles and a rating from “Critical” to “Low”. The emphasis from the right is on the names of the issues rather than a. the actual numbers and b. what those numbers actually indicate or c. whether those numbers are in any way correct and d. whether they changed the result. The idea is really just to get a figure big enough that Biden’s margin in this one county can be called doubtful in some sense, which helps fuel further voter suppression policies.

The single biggest issue highlighted by the report is the ominous-sounding “5.3.1 Mail-in ballots voted from prior address” which is the only issue rated as “Critical” in the report. According to the CyberNinjas, this numbers 23,344 ballots i.e. about half of the supposed 40K. Digging into the details, the issue is primarily people who moved house WITHIN Maricopa County between receiving a mail-in ballot and posting it. Hmmm. OK, sure, not even remotely something indicating mass electoral fraud but possibly in breach of the actual rules…except…it isn’t really 23,344 ballots WERE THAT ACTUALLY HAPPENED, it’s just 23,344 were maybe that’s what happened.

“Mail-in ballots were cast under voter registration IDs for people that may not have received their ballots by mail because they had moved, and no one with the same last name remained at the address. Through extensive data analysis we have discovered approximately 23,344 votes that may have met this condition.”

The ‘extensive data analysis was a comparison with a third-party address validation tool of the kind used by companies to validate their direct marketing tools etc. So some proportion of those would be false positives in terms of an actual change of address, even more, would be false positives of a change of address within the window where it would have been a problem. The ‘audit’ did not actually confirm a single one of these ballots as actually being a problem. Nor did the report in any way connect this issue with any indication of systematic fraud, indeed taking the claim at face value it was eligible voters voting but with not wholly up-to-date details.

In short, it is largely smoke but this one issue bulks up the numbers.

The next highest issue is “5.4.1 More Ballots Returned By Voter Than Received” with 9,041 ballots ‘impacted’. Again, the title doesn’t describe the actual thing found but the potential inference that could be made from the discrepancy. The idea being with these titles, that either intentional or through sloppy reporting the whole “maybe” aspect of the report gets skipped over.

The actual substance of the figure is where there are discrepancies between the number of ballots sent to a person and the number returned e.g. somebody was sent one mail-in ballot but two were received. Note also “received” not “counted” and the report assumes only one ballot was counted. In addition, the report isn’t entirely sure what the figures they have actually indicate, noting:

“NOTE: We’ve been informed shortly before the release of this report that some of the discrepancies outlined could be due to the protected voter list. This has not been able to be validated at this time, but we thought it was important to disclose this information for accuracy.”

But…OK, follow the chains of maybes down the line and there’s at least a possibility of some fraction of that 9,041 being people who voted twice (although probably only counted once). Might that impact the results? The report provides a table that breaks down the nine thousand approximately by party registration.

  • Democrat [sic] Party 34.4%
  • Republican Party 30.4%
  • Prefer not to declare 30.1%
  • Independent 3.7%
  • Libertarian Party 1.3%

So we are well into fractions or fractions of maybes.

I won’t cover every point but the next highest was “5.4.2 VOTERS THAT POTENTIALLY VOTED IN MULTIPLE COUNTIES” with 5,295 votes and this is more of a classic. The CyberNinjas matched first, middle and last names AND year of birth across voter records to find duplicates. They found 10,342 votes out of 2,076,086 votes actually counted in the election.

“Comparing the Maricopa County VMSS Final Voted File to the equivalent files of the other fourteen Arizona counties resulted in 5,047 voters with the same first, middle, last name and birth year, representing 10,342 votes among all the counties. While it is possible for multiple individuals to share all these details, it is not common although the incidence
here (roughly one-third of one percent) may be the rate of commonalities in identifying information between legitimate, separate individual voters especially with common last names.”

Yes, it may well be the actual rate of commonalities and if I was paying for this report that ACTUAL rate (or a research-based estimate) is something I’d expect to see in that paragraph. It’s unlikely that two people would share all those identifying features in common but also the proportion they found was very small…which is what you would expect. This extensive data analysis discovered that a rare thing was rare.

These three issues by themselves (those rated “High” or “Critical”) account for 37,680 of the ballots that the propaganda spin is claiming are in some way evidence of fraud or potential fraud. The report itself makes more moderate claims about those figures and yet even those more moderate claims are poorly substantiated.

The issues with smaller figures have much the same issues. Name matching (e.g. of 282 possibly deceased people) that may or may not be accurate, a lack of clarity on what the figure might indicate and no obvious connection with any kind of systematic fraud.

Even taking the dubious report at face value, the broader narrative of some kind of extensive fraud by the Democratic Party (or the Deep State or satanic cultists or whoever is supposed to be conspiring today) is more disproven by the report than it is supported. A proportion of Arizona residents moving house with a plot to steal an election makes no sense but then none of the conspiratorial plots mooted in the wake of Trump’s defeat made any sense.

The details of the report won’t matter though. You’ll be getting sound bites of 40 thousand bad ballots in Arizona for literally years after this even though the actual report, dodgy as it is, doesn’t even support that figure.

Oh, and a little twist in the story. Do you remember Benford’s Law? Well if you check the leading digits of the figures in the CyberNinja report (page 5), the most common leading digit is 2 not 1. Of course, given the data there’s no reason why you should expect it to follow Benford’s law but for all those people who were claiming that any departure from the rule is sure evidence of fraud…well…OK those people don’t believe in logical consistency anyway so.

Today’s Infographic: Pie Charts of James Bond Movie Theme Singers

I was wondering what proportion of Bond theme songs were sung by somebody Welsh. Quite a few because Shirley Bassey sang several and Tom Jones sang one. There are a few basic questions about how to count things along the way though.

Weirdly when I export the graph Washington disappears as a label

Firstly, do you count Shirley Bassey once or three times? Secondly, Jack White and Alicia Keys are joint singers of Another Way to Die from Quantum of Solace. Do they both count as 1 or should they get 0.5? Thirdly, Aha, Garbage and Duran Duran (and to a lesser extent Wings) are listed on Wikipedia as the artists for their respective Bond theme songs, so should the whole band count or just the lead singer? I went with the lazy option: Bassey gets counted for each song, White and Keys are counted separately and for the bands, only the lead singer counts (which is a net gain for Scotland). Lastly, as far as I’m aware, Norway doesn’t have an equivalent state-like subdivision.

Here’s the national breakdown.

If Norway allies with the UK then they can avoid being outvoted by the USA.

And more right wingers talking nonsense about Benford’s Law update

It seems I was too kind to Larry Correia in my first post about the pro-Trumpist misleading claims about Benford’s Law. He actually is still pushing it as supposed evidence of election fraud.

“Basically, when numbers are aggregated normally, they follow a distribution curve. When numbers are fabricated, they don’t. When human beings create what they think of as “random” numbers, they’re not. This is an auditing tool for things like looking for fabricated invoices. It also applies to elections. A normal election follows the expected curve. If you look at a 3rd world dictatorship’s election numbers, it looks like a spike or a saw.

There’s a bunch of different people out there running the numbers for themselves and posting the results so you can check their math. It appears that checking various places around the country Donald Trump’s votes follow the curve. The 3rd party candidates follow the curve. Down ballot races follow the curve. Hell, even Joe Biden’s votes follow the curve for MOST of the country. But then when you look at places like Pittsburgh the graph looks like something that would have made Hugo Chavez blush.”

On Twitter I noted that far-right extremist Nick Fuentes is also pushing not just the misleading claims about Benford’s Law but a false claim that Wikipedia “added” criticism of its use in elections to discredit the claims being made about the 2020 general election. As I pointed out in this post, the rider that Benford’s Law use with electoral data was limited had been their for years. Rather than pro-Biden supporters adding it, Trump supporters removed the sentence and references in a bid to hide the fact that their analysis was flawed. You can read a 2013 version of the page here

Since then, the section on Benford’s Law in election has expanded into a mini-essay about its use and limitations.

I don’t have a source for 2020 data at the precinct level that some of these graphs are using. I’m certain that there will be both Benford and non-Benford like distributions for Trump and Biden in various places. I do have county level data for 2020 to 2016 from here

The analysis is trivial to do on a spreadsheet. Grab the first character and then tabulate it with a pivot table. You can explore various candidates from Bush to Biden on a Google sheets I made here

Here, for example is Donald Trump in Alaska in 2016:

When you look at the district sizes in Alaska and consider Trump’s proportion of the vote, it becomes obvious very quickly that it would be absurd for this data to follow Benford’s Law. Here are the first four (of 40) districts.

DistrictTrump VotesTotal VotesPercentage
District 13180663847.91%
District 23188549258.05%
District 35403761370.97%
District 44070952142.75%
Trump’s vote in four Alaskan districts in 2016

We have leading digits of 3,5 and 4 and no 1s. Why? Because to get leading digits of 1s Trump’s votes would need to be proportionately much smaller! For example if he’d only got 20% of the vote in District 1 then that would result in some 1s. In some of the examples being passed around the Trumpist circles, that is one of the reasons for Benford-like graphs — they’ve picked places where Trump’s vote was proportionately low pushing into a ranges where 1s were common as a leading digit.

The mechanics of the deception here are fascinating. There’s an initial plausibility (Benford’s Law is a real thing and is actually used to detect fraud and has been applied to elections), a lack of any critical thinking (the examples being circulated are very limited, there’s no comparison with past elections to see what is normal) but then active deception (long standing academic critiques of applying Benford’s Law to election data being actively deleted from online wikis). On that latter part, we know the more extreme white nationalist right (Fuentes, Vox Day) are active in attempting to suppress information on how to apply Benford’s Law to election data. Providing the usual smoke screen an aura of legitimacy are the usual convenient idiots for neo-Nazis such as Larry Correia, who repeat the propaganda as ‘just asking questions’.

More far-right deception about Benford’s law

I discussed Benford’s Law and its misleading use in election data yesterday. What I didn’t mention is that the far-right vanity version of Wikipedia, known as Voxopedia aka “Infogalactic” is actively censoring information about it.

Like many articles on the out-of-date semi-vandalised wiki, the Benford’s Law article [archive version] started as a clone of the authoritative Wikipedia version in 2016. It remained unedited until 7 November, when it was hastily edited.

What was the edit? This part was removed:

“However, other experts consider Benford’s Law essentially useless as a statistical indicator of election fraud in general.Joseph Deckert, Mikhail Myagkov and Peter C. Ordeshook, (2010) ”[ The Irrelevance of Benford’s Law for Detecting Fraud in Elections]”, Caltech/MIT Voting Technology Project Working Paper No. 9
Charles R. Tolle, Joanne L. Budzien, and Randall A. LaViolette (2000) ”[[:doi:10.1063/1.166498|Do dynamical systems follow Benford?s Law?]]”, Chaos 10, 2, pp.331–336 (2000); {{doi|10.1063/1.166498}}”

Edit to Voxopedia by “Renegade” 12:49 7 November

Here is an image of the change. Note this is the ONLY edit that has ever occurred to the page on Voxopedia.

Over at the real Wikipedia, the same page has been subject to deceptive editing also. References to the failure of Benford’s Law to detect fraud in elections have been removed and then re-instated. Note, that prior to the US 2020 election, these references were present. The attempt to remove them occurred AFTER the far-right claims that Benford’s Law could prove fraud (e.g. from Larry Correia and Vox Day) started circulating.

The paper that extremists on the right are trying to hide from people is this one [archive pdf]. The Abstract states:

“With increasing frequency websites appear to argue that the application of Benford’s Law – a prediction as to the observed frequency of numbers in the first and second digits of official election returns — establishes fraud in this or that election. However, looking at data from Ohio, Massachusetts and Ukraine, as well as data artificially generated by a series of simulations, we argue here that Benford’s Law is essentially useless as a forensic indicator of fraud. Deviations from either the first or second digit version of that law can arise regardless of whether an election is free and fair. In fact, fraud can move data in the direction of satisfying that law and thereby occasion wholly erroneous conclusions.”

The Irrelevance of Benford’s Law for Detecting Fraud in ElectionsJoseph Deckert, Mikhail Myagkov and Peter C. OrdeshookUniversity of Oregon and California Institute of Technology

The paper discusses examples and shows (as we discussed yesterday) how election data can show both Benford-like and normal-like distribution of digits.

It can be difficult to tell the extent to which the far-right is knowingly lying versus simply not caring about the truth versus active self-deception. All three forms of subverting the truth can be in play when we look at past examples. However, we have here an unambiguous example of active lying. Day and at least one of his minions was already aware that Benford’s Law is a poor tool to use to detect fraud in elections and have been actively trying to hide that information from his followers.

I Guess I’m Talking About Benford’s Law

The US Presidential Election isn’t just a river in Egypt, it is also a series of bizarre claims. One of the many crimes against statistics being thrown about in what is likely to be a 5 year (minimum) tantrum about the election is a claim about Benford’s law. The first example I saw was last Friday on Larry Correia’s Facebook[1]

“For those of you who don’t know, basically Benford’s Law is about the frequency distribution of numbers. If numbers are random aggregates, then they’re going to be distributed one way. If numbers are fabricated by people, then they’re not. This is one way that auditors look at data to check to see if it has been manipulated. There’s odds for how often single digit, two digit, three digit combos occur, and so forth, with added complexity at each level. It appears the most common final TWO digits for Milwaukee’s wards is 00. 😃 Milwaukee… home of the Fidel Castro level voter turn out. The odds of double zero happening naturally that often are absurdly small. Like I don’t even remember the formula to calculate that, college was a long time ago, but holy shit, your odds are better that you’ll be eaten by a shark ON LAND. If this pans out, that is downright amazing. I told you it didn’t just feel like fraud, but audacious fraud. The problem is blue machine politics usually only screws over one state, but right now half the country is feeling like they got fucked over, so all eyes are on places like Milwaukee.I will be eagerly awaiting developments on this. I love fraud stuff. EDIT: and developments… Nothing particularly interesting. Updated data changes some of the calcs, so it goes from 14 at 0 to 13 at 70. So curious but not damning. Oh well.”

So after hyping up an idea he only vaguely understood (Benford’s law isn’t about TRAILING digits for f-ck sake and SOME number has to be the most common) Larry walked the claim back when it became clear that there was not very much there. As Larry would say beware of Dunning-Krugerands.

The same claim was popping up elsewhere on the internet and there was an excellent Twitter thread debunking the claims here:

footnote [2]

But we can have hierarchies of bad-faith poorly understood arguments. Larry Correia didn’t have the integrity to at least double check the validity of what he was posting before he posted it but at least he checked afterwards…sort of. Vox Day, however, has now also leaped upon the magic of Benford’s law [3]

Sean J Taylor’s Twitter thread does a good job of debunking this but as it has now come up from both Sad and Rabid Puppies, I thought I’d talk about it a bit as well with some examples.

First of all Benford’s law isn’t much of a law. Lots of data won’t follow it and the reason why some data follows it is not well understood. That doesn’t mean it has no utility in spotting fraud, it just means that to use it you first need to demonstrate that it applies to the kind of data you are looking at. If Benford’s Law doesn’t usually apply to the data you are looking at but your data does follow Benford’s law then THAT would/might be a sign of something going on.

That’s nothing unusual in statistics. Data follows distributions and comparing data against an applicable distribution that you expect to apply is how a lot of statistics is done. Benford’s law may or may not be applicable. As always, IT DEPENDS…

For example, if I grab the first digit of the number of Page Views on Wikipedia of Hugo Award finalists [4] then I get a set of data that is Benford like:

The most common digit is 1 as Benford’s law predicts. The probability of it being 1 according to the law is log10(1+1/d) or about 30%. Of the 1241 entries, Benford’s law would predict 374 would have a leading digit of 1 and the actual data has 316. But you can also see that it’s not a perfect fit and we could (but won’t bother because we actually don’t care) run tests to see how good a fit it was.

But what if I picked a different set of numbers from the same data set? Here is the leading digit for the “Age at Hugo” figure graphed for the finalists where I have that data.

It isn’t remotely Benford like and that’s normal (ha ha) because age isn’t going to work that way. Instead the leading digit will cluster around the average age of Hugo finalists. If the data did follow Benford’s law it would imply that teenagers were vastly more likely to win Hugo Awards (or people over 100 I suppose or both).

Generally you need a wide spread of numbers across magnitudes. For example, I joked about Hugo winners in their teens or their centuries but if we also had Hugo finalists who where 0.1… years old as well (and all ages in between) then maybe the data might get a bit more Benfordish.

So what about election data. ¯\_(ツ)_/¯

The twitter thread above cites a paper entitled Benford’s Law and the Detection of Election Fraud [5] but I haven’t read it. The abstract says:

“Looking at simulations designed to model both fair and fraudulent contests as well as data drawn from elections we know, on the basis of other investigations, were either permeated by fraud or unlikely to have experienced any measurable malfeasance, we find that conformity with and deviations from Benford’s Law follow no pattern. It is not simply that the Law occasionally judges a fraudulent election fair or a fair election fraudulent. Its “success rate” either way is essentially equivalent to a toss of a coin, thereby rendering it problematical at best as a forensic tool and wholly misleading at worst.”

Put another way, some election data MIGHT follow Benford’s law sometimes. That makes sense because it will partly depend on the scale of data we are looking at. For example, imagine we had a voting areas of approx 800 likely voters and two viable candidates, would we expect “1” to be a typical leading digit in vote counts? Not at all! “3” and “4” would be more typical. Add more candidates and more people and things might get more Benford like.

Harvard University has easily downloadable US presidential data by State from 1976 to 2016 [6]. At this scale and with all candidates (including numerous 3rd, 4th party candidates) you do get something quite Benford like but with maybe more 1s than expected.

Now look specifically at Donald Trump in 2016 and compare that with the proportions predicted by Benford’s law:

Oh noes! Trump 2016 as too many 1s! Except…the same caveat applies. We have no idea if Benford’s law applies to this kind of data! For those curious, Hilary Clinton’s data looks like (by eyeball only) a better fit.

Now we could test these to see how good a fit they are but…why bother? We still don’t know whether we expect the data to be a close fit or not. If you are looking at those graphs and thinking “yeah but maybe it’s close enough…” then you also need to factor in scale. I don’t have data for individual polling booths or whatever but we can look at the impact of scale by looking at minor candidates. Here’s one Vox Day would like, Pat Buchanan.

My eyeballs are more than sufficient to say that those two distributions don’t match. By Day’s misapplied standards, that means Pat Buchanan is a fraud…which he is, but probably not in this way.

Nor is it just scale that matters. Selection bias and our old friend cherry picking are also invited to the party. Because the relationship between the data and Benford’s law is inconsistent and not understood, we can find examples that fit somewhat (Trump, Clinton) and examples that really don’t (Buchanan) but also examples that are moderately wonky.

Here’s another old fraudster but whose dubious nature is not demonstrated by this graph:

That’s too many twos Ronnie!

Anyway, that is far too many words and too many graphs to say that for US Presidential election data Benford’s law applies only just enough to be horribly misleading.


[2] Sean S Taylor’s R code



[5] Deckert, J., Myagkov, M., & Ordeshook, P. (2011). Benford’s Law and the Detection of Election Fraud. Political Analysis,19(3), 245-268. doi:10.1093/pan/mpr014


Hugo Author Page Views

I gathered the Wikipedia pages of all the authors in my great big Hugo spreadsheet and used my page view gathering tool to add a page view figure to every author with an English Wikipedia page on that sheet. Most of the authors on this list of Hugo Finalists for Novel, Novella, Novelette and Short Story have a Wikipedia page but all the caveats about this data apply. A good example of the issues is Frank Herbert, whose page views have increased because of interest around the new film version of Dune. That doesn’t make the page views utterly flawed as a figure, we just need to be clear that they are a measure of current levels of attention and that currency can change dramatically for individuals.

The other more numerical issue is the distribution. Authors that are currently getting a lot of Wiki-attention do so at a scale orders of magnitude greater than those that aren’t. That can make graphing the data tricky and it also does bad things to measures of central tendency aka averages.

This time I want to look at trends over time. I’m plotting the Hugo Award year against an aggregated value of the authors who were finalists in story categories. To cope with the spread of values I’m using a logarithmic scale for the vertical axis.

Hugo story finalist graphed by year and Wikipedia 30 day page views gathered 14/09/2020

The median is less impacted by the smallest and largest values in each year. Also, in this case I’m treating authors without Wikipedia pages as missing data rather than zero. The most famous authors don’t really influence the graph unless they were finalists with a whole bunch of really famous people. I think 1964 (currently) is the peak year because of a combo of Heinlein, Anderson, Vonnegut, Norton, and Rice-Burroughs. The outliers that year are Frank Herbert (because of the Dune movie) and Clifford D. Simak (a decent number of page views just low for that year), plus Rick Raphael who gets treated as missing data because he doesn’t have an English Wikipedia page.

Arguably, there is a visible late 1990’s/early 2000 dip that has been anecdotally claimed in discussion about the Hugo Awards. Whether that is an actual feature of those finalists or whether they just fall in that spot between too long ago to be notable now but not far back enough to be revisited as classics remains an open question.

Intentionally, the graph ignores two important groups: the authors who are really, really notable currently (in terms of Wikipedia page views) and the authors who aren’t. I’ll deal with the first group by looking at the maximum values per year.

Hugo story finalist graphed by year and max values 30 day page views

I think that is very much a nothing-to-see-here sort of graph. Note that I’ve changed the maximum and minimum points on the vertical axis to fit the data in. Generally, the really high values are consistently high.

Hugo story finalist graphed by year and min values 30 day page views

The minimum value starts very noisy and then gets more stable. Remember that those authors without Wikipedia pages are counted as missing rather than zero, so don’t impact the values on this graph. I think the most recent years would look a bit noisier if we counted the missing authors as zero instead because the most recent years naturally have more early career writers who haven’t got Wikipedia pages yet.

Lastly, here is the first graph again of the median value but this time only showing the value for the winners.

Hugo story winners graphed by year and median values 30 day page views

That looks like it’s trending down a bit but note that this value will be more influenced by the shorter fiction finalists.

Page Views and the Dragon Award

There is a common impression that there has been a change in character of the Dragon Awards this year. I though I might use the Wikipedia page view metric (see here) to see if I could quantify it it in a different way.

An immediate obstacle with using the page view figure is that the distribution is very Zipf like. That makes averages very misleading because the odd Steven King or Margaret Atwood creates a big change in the mean score. To overcome that issue and also to show the authors who don’t have Wikipedia pages, I’ve grouped the data in bins that get proportionately bigger. The first bin is 0 to 10 (basically people who don’t have a Wikipedia page) then 10 to 50, then 50 to 100, then 100 to 500 etc. up to 100,000 or more which is basically Steven King.

One major caveat. The page view numbers are as they are in September 2020 in all cases. So figures for past years reflect those counts for the authors now and not as they were in the year of the award.

This is the table for book categories (I haven’t gather the data for people in the comic book categories).

< 104262453444227
≥ 101113
≥ 502215
≥ 1005488631
≥ 5002136
≥ 1,00012109141560
≥ 5,0003144214
≥ 10,0006943527
≥ 50,0002114
> 100,00011
Winners and Finalists (book categories)

Obviously, there are many ways you can group this data but I think it shows some sensible groupings.

< 10111238
≥ 5011
≥ 100112
≥ 50022
≥ 1,0003322212
≥ 5,00013116
≥ 10,0004217
≥ 50,000112
> 100,00011
Winners (book categories)

These tables don’t suggest any substantial changes to the Dragon Awards. There are ups and downs but the overall character seems to be similar: a mix of big names (e.g. in 2016, Terry Pratchett and Brandon Sanderson) down to names that are famous within their Amazon niches (e.g. Nick Cole).

However, if we look at just the ‘headline’ categories defined by the broad genres Science Fiction, Fantasy, and Horror (I thought I should include Horror) we see a different story.

< 1071212233
≥ 10112
≥ 501214
≥ 10022318
≥ 50022
≥ 100056261029
≥ 500011327
≥ 100002332515
≥ 50000112
> 10000011
Winners and Finalists in Science Fiction, Fantasy and Horror

In these three categories, the authors are (by the page view metric) more notable in 2020 than in previous years.

What about gender? The Dragon Awards have been very male dominated both in absolute terms and even more so in comparison to contemporary awards. Using the page metric groups, a shift becomes more clear.

< 103543217
≥ 100
≥ 5011
≥ 1002133211
≥ 50022
≥ 1,00023361024
≥ 5,00021227
≥ 10,00032117
≥ 5,000011
> 100,0000
Authors using she/her pronouns book categories

The substantial increase is with women authors in the 1000 to 5000 range. The difference in gender balance becomes clearer in aggregate across the years.

GroupHe/himShe/HerTotal% he% she
< 1077179482%18%
≥ 1033100%0%
≥ 5041580%20%
≥1 0020113165%35%
≥ 50042667%33%
≥ 1,00036246060%40%
≥ 5,000771450%50%
≥ 10,0002072774%26%
≥ 50,00031475%25%
> 100,00011100%0%
Gender split 2016-2020 book categories

The gender balance increases with grouping size until the 5,000 group and then declines. Interestingly, with three each, the 50-50 split in that group also exists for winners.

So, yes the Dragons are changing but only in places. Down ballot, finalists still tend to be less notable and more male in a way that’s not very different from 2016.

…I should add

A note on my previous two posts because it illustrates a broader point.

The page views metric does appear to be both meaningful and accessible. Those are handy qualities for making comparisons but it has a significant downside. As soon as people start paying attention to it in any significant sense then the value of it would be severely undermined.

For example, to set up the fields for the web scraping, I visited a few authors main page several times and literally added to their total. The impact of that would be small for N.K. Jemisin’s page but not insignificant for Brian Niemeier’s. The set up I created could also be easily re-designed to visit a single Wikipedia page many times while I got on with some other task.

I noticed an additional circularity today. I was curious about why there was a Chuck Tingle spike in January 2017 and so…visited his Wikipedia page. If there was any stakes attached to this kind of ranking then a random blip would generate interest in a topic which would drive interest in the Wikipedia page, which would increase the size of the blip etc etc.

I’m not suggesting anything like that is going to happen with Wiki page view stats but the scenario reminded me of more notable statistics we encounter. The most obvious one is share prices and other speculative financial data. The capacity for this kind of data to engender feedback loops is infamous and actively undermines the information value of the data.

More broadly, metrics used to judge job performance or business performance can also be self undermining in other ways. What might have been a handy piece of data will get distorted when stakes are attached to the data which in turn are intended to influence people’s behaviour. With social policy this can have unfortunate consequences e.g. in crime statistics

Who “won” the Puppy attention wars?

A good point people raised about yesterdays post on Wikipedia page view metrics is that it captures a current state but in many cases we are more interested in a historical value. This is particularly true when we are looking at the impact of awards or events.

Luckily I don’t need to advance my web scrapping tools further to answer this as Wikipedia actually has a tool for looking at and graphing this kind of data. Like most people I’ve used Wikipedia for many years now but I only learned about this yesterday while looking for extra data (or maybe I learned earlier and forgot — seems likely). The site is and each of the page information pages has a link to it at the bottom under ‘external tools’.

It’s not really suitable for a data set of hundreds of pages but it is quite nice for comparing a small number of pages.

Just to see how it works and to play with settings until I got a visually interesting graph, I decided to see if I could see the impact of the Hugo Awards on four relevant pages. Now the data it will graph only goes back to 2015, so this takes the impact of SP3 as a starting point. I’ve chosen to look at John Scalzi, N.K. Jemisin, Chuck Tingle, Vox Day and Larry Correia.|John_Scalzi|Larry_Correia|Chuck_Tingle|N._K._Jemisin

I added a background colour and labels. The data shows monthly totals and because of the size of some spikes, it is plotted on a logarithmic scale. Be mindful that the points are vertically further apart in terms of actual magnitude than is shown visually.

I think the impact of N.K. Jemisin’s second and third Best Novel wins is undeniable. There is a smaller spike for the first win but each subsequent win leads to more interest. I don’t know why Chuck Tingle had a big spike in interest in January 2017.

I’ve added a little red arrow around July 2019. That was when there was a big flurry among some Baen authors that Wikipedia was deleting their articles

Anyway, to answer my own question: talent beat tantrums in the battle for attention