Hugo Author Page Views

I gathered the Wikipedia pages of all the authors in my great big Hugo spreadsheet and used my page view gathering tool to add a page view figure to every author with an English Wikipedia page on that sheet. Most of the authors on this list of Hugo Finalists for Novel, Novella, Novelette and Short Story have a Wikipedia page but all the caveats about this data apply. A good example of the issues is Frank Herbert, whose page views have increased because of interest around the new film version of Dune. That doesn’t make the page views utterly flawed as a figure, we just need to be clear that they are a measure of current levels of attention and that currency can change dramatically for individuals.

The other more numerical issue is the distribution. Authors that are currently getting a lot of Wiki-attention do so at a scale orders of magnitude greater than those that aren’t. That can make graphing the data tricky and it also does bad things to measures of central tendency aka averages.

This time I want to look at trends over time. I’m plotting the Hugo Award year against an aggregated value of the authors who were finalists in story categories. To cope with the spread of values I’m using a logarithmic scale for the vertical axis.

Hugo story finalist graphed by year and Wikipedia 30 day page views gathered 14/09/2020

The median is less impacted by the smallest and largest values in each year. Also, in this case I’m treating authors without Wikipedia pages as missing data rather than zero. The most famous authors don’t really influence the graph unless they were finalists with a whole bunch of really famous people. I think 1964 (currently) is the peak year because of a combo of Heinlein, Anderson, Vonnegut, Norton, and Rice-Burroughs. The outliers that year are Frank Herbert (because of the Dune movie) and Clifford D. Simak (a decent number of page views just low for that year), plus Rick Raphael who gets treated as missing data because he doesn’t have an English Wikipedia page.

Arguably, there is a visible late 1990’s/early 2000 dip that has been anecdotally claimed in discussion about the Hugo Awards. Whether that is an actual feature of those finalists or whether they just fall in that spot between too long ago to be notable now but not far back enough to be revisited as classics remains an open question.

Intentionally, the graph ignores two important groups: the authors who are really, really notable currently (in terms of Wikipedia page views) and the authors who aren’t. I’ll deal with the first group by looking at the maximum values per year.

Hugo story finalist graphed by year and max values 30 day page views

I think that is very much a nothing-to-see-here sort of graph. Note that I’ve changed the maximum and minimum points on the vertical axis to fit the data in. Generally, the really high values are consistently high.

Hugo story finalist graphed by year and min values 30 day page views

The minimum value starts very noisy and then gets more stable. Remember that those authors without Wikipedia pages are counted as missing rather than zero, so don’t impact the values on this graph. I think the most recent years would look a bit noisier if we counted the missing authors as zero instead because the most recent years naturally have more early career writers who haven’t got Wikipedia pages yet.

Lastly, here is the first graph again of the median value but this time only showing the value for the winners.

Hugo story winners graphed by year and median values 30 day page views

That looks like it’s trending down a bit but note that this value will be more influenced by the shorter fiction finalists.

Page Views and the Dragon Award

There is a common impression that there has been a change in character of the Dragon Awards this year. I though I might use the Wikipedia page view metric (see here) to see if I could quantify it it in a different way.

An immediate obstacle with using the page view figure is that the distribution is very Zipf like. That makes averages very misleading because the odd Steven King or Margaret Atwood creates a big change in the mean score. To overcome that issue and also to show the authors who don’t have Wikipedia pages, I’ve grouped the data in bins that get proportionately bigger. The first bin is 0 to 10 (basically people who don’t have a Wikipedia page) then 10 to 50, then 50 to 100, then 100 to 500 etc. up to 100,000 or more which is basically Steven King.

One major caveat. The page view numbers are as they are in September 2020 in all cases. So figures for past years reflect those counts for the authors now and not as they were in the year of the award.

This is the table for book categories (I haven’t gather the data for people in the comic book categories).

< 104262453444227
≥ 101113
≥ 502215
≥ 1005488631
≥ 5002136
≥ 1,00012109141560
≥ 5,0003144214
≥ 10,0006943527
≥ 50,0002114
> 100,00011
Winners and Finalists (book categories)

Obviously, there are many ways you can group this data but I think it shows some sensible groupings.

< 10111238
≥ 5011
≥ 100112
≥ 50022
≥ 1,0003322212
≥ 5,00013116
≥ 10,0004217
≥ 50,000112
> 100,00011
Winners (book categories)

These tables don’t suggest any substantial changes to the Dragon Awards. There are ups and downs but the overall character seems to be similar: a mix of big names (e.g. in 2016, Terry Pratchett and Brandon Sanderson) down to names that are famous within their Amazon niches (e.g. Nick Cole).

However, if we look at just the ‘headline’ categories defined by the broad genres Science Fiction, Fantasy, and Horror (I thought I should include Horror) we see a different story.

< 1071212233
≥ 10112
≥ 501214
≥ 10022318
≥ 50022
≥ 100056261029
≥ 500011327
≥ 100002332515
≥ 50000112
> 10000011
Winners and Finalists in Science Fiction, Fantasy and Horror

In these three categories, the authors are (by the page view metric) more notable in 2020 than in previous years.

What about gender? The Dragon Awards have been very male dominated both in absolute terms and even more so in comparison to contemporary awards. Using the page metric groups, a shift becomes more clear.

< 103543217
≥ 100
≥ 5011
≥ 1002133211
≥ 50022
≥ 1,00023361024
≥ 5,00021227
≥ 10,00032117
≥ 5,000011
> 100,0000
Authors using she/her pronouns book categories

The substantial increase is with women authors in the 1000 to 5000 range. The difference in gender balance becomes clearer in aggregate across the years.

GroupHe/himShe/HerTotal% he% she
< 1077179482%18%
≥ 1033100%0%
≥ 5041580%20%
≥1 0020113165%35%
≥ 50042667%33%
≥ 1,00036246060%40%
≥ 5,000771450%50%
≥ 10,0002072774%26%
≥ 50,00031475%25%
> 100,00011100%0%
Gender split 2016-2020 book categories

The gender balance increases with grouping size until the 5,000 group and then declines. Interestingly, with three each, the 50-50 split in that group also exists for winners.

So, yes the Dragons are changing but only in places. Down ballot, finalists still tend to be less notable and more male in a way that’s not very different from 2016.

…I should add

A note on my previous two posts because it illustrates a broader point.

The page views metric does appear to be both meaningful and accessible. Those are handy qualities for making comparisons but it has a significant downside. As soon as people start paying attention to it in any significant sense then the value of it would be severely undermined.

For example, to set up the fields for the web scraping, I visited a few authors main page several times and literally added to their total. The impact of that would be small for N.K. Jemisin’s page but not insignificant for Brian Niemeier’s. The set up I created could also be easily re-designed to visit a single Wikipedia page many times while I got on with some other task.

I noticed an additional circularity today. I was curious about why there was a Chuck Tingle spike in January 2017 and so…visited his Wikipedia page. If there was any stakes attached to this kind of ranking then a random blip would generate interest in a topic which would drive interest in the Wikipedia page, which would increase the size of the blip etc etc.

I’m not suggesting anything like that is going to happen with Wiki page view stats but the scenario reminded me of more notable statistics we encounter. The most obvious one is share prices and other speculative financial data. The capacity for this kind of data to engender feedback loops is infamous and actively undermines the information value of the data.

More broadly, metrics used to judge job performance or business performance can also be self undermining in other ways. What might have been a handy piece of data will get distorted when stakes are attached to the data which in turn are intended to influence people’s behaviour. With social policy this can have unfortunate consequences e.g. in crime statistics

Who “won” the Puppy attention wars?

A good point people raised about yesterdays post on Wikipedia page view metrics is that it captures a current state but in many cases we are more interested in a historical value. This is particularly true when we are looking at the impact of awards or events.

Luckily I don’t need to advance my web scrapping tools further to answer this as Wikipedia actually has a tool for looking at and graphing this kind of data. Like most people I’ve used Wikipedia for many years now but I only learned about this yesterday while looking for extra data (or maybe I learned earlier and forgot — seems likely). The site is and each of the page information pages has a link to it at the bottom under ‘external tools’.

It’s not really suitable for a data set of hundreds of pages but it is quite nice for comparing a small number of pages.

Just to see how it works and to play with settings until I got a visually interesting graph, I decided to see if I could see the impact of the Hugo Awards on four relevant pages. Now the data it will graph only goes back to 2015, so this takes the impact of SP3 as a starting point. I’ve chosen to look at John Scalzi, N.K. Jemisin, Chuck Tingle, Vox Day and Larry Correia.|John_Scalzi|Larry_Correia|Chuck_Tingle|N._K._Jemisin

I added a background colour and labels. The data shows monthly totals and because of the size of some spikes, it is plotted on a logarithmic scale. Be mindful that the points are vertically further apart in terms of actual magnitude than is shown visually.

I think the impact of N.K. Jemisin’s second and third Best Novel wins is undeniable. There is a smaller spike for the first win but each subsequent win leads to more interest. I don’t know why Chuck Tingle had a big spike in interest in January 2017.

I’ve added a little red arrow around July 2019. That was when there was a big flurry among some Baen authors that Wikipedia was deleting their articles

Anyway, to answer my own question: talent beat tantrums in the battle for attention

Authors: which ones get looked up?

A perennial question around award nominees is just how significant are the authors being honoured. It’s a tricky question, particularly as there is no good data about book sales. Amazon ranks are mysterious and Goodreads data may be a reflection of particular community.

I’m currently taking a few baby steps into web scraping data and I was playing with Wikipedia. Every Wikipedia article has a corresponding information page with some basic metadata about the article. For example here is the info page for the article on the writer Zen Cho On that page is a field called “Page views in the past 30 days” that gives the figure stated. As a first attempt at automating some data collection, it’s a relatively easy piece of data to get.

So, I put together a list of authors from my Hugo Award and Dragon Award lists, going back a few years (I think to 2013). Not all of them have Wikipedia pages, partly because they are early in their careers but also because Wikipedia does a poor job of representing authors who aren’t traditionally published. Putting the ‘not Wiki notable’ authors aside, that left me with 163 names. With a flash of an algorithm I had a spreadsheet of authors ranked by the current popularity of their Wikipedia page.

Obviously this is very changeable data. A new story, a tragedy, a scandal or a recent success might change the number of page views significantly from month to month. However, I think it’s fairly useful data nonetheless.

So what does the top 10 look like?

1Stephen King216,776
2Margaret Atwood75,427
3Brandon Sanderson72,265
4Terry Pratchett55,591
5Rick Riordan43,484
6N. K. Jemisin34,756
7Cixin Liu32,372
8Sarah J. Maas21,852
9Ian McEwan20,468
10Neal Stephenson20,058

The rest of the top 30 look like this:

11Robert Jordan19,169
12Ted Chiang17,635
13Owen King16,041
14Jim Butcher15,493
15James S. A. Corey15,109
16Stephen Chbosky14,490
17Leigh Bardugo13,787
18China Miéville13,580
19Andy Weir13,057
20Harry Turtledove11,452
21Cory Doctorow11,362
22Jeff VanderMeer11,243
23John Scalzi10,796
24Chuck Tingle10,763
25Ben Aaronovitch10,493
26Brent Weeks10,271
27Ken Liu9,003
28Tamsyn Muir9,002
29Alastair Reynolds8,951
30Kim Stanley Robinson8,879

There’s a big Zipf-like distribution going on with those numbers that decline quickly by rank. John Scalzi has Chuck Tingle levels of fame on this metric.

OK, so I know people want to know where some of our favourite antagonists are, so here are some of the notable names from the Debarkle years.

40Vox Day5,271
45Larry Correia4,455
60John Ringo2,878
81John C. Wright1,251
111Brad R. Torgersen560
123Sarah A. Hoyt407
140L. Jagi Lamplighter229
152Dave Freer102
153Lou Antonelli101
156Brian Niemeier81

Day probably gets a lot more views due to people looking him up because of his obnoxious politics. Larry Correia is in a respectable spot in the 40’s. He is just below Martha Wells who has 4,576 page views — which is essentially the same number given how these figures might change from day to day. John Ringo is just above Chuck Wendig and Rebecca Roanhorse (2,806 and 2,786). John C Wright is sandwiched between Tade Thompson and Sarah Gailey.

You can see the full list here

Let me know if you find any errors.

The last one for the time being

This is fan categories but with the intermediate nodes of category or year. In other words the edges of the graph join names together directly but represent that the two names shared a category in a year. It makes a big bow tie.

It is less than accurate though because of an alphabetical bias. While all finalist appear, because of a column limit when I was processing the data, the lists only went ten-deep. That means finalists further down the alphabet in years with lots of named finalists, don’t get as many connections as they should have.

A more readable version here in PDF.

The file name was short for fan categories with direct connections. However, “Fan Cats Direct” sounds like an interesting business.

They made me do this

I blame James Davis Nicoll who forced me to do this.

Hugo Fan Categories (Artist, Fanzine, Fan Writer and Fancast) by year.

The growth in years size post 2012 is firstly by the addition of fancast but also more group finalists (and often in Fancast). It looks like some huge city on a bay with bridges out to islands (the Mike Glyer Bridge joining the historic CBD to the 2016/18).

Here they are organised by award category.

Hugo Mode

There’s a mob of network data scientists with flaming pitchforks hammering at the doors of Felapton Towers in a vain attempt to drag these tools out of my hands and try me for crimes against having the faintest idea of what I’m doing. In the meantime this blog is all graphs all the time until I run out of things to stuff into Gephi and see what happens.

In this case what happens was more useful than I imagined. I thought mapping connections between the four award categories I have collected in my big-hugo-spreadsheet (Novel, Novella, Novelette, Short Story) would be a bit dull. However, the graph has done a very nice job sorting authors into nine semi-neat clusters.

ETA zoomable PDF below:

The four big outer broccoli-like fronds show authors whose work has only been nominated in a single category. Let’s call them the specialists. There is a second ring of four groups which joins adjacent pairs, Novelette & Novella, Novella & Novel, Novel & Short, Short & Novelette. Nice. Then the sorting hat gives up and lumps everybody else in the middle.

Now a good data diagram should raise questions and this one does. There are two pairings we can’t see easily because they sit on the diagonals of the quadrilateral: Novella & Short, Novelette & Novel.

Authors who have only been finalists in the Novella & Short categories are:

  • Amal El-Mohtar
  • Andy Duncan
  • Gregory Benford
  • Jack McDevitt
  • Joanna Russ
  • John C. Wright
  • Keith Laumer
  • Ken Liu
  • Kij Johnson
  • P. Djèlí Clark
  • Rivers Solomon
  • Spider Robinson
  • Ted Reynolds

Authors who have only been finalists in the Novel & Novelette categories are:

  • Andre Norton
  • Charlie Jane Anders
  • David Gerrold
  • Murray Leinster
  • Paolo Bacigalupi
  • Philip K. Dick
  • Piers Anthony
  • Tom Reamy
  • Walter M. Miller, Jr.
  • William Gibson
  • Yoon Ha Lee

The rest of that central cloud are the hat-trick authors (3 categories) and what I guess we might call the grand-slam authors (4 categories). These are both large groups.

The grand-slam category is interesting. It consists of 25 authors and is quintessentially Hugo Awardish. Given the very male-dominated decades of the Hugos, I was glad to see that the group has many women in it — 8 out of 25, so still an under-representation but better than I expected.

  • Algis Budrys
  • Clifford D. Simak
  • Connie Willis
  • David Brin
  • Fritz Leiber
  • George R. R. Martin
  • Gordon R. Dickson
  • Greg Bear
  • Joan D. Vinge
  • John Varley
  • Kate Wilhelm
  • Kim Stanley Robinson
  • Larry Niven
  • Mary Robinette Kowal
  • Maureen F. McHugh
  • Michael Bishop
  • Michael Swanwick
  • Nancy Kress
  • Orson Scott Card
  • Poul Anderson
  • Robert Silverberg
  • Roger Zelazny
  • Samuel R. Delany
  • Ursula K. Le Guin
  • Vonda N. McIntyre

Better Hugo Islands

Watch me learn people! The result is a tiny bit more intentional as I tried to pick settings that worked last time. However, I didn’t manually move any nodes. I’m not sure why Samuel Delany sits above the fray but I suspects its a result of the algorithm that stops labels overlapping and his label was avoiding Harlan Ellison.

The semi-discontinuity between the main island and 2016-2020 is still visible and the cause (2015) is also apparent. There are connections back to the ‘mainland’ (Ted Chiang, Lois McMasters Bujold etc) but the impact of the Sad Puppies effectively removed a transitional year. Ironic really — the Puppies never could decide if they were traditionalists bringing the Hugos back to their roots or iconoclasts setting up a new order (as is often the way with hyper-traditionalist movements). For extra-double-reverse irony the nodes for Brad Torgersen, Larry Correia and Vox Day sit on the mainland near the 2014 coast.