I’ll lead with the graph and then explain what I’ve plotted and then describe what I think it tells us.


The awful things I do to data

So this is a bit of complex n-gram query but in essence what it is looking at two terms “science fiction” and ‘Hugo Awards” and looking at those terms as proportions of terms used in books in general (in English) and in fiction in particular. There is a sort of peak in all the datasets from the mid 70s to the mid 80s that I commented on briefly before.

More exactly: I’ve done various crimes to make this graph. Specifically I’ve re-scaled all the data sets except the “science fiction” in fiction set. I’ve done that so that trends in each set can be seen on one graph. However it is important to note that this scaling means that a point on one line of equal height to a point on another line does NOT mean the terms were equally common. Note also that when comparing data a different corpus may be being used.

(science fiction:eng_2012*7.5),science fiction:eng_fiction_2012,(Hugo Award:eng_fiction_2012 * 75),(Hugo Award:eng_2012 * 800)

By using brackets you can do arithmetic operations on the data and in this case I’ve used my eyeballs to scale the data with coefficients that my eyeballs made up (well done eyeballs). They aren’t methodologically sound and represent just what worked visually. Plot the 4 graphs separately though, and you will see the same shapes.

What does the graph show?

The graphs for “science fiction” and for “Hugo Award” show similar patterns when compared in the same data sets. In the combined set (eng_2012) which includes fiction and non-fiction, there is a fairy rapid increase from the 1940s for both terms, with “Hugo Award” lagging by a few years. There is a peak in the early 1980s, then a bit of a decline, followed by a more flat trend for “science fiction” and slight upward trend for “Hugo Award”.

Within fiction the two terms have a distinct peak in the mid-70s which drops off for “science fiction” and has a brief plateau for “Hugo Award” before following the same downward trend. Plotting the graphs without smoothing unsurprisingly makes for a more jagged graph but also suggests that the common peak for both “science fiction” and “Hugo Award” was about 1978.

Of course there is something a bit unnatural in considering the terms “science fiction” and “Hugo Award” in fiction books. Note the n-gram data is from the content of books and so technically we are looking in the eng_fiction_2012 dataset for uses of the term “science fiction” WITHIN a story. Arguably that is not a term that would be used much within a SF story and “Hugo Award” even less. However, they are both terms that might appear in introductions and other parts of a book. This, I think, gives us a clue to what the graphs show.

The Hugo Winners anthology series edited by Issac Asimov ( are an obvious example of fiction work published in the period marked by the rapid-rise of both terms and the peak. These books alone probably do not account for the whole of the graph obviously but reflect examples of the extent to which short story anthologies (which would more likely have introductions and references both to the genre and awards) were a proportionately more important part of published fiction. The rise and fall represents not a decline in science fiction but rather a decline in the significance of the science fiction anthology.

Hugo relevance?

In either data set it is clear that the ups and downs of science fiction and the ups and downs of the Hugos are roughly related. What I’d like is actually just NON-FICTION because if I am interested in the relevance of the Hugo Awards to the genre I should be looking in books that discuss the genre not books that ARE the genre. Unfortunately Google don’t give a corpus normalised on non-fiction. The best I can do is to subtract fiction away from the general corpus.

As I’m interested in some sort of index the right thing to do is to look at a proportion. In this case the proportion of “Hugo Award” mentions over “science fiction” mentions. I’m not going to use my eyeball scaling coefficients because “Camestros’s eyeballs” are not a legitimate statistical method. The gives me this search criteria:

((Hugo Award:eng_2012)-(Hugo Award:eng_fiction_2012 ))/((science fiction:eng_2012)-(science fiction:eng_fiction_2012))

This is my rough guess at getting “Hugo Award” as a proportion of “science fiction” in non-fiction in English. I’ll start the year range at 1953 to avoid spurious pre-Hugo results and…this is what we get:


Definitive proof of Hugo relevance? No, too many arbitrary choices on my part but definitely grist to the mill.

Now, just because I could I threw in the Nebula Awards as well. At which point I stopped before the data police start looking for me.


