Ngramming

So I’ve been down a few dead-ends of late in my number crunching past Hugo winners.

I’ve looked for obvious signs of bias and also for cliques and found not much to write home about. The last two issues are whether the Hugo Awards (or other awards such as the Nebulas) have gone to unworthy winners or alternatively, to works that are too literary. This is something of a heads-you-win-tails-I-lose proposition, as demonstrating worthiness would tend to involve showing independent recognition of a writer’s skill beyond SF/F awards – thus proving the too literary complaint.

Either way I have been looking for a way into this so that there is actual evidence to discuss. So far not much luck.

One promising lead was the Google N-Gram viewer. When Google digitized huge numbers of books they gained a massive corpus of texts that allow for systematic analysis. One kind of analysis is a count of n-grams i.e. a ordered set of characters. As the Google book metadata includes the year of publication that allows for trends in topics to be graphed. For example this graph shows trends for William Gibson’s 1989 Hugo nominee Mona Lisa Overdrive.

mldrive See it properly here.

I had a plan that I could use the n-grams as follows.

  • Pick a point N-years after a book was nominated for best novel (say N=10)
  • Track the book title in the n-gram viewer up to that year
  • Use that figure to quantify the ‘staying power’ of the book and/or the critical reaction to it

The thinking was that the more a book was written about the more it was a defensible win/nominee – particularly if it was still being written about 10 years later. Here is a comparison of Le Guin’s The Left Hand of Darkness (red) and Gibson’s Mona Lisa Overdrive (blue).

lhdmlo

This comparison is a bit unfair as The Left Hand of Darkness is a seminal book. However the graph shows the kind of comparison that is possible.

Unfortunately this plan was a bit of bust.

  1. spoiler 1: Dune. You can’t do the n-gram of Dune without getting every case of the word ‘dune’ (at least with a capital D).
  2. spoiler 2: The Urth of the New Sun. You can’t do a n-gram search of more than 5 words.
  3. spolier 3: You can’t go beyond 2008 – which I sort of new already but it does limit the ability to look at recent winners.

So a different plan was to look at authors. Here are a couple:

https://books.google.com/ngrams/interactive_chart?content=C.+J.+Cherryh%2C+Lois+McMaster+Bujold&year_start=1989&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CC.%20J.%20Cherryh%3B%2Cc0%3B.t1%3B%2CLois%20McMaster%20Bujold%3B%2Cc0
The graph above shows 1989 winner C. J. Cherryh and 1989 nominee Lois McMaster Bujold.

Add 1989 nominee Orson Scott Card and the graph shows that he is more written about than Cherryh or Bujold. Add William Gibson and Gibson ‘wins’ the n-gram battle. It is a kind of quantified evidence of Gibson’s influence as a writer. It doesn’t mean he is a better writer than Cherryh, Bujold or Card but it does indicate that he has been influential and as a consequence written about.

I’m going to go as late as I dare next and look at the 2004 nominees for best book.

  • Lois McMaster Bujold
  • Robert Charles Wilson
  • Robert J. Sawyer
  • Charles Stross
  • Dan Simmons

https://books.google.com/ngrams/interactive_chart?content=Lois+McMaster+Bujold%2CRobert+Charles+Wilson%2CRobert+J.+Sawyer%2CCharles+Stross%2C+Dan+Simmons&year_start=2004&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CLois%20McMaster%20Bujold%3B%2Cc0%3B.t1%3B%2CRobert%20Charles%20Wilson%3B%2Cc0%3B.t1%3B%2CRobert%20J.%20Sawyer%3B%2Cc0%3B.t1%3B%2CCharles%20Stross%3B%2Cc0%3B.t1%3B%2CDan%20Simmons%3B%2Cc0

In order of written-about-ness

  1. Dan Simmons
  2. Robert J. Sawyer
  3. Lois McMaster Bujold
  4. Charles Stross
  5. Robert Charles Wilson

But all of a similarish coverage.

Advertisements