More BookBub numbers

In the comments to the previous post on this topic, Johan P raised some really interesting points. I’d said rather glibly that the categories with more subscribers will obviously have more free-downloads and sales. As Johan points out this is counter-intuitive as the figures given are AVERAGES i.e. (I assume) the number of downloads/sales per book rather than the total number of downloads or sales in those categories. However, it really is true that the bigger categories have bigger downloads/sales but I haven’t explained it properly and I did use misleading terms like ‘crowded’.

The graph plots the totals of free-downloads + discounted book sales (horizontal axis) against number of subscribers. The relationship is quite strong. I plotted a line of best fit courtesy of Excel. Now a linear relationship is probably not the best way of describing the data. I assume that underneath all of this is some sort of power-law type thing going on with sales (i.e. some books sell HUGE amounts and shape the averages accordingly). How that all plays when comparing subscribers to sales would require more detailed data than we have. Even so, the line gives use something to compare the data we do have and an r-squared of 74% is enough to justify my claim that more subscribers=more downloads/sales as a broad statement.

Flipping this round, we get a different way of looking at the data: which genres deviate most from that line and in which direction? If I’m right and the sales figures are distorted by bestsellers, then a newbie author should stay clear from those genres ABOVE the line because these genres have more subscribers than we would predict from the number of downloads/sales. Genres below the line have more sales/downloads than we would predict from the number subscribers and that sounds like a better bet or at least those averages maybe closer to a ‘typical’ value rather than a distorted average.

Here’s a similar graph but this time looking at sales only and unfortunately done using Apple’s Numbers spreadsheet rather than Excel:

There are many ways we can quantify how much a data point deviates from that line but within the limits of the tools on this laptop, I’m just going to find the difference between the actual number of subscribers and the number predicted by the equation of the line. Negative is better here I think but I’ve sailed off into generating numbers whose meaning is unclear. I *think* that the genres near the top are less impacted by a few bestseller and the books near the bottom are more impacted but…I wouldn’t swear to that and I’m just guessing.

Women’s Fiction-1,166,549
Historical Romance-573,940
Christian Fiction-451,331
Erotic Romance-427,329
Romantic Suspense-418,864
Christian Nonfiction-397,638
True Crime-392,253
Paranormal Romance-391,022
Historical Fiction-372,395
Advice and How-To-358,251
Supernatural Suspense-347,481
Politics and Current Events-213,950
Dark Romance & Erotica-186,103
American Historical Romance-153,023
Time Travel Romance-62,258
Religion and Spirituality-55,485
Science Fiction-36,711
Chick Lit-6,255
Literary Fiction104,064
African American Interest104,971
Contemporary Romance115,599
Middle Grade122,818
New Adult Romance209,591
Cozy Mysteries309,753
Historical Mysteries317,139
General Nonfiction499,596
Psychological Thrillers600,214
Biographies and Memoirs631,754
Action and Adventure697,134
Teen and Young Adult968,973
Crime Fiction1,047,144

Some Book Bub numbers and petunias

My attention was drawn to a set of numbers from BookBub available here:

Some major caveats before we go into them. Firstly these are for marketing purposes and as they say “averages are based on historical data, but are only meant as a reference and are not guaranteed”. The book figures also only apply to free downloads and discounted book sales. Lastly, these are BookBubs numbers and other retailers of books may show different patterns.

A broader caveat to add when considering any kind of average sales within books (or other media) is the dreaded power-law distribution. A small number of books account for a large number of sales and conversely a large number of books account have small sales individually but account for a lot of sales together. The arithmetic mean has many flaws but it is particularly flawed in such circumstances. One huge hit (e.g. The Da Vinci Code) will have an outsized impact on the average book sales even if other books are selling poorly.

Tables and things after the fold…

Continue reading “Some Book Bub numbers and petunias”

A bad survey about the ‘Intellectual Dark Web’

This is an edited version of three Twitter rants from yesterday. It started as an off-cuff reaction but I was too far into it before I thought that it should be a blog-post rather than Tweets.

Stephen Pinker tweeted out a very weird bit of science theatre created by Michael Shermer.

Pinker has enough critical thinking skills that he should look at it with hefty scepticism…but obviously isn’t. It’s pretend science, using play-acting at science to refute what is obvious and ignores the core issues.

The “survey” by Michael Shermer (which should be a red flag in itself) was sent to 34 notable people associated with the label “Intellectual Dark Web” and asked where they stand on a number of issues. The survey was anonymous, so the views identified in the survey can’t be matched to the individuals asked.

Each and every one of the people surveyed is a public figure who have made multiple public statements about politics and social issues. I don’t need an anonymous survey to find out what Andy Ngo or Sam Harris thinks, I can go and read what they say. And it is what they SAY that matters and what defines the IDW term not what they might privately think. If Sam Harris thinks he has warm & fuzzy liberal beliefs that’s nice but the whole point of the “dark web” label was the contrarian issues he promotes. Maybe Ben Shapiro secretly believes Global Warming is real and climate change is caused by humans. I don’t know but what matters is he propagandises the opposite. If an anonymous survey of the 34 “Intelectual Dark” Webbers reveals that their underlying views are more centrist and mainstream then that is not evidence that the public perception of their public positions is wrong. Rather it confirms a key point about the IDW.

The fundamental issues with the disparate group lumped together as the Intellectual Dark Web is that they are DISINGENUOUS about their politics. It’s not news that Jordan Peterson thinks of himself as moderate and reasonable. We knew that already. It doesn’t change that he (and Harris & Shapiro & Ngo & Quillette) frame and enable a perspective that bolsters the far right. The whole “we are the reasonable ones” is part of the schtick of the IDW. That they’ll boost that in an anonymous survey is, frankly, wank.

Let’s be sceptical as I’m sure Dr Pinker and Shermer would want us to be. Let’s take one conclusion Pinker raises from the survey: The members of the IDW are “concerned w climate” Let’s look at the survey: The survey agrees: “67% strongly agreed that global warming is caused by human actions (no one strongly disagreed)” So their you go! Hoorah! No, no let us be sceptical first. If this was GENUINELY true would it not be easily observed?

To the empircism-mobile! Here’s the output of the Quillete Climate tag zoiks! A hefty TWO article, one concern trolling Greta Thunberg and the other saying people shouldn’t be mean to capitalism. Yes, Quillette is just one source but it is one that connects Steven Pinker on the one hand (who we can observe genuinely does advocate for action on Global Warming) with Andy Ngo on the other hand (who genuinely does have connections with the alt-right and violent far right groups) via Claire Lehmann (Quillette’s founder, fan of Pinker and one time boss of Ngo).

Yes, Steven Pinker himself has a better record on the of global warming but the issue he raised was to look collectively at the IDW and their media-organs. Broadly this is not a group trying to do very much about helping with the issue. And wow, think of the actually good the IDW could achieve given their actual audience. Whatever they may think of themselves, collectively they do have the ear of many on the right – exactly where climate change denial and bad science on the topic is endemic. You’d think these out spoken people might be busy being outspoken on a potential planet wide disaster.

It gets worse. The actual sample was only 18 not 34 people. Nearly half of the 34 didn’t answer. So when the survey says “67%” (the percentage favouring gun control and which believes global warming is real) actually means “12 people” That’s actually both more plausible and more wretched. Even if we accept that 12 of those IDWs think climate change is real, it says almost nothing about the group. Any one member of the original 34 people is a hefty 3% of the population being sampled and hence missing any one of them can have a large impact on the results. This is particularly true given that we already know that the label of “Intellectual Dark Web” is being attached to a group with a very broad range of views on many topics.

Shermer is assuming non-response to the survey is random across the traits being surveyed (i.e the 18 is a random sample of the 34). There is no reason to believe that and really anybody who is wants to seriously call themselves a sceptic should dismiss any general conclusion from the survey without substantial additional supporting evidence.

Indeed there’s good reason to assume that the 18 who responded is not a good random sample of the 34, just on the nature of the numbers. It is very hard with small numbers in a survey for the sample to be representative because one person makes a big difference. Shermer hides that by quoting percentages rather than raw totals but with small number percentages hide how few people he’s talking about. It’s not invalid to look at proportions with small sample sizes, sometimes that is all you have but there’s a point where 12 out of 18 is more informative than 67%.

We can illustrate the issue with the women who were surveyed. Of the 34 named people in survey associated with the “Intellectual Dark Web” 8 (24%) are women. In the survey 3 (17%) are women. So are the IDW 17% women (generalising from survey) or 24%? Obviously 24% is the correct figure but 17% is the equivalent of the the kind of survey conclusions Shermer presents. In fact any one woman listed is 13% of the IDW women, so one more woman answering makes a huge different to sub-sample of women. Any one person is 6% of the whole sample of 18 people!

Circling back to 67% claim. Again assuming everybody who responded is being honest (which I doubt) the survey actually found that 12 people of the 34 who were asked believed in gun control and the same number believed that global warming was real (which I’ll add isn’t saying much, some prominent sceptics will say global warming is real, just as many anti-vaccination campaigners will say they support vaccinations – it is the ‘but’ that follows where the issues lie). That might mean 67% or there about of 34 believe in gun control but a safer conclusion is no less than 35% do (12/34) and no more than 82% (28/34). Given how granular this data is, hoping the estimate is in the middle isn’t supported.

This is why I call it theatre. It is the wrong methodology applied badly. It illustrates methodological snobbery. Synthesising the complex views of a small group of people is exactly where qualitative methods work better. It is a domain where you need to put on your humanities hat and apply those humanities skills. Shermer is using sciencey film-flam by presenting a pointlessly anonymous survey and presenting the results as percentages as if there were proportions of the whole group.

Don’t get me wrong I absolutely LOVE applying basic quantitive methods to things and place where they don’t always make sense. It’s very much my hobby but even on this less than 100% serious blog I’d throw more caveats at better numbers than Shermer is using.

Loved Books: The Mismeasure of Man by Stephen J Gould

Stephen Jay Gould is a voice that is missed in today’s world. Smart, compassionate and analytical but also with a deft capacity to write about complex ideas in an engaging way. In The Mismeasure of Man Gould stepped out of his main field of paleontology and looked at the history of attempts to measure intelligence and the racist assumptions that have run through those attempts. This is the 1981 edition which doesn’t have the chapters on The Bell Curve but still a worthy read.

Is it perfect? No but then a popular account of broad area of research necessarily simplifies and skips over some details. As gateway into understanding the issues there is no better book that I’m aware of.

Ersatz Culture’s Gender Graphs

Ersatz Culture has been systematically graphing all the awards (well, lots of them but maybe not all of them) in terms of gender and very systematically.

There are a host of different patterns in those graphs – note these are my observations not those of Ersatz Culture. Some awards are more volatile than others and, of course, some awards are very recent. Overall, there has been the shifted already noted from:

  1. Mainly men
  2. More men than women but many women
  3. Mainly women

The nearest graph to one that splits neatly into these phases is the Nebula Award for Short Story but as with any narrative overlaid on data, take it as the speculation it is.

There are few examples of an award bouncing around a 50/50 split. The Arthur C Clarke award though seems to have less of a trend and more of a noisy wobble around a 70/30ish split.

Young Adult awards have been more favourable to women. Fantasy awards have tended to be more favourable to women also. Any shift in a generic award towards YA or fantasy therefore might also lead to a shift towards women.

New writer awards (the former-Campbell Award, Locus Best First Novel) have often had a better split (not always a good split) than other awards in the same year. That is interesting as they might be a leading indicator of future award demographics in these awards.

Some comparison data on gender: Amazon

More as a data grabbing exercise than anything, I tabulated the Amazon Best Seller list for Science Fiction and Fantasy:

This data is a snapshot and right now the list is naturally dominated by Margaret Atwood’s sequel to The Handmaid’s Tale, so the list contains a lot of different versions of both (print version, audio version, Kindle version etc). It’s also very Amazon with some popular-in-Kindle-unlimited works further down the ranks.

I took the top 100 listed and then did a few things to the data. Firstly, I deleted multiple versions of a work, that will add a bit of bias to the data by understating the impact of biggest sellers. I then classified authors based on name, pronouns, and bios as male, female, non-binary or both (in the case of dual authors). I didn’t identify any authors for the non-binary category. One author name was a joint authorship of a man & woman and was counted as “both”. That took the initial 100 rows down to 84 rows.

I then duplicated that data set and in the second version I deleted multiple works by an author leaving only the highest ranked work from the Amazon list. This was done so a single author wasn’t double counted (or n-tuple counted in the case of J.K.Rowling) but the process reduces the success of authors like Rowling or Stephen King. That took the number of rows down to 55.

The results are delightfully ambiguous with enough contrary results to please multiple readings.

Gender All Works Top Work All Works Top 50 Top Work Top 50
Female 56% 49% 61% 54%
Male 43% 49% 39% 46%
Both 1% 2% 0% 0%
  • All Works: counts by author gender of the 85 books in the SFF Amazon bestsellers.
  • Top Works: counts by author of the 55 books by unique authors in the SFF Amazon bestsellers.
  • All Works Top 50: counts by author ranked 50 or better out of the 85 (36 books).
  • Top Work Top 50: counts by authors ranked 50 or better out of the 55 (24 books).

Looking at just works ranked 25 or better results in a figure more consistent between the two sets of data.

Gender All Works Top 25 Top Work Top 25
Female 59% 56%
Male 41% 44%

Make of this what you will 🙂

A bit more on Dragons and probabilities etc

I had some weird conversations yesterday about Dragon Award stats. One was a brilliant take down of my figure that 10 men out of 10 had won Dragon Awards from 2016 in the two headline categories. Aha! Four years and two categories is only EIGHT! Yeah but it really is ten men. James S A Corey is actually two people and, even harder to believe, apparently John Ringo and Larry Correia are different. Mind you…if I only count Larry Correia once (because he is the same person whichever year he’s in) then it is back to 8 again…You’ll note that however we count it the answer comes out the same: 100% have gone to men in the two headline categories.

The discussion does raise a relevant point about why statistics is hard. Even a basic stat like a count of how many out of how many requires engaging your brain and thinking carefully about what you are counting. It was suggested that I should have said 10 men out of 8 awards…which I guess makes it clearer what was being counted but is horrible arithmetically. It looks like “10 out of 8” i.e. 125% which is nonsense because we are diving two different things and creating a derived unit of men per awards.

I’ll point people back to this post and this post where I talked in more detail about what I counted and how.

To round off that previous gender post here is an equivalent graph of winners by gender in the book category:

Like the graph in the previous post of finalists, I’m using counts by gender which reduces the gender disparity by only counting two joint authors of the same gender as 1 but two joint authors of different genders as 1 each per gender. Same caveats about gender as a binary classification apply as with the earlier post.

Worst year was 2017 which was also peak Rabid Puppy influence.

A couple of conceptual questions have come up that are related. I was asked elsewhere what the chance was of so many authors on Brad’s list winning. A different question with the same kind of issue was asked by James Pyles – basically what was the chance of N.K.Jemisin winning a Hugo three times in a row.

Both questions aren’t something that can easily be answered and they sort of miss the point of the kind of comparisons against chance you might do with gender. With the Brad list these were people who were plausible winners, the outcome wasn’t surprising. There’s no expectation that the result of an award is a random event when looking at individuals – the same is true with Jemisin. We could say, well there’s 7 billion people on earth and one winner so the chance is 1/7 billion and the chance of winning three times is (1/7 billion)^3 and then concluding that everything is impossible but the comparison is silly.

Comparing with chance is there to test a kind of hypothesis: specifically whether the result is plausibly the result of chance. If the probability is tiny then we can reject that it happened by chance. We already know that somebody winning a Dragon or a Hugo isn’t by chance because names aren’t picked out of a hat.

So why compare gender of winners to chance events if we know winning isn’t a chance event? Good question. Because, we are testing another level of hypothesis. With gender, the hypothesis could be stated as ‘gender is an irrelevant variable with regard to winning award X’.

Consider this. Imagine if all Dragon (or Hugo) winners were born on a Tuesday. That would be remarkable. Day of the week surely isn’t connected to whether you win an award or not! We might reasonably expect only one-seventh of winners to be born on a Tuesday. We might do extra research to see if across all people if day-of-the-week is evenly distributed. We might fine tune that further and consider only English speakers or only Americans etc. The point being that if day-of-the-week departed from chance then we would reject that day-of-the-week is irrelevant.

If we did find that, it wouldn’t tell us why or how day-of-the-week was relevant. One response I’ve seen to producing gender stats is people saying that they don’t pay attention to author’s gender when voting. Even if we ignore subconscious influences and take that at face value, all that does is remove one possible cause of a gender disparity, it doesn’t make the gender disparity go away.

Another response is that looking at gender stats is ‘politics’. Well, yes, it is but it is relevant even if we otherwise lived in a gender neutral utopia. Again, imagine if Tuesday-born people won far more sci-fi awards than other people — that would be fascinating even though we don’t live in a world of Tuesday-privilege.