Take a look at the beta version of dw.com. We're not done yet! Your opinion can help us make it better.
Is there really no accounting for taste? Two literary academics say their algorithm comes pretty close to guessing bestseller fiction. DW spoke to Jodie Archer and Matthew Jockers about "The Bestseller Code."
DW: "The Bestseller Code" describes how you built computer tools, an algorithm, to analyze bestselling books and find out what makes them bestsellers. How can this help a creative industry like fiction writing?
Jodie Archer: It was not our original intent to write a book specifically designed to help fiction writers. The book comes with a caveat that this isn't a "magic tea" that helps you, that as soon as you read it, you'll be able to write a bestseller, and it's not a "how to book" either.
That said, we have been inundated with [authors] writing to us, saying "Wow, this is one of the most helpful books I've read in terms of my own crafting." But we've also had interest from the publishing industry. They think our data may be able to help them sort through the masses of manuscripts put forward to see which new writers might have a shot at making it.
So, Matthew, talk us through some of the technology. I understand, for instance, how a machine can read the noun-adjective ratio in a book. But how does a machine understand "high points" and "low points" in a narrative? It sounds like the "sentiment analysis" we do on social media. How does it work?
Matthew Jockers: There is a variety of ways you can do that. The approach we took is probably the simplest - it's a matter of looking up every word in a reference sentiment dictionary and scoring it. There are words that will score positive or negative, depending on context, and we get a little bit of that by studying the sentiment at the level of the sentence.
But then you expand that - you average across sentences. And what happens is that when the sentiment changes, when there's a pivotal moment, an author doesn't express that plot shift in just one sentence, it happens over the course of a paragraph or a page, or there's a build up. By mapping these changes in sentiment, they became a pretty reliable proxy for what we call "plot change."
You've been careful to point out that the computer didn't know what books you put into the system - whether they were bestsellers or not. But when you're programming something like this, how do you remove yourself and your bias? Because you do both have backgrounds in fiction, in editing, and in deciding what's good and what's bad.
Jockers: Yes, well, we know which books were bestsellers and which were not. But the machine doesn't know that as it parses and analyzes each book and returns a result to us, which is its guess about which ones were the bestsellers and which were not.
Is human bias good?
As for our own bias… Take for example the books the machine got wrong. One of those books was "Game of Thrones." It's atypical of what appears on the bestseller list, we know that from years of looking at the bestseller list. And the machine gets it wrong for the same reason that we would probably get it wrong if we were agents or editors. We'd look at the book and say, "This is interesting if you like Sci-Fi or fantasy, but it's not likely to appear on the bestseller list."
But surely that's the point, isn't it? Jodie, you've worked as an editor and in acquisitions. It's long been said that bestselling fiction is written according to a formula, whether it's a conscious thing or not. So why do we need this, given that authors have done so well for so long, some selling phenomenal numbers of copies? What's the use in this?
If you look at the history of publishing and the New York Times bestseller list in the US, which is what the study is based on, back in the 1980s and 90s, the list belonged to three writers: John Grisham, Danielle Steel, and Stephen King. If you weren't one of those, you really had no chance. People would say, "To be a bestselling writer, you have to be a formula writer," and what they meant was a genre writer, who was one of those specific people, using their personal writing formula, because you just didn't see other writers there.
Since then the lists have become more diverse. So there is hope in the industry that more writers of different backgrounds, more ambitious writers, less genre writers, will break through. And this algorithm can help editors look at those manuscripts that are less traditionally formulaic, that may still make it. "The Girl on the Train" is one of them. This girl-trend is totally new - "feminine noirs" as we call them - the psychological, domestic thriller. And the algorithm picked them up.
But isn't there a risk? Say I wanted to publish a book and - instead of going through an agent or a publisher, knocking on doors - I was asked to simply upload my manuscript. Then, before anyone's had a chance to look at it, it's been rejected or accepted. The human element of not knowing what makes a bestseller is totally removed.
You have to remember we're looking at a very small segment of the industry in this particular study. This is really about those books that hit the top ten [to 15] every week and those that are selling in the biggest numbers. That does not mean that if we carry the list down to 50, 60 70, and there's a book selling in very respectable numbers and making that writer a living, that that book isn't decent and doesn't have a shot at making it to a smaller degree. But, yes, any writer or editor looking at a manuscript with this [algorithm] would have to take into consideration that the algorithm is looking at potential "bestsellerdom." It doesn't mean a book it turns down is bad, they would have to read it. But if they're going to put a million dollars on a book to buy it, it would perhaps make them feel a little more confident if the bestseller algorithm had also given it a very high score.
Our obsession with numbers
Are you not concerned by this over-reliance on numbers and correlations? Are we not over-technologizing fiction writing?
Jockers: I'd like to turn that around on you a bit. My interest in this, maybe, is more academic. And you started this line of questioning with, "What's the use of this?" My answer would be a little different, because for me the use of this is that it's provided us with an entirely different sort of microscope under which to examine what it is that writers do with their creative energy.
We're finding those elements that the bestselling writers, consciously or unconsciously, manage to weave into their prose - and at the same time, the things they didn't weave in, that we find more typical to the books that didn't hit the list. As someone who teaches literature, about craft and technique, it's an incredible tool from that perspective.
And at the same time, there is much more that is done with this kind of "text mining" technology. What you've done is the tip of a much bigger iceberg.
Jockers: There's certainly lots of things that can be done here and some of them would be distasteful and others would be positive. We all like it when the NSA uses text mining to discover a threat and avoid it, but we get very disturbed when we find they're snooping on our email. So there's positives and negatives. I do sympathize with the point you made about the person who doesn't get a human read because their manuscript hasn't passed some computational test, and when you describe that, I'm sympathetic. I would worry about that too. But I would also go back to how Jodie answered that question, which I think is important. We didn't develop this tool to be a product to sell to industry. To be fair, industry folks have been interested in what we've done. But it was really a tool to see if and what the elements are to distinguish a bestseller from a non-bestseller.
Jodie Archer and Matthew L. Jockers are the authors of "The Bestseller Code" (Allen Lane, 2016). Archer has worked in publishing in London and New York and is now a full-time writer. Jockers is a text mining and digital humanities expert. Both worked on iBooks and literature at Apple. They have been a research team since 2010.