Postings on science, wine, and the mind, among other things.

Writing style by genre

Analyzing 16k user-submitted writing samples to learn how content affects style

Like virtually any cultural activity, writing is no stranger to elitism. The division between "literary fiction" and "genre fiction" is perhaps the best known manifestation of such snobbishness in writing. Literary fiction consists of works with significant artistic merit, as opposed to mere commercial writing. Genre fiction consists of virtually everything else - SciFi, fantasy, mystery, horror, romance, and so forth. This distinction is arguably largely artificial: much of what we consider now literary fiction - particularly "the classics" of western canon (not to be Classical writing) - is an ad hoc and post hoc jumble of what scholars generally regard as the best writing from the past few hundred years. Many of these works were considered genre fiction in their own time, but have been tossed into the same category because we no longer recognize genres like Bildungsroman or Picaresque. Thus, today's most influential genre fiction - be it A Song of Ice and Fire or Marvel Comics - may well end up as tomorrow's literary fiction.

Although the distinction between literary and genre fiction may not be substantive, it is nonetheless consequential. Successful genre writers may make better money than those professing to create "high" art, but successful literary fiction writers garner more prestige, and are probably more likely to receive accolades from genre-spanning organizations. Readers are also affected by this distinction: reading literary fiction is typically considered praiseworthy, edifying, and even moral. Reading genre fiction is more typically considered an indulgence or guilty pleasure, something to be ashamed of in the company of cultured people; it's occasionally even the subject of moral panics (is reading Harry Potter turning your child into a Satanist?).

Given such cultural consequences, it's worth asking whether there's any substantive difference at all between literary and genre fiction. Put another way, if we could carve the world of writing at its joints, would literary fiction fall away from other genres in splendid isolation? If literary fiction is really as different from other genres as it's cultural dominance would suggest, then it should stand apart from other types of writing, fiction or nonfiction. However, as discussed above, there are many reasons to think this might not be the case.

One may approach such a question in a number of ways, but one arguably under-explored avenue is through the use of data. In previous blog posts, I have demonstrated how quantitative text analysis can be used to build a book recommender or analyze the stylistic similarity between famous writers. Most recently, I built a simple tool that can tell you how similar your writing style is to that of a selection of famous authors. This tool works by taking a writing sample that you paste into a text box and counting the frequencies of various grammatical words (articles, conjunctions, pronouns, etc.) and punctuation marks. This set of frequencies can then be correlated with frequencies from other (famous) writers to estimate stylistic similarity.

Recently, while checking the backend of my website, I was surprised to see that over 16,000 unique writing samples had been analyzed using my tool! Note that I do not record the writing samples themselves to protect the privacy and intellectual property rights of submitters. However, I do save the sets of frequencies which summarize each submission's writing style. Since submitters also indicate who wrote the submission and what genre it belongs to, these data offer a unique opportunity to explore how genre affects style.

Counts of unique submissions, split by who wrote them.

As you can see, in the overwhelming majority of cases, submitters reported that they had written the submitted sample themselves. This is useful because it means that there's potentially less of a pre-selection bias on the writing. Analyzing works by famous authors might be a more standard approach to comparing style between genres, but the most famous cases may not be representative of what most writers produce. For example, it could be more difficult to achieve success as a literary fiction writer than as a genre writer. If so, then comparing the most famous/successful writers across this division would suggest that literary writers are better, even if the average style in both categories was identical. Analyzing writing samples with no pre-selection may thus give us a less biased view of any differences.

Counts of unique self-submissions, split by genre.

Above you can see the various genres within which submitters could categorize their writing samples. Literary fiction is the most second most common category, following essays. Genre fiction is split up across a number of separate genres. Within each of these categories, we can calculate a genre-typical style by averaging style estimates across samples. We can then measure the similarity of writing in different genres by simply correlating these average style estimates. You can see the results of this progress in the heatmap below, with red representing similar styles and blue representing dissimilar styles.

As you can see, the data seem to suggest three main writing style clusters across genres. First, in the upper right of the graph, the three primary nonfiction categories - journalism, essays, and other nonfiction - stick close together. Second, autobiography and memoirs stand alone, slightly more similar to nonfiction than to fiction, but not particularly similar to either. Finally, in the middle and lower left of the graph, all of the categories of fiction - including both literary and genre fiction - stick together. There is some internal structure in this third cluster - for example, children's literature seems relatively dissimilar from other types of fiction - but in general, fiction holds together stylistically.

This analysis tells us which genres are similar and which are different. The results seem to suggest that literary fiction is indeed much of a muchness with fiction more broadly. However, this visualization tells us little about why various genres are similar or distinct. To get at this question, we can apply a different statistical technique to the averaged style estimates for each genre. In particular, we can use a tool called metric unfolding to illustrate which stylistic features (grammatical words and punctuation marks) are driving the results in the heatmap above. You can see the results of this unfolding below in the form of a 2-D "map" of style-space, within which closer points are more similar.

The different genres of writing (red) are arrayed around the edges of the map, and largely reproduce the clusters we observed in the earlier heatmap: journalism, essays, and other nonfiction hang together on the right, autobiographies stand along at the top, and fiction genres are on the left. What may not have been clear from the heatmap is that two clusters pull apart within fiction, with historical fiction, science fiction, fantasy, and horror in one cluster at the bottom left, and mystery, children's literature, literary fiction, romance, young adult fiction and other fiction in another cluster nearer the top left. The blue points in the middle each represent one of the stylistic features (grammatical words and punctuation marks) that informed the analysis. I've only labelled those on the periphery, partially because these are the features which differentiate genres, and partially for legibility.

What can we glean by examining the positions for the stylistic features relative to the genres? The passive voice (is, are) and parentheticals (openparen and closeparen) seem to be the defining stylistic features of nonfiction. Use of the gendered pronouns (he, him, his, she, her) appears to be the most consistent theme amongst the stylistic features over-represented in fiction. As a nice sanity-check, the first-person personal pronouns (i, me, my) appear most frequently in autobiographies/memoirs, as one might expect. Humorously, "because" also appears relatively biased towards this genre, perhaps suggesting that a major purpose of such writing is self-justification.

It's a little less clear what separates the two different fictional clusters. Their vertical positioning suggests that the first-person perspective might be more common in the upper cluster than the lower. There's also a slight bias in the gender of pronouns, such that the female pronouns are also closer to the upper genre cluster and the male pronouns are also closer to the lower genre cluster. It's hard to know how meaningful this is, but if it's robust then this would be consistent with the stereotype that genres such as science fiction tend to be more male-dominated. We can also see more use of the neuter pronoun (its), the definite article (the), and the localizing preposition (from) on the lower side of the central stylistic feature group. The relative proximity of these words to the lower fiction genre cluster may suggest a greater emphasis on places and things in these genres. The "t" feature - which stands for the contracted t in words like "can't" or "shouldn't" - seems like the most specific feature to the upper fiction cluster, perhaps suggesting a tendency for casual negation in these genres.

There are, of course, some caveats to the conclusions we can draw from the present analyses. For instance, the sample is large but non-random, and is thus unlikely to be representative of the general population or the population of writers. Also, it's not clear how accurately people labelled the samples they submitted. If the label of "literary fiction" was used more aspirationally than realistically, then this might have biased us away from finding any differences between this genre and the others.

In sum, although some genres do pull away from literary fiction in the unfolding analysis, many commercial fiction genres remain close by. Thus, consistent with our direct examination of stylistic similarity in the heatmap, it seems as though literary fiction is not a thing apart, at least from a stylometric point of view. I hope you've enjoyed this peak into how we might approach humanistic debates from a quantitative point of view. If you'd like to read more on this theme you might like my earlier posts on style-based book recommendation or comparing the styles of many famous authors. If you'd like to compare your own writing style to that of famous authors, you can use the tool on this site or its twin on my research platform, The two are identical, but I'd urge you to use the latter so that your participation can contribute to Science(!) rather than just my personal blog.