Postings on science, wine, and the mind, among other things.
In one of my previous posts, I used data from Literotica.com to explore the relationship between author demographics (sex, age, sexual orientation, etc.) and the categories of erotic stories that people tend to write. One question that readers raised regarding that investigation was whether the predefined categories on Literotica actually reflected the natural structure of the story content. In other words, if one tried to arrange the stories into cohesive groups while blind to the category labels, would the predefined categories reemerge?
Reading and sorting all three hundred thousand stories on Literotica would be a monumental task, but fortunately we have ways of letting them speak for themselves. All stories on Literotica belong to one and only one category. However, authors also have the option to attach up to six customized semantic labels, i.e. tags, to their stories. Analyzing the frequencies of these tags and their co-occurrences can reveal the latent structure hidden in the stories.
Understanding this structure may tell us about more than just a genre of fiction. It may also shed light on the associations between various sexual interests. There's been relatively little research on this topic for a variety of reasons. As I've alluded to previously, this is partially because research on sex tends to not receive the attention it might due to its taboo nature. However, research on sexual fantasies and preferences is also difficult because people lie, particularly when it comes to sex. This deception is completely understandable, of course, since the exposure of sexual interests - even quite common ones - can prove embarrassing or worse. Nonetheless, this reticence hampers sex reserach. Studying the structure of erotic literature is a promising avenue precisely because it may reveal people's (honest) implicit knowledge and preferences. Of course it should be remembered that Literotica's authors, while numerous, may not be fully representative of the larger population. That said, the site likely represents one of the largest collections of written erotica, so at least extrapolating from it to the themes of the genre seems fairly safe.
I used Python and Beautiful Soup to collect the tags from 293,535 stories on Literotica. Collectively, authors had described these stories with 283,753 tags. I limited the present analysis to those tags which occurred at least 1,000 times. I calculated how frequently each tag co-occurred with each other tag across stories. I also calculated the expected frequencies with which the tags should have co-occurred. I then built up a network between the tags by connecting tags for which the ratio of actual co-occurrence to expected co-occurrence was at least 5:1 and for which there were more than 100 total co-occurrences. In other words, nodes that are connected in the figure below occur in the same stories far more than we would expect given their overall frequency, suggesting that they are thematically related.
The resulting network is shown in the interactive figure below. Singleton nodes, unconnected to any other, were omitted, yielding a final total of 180. The radius of the nodes (circles) in the network vary with the log (base 10) of the tag's frequency. The nodes were colored based on the infomap community finding algorithm. Due to the large number of communities, some colors were reused. If you see two nodes with the same color but no connection through other nodes of the same color, then they do not belong to the same community (cluster). Statistical analyses were performed in R using the igraph package. The network visualization was produce using d3.js borrowing from these examples (1, 2, 3).
How to interact with this graph: clicking on a node will affix it to the background. A black outline will appear around fixed nodes. Nodes can be dragged to different positions to allow for clearer viewing. Double-click on a fixed node to release it. Hover over a link between nodes to darken it and show its path more clearly. If you can't view the interactive graph, press the button below or click here to see a static version.
Please feel free to explore the interactive graph above - there's a lot to it, so you'll probably see something I missed. Below I'll point out a few of the more interesting sets of clusters to emerge from the analysis.
BDSM topics constitute one of the larger themes in the Literotica corpus. The main cluster is shown in purple. The red cluster contains tags relating to dominant men and submissive women. Interestingly, the second-person point of view tag is also a part of this group. Perhaps submissive women particularly enjoy being directly addressed (i.e. with "you") in the narrative?
Another cluster (in light blue) corresponds to dominant women and submissive men. This cluster is the only in the whole network to be linked to the crossdressing cluster (dark blue). The set of BDSM clusters are also linked quite closely to a (tan) cluster containing topics related to toys and masturbation.
Another set of clusters, albeit less densely interconnected, deals with the body. The blue green cluster consists of body part tags. The yellow green and grey clusters deals with oral and manual intercourse. Interestingly, an additional cluster (pink) emerges. This cluster includes tags relating to hair color, weight and age, and also large body parts. Given the lack of direction connection between the blue green and pink clusters, it seems as though fixation on large penises and breasts constitutes a fundamentally different theme from interest in these parts per se.
Another super-cluster deals with taboo sex: incest and student-teacher relations. Different clusters deal with mother-son, father-daughter, and sibling incest (red, light blue, and cyan, respectively). It should be noted that stories involving sex with minors are prohibited on Literotica, so these tags correspond to stories about sex between adult relatives. The incest theme is clearly tied to the educational theme through the common element of age differentials. The romantic "love" cluster is only linked to these taboo clusters through the common element of family.
A pair of clusters in orange and dark pink correspond to fantasy/sci-fi themes. These themes are almost completely uncorrelated with any others, suggesting that stories with other themes are all equally likely to be populated by vampires and werewolves. A drama cluster in light green is the only connection to the rest of the network. Interestingly, this cluster in turn is linked to light pink and dark green clusters focused on interracial relations, as well as another dark pink cluster corresponding to gay and bisexual tags.
The largest theme super-cluster concerns the involvement of more than two people in romantic and/or sexual activities. This theme is divided into several clusters. In light blue, one cluster deals with unfaithful spouses (particularly wives) either in the context of the faithful spouse consenting/encouraging the extra-marital relations (hot wife) or not (cheating). In light purple, another cluster deals with the themes of exhibitionism and voyeurism, in which the third party or parties may not be actively involved in sex but simply serve as observer or observee. The dark purple cluster deals with swinging and outright group sex involving potentially more than three people. Finally, the green and light brown clusters deal with threesomes, with the green cluster more oriented towards the presence of two men and one woman, and the brown cluster oriented towards two women and one man.
This thematic super cluster is far more integral to the network than the other two clusters of comparable size: BDSM and taboo. Whereas those to themes are largely separable from the rest of the network, there are numerous connections between multi-person sex theme and a wide variety of other topics. This suggests that the number of people sexually involved in a narrative plays a crucial role in determining what other themes may emerge.
Two small tag clusters were completely isolated from the rest of the network: office and virginity. What this indicates is that these themes have no particular relationship with any other thematic components of a story. In other words, the tale of the sexy secretary and her boss (if you'll forgive the tired trope) is equally likely to involve a werewolf as whips and chains or the boss's wife.
As we've seen, analyzing the semantic tags associated with erotic fiction can be a powerful too for understanding the themes of the genre. The results suggest that Literotica's category system isn't wildly off-base. It might be useful for them to further subdivide some of their categories, as the present analyses suggest that themes such as BDSM and Incest/Taboo have distinct subthemes. Un(der)recognized themes in the site's categorization system include office sex and focus on (big) sexual body parts. Going forward it will be interesting to see whether some of these themes have relationships at the author level that don't manifest at the story level. For example, perhaps authors who write stories involving BDSM also write other stories about group sex, despite the fact that these themes don't tend to co-occur unusually frequently within individual stories. It might also prove interesting to compare story tags to the content of the story text itself.
© 2015 Mark Allen Thornton. All rights reserved.