Postings on science, wine, and the mind, among other things.
This inaugural blog post is devoted to providing a detailed explanation of the history and function of my book recommender for Project Gutenberg featured on this site. For those of you who haven't heard of Project Gutenberg, it is large online repository of free ebooks. They purport to offer over 45,000 texts in a variety of formats suitable for reading in the browser or a variety of e-readers such as the Kindle.
Most of these books are out of copyright in the United States, although a few were made freely available by author and/or publisher. As a result they primarily date from the first quarter of 20th century or earlier. Most of these texts are the lesser-known cousins of more famous "classics" from past eras (although you can find the classics there too, though obviously limited in number). Additionally, many of the books that comprise the collection are no longer in print and would likely be quite hard to find outside of large libraries.
The tiny proportion of books that remain well-known centuries after their creation limits us severely as contemporary readers. How can we search out new (old) books that we might like if by chance they didn't make it through the years of filtering and vetting that created what we now call the canon? Project Gutenberg does provide some tools for doing so, such as their bookshelf categories, but not all books necessarily make it onto a bookshelf. Moreover the category definitions pay little head to factors such as writing style.
My recommender is meant to address this problem by providing a new way to rediscover literature lost to time. Even if you don't realize it, you probably already interact with many recommendation systems on the web. Netflix and Amazon's recommenders are two of the most well-known examples. These systems (to the best of my knowledge) rely on complex tagging systems combined with user behavior and ratings. Thus a novel like War and Peace might receive tags such as "epic", "war", "realist" and "Russian". If a user indicates they liked War and Peace, either by doing something like buying it or by rating it, then such a recommender would update its records to reflect that the user is more likely to enjoy other books with the same tags (or tags that are correlated across users). Of course there's a lot more to it than that, as you can read in the links above.
Unfortunately, unlike Amazon and Netflix, I don't have the resources to throw at minutely tagging tens of thousands of books. Moreover, since a tiny proportion of books get the overwhelming majority of readers, it's worth pointing out that it's questionable how well any system derived from just the classics would generalize to the rest. Instead I decided to go to the raw data: the text of the books themselves. With some help from stackexchange, I had soon downloaded the full collection from one of Project Gutenberg's many mirrors. This ended up being a relatively modest 12.3 GB zipped or 28.3 GB as uncompressed text files.
Using Python, I complied a non-redundant list of English books including filename, book title and author (where applicable). Fortunately Project Gutenberg ebooks have a fairly consistent format, and thus a little regular expressions magic was all that was necessary to carve the text up into header, actual book, and footer. From the header, more regular expressions allowed me to extract the title and header. The body of the book I fed into the guess-language package to ensure that it was English. This process yielded a total of 36615 books which I ultimately whittled down to 34188 after manually removing any duplicates and periodicals I could still find.
While I compiled this list, my script was also searching through the text of each book counting the meaningful words. By meaningful words, I mean any words that are not stop words: very high frequency grammatical function words like articles, conjunctions and pronouns. I used the stop word list in the Python Natural Language Toolkit to filter out these content free words. I retained any remaining words that occurred at least 100 times in at least 100 different books. While this might seem like a high bar for inclusion, bear in mind that this is less than half a percent of the words within less than half a percent of the books. Ultimately this process yielded a set of 989 words.
This next step in the analysis involved taking another walk through the text of each book. Thanks to Python it only took a few hours to count the occurrences of each of the 989 content words in each book. I additionally compiled counts from each book for every stop word and for every standard punctuation mark. Then, using scipy, I calculated the "distance" (using the cosine similarity metric) between every book in terms of content (using the content word frequencies) and in terms of style (using the stop word and punctuation mark frequencies).
This is where the humanists among you will probably stop me and say, "Surely you don't mean to say that the frequencies of these words is all there is to meaning or style?" And of course you'd be correct to say so: there's obviously more behind these concepts than mere frequencies. However, many of those things are considerably more difficult to measure using an automated text-mining script. Moreover, for our purposes we don't really need the frequencies to mean anything intrinsically, so long as they are reliably correlated with meaning (or style). This correlation between meaning and frequencies is a property that bag-of-words approaches like this one depend on.
To give a roughly analogous example, consider the case of personality traits. These traits are abstract psychological constructs that we can't directly measure at all. However, we can get at them by measuring things (self-reported preferences, behaviors, and so forth) that are thought to reliably correlate with a particular trait. Measure enough of them and you have a good idea of what traits a person possesses.
Back to the distance calculations: I calculated all of the distances between all 34,188 books. This meant
# of distance calculations = (34188^2)/2 - 34188/2 = 584,392,578
for content and the same number again for style. It took about 30 hours for this calculation to complete on my rather overpowered desktop (though admittedly I was definitely
not guilty of the sin of premature optimization). However, when it was eventually completed it was a relatively simple matter to pick the 10 books with the smallest distances to each book in the collection, which is what the recommender now provides you!
Whew! Building this thing was a bit more involved than I anticipated - but then, isn't that always the case? The whole recommender side of thing was originally just a excuse for me to practice my web development skill, but at least I can say I'm proud of the result. Stay tuned for updates about what I'm going to do with the recommender going forward, and how I'm going to use your ratings to make it better. In the mean time I hope some of you find something you like to read!
© 2014 Mark Allen Thornton. All rights reserved.