Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEAI'm a bit more fussy about what I consider words, but there are some obvious exclusions that probably shouldn't be (221B being a prime example). There are a lot of rules that could be added to make the extraction less flawed (for example splitting the data using a null, new line and space delimiter and selection of unusual semantics (e.g. multiple spaces or tabs) to detect where a word is unusual and so not acceptable in body text.
Anyway, here is a lazy example based on some additional rules as to what might be considered a word.
Fun.
Big.txt is a compendium of multiple works, though. Try Remembrance of Things Past for a single monumental work, 700K words in 7 volumes. Proust had a lot of free time on his hands!
Dan