I’ve been experimenting recently a bit with a quite fascinating new research tool brought to you by those wonderful folks at Google Labs. It’s called an “Ngram Viewer“, and it’s basically a tool for taking words and phrases and, as Google puts it, “…display[ing] a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over…selected years”. It’s a technique of limited extent, but quite interesting in what it can do. Since it’s easier to describe in illustration than in the abstract, let’s consider an example [BTW, if the page versions are too small to make out appropriately, click on the [FS] after the ngram and a full sized version of the picture will be available]:
[FS] This ngram considers the STEM fields (science, technology, engineering, mathematics) as represented in the American English corpus from 1850 to the present. The metric on the Y-axis is the percent of books published at the date in question (the X-axis) that contain the subject word or phrase. It can be interpreted approximately as follows: science is far and away the most common term, and although it fluctuates some up and down, has remained about as common between then and now. Engineering and mathematics got relatively little play until the 1930s, rose some then, and have remained about the same since then. Technology, however, although it got a really late start, skyrocketed past the others about 1960, even surpassing science itself about 2000; since then it has dropped down slightly but remain sat about its high level.
World-shaking? Hardly. Suitable for hypothesis testing? No this is strictly an exploratory technique (I’ll be posting on the critical differences between exploratory and hypothesis testing research in the next few days.) Informative? Well, that depends on what you can tie to it., and what your interests are. Personally, I found it interesting how fast technology rose as a meme during our techno-fascination period,\, and how it’s begun to fall off almost as quickly since 2000.
[FS] Let’s look at another one. This one charts six disciplines with which I have been strongly or loosely identified with over the years. It shows that until around 1910 these terms hardly ever emerged into book titles, but starting then and continuing until about 1930 they all took off, with psychology taking the lead; that field then dropped some, but then recovered somewhat. Sociology peaked about 1935, and despite a brief rise in the 1970s, has generally remained pretty flat; likewise engineering, but at a somewhat higher level. Economics briefly plateaued,m but then took off again, reaching a high in 1985, but then falling off again. Political science and business administration are essentially non-starters;l it doesn’t mean that they haven ‘t been written about, but that they haven’t been featured in titles.
[FS] The ngram viewer can also be configured to look at differences between terms. This one explores three current dichotomies in terms of which term got more play at what times.The line for “public-private” shows a definite advantage for public over this time period (1800-present). The line for “government-business” shows a slowly declining advantage for government until about 1905, when business took the advantage; about 1940 government regained the edge, but lost it again in 2000. I find the line for “problem-solution” particularly interesting. Titles featuring “solution” had an advantage until about 1928, when problem took the lead and has held it since then although with its advantage declining slightly since 2000.
That’s enough for now. I’ve got some more that I’ll present soon. In the meantime, I’ll ask you to think about this, and if there’s any particular analysis you’d be interested in, either run it yourself or better yet, send your question to me in a comment and I’ll run it and interpret it for you. As I said, this isn’t a hypothesis testing technique, but it can be quite interesting. And, thanks to the nice folks at Google, it’s both free and comprehensive.