Intriguing new free research tool

By | April 26, 2013

googleI’ve been experimenting recently a bit with a quite fascinating new research tool brought to you by those wonderful folks at Google Labs. It’s called an “Ngram Viewer“, and it’s basically a tool for taking words and phrases and, as Google puts it, “…display[ing] a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over…selected years”. It’s a technique of limited extent, but quite interesting in what it can do. Since it’s easier to describe in illustration than in the abstract, let’s consider an example [BTW, if the page versions are too small to make out appropriately, click on the [FS] after the ngram and a full sized version of the picture will be available]:

STEM ngram copy[FS] This ngram considers the STEM fields (science, technology, engineering, mathematics) as represented in the American English corpus from 1850 to the present. The metric on the Y-axis is the percent of books published at the date in question (the X-axis) that contain the subject word or phrase. It can be interpreted approximately as follows: science  is far and away the most common term, and although it fluctuates some up and down, has remained about as common between then and now. Engineering and mathematics got relatively little play until the 1930s, rose some then, and have remained about the same since then. Technology, however, although it got a really late start, skyrocketed past the others about 1960, even surpassing science itself about 2000; since then it has dropped down slightly but remain sat about its high level.

World-shaking? Hardly. Suitable for hypothesis testing? No this is strictly an exploratory technique (I’ll be posting on the critical differences between exploratory and hypothesis testing research in the next few days.) Informative? Well, that depends on what you can tie to it., and what your interests are. Personally, I found it interesting how fast technology rose as a meme during our techno-fascination period,\, and how it’s begun to fall off almost as quickly since 2000.

psychology etc ngram copy 2[FS] Let’s look at another one. This one charts six disciplines with which I have been strongly or loosely identified with over the years. It shows that until around 1910 these terms hardly ever emerged into book titles, but starting then and continuing until about 1930 they all took off, with psychology taking the lead; that field then dropped some, but then recovered somewhat.  Sociology peaked about 1935, and despite a brief rise in the 1970s, has generally remained pretty flat; likewise engineering, but at a somewhat higher level. Economics briefly plateaued,m but then took off again, reaching a high in 1985, but then falling off again. Political science and business administration are essentially non-starters;l it doesn’t mean that they haven ‘t been written about, but that they haven’t been featured in titles.


public-private 2 ngram copy[FS] The ngram viewer can also be configured to look at differences between terms. This one explores three current dichotomies in terms of which term got more play at what times.The line for “public-private” shows a definite advantage for public over this time period (1800-present). The line for “government-business” shows a slowly declining advantage for government until about 1905, when business took the advantage; about 1940 government regained the edge, but lost it again in 2000. I find the line for “problem-solution” particularly interesting. Titles featuring “solution” had an advantage until about 1928, when problem took the lead and has held it since then although with its advantage declining slightly since 2000.


That’s enough for now. I’ve got some more that I’ll present soon. In the meantime, I’ll ask you to think about this, and if there’s any particular analysis you’d be interested in, either run it yourself or better yet, send your question to me in a comment and I’ll run it and interpret it for you. As I said, this isn’t a hypothesis testing technique, but it can be quite interesting. And, thanks to the nice folks at Google, it’s both free and comprehensive.

  • Jonathan Freeman

    Is there perchance a possible research project to be had in the “interesting” psychology behind the names chosen by responders to blogs? For instance:

    “pistol targets silhouette”

    “good design”

    “cheapest auto insurance”

    “Cleaning service in Henrico”

    “audio video in virginia beach”

    “electrical repair in madera”




    “password hacking software”


    “unfinished cabinets in beaverton”

    …to pick just a few completely at random…. in the spirit of which I suggest:

    “nose picked at random” … that’s Random, South Texas… small town with terrible traffic problems, really bad congestion around the nasal base.

    “Don Key” … little known Spanish explorer. Didn’t discover much as his mode of transport had relatively short legs.

    “Virgin beaches on video” … actually a cable channel only viewable when staying at certain hotels.

    ditto for “beaver finished on my cabinets”.

    • `In ship afterDrEvel1

      I believe that I do a reasonably good job of keeping the junk responses out of my blog. Since I started, I have had (as of this afternoon) 2,188 response that are appropriately classified as “spam”. The vast majority of these are handled and suppressed by the Akismet software made available by WordPress for use with these blogs. I still get anywhere from 1 to 5 other spam responses that manage to slip through the screen and show up temporarily on the blog; however, I remove these as I find them. When removed, they become part of the spam database and their characteristics part of what the screening software uses to determine spam. Keeping the blog clear is just one of those things that we have to learn how to do.
      What interests me more than the names of the spammers (which may after all not be their real names), is the content of their messages that slip through the screen. Generally, they sound plausible, and occasionally even I wonder if this might not be a real response. It would have to be from a reader who hadn’t really read the post in question, but I aspire to as many readers as possible, even if they are sophomores in high school. The spammers obviously have generators capable of creating postings plausible enough to pass some pretty good screens; yet most of the spam doesn’t use them (look in your spam folder at the contents if you doubt me.) Are the literate spams from some particularly literate spammer? I have no answers here, unfortunately. I really don’t know enough about the mechanics of spam filters to offer a definitive analysis (for good reasons, the information publicly available about effective spam filters is limited.) However, I will continue to police my blog vigorously; and I’ll let you know when the beaver finishes with my cabinets.

      BTW, I have some interesting findings along the lines that you suggested for using this tool further. I would have posted them 10 days ago except that when I had the post all ready to upload, I managed to delete it without any backup in the process. This so demoralized me that it’s taken about two weeks to even consider redoingin it. But I will; watch this space.

  • Jonathan Freeman

    What an interesting tool. Thanks for making folk aware of it. I had no idea it existed.

    Might I suggest two Yiddish words to explore using it: “mensch”, and “schlep”, out of curiosity about the influence of Jewish/Ashkenazi culture in the USA, and the way in which Yiddish words became an integral part of American writing… although, did they? Perhaps your results will show that that there was a Rise & Fall?

    Do you have any thoughts on Google’s apparent altruism? Apropos which I saw an interesting TED talk by Steven Wolfram (from 2010 in Long Beach I think) about his work of 30 years on computation, (Mathematica, Wolfram Alpha, The New Science). Will you be blogging about such efforts as his and Google’s to encompass the entirety of human knowledge and render it accessible and useable to anyone?