Research is a lot about studying the characteristics of large things — large populations of people and organizations, big ideas and concepts, lots of time. But often large things aren’t all that easy to get at. More usually, the population is not available to us for a variety of practical reasons, and we only have access to pieces of it. When we are able to study the large thing directly, that’s what we usually term a “census”, the best-known being the decennial count of the US population. I once took part in a study with an N=69,000 – all the R&D contracts let by the US Army over a five year period. When working with censuses, standard statistical approaches lose their utility (more on this issue in a future post.)

The reason that we study and emphasize statistics in research training, however, is that often all we can get at is a part of the whole — that doesn’t diminish our appetite for information about the whole, but it certainly does impede access to the information itself. Thus research often involves the examination of properties of (relatively) small groups of units out of the whole, with the aim of generalizing the results so observed to the larger population from which that smaller group, called a “sample”, is taken or “drawn”.

The process of generalizing back to a population based on the properties of a sample is called “inference”, and, however critical it may be, it turns out to be something that human beings aren’t particularly good at. As we noted earlier, we are prone to a series of errors and “frame problems” that make our inferences systematically less good that they ought to be. Usually, these errors don’t matter a lot. But in systematic scientific research, they can be devastating, and we need methods of protecting ourselves against them. As we’ve commented earlier, a substantial part of what we call “statistics” is in fact nothing more than a set of tools for protecting scientists from making inferential errors.

Since statistics is in large part based on the mathematics of probability, statistical inference and generalization from samples to populations requires that the samples be constructed according to certain rules. In particular, it requires that the samples be selected from the population at random. But this turns out to be particularly problematical, for reasons both practical and theoretical. In fact, in many cases random sampling is simply not possible. Thus, researchers have evolved a wide variety of sampling techniques collectively called “non-probability samples” that are often used in study practice. Collectively, they probably account for the vast majority of survey research samples, although they give approximate results at best. Understanding both the uses and limitations of all kinds of samples is a key part of research training – as is developing the more craft-oriented understanding that practically useful results can be generated by procedures that are something less than theoretically elegant.

It’s important to understand what “practically useful” means in this context. If you’re running a national political poll on the eve of a closely contested presidential election, then you need to be * extremely* sure that your results are correct and generalizable — if you’re off by even a couple of percentage points, then your credibility is shot and lots of people paying your bills are going to be very angry. So in that case you need to follow very precise sampling procedures using all the information at your disposal — strict random sampling stratified and weighted appropriately. There’s still a distinct chance that you’ll be off, but you’ll have given yourself the best chance that current science can buy for you. If you’re researching the market for a new toothpaste, you’ll probably settle for a sample that, while random, isn’t so carefully structured to map onto the population, because the added precision of your results isn’t worth the extra expense. Good sampling is very costly; all you’re looking for is a reasonable approximation. And if you’re correlating a couple of attitudinal measures in support of a hypothesis for an academic dissertation, then you’ll probably settle for the first fifty or sixty people you can get to fill out your questionnaire, since there are basically very few consequences attached to whether or not the hypothesis is actually true or false with regard to your generalizable population. This isn’t to say that sampling doesn’t matter in dissertation research, or even in marketing research — only that the precision of the sample with regard to the population is one of the trade-offs that are inevitable in the actual doing of field research.

An important corollary to the idea of sampling is the idea of “statistical power”. Power simply refers to the chance of finding a relationship in a sample, given that it actually exists in the population — that is, it is the chance of supporting a true hypothesis. Since it’s expressed as a probability, power can vary between 0 (“no chance, buddy!”) to 1 (“lead-pipe cinch”), and is usually represented as (1 – β — the Greek letter b (beta)), where beta represents what is also called “Type 2 error”, or failing to support a true hypothesis. It’s closely related to alpha — that is, Type 1 error, or the chance of supporting a false hypothesis that we’re willing to live with. As one source put it, “If the Type I risk is the chance of crying wolf, the Type II risk is the chance of not seeing a real wolf.”

Obviously, we would like this power value to be as large as possible. But since power is a function of sample size and effect size (the strength of the actual relationship between the variables in the population), it doesn’t come free. Small effect sizes require larger samples to detect, and may easily be missed; with large enough samples, an effect of virtually any size can be found to be significant, practically meaningful or not. The trade-offs among sample size, effects, and power are very important and have major impacts on how research is actually practiced.

Part 6 of this extended discussion on causal inference is here.