Internal validity of research studies

Ph.D. Topics : Research and Evaluation Methods

What is validity? If I say “that’s a valid argument” to you, it means your facts and your logic seem reasonable to me. In research methods, we talk about validity because we want to make statements about the world; we want to make knowledge claims. We want these claims to be valid, meaning they should be well-grounded in logic and fact so that we can trust in them.

Much of scientific research is concerned with making claims about causality. In education research, for example, we want to know what causes students’ math achievement to be high or low. Is it what their teacher does? Their raw brain power? How hard they work? And so forth. Obviously, the answer is, it’s many things, but to what extent possible we want to isolate the factors that are under our control (teaching method, curriculum, school culture) and find the factors that will result in the highest math achievement.

Internal validity = Extent to which you can infer causality

The internal validity of a research study is the extent to which you can make a causal claim based on the study; it is the validity of the causal inference you make. Different research designs provide stronger or weaker internal validity. For example, well-designed randomized experimental designs generally are considered to provide the strongest internal validity. Quasi-experimental studies in which treatments are assigned randomly to intact groups (e.g., classrooms) can have strong internal validity also.

Porter (1997, as cited in Gliner & Morgan, 2000) identified three criteria necessary for establishing a causal relationship:

  • Temporal precedence
  • Existence of an association
  • Elimination of plausible alternative explanations

We probably wouldn’t even investigate the potential presence of causality if we hadn’t already noted the temporal precedence of the cause before the effect. The existence of an association is usually established statistically, with ANOVA (often used for experimental designs) or regression (more commonly for observational designs). The third element is what is often problematic. How do we rule out other explanations? What might explain the statistically significant differences we see besides the cause(s) we hypothesized? From here, we have to look at specific threats to validity: confounding elements that may explain the statistically significant differences among groups that we see.

Threats to validity

Campbell & Stanley (1963) identified eight threats to internal validity. Gliner & Morgan (2000) grouped these threats into two main types, equivalence of groups on participant characteristics and control of extraneous experience or environmental variables. These eight threats, with Gliner & Morgan’s classifications, are:

  1. History – results you see may be due to environmental events influencing the outcomes. (Experiences/Environment)
  2. Maturation – subjects change over time. For example, just because you see that someone’s math has improved over time doesn’t mean that the curriculum you used was the cause of that improvement.(Experiences/Environment)
  3. Testing – a.k.a. the “practice effect,” with repeated testing subjects will get better, not because of the intervention. (Experiences/Environment)
  4. Instrumentation – scoring can change; calibration of instruments can drift; different raters may give different results. (Experiences/Environment)
  5. Statistical regression – subjects with extreme scores will tend to have less extreme scores when you test them again. (Participant characteristics)
  6. Differential selection – e.g. if you let subjects pick their own group, the two groups may differ in some way that influences the outcomes. (Participant characteristics)
  7. Experimental mortality – subject drop out may differ according to what treatment the subject receives. (Participant characteristics)
  8. Selection-maturation interaction – biases in assignment may interact differentially with maturation or other factors. (Participant characteristics)

According to Cook & Campbell (1979), seven of these threats (all except experimental mortality) can be ruled out by the use of control group designs and random assignment of subjects to the various treatment and control groups–i.e., by using a randomized experimental design with control groups. But Cook & Campbell identified four threats in addition to experimental mortality that are not ruled out by experimental control group designs:

  1. Diffusion/imitation of treatment
  2. Compensatory equalization of treatments
  3. Compensatory rivalry
  4. Resentful demoralization

These four threats arise particularly in field experiments, where groups receiving different treatments can observe each other (Cook & Shadish, 1994). If there this sort of interference between the groups, internal validity may be compromised.

How do you get around that? More modern treatments of causal inference such as Rubin’s potential outcomes model suggest some possibilities. Perhaps I’ll tackle that in a separate post along with some discussion of alternate views of the epistemology of causality, such as Lincoln & Guba’s (1985) constructivist paradigm.


Campbell, D.T. & Stanley, J.C. (1963). Experimental and Quasi-Experimental Designs for Research. Chicago: Rand-McNally.

Cook, T.D., & Campbell, D.T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin.

Cook, T. D., & Shadish, W. R. (1994). Social experiments: Some developments over the past fifteen years. Annual Review of Psychology, 45(1), 545-580.

Gliner, J. A., & Morgan, G. A. (2000). Research Methods in Applied Settings: An Integrated Approach to Design and Analysis. Mahwah, N.J: Lawrence Erlbaum.

Lincoln, Y.S., & Guba, E.G. (1985). Naturalistic Inquiry. Newbury Park, CA: Sage.

Porter, A.C. (1997). Comparative experiments in educational research. In Jaeger, R.M. (Ed), Complementary Methods for Research in Education (2nd ed., pp.524-544). Washington, DC: American Educational Research Association.