statistics

# The normal distribution

Ph.D. Topics : Statistics

The normal distribution, or bell curve, is probably the most important probability distribution in statistics. Many quantities we observe are roughly normally distributed; the central limit theorem provides a mathematical explanation for this.

The probability density function is given by:

$f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

statistics

# The central limit theorem

Ph.D. Topics : Statistics

Why are so many quantities we measure in nature approximately normally distributed? The central limit theorem (CLT), a key tenet of probability theory, says that the average or sum of a large number of independent and identically distributed random variables will be approximately normally distributed, no matter whether those underlying variables are normally distributed themselves or not. Many outcomes we measure–someone’s height, their math aptitude, the temperature in New Orleans on a summer day–represent the sum of many independent factors; the CLT is a mathematical explanation of why these quantities follow a roughly bell-shaped distribution.

The CLT also provides justification for null hypothesis testing of mean and mean difference values. It tells us that no matter what the underlying distribution of the quantity we’re measuring, the distribution of means will look normal, so long as we take a large enough sample.

#### Understanding the central limit theorem

Here’s an easy-to-understand definition of the central limit theorem:

The distribution of an average tends to be normal, even when the distribution from which the average is computed is decidedly non-normal.

Let’s dig into this definition a little.

# Two kinds of people in the world…

… those that like to classify people into different kinds and those that don’t. I’m a classifier.

That’s why I’m intrigued by latent class analysis (LCA), where you statistically divide up people into unobserved classes based on some observed variables (like behavior). Take the example of autism. Is Asperger’s Syndrome on the autistic spectrum or is it an altogether different thing? LCA might be able to answer that question.

I’ve spent the last couple days reading through simulation studies on identifying classes in an LCA-type technique called growth mixture modeling (GMM) where you try to identify classes underlying different developmental trajectories. The oft-cited example in this area is alcohol use, tracked during adolescence and sometimes into adulthood. These studies typically find a few distinctly different trajectories, so different that they (apparently) qualify as different latent classes. For example, this 2003 study found five growth trajectories:

• Occasional very light drinkers
• Moderate escalators
• Infrequent bingers
• Rapid escalators
• De-escalators

I’m thinking of designing and running my own simulation study of growth mixture modeling, starting from the ideas in Bauer & Curran (2003). They demonstrated that GMM using information criteria routinely in use at that time would likely extract too many classes given non-normal inputs.

I”m thinking I could go the opposite way: look at cases where there are multiple classes generating the data and see what happens when you treat the data as coming from a single population. Jedidi, Jagpal, & DeSarbo (1997) tackled this question in the case of LCA (not growth curve analysis) with applications to marketing.

But what I’m struggling with is this: when you see non-normal data, is that because there really are multiple classes generating that? Or is the data inherently non-normal? How can you detect the difference, given that non-normal distributions can be approximated by mixtures of normal distributions?

On the one hand, I have this philosophical sense that there aren’t any “classes” of people in the world, just different ways of classifying. On the other hand, genotype differences are real, so I need to keep the medical interpretation in mind. For example there is clearly a class of people who have cystic fibrosis compared to a much larger class of people who do not. Those are the sorts of situations I need to keep in mind when I design the simulation. Alcohol use is interesting but I’m not sure I’d use it as a template for what I’d like to explore.

statistics

# The price and payoff of Bayesian statistics

I’ve never totally understood why people complain so much about having to specify prior distributions in order to do Bayesian inference. Even if you’re doing frequentist statistics, you have to make some assumptions about the world and about your data. If you’re using some maximum likelihood based approach, you’re counting on asymptotics to get you to multivariate normality — and so many data analysis problems just don’t have the sample size for that.

The big payoff with Bayesian statistics, it seems to me, is that you get full-on probability distributions as output, not just a mean and a standard error. But everyone focuses on specification of the prior.

Johnson & Albert in Ordinal Data Modeling:

The additional “price” of Bayesian inference is, thus, the requirement to specify the marginal distribution of the parameter values, or the prior. The return on this investment is substantial. We are no longer obliged to rely on asymptotic arguments when performing inferences on the model parameters, but instead can base these inferences on the exact conditional distribution of the model parameters given observed data–the posterior.

That is a huge payoff. But even more important than that, Bayesian statistics is so much more believable than classical. I am almost happy that I spent 15 years ignorant of what was going on in academic statistics so I could jump on the Bayesian train now.

Here’s one of the first “pop statistics” articles I’ve seen — an attempt to clarify for the layperson what is going on with statistical practice in academic research. It’s a good article. I learned a few things and found a few interesting references.

Reporter Siegfried misses a couple important points though. He doesn’t note that frequentist statistics are based on repeated sampling on into infinity, that confidence intervals cannot be interpreted except with reference to the long run. This is endlessly confusing to intro stats students. Most of them probably never absorb it.

And what about Bayesian statistics? Siegfried, like so many others, focuses on specifying the prior:

Bayesian math seems baffling at first, even to many scientists, but it basically just reflects the need to include previous knowledge when drawing conclusions from new observations. To infer the odds that a barking dog is hungry, for instance, it is not enough to know how often the dog barks when well-fed. You also need to know how often it eats — in order to calculate the prior probability of being hungry. Bayesian math combines a prior probability with observed data to produce an estimate of the likelihood of the hunger hypothesis.

This describes Bayesian stats mostly correctly (in my novice opinion) but focuses too much on the price (the need to specify the prior) rather than the payoff you get (probability distributions that are easily interpretable under conventional notions of probability).

Here, I think, Siegfried further obscures what’s going on with the enthusiasm for Bayesian ways of analyzing data:

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of “probability” in the real world. Standard or “frequentist” statistics treat probabilities as objective realities; Bayesians treat probabilities as “degrees of belief” based in part on a personal assessment or subjective decision about what to include in the calculation. That’s a tough placebo to swallow for scientists wedded to the “objective” ideal of standard statistics. “Subjective prior beliefs are anathema to the frequentist, who relies instead on a series of ad hoc algorithms that maintain the facade of scientific objectivity,” Diamond and Kaul wrote.

No, no, no. Bayesian methods do not introduce confusion into the concept of probability. Classical statistics did that. Bayesian statistics clarifies probability — makes it into a human measure, not some pseudo-objective long-run construction.

statistics

# Effort or expression

From Michael Foley’s The Age of Absurdity: Why Modern Life Makes It Hard to be Happy:

Difficulty has become repugnant because it denies entitlement, disenchants potential, limits mobility and flexibility, delays gratification, distracts from distraction and demands responsibility, commitment, attention and thought….

Why submit to mathematical rigour when you can do a degree in Surfing and Beach Management instead?

This is exactly what I was getting at with my cross-country study of math achievement. Countries that have higher self-expressive values tend to have lower mean math achievement. Not only that, but they are less likely to reward students with higher liking-for-math with math achievement, compared to countries with more survivalist values. (Note: my final results were not quite so puny as I had feared. Actually after I refined the model based on some diagnostics they were pretty good but until I solve the psychometric problem of measuring liking-for-math I cannot go further with it. So I suppose that will be what I attack this summer for my research practicum.)

I think this may explain why the curve of math achievement related to GDP flattens at higher levels of income — as countries make the shift from industrial to post-industrial, their values shift from survivalist to self-expressive, and self-expressive values are more likely to encourage a degree in Surfing and Beach Management than in something that requires multivariable calculus and differential equations. (Click for a bigger graph).

As countries industrialize, GDP goes up, and values shift from traditionalist to secular-rational. In this phase, math achievement is encouraged and supported, because an industrially-oriented economy needs, most of all, quantitatively skilled human capital. Mean math achievement will improve and students who like math will put even more effort into it. They will be encouraged by their parents and their peers and by future job opportunities.

But then industrialization provides so much wealth that the country shifts to a post-industrial economy, as has happened in many English-speaking countries. Now cultural values move from survivalist to self-expressive. It’s less important to prepare yourself for a high-paying job than to “do what you love.” The money will follow.

Crudely put, does a culture reward effort or expression? Math success requires effort and it doesn’t really help someone express themselves (just read some of my more statistically-oriented blog posts and you can see that!)

I’m not saying an emphasis on expression is wrong. But it doesn’t contribute to high math achievement. Improving teacher quality isn’t going to change that. Context and culture matters when it comes to academics.

# Modeling scale usage heterogeneity the Bayesian way

Posts in my journal club category are my summaries and thoughts on journal articles I read. I’ve found I absorb material much better if I try to summarize it in a way that might make sense to someone else. The article covered here offers a potential solution to a problem I ran into in the TIMSS data set.

Rossi, P.E., Gilula, Z., Allenby, G.M. (2001). Overcoming scale usage heterogeneity: A Bayesian hierarchical approach. Journal of the American Statistical Association 96(453), 20-31.

Abstract. Questions that use a discrete ratings scale are commonplace in survey research. Examples in marketing include customer satisfaction measurement and purchase intention. Survey research practitioners have long commented that respondents vary in their usage of the scale: Common patterns include using only the middle of the scale or using the upper or lower end. These differences in scale usage can impart biases to correlation and regression analyses. To capture scale usage differences, we developed a new model with individual scale and location effects and a discrete outcome variable. We model the joint distribution of all ratings scale responses rather than specific univariate conditional distributions as in the ordinal probit model. We apply our model to a customer satisfaction survey and show that the correlation inferences are much different once proper adjustments are made for the discreteness of the data and scale usage. We also show that our adjusted or latent ratings scale is more closely related to actual purchase behavior.

Assume the observed item indicators (matrix X) are a discrete version of underlying latent continuous data Y. i indexes the individuals and j the questions. There are K+1 common, ordered cutoff points ck, the first at negative infinity and the last at positive infinity such that

$x_{i,j} = k \;\; \textup{if} \;\;c_{k-1}\leq y_{i,j}\leq c_k$

The underlying latent continuous variables are distributed multivariate normal:

$y_i \sim N(\mu^*_i, \Sigma^*_i)$

The cutoffs discretize the latent variable Y. This is similar to an multinomial probit model but we’re not interested in the conditional distribution of one discrete variable but rather the joint distribution of a bunch.

This model provides for a different mean vector and covariance matrix for each respondent, but the authors simplify it by using a respondent-specific location-scale shift:

$y_i = \mu + \tau_i \iota + \sigma_i z_i$

$z_i \sim N(0, \Sigma)$

This allows for acquiescent/disacquiescent response styles, for overuse of a particular response value, and for extreme response styles:

• Acquiescent (disacquiescent) would be represented with a large positive (negative) location shift and a shrunken scale parameter.
• Overuse of a particular response value would be represented by a location shift to that value with a shrunken scale parameter.
• Extreme response styles would be represented with no scale shift and a very large scale parameter, which would tend to put a lot of probability density into the two tails.

The location and log scale parameters are modeled as bivariate normal, allowing them to be correlated with each other:

$\begin{bmatrix} \tau_i\\ \textup{ln} \: \sigma_i \end{bmatrix} \sim N(\varphi, \Lambda)$

You need to specify or model the cutoff values somehow. You could assume them to be known, say equally spaced between the actual values on the rating scale. This model specifies them in a way that allows for nonlinear spread, which you can imagine might be the case:

$c_k = a + bk + ek^2$

The authors go over a number of assumptions that force identification of the model. I get the need for this if not quite understanding why they did what they did, or what implications it has. Will come back to that at some point.

Then you need priors for mu, sigma, phi, lambda, and e.  They use flat priors on the means and the “cutoff” parameter e and inverse-Wishart priors for the covariance matrices.

And from there it’s just an easy simulation problem. Ha, right!

Thankfully, I’m not on my own trying to understand and implement something like this because Rossi, Allenby, & McCulloch wrote a textbook that includes a case study dealing with it. There’s even software and data sets to go with it. But since Penrose doesn’t have the book, I have to wait to get it from the University of Northern Colorado, darn. Too bad, because it would have been fun to spend spring break sorting it all out.

# Social science as rhetorical exercise: An example from research on narrative identity processing

Green, Ha, & Bullock (2010) critique mediation analyses in social science research:

Given the strong requirements in terms of model specification and measurement, the enterprise of “opening the black box” or “exploring causal pathways” using endogenous mediators is largely a rhetorical exercise.

But what is social science anyway? To what extent can we find the “truth” about complex social systems that involve agents with free will and myriad complex, interlinked influences on them?

Perhaps social science is just rhetoric of an advanced sort, carefully constructed arguments based on theory, prior research, data analysis and hunches that describe how the world might work. Over time, some of these arguments are shown to be false, so (ideally) we fix the story up and make it better fit what we’ve observed and what we can deduce from the build-up of evidence and argument so far.

An example from the psychology of narrative identity processing

Pals’ (2006) study of narrative identity processing and adult development is an example of mediation analysis as advanced rhetoric. Here’s the abstract:

Difficult life experiences in adulthood constitute a challenge to the narrative construction of identity. Individual differences in how adults respond to this challenge were conceptualized in terms of two dimensions of narrative identity processing: exploratory narrative processing and coherent positive resolution. These dimensions, coded from narratives of difficult experiences reported by the women of the Mills Longitudinal Study (Helson, 1967) at age 52, were expected to be related to personality traits and to have implications for pathways of personality development and physical health. First, the exploratory narrative processing of difficult experiences mediated the relationship between the trait of coping openness in young adulthood (age 21) and the outcome of maturity in late midlife (age 61). Second, coherent positive resolution predicted increasing ego-resiliency between young adulthood and midlife (age 52), and this pattern of increasing ego-resiliency, in turn, mediated the relationship between coherent positive resolution and life satisfaction in late midlife. Finally, the integration of exploratory narrative processing and coherent positive resolution predicted positive self-transformation within narratives of difficult experiences. In turn, positive self-transformation uniquely predicted optimal development (composite of maturity and life satisfaction) and physical health.

This study was correlational, so that’s the first reason that strict causalists would dispose of it. It also studied mediation, so even if it were some sort of randomized experiment, there would be questions about its suggestions of causality. But the researcher doesn’t just run the mediational analysis and then declare that she’s shown what she wanted to show. She places the correlational findings in the context of theory and makes an overall argument for her hypothesis while noting the limitations of the approach:

A second limitation of this study is that although the hypotheses reflect theoretically driven ideas about cause-effect relations (e.g., coping openness stimulates exploratory narrative processing; coherent positive resolution leads to increased ego-resiliency), the correlational design did not allow for analyses that would support conclusive statements regarding causality. The longitudinal findings were consistent with causal patterns unfolding over time but did not prove them. Thus, an important direction for future research on narrative identity processing will be to examine its causal impact, ideally through studies that closely examine the connection between changes in narrative identity and changes in relevant outcomes. In one recent study, for example, individuals who wrote about a traumatic experience for several days displayed an increase in self-reported personal growth and self-acceptance, whereas those who wrote about trivial topics did not show this pattern of positive self-transformation (Hemenover, 2003). This finding supports the idea that when people fully engage in the narrative processing of a difficult experience, their understanding of themselves and their lives may transform in ways that will make them more mature, resilient, and satisfied with their lives. Findings such as these reflect the growing view that the narrative interpretation of past experiences—the cornerstone of narrative identity—constitutes one way adults may intentionally guide development and bring about change in their lives (Bauer et al., 2005).

Is this research useful even if causality and mediation has not been proven? I think it is. We don’t know for sure which way causality runs among the various traits and behaviors studied (it probably runs in multiple directions) but Pals makes a good argument that someone with coping openness may engage in exploratory narrative processing of difficult life events and this, in turn, may drive a maturing process. In the second mediational hypothesis, she argues that developing coherent positive resolutions in that narrative processing of life events might lead to increased ego-resiliency. Are these analyses and arguments of practical use? I think yes.

The remembering self creates stories and needs those stories to make sense of experience. Research like Pals (2006) gives insight into what sort of stories might be most useful in leading towards maturity and psychological resiliency:

• Development of the stories should use an open and exploratory style rather than closed and defensive.
• The ending of the story should reflect some sort of positive resolution.

So mediation analysis, even of the non-experimental sort, can be useful. Okay so maybe it’s not like the scientific finding that lack of Vitamin C causes scurvy, but that doesn’t make it useless or unscientific.

Philosophers of science would have something more sophisticated to say about this. My point is that science doesn’t happen exactly according to the “scientific method” you learned in high school. In many ways it is just advanced rhetoric that’s (ideally) grounded in careful analysis, thoughtful theorizing, and an understanding of prior research.

References

Green, D. P., Ha, S. E. and Bullock, J.G. (2009) Enough Already About “Black Box” Experiments: Studying Mediation is More Difficult than Most Scholars Suppose. Annals of the American Academy of Political and Social Science 628, 200-08. Available at SSRN: http://ssrn.com/abstract=1544416

Pals, J.L. (2006). Narrative identity processing of difficult life experiences: Pathways of personality development and positive self-transformation in adulthood. Journal of Personality 74(4).

A second limitation of this study is that although the hypotheses
reflect theoretically driven ideas about cause-effect relations (e.g., coping
openness stimulates exploratory narrative processing; coherent
positive resolution leads to increased ego-resiliency), the correlational
design did not allow for analyses that would support conclusive
statements regarding causality. The longitudinal findings
1102 Pals
were consistent with causal patterns unfolding over time but did not
prove them. Thus, an important direction for future research
on narrative identity processing will be to examine its causal impact,
ideally through studies that closely examine the connection
between changes in narrative identity and changes in relevant outcomes.
In one recent study, for example, individuals who wrote
about a traumatic experience for several days displayed an increase
in self-reported personal growth and self-acceptance, whereas
those who wrote about trivial topics did not show this pattern
of positive self-transformation (Hemenover, 2003). This finding
supports the idea that when people fully engage in the narrative
processing of a difficult experience, their understanding of
themselves and their lives may transform in ways that will
make them more mature, resilient, and satisfied with their lives.
Findings such as these reflect the growing view that the
narrative interpretation of past experiences—the cornerstone of narrative
identity—constitutes one way adults may intentionally guide
development and bring about change in their lives (Bauer et al.,
2005).A second limitation of this study is that although the hypotheses reflect theoretically driven ideas about cause-effect relations (e.g., coping openness stimulates exploratory narrative processing; coherent positive resolution leads to increased ego-resiliency), the correlational design did not allow for analyses that would support conclusive statements regarding causality. The longitudinal findings 1102 Pals were consistent with causal patterns unfolding over time but did not prove them. Thus, an important direction for future research on narrative identity processing will be to examine its causal impact, ideally through studies that closely examine the connection between changes in narrative identity and changes in relevant outcomes. In one recent study, for example, individuals who wrote about a traumatic experience for several days displayed an increase in self-reported personal growth and self-acceptance, whereas those who wrote about trivial topics did not show this pattern of positive self-transformation (Hemenover, 2003). This finding supports the idea that when people fully engage in the narrative processing of a difficult experience, their understanding of themselves and their lives may transform in ways that will make them more mature, resilient, and satisfied with their lives. Findings such as these reflect the growing view that the narrative interpretation of past experiences—the cornerstone of narrative identity—constitutes one way adults may intentionally guide development and bring about change in their lives (Bauer et al., 2005).