diary of a doctoral student, psychometrics, statistics

Dissertation topic: Constructing predictive indexes

The actual working title of my dissertation is: Modeling Social Participation as Predictive of Life Satisfaction and Social Connectedness: Scale or Index?

When I tell people my topic, I usually start with the domain area: social participation as related to life satisfaction in older U.S. adults (my data set is people age 65 and over from the Health and Retirement Study), but really, the topic is a statistical and measurement one. Participation happens to be something I’m personally interested in and fits the statistical problem area, but I could do this same project in a variety of domains with a range of constructs. Maybe I ought to change my elevator speech to start with the statistical/measurement part.

Most psychometrics concerns itself with the measurement of latent psychological constructs like attitudes, intelligence, academic achievement and so forth. Psychometricians have developed sophisticated means of constructing instruments (surveys or assessments, for example) that can measure these latent constructs. The approach taken is often based on either classical test theory or item response theory. Either way, the assumption is that observed data (such as a student’s answers to test questions or a subject’s survey responses) are caused by whatever unobserved trait is intended to be measured.

However, there are some things we want to measure that don’t fit this model. Social participation is one of them. Participation instruments generally ask the respondent to report his or her level of participation in various activities. In a latent factor setting, you would then assume some underlying level of participation that gave rise to the observed frequencies of participation. That’s not quite right though. If someone increases their participation in some area — say by joining an investment club — their overall level of participation goes up. The increase in participation in the investment club seems causally prior to the increase in overall participation. This is the opposite direction of causality than that proposed by traditional psychometric models.

Some people call a measurement instrument developed by some sort of summation of disparate items an index rather than a scale, where a scale follows the latent factor model. The development of such indexes follows a so-called formative measurement model, where what you’re trying to measure is formed of what you observe, in contrast to the development of scales that follows a reflective measurement model, where what you observe reflects the underlying latent factor of interest. In the diagram, the first figure represents formative measurement (observed indicators x1-x3 cause the latent construct eta 1) and the second figure represents reflective (observed indicators y1 to y3 reflect the level of the latent construct).

There has been plenty of criticism of formative measurement, but I think it can be made useful, and that’s the aim of my dissertation project. I’m now at the analysis stage and just beginning to really understand the usefulness and potential of formative indexes.

As an aside, I don’t like to call formative measurement “measurement.” I prefer to think of it as “modeling.” I think what you’re doing with index development is constructing a one- or few-number summary of a lot of individual data items in a way that predicts outcomes of interest. Think of the Apgar score as a good example. It gives you a one number summary of the health of the baby and its likelihood to survive and thrive, but you’re not measuring one thing in particular about the baby. Well, maybe you are measuring overall health. Hmmmm.

To be continued…


Data science: Don’t forget psychometrics

I can hardly believe that in this 5,000-word-plus blog post on data science there is not one single mention of psychometrics.

Statistician Andrew Gelman calls psychometrics, “the most underrated science.” He also says:

This reminds me of a longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before.

What? You don’t know what psychometrics is? Well, don’t feel bad. I didn’t know what it was until after I enrolled in a doctoral program that had it as one of its core topics. Yet even the program description doesn’t mention the word “psychometrics.” So what is it? Here’s one definition:

Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the construction and validation of measurement instruments, such as questionnaires, tests, and personality assessments. [Wikipedia]

So psychometrics is the science of measuring unobservable characteristics of people. Another name for those unobservables is “latent variables.” You can’t measure knowledge, abilities, attitudes, and personality traits directly. I can’t just look at you and know how good you are at math, for example. Psychometricians develop measurement instruments like standardized tests, questionnaires, IQ assessments and so forth to measure these latent psychological constructs. They rely on a vast foundation of theory and tools that help ensure these measurement instruments measure what they purport to measure (validity) and measure it consistently without excessive error (reliability).

Of course psychometrics is relevant to analyzing web data, because what is the web about anyway? People doing things online, as well as what they might like to do online (subscribe to a web service, rent a movie, buy a nutritional supplement). Web properties want to use their vast pools of data to tell them something about the psychology and predicted behavior of the people using their websites. Psychometrics can help.

psychometrics, statistics

Two kinds of people in the world…

… those that like to classify people into different kinds and those that don’t. I’m a classifier.

That’s why I’m intrigued by latent class analysis (LCA), where you statistically divide up people into unobserved classes based on some observed variables (like behavior). Take the example of autism. Is Asperger’s Syndrome on the autistic spectrum or is it an altogether different thing? LCA might be able to answer that question.

I’ve spent the last couple days reading through simulation studies on identifying classes in an LCA-type technique called growth mixture modeling (GMM) where you try to identify classes underlying different developmental trajectories. The oft-cited example in this area is alcohol use, tracked during adolescence and sometimes into adulthood. These studies typically find a few distinctly different trajectories, so different that they (apparently) qualify as different latent classes. For example, this 2003 study found five growth trajectories:

  • Occasional very light drinkers
  • Moderate escalators
  • Infrequent bingers
  • Rapid escalators
  • De-escalators

I’m thinking of designing and running my own simulation study of growth mixture modeling, starting from the ideas in Bauer & Curran (2003). They demonstrated that GMM using information criteria routinely in use at that time would likely extract too many classes given non-normal inputs.

I”m thinking I could go the opposite way: look at cases where there are multiple classes generating the data and see what happens when you treat the data as coming from a single population. Jedidi, Jagpal, & DeSarbo (1997) tackled this question in the case of LCA (not growth curve analysis) with applications to marketing.

But what I’m struggling with is this: when you see non-normal data, is that because there really are multiple classes generating that? Or is the data inherently non-normal? How can you detect the difference, given that non-normal distributions can be approximated by mixtures of normal distributions?

On the one hand, I have this philosophical sense that there aren’t any “classes” of people in the world, just different ways of classifying. On the other hand, genotype differences are real, so I need to keep the medical interpretation in mind. For example there is clearly a class of people who have cystic fibrosis compared to a much larger class of people who do not. Those are the sorts of situations I need to keep in mind when I design the simulation. Alcohol use is interesting but I’m not sure I’d use it as a template for what I’d like to explore.

psychometrics, statistics

Modeling scale usage heterogeneity the Bayesian way

Posts in my journal club category are my summaries and thoughts on journal articles I read. I’ve found I absorb material much better if I try to summarize it in a way that might make sense to someone else. The article covered here offers a potential solution to a problem I ran into in the TIMSS data set.

Rossi, P.E., Gilula, Z., Allenby, G.M. (2001). Overcoming scale usage heterogeneity: A Bayesian hierarchical approach. Journal of the American Statistical Association 96(453), 20-31.

Abstract. Questions that use a discrete ratings scale are commonplace in survey research. Examples in marketing include customer satisfaction measurement and purchase intention. Survey research practitioners have long commented that respondents vary in their usage of the scale: Common patterns include using only the middle of the scale or using the upper or lower end. These differences in scale usage can impart biases to correlation and regression analyses. To capture scale usage differences, we developed a new model with individual scale and location effects and a discrete outcome variable. We model the joint distribution of all ratings scale responses rather than specific univariate conditional distributions as in the ordinal probit model. We apply our model to a customer satisfaction survey and show that the correlation inferences are much different once proper adjustments are made for the discreteness of the data and scale usage. We also show that our adjusted or latent ratings scale is more closely related to actual purchase behavior.

Assume the observed item indicators (matrix X) are a discrete version of underlying latent continuous data Y. i indexes the individuals and j the questions. There are K+1 common, ordered cutoff points ck, the first at negative infinity and the last at positive infinity such that

x_{i,j} = k \;\; \textup{if} \;\;c_{k-1}\leq y_{i,j}\leq c_k

The underlying latent continuous variables are distributed multivariate normal:

y_i \sim N(\mu^*_i, \Sigma^*_i)

The cutoffs discretize the latent variable Y. This is similar to an multinomial probit model but we’re not interested in the conditional distribution of one discrete variable but rather the joint distribution of a bunch.

This model provides for a different mean vector and covariance matrix for each respondent, but the authors simplify it by using a respondent-specific location-scale shift:

y_i = \mu + \tau_i \iota + \sigma_i z_i

z_i \sim N(0, \Sigma)

This allows for acquiescent/disacquiescent response styles, for overuse of a particular response value, and for extreme response styles:

  • Acquiescent (disacquiescent) would be represented with a large positive (negative) location shift and a shrunken scale parameter.
  • Overuse of a particular response value would be represented by a location shift to that value with a shrunken scale parameter.
  • Extreme response styles would be represented with no scale shift and a very large scale parameter, which would tend to put a lot of probability density into the two tails.

The location and log scale parameters are modeled as bivariate normal, allowing them to be correlated with each other:

\begin{bmatrix} \tau_i\\ \textup{ln} \: \sigma_i \end{bmatrix} \sim N(\varphi, \Lambda)

You need to specify or model the cutoff values somehow. You could assume them to be known, say equally spaced between the actual values on the rating scale. This model specifies them in a way that allows for nonlinear spread, which you can imagine might be the case:

c_k = a + bk + ek^2

The authors go over a number of assumptions that force identification of the model. I get the need for this if not quite understanding why they did what they did, or what implications it has. Will come back to that at some point.

Then you need priors for mu, sigma, phi, lambda, and e.  They use flat priors on the means and the “cutoff” parameter e and inverse-Wishart priors for the covariance matrices.

And from there it’s just an easy simulation problem. Ha, right!

Thankfully, I’m not on my own trying to understand and implement something like this because Rossi, Allenby, & McCulloch wrote a textbook that includes a case study dealing with it. There’s even software and data sets to go with it. But since Penrose doesn’t have the book, I have to wait to get it from the University of Northern Colorado, darn. Too bad, because it would have been fun to spend spring break sorting it all out.

psychology, psychometrics, statistics

Adjusting for response styles in cross-cultural research

I’m working on a cross-country study of math achievement scores related to liking-for-math and ran into some problems with the measure I’m using for liking-for-math. Some countries show extremely skewed distributions on the liking-for-math index I constructed.

Given the obvious differences in patterns of responses across countries, can I really make cross-country comparisons? I tried, and got statistically significant but weak results. But if my measure isn’t any good, I can’t trust those results.

Turns out this problem of “extreme response styles” is well-known in cross-cultural psychology. And there’s a related literature covering measurement invariance, asking the question whether you can compare psychometric results across cultures or other diverse groups of people.

I did a tiny bit of research this morning to see if there’s anything I can do to adjust for differences in response styles across students and countries on TIMSS math background items. I had already tried weeding out the countries that show extremely skewed response styles but then I didn’t have enough data to run the analysis. And anyway, throwing out a bunch of data isn’t a good solution.

Buckley (2006) suggests a Bayesian approach that estimates a posterior distribution for each student characterized by a location shift and a scale adjustment that represent how a student’s responses relate to his or her actual attitude. For example, a large positive location shift and a reduced scale would typify extreme acquiescence, as the student picked mostly “strongly agree” type items. Buckley also provides a quick-and-dirty linear regression tactic for estimating a student’s latent true score on the measure taking into account extreme or random response styles. I may give that a shot this morning — the class project is due next week so I have some time — and then later explore a Bayesian solution.

It’s so cool to see my two interests — cross-country psychological studies and Bayesian stats — colliding. Seems like a potential dissertation topic.


Buckley, J. (2006). Cross-national response styles in international educational assessments: Evidence from PISA 2006. Retrieved from https://edsurveys.rti.org/PISA/documents/Buckley_PISAresponsestyle.pdf

psychometrics, statistics

Attitude towards math vs. confidence, liking, usefulness of math in TIMSS

Kadijevich, D. (2006). Developing trustworthy TIMSS Background Measures: A case study on mathematics attitude. The Teaching of Mathematics IX(2), 41-51.

Abstract. This study, which used a sample of 197,707 students from 46 countries that participated in the TIMSS 2003 project in eight grade, examined whether, for a large number of the TIMSS countries, trustworthy TIMSS measures of several dimensions of mathematics attitude can be developed. By focusing on self-confidence in learning mathematics, usefulness of mathematics, and liking mathematics, it was found that both factor validity and reliability of the measures of these three dimensions derived from the raw data was only attained for the students from the United States. However, when scores concerning the utilized attitudinal statements of all subjects were transformed into Guttman’s image form scores, the factor validity and reliability of the three measures utilizing such transformed data was attained for thirtythree countries (N = 137;346). It was found that for all these thirty-three countries mathematics attitude was mostly saturated by either usefulness of mathematics or self-confidence in learning mathematics. A higher mathematics achievement was found for countries where mathematics attitude was mostly saturated by self-confidence in learning mathematics.

It’s not mentioned in the abstract, but if you combine all the mathematics attitude items into one grand attitude-towards-math scale, you get decent internal reliability (alpha above .70) for almost all countries.

This makes me think maybe I ought to use an overall “attitude towards math” score in the next iteration of my model. Or try that Guttman transformation, which doesn’t make any sense to me, so will need to understand what’s going on with it first. How can it eliminate measurement error?

Also, interesting that attitude towards math is either saturated by self-confidence (something intrinsic) or usefulness (something extrinsic), and the intrinsic one predicts higher math achievement.

psychometrics, statistics

Puny results for my model of math achievement related to liking math

Here’s a description of the data analysis project I’m working on.

I was so excited to push the button on my hierarchical linear analysis, hoping hoping hoping to see a statistically significant effect of the kind I wanted. And I did!

Here’s what I found:

  1. Per capita GDP, secular-rational values, and survivalist values all predict higher math achievement scores.
  2. Self-expressive values (at the opposite pole from survivalist) predict lower “returns” to liking math. That is, countries that are higher on self-expressive values show lower slopes for a proposed linear relationship between liking math and math achievement.

That second part was what I was really interested in. I hypothesized that a culture that valued self-expression would provide a worse context for math achievement, especially at higher levels of liking math. Students with higher liking for math would be relatively more disadvantaged by self-expressive values. In the graph below, you can see how lower values on the SURVSELF dimension (representing lower self-expressive values, higher survivalist) result in higher mean math achievement as well as a higher slope for math achievement related to liking math.

So why am I disappointed? The “effect size” — the practical magnitude of the effect — was small, even though it was statistically significant.The slopes just aren’t that different at different levels of self-expressive values.

One way of measuring effect size in hierarchical linear models is to report “proportion variance explained” or how much of the variation is accounted for when you add in the predictor of interest.

For finding #2 above, the PVE was just 5%. So self expressive values at the country level don’t explain much of the difference of the slopes of math achievement related to liking math. To use the technical term, those are some seriously puny results.

But still, it is a statistically significant effect in my model, and that at least, is a happy thing. I think there is something to what I’m exploring. The models I ran converged quickly, which my prof said is an indication of a highly informative model. GDP didn’t explain all the variance in mean math achievement — the two country-level value dimensions were statistically significant and practically significant also. That was a somewhat surprising result because both value dimensions are related to economic transition. That traditionalist vs. secular-rational values and survivalist vs. self-expressive values explain significant variance over and above GDP is important, and worthy of some more study.