Category Archives: psychometrics

Dissertation topic: Constructing predictive indexes

The actual working title of my dissertation is: Modeling Social Participation as Predictive of Life Satisfaction and Social Connectedness: Scale or Index?

When I tell people my topic, I usually start with the domain area: social participation as related to life satisfaction in older U.S. adults (my data set is people age 65 and over from the Health and Retirement Study), but really, the topic is a statistical and measurement one. Participation happens to be something I’m personally interested in and fits the statistical problem area, but I could do this same project in a variety of domains with a range of constructs. Maybe I ought to change my elevator speech to start with the statistical/measurement part.

Most psychometrics concerns itself with the measurement of latent psychological constructs like attitudes, intelligence, academic achievement and so forth. Psychometricians have developed sophisticated means of constructing instruments (surveys or assessments, for example) that can measure these latent constructs. The approach taken is often based on either classical test theory or item response theory. Either way, the assumption is that observed data (such as a student’s answers to test questions or a subject’s survey responses) are caused by whatever unobserved trait is intended to be measured.

However, there are some things we want to measure that don’t fit this model. Social participation is one of them. Participation instruments generally ask the respondent to report his or her level of participation in various activities. In a latent factor setting, you would then assume some underlying level of participation that gave rise to the observed frequencies of participation. That’s not quite right though. If someone increases their participation in some area — say by joining an investment club — their overall level of participation goes up. The increase in participation in the investment club seems causally prior to the increase in overall participation. This is the opposite direction of causality than that proposed by traditional psychometric models.

Some people call a measurement instrument developed by some sort of summation of disparate items an index rather than a scale, where a scale follows the latent factor model. The development of such indexes follows a so-called formative measurement model, where what you’re trying to measure is formed of what you observe, in contrast to the development of scales that follows a reflective measurement model, where what you observe reflects the underlying latent factor of interest. In the diagram, the first figure represents formative measurement (observed indicators x1-x3 cause the latent construct eta 1) and the second figure represents reflective (observed indicators y1 to y3 reflect the level of the latent construct).

There has been plenty of criticism of formative measurement, but I think it can be made useful, and that’s the aim of my dissertation project. I’m now at the analysis stage and just beginning to really understand the usefulness and potential of formative indexes.

As an aside, I don’t like to call formative measurement “measurement.” I prefer to think of it as “modeling.” I think what you’re doing with index development is constructing a one- or few-number summary of a lot of individual data items in a way that predicts outcomes of interest. Think of the Apgar score as a good example. It gives you a one number summary of the health of the baby and its likelihood to survive and thrive, but you’re not measuring one thing in particular about the baby. Well, maybe you are measuring overall health. Hmmmm.

To be continued…

Data science: Don’t forget psychometrics

I can hardly believe that in this 5,000-word-plus blog post on data science there is not one single mention of psychometrics.

Statistician Andrew Gelman calls psychometrics, “the most underrated science.” He also says:

This reminds me of a longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before.

What? You don’t know what psychometrics is? Well, don’t feel bad. I didn’t know what it was until after I enrolled in a doctoral program that had it as one of its core topics. Yet even the program description doesn’t mention the word “psychometrics.” So what is it? Here’s one definition:

Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the construction and validation of measurement instruments, such as questionnaires, tests, and personality assessments. [Wikipedia]

So psychometrics is the science of measuring unobservable characteristics of people. Another name for those unobservables is “latent variables.” You can’t measure knowledge, abilities, attitudes, and personality traits directly. I can’t just look at you and know how good you are at math, for example. Psychometricians develop measurement instruments like standardized tests, questionnaires, IQ assessments and so forth to measure these latent psychological constructs. They rely on a vast foundation of theory and tools that help ensure these measurement instruments measure what they purport to measure (validity) and measure it consistently without excessive error (reliability).

Of course psychometrics is relevant to analyzing web data, because what is the web about anyway? People doing things online, as well as what they might like to do online (subscribe to a web service, rent a movie, buy a nutritional supplement). Web properties want to use their vast pools of data to tell them something about the psychology and predicted behavior of the people using their websites. Psychometrics can help.

Two kinds of people in the world…

… those that like to classify people into different kinds and those that don’t. I’m a classifier.

That’s why I’m intrigued by latent class analysis (LCA), where you statistically divide up people into unobserved classes based on some observed variables (like behavior). Take the example of autism. Is Asperger’s Syndrome on the autistic spectrum or is it an altogether different thing? LCA might be able to answer that question.

I’ve spent the last couple days reading through simulation studies on identifying classes in an LCA-type technique called growth mixture modeling (GMM) where you try to identify classes underlying different developmental trajectories. The oft-cited example in this area is alcohol use, tracked during adolescence and sometimes into adulthood. These studies typically find a few distinctly different trajectories, so different that they (apparently) qualify as different latent classes. For example, this 2003 study found five growth trajectories:

  • Occasional very light drinkers
  • Moderate escalators
  • Infrequent bingers
  • Rapid escalators
  • De-escalators

I’m thinking of designing and running my own simulation study of growth mixture modeling, starting from the ideas in Bauer & Curran (2003). They demonstrated that GMM using information criteria routinely in use at that time would likely extract too many classes given non-normal inputs.

I”m thinking I could go the opposite way: look at cases where there are multiple classes generating the data and see what happens when you treat the data as coming from a single population. Jedidi, Jagpal, & DeSarbo (1997) tackled this question in the case of LCA (not growth curve analysis) with applications to marketing.

But what I’m struggling with is this: when you see non-normal data, is that because there really are multiple classes generating that? Or is the data inherently non-normal? How can you detect the difference, given that non-normal distributions can be approximated by mixtures of normal distributions?

On the one hand, I have this philosophical sense that there aren’t any “classes” of people in the world, just different ways of classifying. On the other hand, genotype differences are real, so I need to keep the medical interpretation in mind. For example there is clearly a class of people who have cystic fibrosis compared to a much larger class of people who do not. Those are the sorts of situations I need to keep in mind when I design the simulation. Alcohol use is interesting but I’m not sure I’d use it as a template for what I’d like to explore.

Modeling scale usage heterogeneity the Bayesian way

Posts in my journal club category are my summaries and thoughts on journal articles I read. I’ve found I absorb material much better if I try to summarize it in a way that might make sense to someone else. The article covered here offers a potential solution to a problem I ran into in the TIMSS data set.

Rossi, P.E., Gilula, Z., Allenby, G.M. (2001). Overcoming scale usage heterogeneity: A Bayesian hierarchical approach. Journal of the American Statistical Association 96(453), 20-31.

Abstract. Questions that use a discrete ratings scale are commonplace in survey research. Examples in marketing include customer satisfaction measurement and purchase intention. Survey research practitioners have long commented that respondents vary in their usage of the scale: Common patterns include using only the middle of the scale or using the upper or lower end. These differences in scale usage can impart biases to correlation and regression analyses. To capture scale usage differences, we developed a new model with individual scale and location effects and a discrete outcome variable. We model the joint distribution of all ratings scale responses rather than specific univariate conditional distributions as in the ordinal probit model. We apply our model to a customer satisfaction survey and show that the correlation inferences are much different once proper adjustments are made for the discreteness of the data and scale usage. We also show that our adjusted or latent ratings scale is more closely related to actual purchase behavior.

Assume the observed item indicators (matrix X) are a discrete version of underlying latent continuous data Y. i indexes the individuals and j the questions. There are K+1 common, ordered cutoff points ck, the first at negative infinity and the last at positive infinity such that

x_{i,j} = k \;\; \textup{if} \;\;c_{k-1}\leq y_{i,j}\leq c_k

The underlying latent continuous variables are distributed multivariate normal:

y_i \sim N(\mu^*_i, \Sigma^*_i)

The cutoffs discretize the latent variable Y. This is similar to an multinomial probit model but we’re not interested in the conditional distribution of one discrete variable but rather the joint distribution of a bunch.

This model provides for a different mean vector and covariance matrix for each respondent, but the authors simplify it by using a respondent-specific location-scale shift:

y_i = \mu + \tau_i \iota + \sigma_i z_i

z_i \sim N(0, \Sigma)

This allows for acquiescent/disacquiescent response styles, for overuse of a particular response value, and for extreme response styles:

  • Acquiescent (disacquiescent) would be represented with a large positive (negative) location shift and a shrunken scale parameter.
  • Overuse of a particular response value would be represented by a location shift to that value with a shrunken scale parameter.
  • Extreme response styles would be represented with no scale shift and a very large scale parameter, which would tend to put a lot of probability density into the two tails.

The location and log scale parameters are modeled as bivariate normal, allowing them to be correlated with each other:

\begin{bmatrix} \tau_i\\ \textup{ln} \: \sigma_i \end{bmatrix} \sim N(\varphi, \Lambda)

You need to specify or model the cutoff values somehow. You could assume them to be known, say equally spaced between the actual values on the rating scale. This model specifies them in a way that allows for nonlinear spread, which you can imagine might be the case:

c_k = a + bk + ek^2

The authors go over a number of assumptions that force identification of the model. I get the need for this if not quite understanding why they did what they did, or what implications it has. Will come back to that at some point.

Then you need priors for mu, sigma, phi, lambda, and e.  They use flat priors on the means and the “cutoff” parameter e and inverse-Wishart priors for the covariance matrices.

And from there it’s just an easy simulation problem. Ha, right!

Thankfully, I’m not on my own trying to understand and implement something like this because Rossi, Allenby, & McCulloch wrote a textbook that includes a case study dealing with it. There’s even software and data sets to go with it. But since Penrose doesn’t have the book, I have to wait to get it from the University of Northern Colorado, darn. Too bad, because it would have been fun to spend spring break sorting it all out.

Adjusting for response styles in cross-cultural research

I’m working on a cross-country study of math achievement scores related to liking-for-math and ran into some problems with the measure I’m using for liking-for-math. Some countries show extremely skewed distributions on the liking-for-math index I constructed.

Given the obvious differences in patterns of responses across countries, can I really make cross-country comparisons? I tried, and got statistically significant but weak results. But if my measure isn’t any good, I can’t trust those results.

Turns out this problem of “extreme response styles” is well-known in cross-cultural psychology. And there’s a related literature covering measurement invariance, asking the question whether you can compare psychometric results across cultures or other diverse groups of people.

I did a tiny bit of research this morning to see if there’s anything I can do to adjust for differences in response styles across students and countries on TIMSS math background items. I had already tried weeding out the countries that show extremely skewed response styles but then I didn’t have enough data to run the analysis. And anyway, throwing out a bunch of data isn’t a good solution.

Buckley (2006) suggests a Bayesian approach that estimates a posterior distribution for each student characterized by a location shift and a scale adjustment that represent how a student’s responses relate to his or her actual attitude. For example, a large positive location shift and a reduced scale would typify extreme acquiescence, as the student picked mostly “strongly agree” type items. Buckley also provides a quick-and-dirty linear regression tactic for estimating a student’s latent true score on the measure taking into account extreme or random response styles. I may give that a shot this morning — the class project is due next week so I have some time — and then later explore a Bayesian solution.

It’s so cool to see my two interests — cross-country psychological studies and Bayesian stats — colliding. Seems like a potential dissertation topic.

Reference

Buckley, J. (2006). Cross-national response styles in international educational assessments: Evidence from PISA 2006. Retrieved from https://edsurveys.rti.org/PISA/documents/Buckley_PISAresponsestyle.pdf

Attitude towards math vs. confidence, liking, usefulness of math in TIMSS

Kadijevich, D. (2006). Developing trustworthy TIMSS Background Measures: A case study on mathematics attitude. The Teaching of Mathematics IX(2), 41-51.

Abstract. This study, which used a sample of 197,707 students from 46 countries that participated in the TIMSS 2003 project in eight grade, examined whether, for a large number of the TIMSS countries, trustworthy TIMSS measures of several dimensions of mathematics attitude can be developed. By focusing on self-confidence in learning mathematics, usefulness of mathematics, and liking mathematics, it was found that both factor validity and reliability of the measures of these three dimensions derived from the raw data was only attained for the students from the United States. However, when scores concerning the utilized attitudinal statements of all subjects were transformed into Guttman’s image form scores, the factor validity and reliability of the three measures utilizing such transformed data was attained for thirtythree countries (N = 137;346). It was found that for all these thirty-three countries mathematics attitude was mostly saturated by either usefulness of mathematics or self-confidence in learning mathematics. A higher mathematics achievement was found for countries where mathematics attitude was mostly saturated by self-confidence in learning mathematics.

It’s not mentioned in the abstract, but if you combine all the mathematics attitude items into one grand attitude-towards-math scale, you get decent internal reliability (alpha above .70) for almost all countries.

This makes me think maybe I ought to use an overall “attitude towards math” score in the next iteration of my model. Or try that Guttman transformation, which doesn’t make any sense to me, so will need to understand what’s going on with it first. How can it eliminate measurement error?

Also, interesting that attitude towards math is either saturated by self-confidence (something intrinsic) or usefulness (something extrinsic), and the intrinsic one predicts higher math achievement.

Puny results for my model of math achievement related to liking math

Here’s a description of the data analysis project I’m working on.

I was so excited to push the button on my hierarchical linear analysis, hoping hoping hoping to see a statistically significant effect of the kind I wanted. And I did!

Here’s what I found:

  1. Per capita GDP, secular-rational values, and survivalist values all predict higher math achievement scores.
  2. Self-expressive values (at the opposite pole from survivalist) predict lower “returns” to liking math. That is, countries that are higher on self-expressive values show lower slopes for a proposed linear relationship between liking math and math achievement.

That second part was what I was really interested in. I hypothesized that a culture that valued self-expression would provide a worse context for math achievement, especially at higher levels of liking math. Students with higher liking for math would be relatively more disadvantaged by self-expressive values. In the graph below, you can see how lower values on the SURVSELF dimension (representing lower self-expressive values, higher survivalist) result in higher mean math achievement as well as a higher slope for math achievement related to liking math.

So why am I disappointed? The “effect size” — the practical magnitude of the effect — was small, even though it was statistically significant.The slopes just aren’t that different at different levels of self-expressive values.

One way of measuring effect size in hierarchical linear models is to report “proportion variance explained” or how much of the variation is accounted for when you add in the predictor of interest.

For finding #2 above, the PVE was just 5%. So self expressive values at the country level don’t explain much of the difference of the slopes of math achievement related to liking math. To use the technical term, those are some seriously puny results.

But still, it is a statistically significant effect in my model, and that at least, is a happy thing. I think there is something to what I’m exploring. The models I ran converged quickly, which my prof said is an indication of a highly informative model. GDP didn’t explain all the variance in mean math achievement — the two country-level value dimensions were statistically significant and practically significant also. That was a somewhat surprising result because both value dimensions are related to economic transition. That traditionalist vs. secular-rational values and survivalist vs. self-expressive values explain significant variance over and above GDP is important, and worthy of some more study.

Confirmatory factor analysis: The basics

My psychometrics prof did a quick intro to confirmatory factor analysis in class last night, and since next quarter I’m taking a class out of sequence that depends on it (latent growth curve modeling), thought I’d summarize here to consolidate what I learned.

EFA vs. CFA

You can use exploratory factor analysis (EFA) or confirmatory factor analysis (CFA) to investigate the construct validity of a psychometric instrument. With exploratory, you don’t specify the factor structure up front — the analysis finds factors and their loadings on items for you. With confirmatory, you specify the factors and how they relate to items from the instrument.

The “factors” you are looking for are also known as latent variables, things you can’t directly measure (that’s why they’re called “latent”). As an example, intrinsic motivation to study math is a latent variable or construct. There’s no way to measure it directly so you develop some measurement instrument  — usually a set of items asking about that construct. You might assess more than one construct at a time, for example intrinsic and extrinsic motivation, and so you want to see if the items relate to the underlying factors as you theorized.

You can do a confirmatory factor analysis with EFA techniques (e.g., with principal axis factoring and oblique or orthogonal factor rotation) but there are additional benefits to using CFA (from Gable & Wolf, 1993):

  • You get a unique factor solution with CFA
  • CFA assesses the degree of model fit
  • CFA output on individual model parameters suggests how to improve the model
  • You can test factorial invariance across groups

How to do a simple CFA

You express a factor analysis model using structural equation modeling (SEM) notation. A circle or oval indicates a latent variable (a.k.a. factor). A square or rectangle indicates an observed variable (a.k.a. indicator). A single-headed arrow shows causality, with factors causing indicators, not the other way around. A curved double-headed arrow indicates unanalyzed assocation.

Here’s an example of a CFA diagram that I made in Amos. This shows two factors — positive attitudes towards math (should be positive attitude toward math I think) and extrinsic motivation to learn math, each with four indicators.

Then you input correlations or covariances from your data set and run the analysis. You’ll get estimated factor loadings as well as a bunch of measures of goodness of fit of the model.

I won’t go over running the analysis here — I haven’t actually gotten that far myself — but here’s what to look for to see if you have good model fit:

  • Root mean squared error of approximation (RMSEA) should be less than .05.
  • Comparative Fit Index (CFI) — excellent model if > .95, good if between .90 and .95, poor if less than .95

The prof also mentioned the chi square measure of fit, but admitted it is worthless. Note for the chi square test statistic you are looking for a nonsignificant chi-square not a significant one, in contrast to most statistical tests. For large samples the chi square will virtually always be significant, so some statisticians recommend dividing by degrees of freedom. Some say that a model with chi square / df less than three is good.

Here’s a useful page summarizing more measures of goodness of fit of structural equation models.

Reference

Gable, R. K., & Wolf, M. B. (1993). Instrument development in the affective domain: Measuring attitudes and values in corporate and school settings. Evaluation in education and human services. Boston: Kluwer Academic.

Because I’m bored: A post about novelty seeking

Do you know anyone who is easily bored? Always looking for the novel, the exciting, the stimulating? You might see this trait manifest in different ways: the heli-skiier, the intellectual omnivore, the golf-sensation/sex-addict, even the heroin abuser likely have in common a drive to avoid boredom and a twin drive to experience excitement in whatever form works best for them.

Some psychologists call this novelty seeking, sensation seeking, or stimulation seeking.* Here’s one definition of sensation seeking:

a trait defined by the seeking of varied, novel, complex, and intense sensations and experiences, and the willingness to take physical, social, legal, and financial risks for the sake of such experiences. (Zuckerman & Kuhlman, 2000)

I took the Zuckerman-Kuhlman Personal Questionnaire sensation seeking scale and scored 84%, High bordering on Very High. I’m easily bored. I’m always looking for the next excitement, usually intellectual but could be something else. My need for novelty makes it hard to ever reach equilibrium because I inevitably get distracted by sparkly objects passing by.

It’s a good thing

That definition makes sensation seeking sound like a mostly negative thing (all those risks!) but in my experience, it’s not. Because I’m easily bored, I’ve had a pretty exciting life, I think. In Penelope Trunk’s framing, I’ve prioritized having an interesting life over a happy one. In the past, I have sacrificed comfort and stability for the new and different, whether it was a new job or a new career or a new house or a new state.

But now I find myself leaning more towards trying to have a happy, stable life rather than an interesting and exciting one. Sensation seeking declines with age, and I think I might have reached a pretty optimal level for where I am in my life. I’m totally willing to take intellectual leaps and risks, where some people in their early 40s might be stuck with tired ideas. I wouldn’t rule out moving our family yet again if the right opportunity arose. I take risks like blogging about random stuff that enters my bored brain. And yet I’m settled and stable in many ways I couldn’t have imagined in my 20s: I am satisfied with my husband, my neighborhood, my house, my career path, my colleagues.

It’s in the genes

There’s evidence of a genetic basis for novelty seeking, and also evidence that novelty seeking may be a risk factor for drug dependence.

And, novelty seekers may be more intelligent on average. From a 2002 paper in the Journal of Personal and Social Psychology:

The prediction that high stimulation seeking 3-yr-olds would have higher IQs by 11 yrs old was tested in 1,795 children on whom behavioral measures of stimulation seeking were taken at 3 yrs, together with cognitive ability at 11 yrs. High 3-yr-old stimulation seekers scored 12 points higher on total IQ at age 11 compared with low stimulation seekers and also had superior scholastic and reading ability. Results replicated across independent samples and were found for all gender and ethnic groups. Effect sizes for the relationship between age 3 stimulation seeking and age 11 IQ ranged from 0.52 to 0.87. Findings appear to be the first to show a prospective link between stimulation seeking and intelligence. It is hypothesized that young stimulation seekers create for themselves an enriched environment that stimulates cognitive development. [emphasis added]

I think adult stimulation/sensation/novelty seekers can do the same thing.

Notes

*Are novelty-seeking and sensation-seeking the same thing? Maybe.

Measurement, comparison, and variability

Andrew Gelman:

the activity we call “statistics” exists in the middle of the Venn diagram formed by measurement, comparison, and variability. No two of the three is enough.

I’m dealing with the middle of that Venn diagram right now:

  • Measurement — am I measuring what I want to measure? I wanted to measure intrinsic motivation to learn math but I’ve got an index that’s likely better described as positive attitudes towards math (PATM). And I don’t even know what it’s really measuring.
  • Comparison — can I compare PATM across countries? If I use it as a regression predictor for math achievement, are results across countries comparable?
  • Variability — how do scores on PATM vary across countries? Certainly not enough to just know the mean. This provides some information to answer the question about comparison.

When I plot histograms of different countries’ scores on PATM, I see some strange results. Some countries (most English-speaking and Asian for example)  have a roughly normal distributions of scores, with extra mass sometimes at the top or bottom since scores are between -6 and 6. But most Middle Eastern, Latin American, and African countries have negatively skewed distributions, with half or more of the students hitting the very top score. And Eastern European countries tend to have positively skewed distributions, with more students saying they don’t like math than they do.

For example, here’s Japan:

(My histograms use IMLM not PATM because I made them when I was still calling the measure Intrinsic Motivation to Learn Math).

And here’s Jordan:

What’s the problem? I would expect positive attitudes towards math to be basically normally distributed throughout the population — most people would be neutral, some would dislike it, and some would really like it. If a lot of students say “I really really like it” (or, as in Eastern Europe, “I really really don’t like it”) then I wonder if my measurement instrument is measuring some additional thing besides what they feel about math. I don’t know that I can compare a Japanese student who scores 6 on PATM to a Jordanian student who scores the same.

What to do?

I’m not sure what to make of this or what to do with it. One thought I had is to just do my initial analysis with the countries that have roughly normal distributions of scores. For those cases, I feel like I can have some confidence that the PATM measure is actually getting at the actual distribution of positive attitudes towards math in the population vs. measuring something else or something in addition (optimist/pessimist mindset? lack of any rigorous math education which would separate the like-maths from the dont-like-maths?)

I guess I might also correlate skewness of the PATM distribution with other measures, for example, the country-level cultural measures I’m using to characterize the context in which students learn math or even just with math scores. Maybe that would give some insight into what’s going on here.

This is another example of how useful it was for me to present at our research meeting on Tuesday. One of the other students asked if there was good variability on the scores. I said, “sure, all the countries vary across all the values.” But then I realized the point she was making: how do they vary across the score levels… now that is important.