Newbie, here. (I'm a big SGU fan -- I've listened to all episodes from #1 through #242, plus the uncut bonus stuff -- but haven't paid much attention to the forums. Regular posting to Skeptoid's Skeptalk e-mail list is about all I make time for these days.)
Just some comments on Steve et al.'s discussion of Joe's e-mail about personality tests, which interested me because I have some background in personality research and psychometrics (via my graduate work, consulting projects, and professional interactions). I don't think anything they said was blatantly wrong, insofar as they made any specific claims, but they may have overlooked some key distinctions and missed opportunities to point out
relevant science (perhaps due to lack of time or motivation).
1. TEST PURPOSE: It's probably best to evaluate the quality of an instrument (e.g., test, questionnaire, survey) with reference to its specific intended use. That is, instead of concluding that a given test is good or bad in general -- or assessing
how good it is in general -- we'd often be better off assessing its psychometric properties when it's used to accomplish specific aims (e.g., selection/hiring/admission, diagnosis, certification, career guidance, self-improvement, personal growth, recreation) for particular populations of people. A given test could be great for one use (e.g., highly reliable, precise, valid, factorially invariant, "fair," etc.) but poor for another. So, for instance, Steve referred to the MMPI as the "real" personality inventory, but that was developed more for diagnosing maladaptive traits (e.g., psychopathology) at clinical levels, which may not be relevant to the context in which Joe was taking a personality test (e.g., training or development for subclinical employees). This relates to Steve's comment that the use of certain personality tests "significantly oversimplifies the process": What
is this process, exactly?
2. TEST LENGTH: Steve seemed to suggest that longer tests are better (e.g., 200 items is "just not enough"). That's not necessarily true: How many items is enough depends on a lot of things, including what the test is used for (see #1) and properties of the items. Sure, all else being equal longer tests tend to have a smaller standard error of measurement (i.e., greater precision) for estimating an examinee's "true" score, and they tend to sample the relevant domain (e.g., indicators of a particular personality trait) more thoroughly. But there are some unstated assumptions and other practical issues to consider. For example, items can be viewed as having certain properties, such as difficulty, discrimination, and dimensionality, which also influence the test's quality. It's possible to have, say, a 10-item test of some specific trait that's better than a 20-item test of the same trait, provided that the shorter test's items are better. Item response theory (IRT), in fact, is often used to design maximally reliable tests for particular purposes with as few items as feasible (e.g., the GRE, MCAT, and LSAT -- especially those components that are computer-adaptive). There are also practical issues to consider, such as that generating more and more items for a test -- a task usually done by humans -- may be challenging and result in poorer items (e.g., less relevant to the target trait), and some examinees may react negatively to longer tests (e.g., careless responding, acquiescence, fatigue).
3. PSYCHOMETRIC QUALITY: Psychometricians and their kin (e.g., survey statisticians) use formal criteria and techniques to evaluate a test's quality. Steve et al. seemed to focus on what might be called "face validity" -- basically, whether the test seems to measure what it purports to measure, judging from the apparent content of its items. Although that's not irrelevant, it's a pretty weak approach to evaluating a test. Items that
seem reasonable can perform poorly in formal evaluations, or vice versa; of course, this depends a lot on the expertise of the person making the judgments, but folks who aren't experts in psychometrics or the domain being measured can easily make bad judgments. To be clear, I'm not defending the Enneagram or MBTI here; I don't know enough about research on them. I'm just saying it's unwise to evaluate a test on face validity alone. Other criteria include various types of
reliability (e.g., internal consistency, iter-rater, test-retest),
validity (e.g., predictive, concurrent, discriminant, convergent, content), and so on.
4. FORCED-CHOICE ITEMS: One of Steve's criticisms of the Enneagram's items seemed to be that they use a forced-choice response format and are
factorially complex (i.e., measure more than 1 variable). It's not obvious to me that these are as problematic as he seemed to think. I'm sure there's research on the pros and cons of forced-choice items, and this is one of many points where commenting on that research could've been informative. Similarly, just because an item's content covers more than one dimension (i.e., "loads" on more than one "factor") doesn't mean the item's necessarily worthless, though probably this isn't ideal and should be avoided when feasible.
5. PSYCHOMETRIC RESEARCH: Anyone interested in the
science of measuring personality for various purposes might look into the massive research literature. Believe it or not, some serious scientists work in this area. Besides various relevant professional organizations (e.g., American Psychological Association -- especially Divisions 5, 8, and 14) and scholarly journals (e.g.,
Journal of Personality and Social Psychology,
Personality and Social Psychology Bulletin,
Journal of Research in Personality), there are countless Web resources. Of course, it's hard to judge these resources' scientific credibility; here's one that I'm pretty sure could be considered fairly authoritative:
www.personality-project.orgCheers!