Normalizing the Use of Single-Item Measures: Validation of the Single-Item Compendium for Organizational Psychology

The application of single-item measures has the potential to help applied researchers address conceptual, methodological, and empirical challenges. Based on a large-scale evidence-based approach, we empirically examined the degree to which various constructs in the organizational sciences can be reliably and validly assessed with a single item. In study 1, across 91 selected constructs, 71.4% of the single-item measures demonstrated strong if not very strong definitional correspondence (as a measure of content validity). In study 2, based on a heterogeneous sample of working adults, we demonstrate that the majority of single-item measures examined demonstrated little to no comprehension or usability concerns. Study 3 provides evidence for the reliability of the proposed single-item measures based on test–retest reliabilities across the three temporal conditions (1 day, 2 weeks, 1 month). In study 4, we examined issues of construct and criterion validity using a multi-trait, multi-method approach. Collectively, 75 of the 91 focal measures demonstrated very good or extensive validity, evidencing moderate to high content validity, no usability concerns, moderate to high test–retest reliability, and extensive criterion validity. Finally, in study 5, we empirically examined the argument that only conceptually narrow constructs can be reliably and validly assessed with single-item measures. Results suggest that there is no relationship between subject matter expert evaluations of construct breadth and reliability and validity evidence collected across the first four studies. Beyond providing an off-the-shelf compendium of validated single-item measures, we abstract our validation steps providing a roadmap to replicate and build upon. Limitations and future directions are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic €32.70 /Month

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Rent this article via DeepDyve

Similar content being viewed by others

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

Four Research Designs and a Comprehensive Analysis Strategy for Investigating Common Method Variance with Self-Report Measures Using Latent Variables

Article 17 December 2015

Research Design Considerations in Human Science Research: Reconciling Conceptions of Science, Theories of Measurement and Research Methods

Chapter © 2018

Explore related subjects

Change history

Notes

We should note that definitional and terminological confusion surrounds the concepts of complexity and concreteness, with different authors approaching and defining terms in inconsistent ways. For example, referring to a construct as “broad” or “abstract” could be meant to refer to how large the construct space is (e.g., a “complex” personality trait), to denote that a construct is multi-dimensional in nature (e.g., job satisfaction), or to denote that a concept is not grounded in sensory-motor information (e.g., role ambiguity; Borghi et al. 2017).

We recognize that content validity is also often conceptualized in terms of the degree to which a construct is accurately captured by the item(s) included in a given measure. While we follow the example set forth by Colquitt et al. (2019) in terms of examining definitional correspondence as an indicator of content validity, in other literatures, our approach might be evaluated as an examination of face validity. As discussed by Allen et al., (2022, p. 1) though, “just as for multi-item measures, it is critically important for single-item measures to demonstrate face validity,” wherein face validity can be defined as the “clarity or relevance” of an item or measure.

Colquitt et al. (2019) provide overall criteria with five levels (Table 5 in their study). A definitional correspondence estimate of .91 and above is considered very strong, .87 to .90 is strong, .84 to .86 is moderate, .60 to .83 is weak, and .59 and below as lack of definitional correspondence. To be clear, we are not arguing that a definitional correspondence estimate of .60 to .83, which Colquitt et al. again define as weak, is necessarily acceptable. Consistent with our overarching argument for triangulating the validity of single-item measures, the goal is to ensure content validity is demonstrated first and foremost and then evaluated against other pieces of psychometric evidence relative to the needs of a specific program of research.

Please note that presenting multiple examples in an item does not mean an item is necessarily double-barreled. Double-barreled questions ask about two distinct (i.e., divergent) “attitudinal” phenomena wherein respondents provide only one answer (Olson 2008). Using the word “and” in an item does not inherently make it a double-barreled item. However, in the spirit of study 2, the use of multiple examples (conjuncts) may increase the complexity and impact the interpretability of single-item measures (Olson 2008).

Cicchetti (1994) suggests that ICC values greater than .74 indicate excellent reliability, between .60 and .74 indicate good reliability, between .40 and .59 indicate fair reliability, and below .40 indicate poor reliability.

To be clear, similar to Hypothesis 1, we are not advocating that single-item measures demonstrating “poor” construct validity (i.e., loadings of .32 to .44) are necessarily valid. We set this minimum based on accepted practices in the larger scale development literature. As noted by Allen et al. (2022) though, standards applied to validating single-item measures may be different than those used for multi-item measures. Thus, depending on the construct under consideration, setting more stringent minimums might be prudent. It should also be recognized that using different multi-item measures of the same focal construct might result in different construct validity estimates for a given single-item measures. Again then, single-item construct validity evidence must be interpreted relative to other pieces of validity evidence including content validity as well as the psychometric characteristics of the comparative multi-item measure.

These two multi-item measures included negatively worded items. Autonomy climate (χ 2 (5) = 83.04, p < .001, CFI = .83, SRMR = .07) and perceived contract breach (χ 2 (5) = 110.48, p < .001, CFI = .89, SRMR = .08) both continued to demonstrate poor fit when the respective single-item measure was excluded and the CFA model re-estimated; the poor fit seems to be a function of the multi-item measure, not because of single-item measure.

Another way to evaluate construct validity is to examine the bivariate correlation between the two measurement approaches. As such, the bivariate correlations between the single-item and multi-item reference measure, for each construct, are reported in Table 2. The average construct validity correlation across the 91 constructs was .58 (SD = .19). Interestingly, across the 91 constructs, single-item CFA factor loadings correlated at .97 (p < .001) with the observed bivariate correlations between the single-item and multi-item construct measures suggesting that the two approaches for establishing construct validity are effectively equivalent.

Relatedly, in an effort to reduce the trade-off between the number of items in the scale length and the scale quality, Cortina et al. (2020) have developed a procedure that aims to optimize the scale quality (e.g., alpha reliability coefficient, part-whole correlations) of the resulting shortened scale by analyzing all possible sets of items drawn from the full scale.

We established construct validity evaluations based on a point system; we assigned points based on content validity (i.e., 1 point: content validity ≤ .69, 5 points: content validity ≥ .90), amount of usability concerns (i.e., 1 point: systematic usability concerns, 4 points: no meaningful usability concerns), average ICC test–retest reliability scores (i.e., 1 point: ICC < .40, 4 points: ICC >.74), construct validity (i.e., 0 points: CFA factor loading < .32, 5 points: CFA factor loadings >.70), and criterion validity (i.e., 1 point: limited to no evidence of criterion validity, 5 points: systematic evidence of criterion validity). We then computed an average across these different pieces of reliability and validity (scores ranged from 1.67 to 4.47). Constructs with scores greater than 4.00 were evaluated as having extensive construct validity, constructs between 3.00 and 3.99 were evaluated as demonstrating very good construct validity, constructs between 2.70 and 2.99 were evaluated as demonstrating good construct validity, constructs between 2.25 and 2.69 were evaluated as demonstrating limited construct validity, and constructs less than 2.25 were evaluated as demonstrating no construct validity. Additional information is available upon request.

There was also no relationship between SME ratings of construct breadth and the final triangulated construct validity level reported in Table 3 (r = 0.15, p = 0.14).

References

Author information

Authors and Affiliations

  1. University of Alabama, 361 Stadium Drive, Box 870225, Tuscaloosa, AL, 35487, USA Russell A. Matthews & Yeong-Hyun Hong
  2. Wayne State University, 5057 Woodward Ave., Suite 8402.3, Detroit, MI, 48202, USA Laura Pineault
  1. Russell A. Matthews