Context
In the research of content validity by means of expert judges or study participants, one of the usual steps is to quantify the results with various methods (e.g., for a summary of the methods, see Pedrosa, Suarez-Alvarez & Garcia-Cueto, 2013). Quantifying the content validity is usually calculated in one coefficient, ranging from 0.0 to 1.0 (e.g., Aiken, 1980; Hernandez-Nieto, 2002; Lynn, 1986; Osterlind, 1992), or from -1.0 to 1.0 (e.g., Lawshe, 1975; Rovinelli & Hambleton, 1977), and generally values close to 1.0 are interpreted as evidence of the strength of the validity of the measured attribute. This standardization applies regardless of the measure of the responses; for example, a scaling from 1 to 3, or from 0 to 7. This type of transformation is very common, because its interpretation framework eliminates infinite values at the extremes, it is comparable to a transformation based on percentages (from 0.0% to 100%), and you can ensure that the final consumers of the results will quickly understand the information (Bonett & Price, 2020). Also, this transformation makes the interpretation independent of the scaling frame in which the data were collected. For example, a coefficient V (Aiken, 1980) of .70 can be obtained from an average of 3.8 on a scaling 1 to 5, an average of 2.8 on a scaling 0 to 4, or an average of 7.3 within a scale of 1 to 10. However, it is not the only possible transformation, nor the only framework to understand its results.
In a study of content validity, where the data collected are formulated in a particular measure, for example, from 1 to 5, the results also may be in the same metric, and does not require any other processing to communicate the results. The expression of results in the metric of the rating scale is useful in preventing another metric for its interpretation, and it has the advantage of contextualizing in the same units that the individual scores were produced. For example, the evidence of content representativeness can be scaled to 1 (completely unrepresentative), 2 (unrepresentative), 3 (moderately representative), 4 (acceptably representative), and 5 (fully representative); if a group of judges produces an average scoring of 4.1, and this average suggests an acceptable representation of the construct, but it also can be said that the trend of the perceived validity is just at this level 4. The accuracy of this kind of information requires other interpretable quantities, such as confidence intervals (CI).
Due to a point estimate does not guarantee accuracy itself, a confidence interval indicates its accuracy through a range of variability of the estimated value. For the user, this can be interpreted as the degree of certainty of finding the estimated value in the reference population, which will allow a better decision-making in the research works, and added information in the interpretation of the results (Escrig-Sos et al., 2007; Tellez et al., 2015). The CIs for the content validity coefficients seem to have barely been developed (Penfield & Giacobbi, 2004). Also, a casual review of the literature repeatedly performed by the authors of this manuscript, found that, beyond Penfield's work (Miller & Penfield, 2005; Penfield & Giacobbi, 2004), no CI procedures for these types of coefficients have been widespread or derived to date.
Method
To add information about the statistics accuracy of the coefficients of content validity, confidence intervals must be built around the estimated coefficient. However, the traditional method for creating confidence intervals (i.e., the Wald method) requires the assumption of normal distribution of calculated average (Penfield, 2003). Because the data obtained in content validity studies are usually discrete, scaled down to five or fewer options, show skewed distribution, and the usual size of the judges’ sample is generally small, another method is required to estimate appropriate intervals. (Miller & Penfield, 2005; Penfield & Miller, 2004).
Based on the work of Penfield (2003), who adapted the Wilson (1927) score procedure to generate asymmetric confidence intervals, Penfield and Miller (2004) used this adaptation for data produced in content validity studies; specifically, scores based on ordinal metrics, small judges’ samples (less than 10), reduced number of response options, and skewed distribution. There are other methods to generate asymmetric confidence intervals (e.g., Willink, 2005), but Wilson’s method appears to be efficient for the conditions in which content validity studies are conducted (Penfield, 2003; Penfield & Miller, 2004). Because the manual calculation is prone to error at computing, a computer code is proposed for this purpose, using the Statistics Package for the Social Sciences (abbreviated as SPSS). In recent years, R has been a free powerful platform for software development, but the SPSS is still used in different fields of the methodological research (e.g., Duricki, Soleman, & Moon, 2016; Vanus, Kubicek, Gorjani, & Koziorek, 2019; Valeri, & Vanderweele, 2013), and that over time its ease of handling and immediate understanding especially in the field of science social was noted (Gogoi, 2020; Rivadeneira et al., 2020).
The description of procedure and its rationality can be seen directly in Penfield and Miller (2004), so readers can inspect the formulas. The literature related to this method, based on the score procedure (Wilson, 1924), can be found in Penfield (2003), Penfield and Giacobbi (2004), and Miller and Penfield (2005).
Program
To calculate the average of the content validation and the asymmetric confidence intervals, an SPSS syntax is written which is adapted to the SAS syntax by Miller and Penfield (2005). Even with the rise and popularity of the free software (Haine, 2019; Muenchen, 2017), the SPSS is of persistent use and it still holds the attention of researchers in different disciplines (Haine, 2019; Masuadi et al., 2021; Muenchen, 2017; Shaikh, 2016, 2017). Similar to a few ad hoc programs to obtain evidence of the content validity (e.g., Merino-Soto, 2018; Merino-Soto & Livia-Segovia, 2009), the program presented here implements the method of Penfield and Miller (2004) for constructing confidence intervals around the mean value of the scores, and reports the mean of the expert judges’ scores, its equivalent expression in proportion to the scaling range, the asymmetric CIs around this mean, and the width of the interval. The program is freely accessible and it must be requested from the corresponding author.
Application example
To illustrate the procedure, the results for the evidence of clarity of the items of the dimension Atención Sostenida, made by Moscoso and Merino-Soto (2017; see Table 2) were used, in which the Inventario de Ecuanimidad y Mindfulness was validated. In the input specifications for the program, you have to report the mean (M), the sum (S) of the judges’ ratings for each item, the number of judges (17), the number of scaling responses (k; in this example, 6 ordinal points, from no validity to full validity), the error rate (alpha; among the options: .10, .05, .01), and the value of the number (start) with which k starts, that is, 0 or 1 (see Table 1, Input heading). In the Output heading of Table 1, the results after applying the program for the calculations are shown. It is observed that, in all the items, except item 1, the lower limit of the interval exceeds the value 4. If the researcher establishes a priori that the items with lower limit of CI must exceed is 4, then this item does not seem to fit this criterion. However, given the difference of .02, the researcher must make a decision whether to apply the criteria strictly or to be flexible.
Final comments
This manuscript presents a computer program, built in the SPSS syntax, to quantify the degree to which the judges agree on the content validity, and their asymmetric confidence intervals. The fundamental difference of this method with others (eg., Merino & Livia, 2009) is that the results are presented in the same metric of the judges’ responses. Because this crude coefficient is equivalent to some transformation between 0.0 and 1.0, the expected linear association between this method and the methods based on standardized coefficients for content validity is perfect or very high. Therefore, the user should focus on: a) how the evidences of the content validity is expressed (i.e., in the metric of the judges’ responses, or in a range between 0.0 and 1.0), instead of the validity or accuracy of the procedures; b) the confidence level of the interval , which can be generally 90% when the number of judges is small (Merino & Livia, 2009), but the 95% or 99% levels can also be chosen); and c) in the existing computer programs to calculate them complementary results (e.g., Merino, 2018; Merino-Soto & Livia-Segovia, 2009.