Introduction
Early detection of psychological problems is, doubtless, a goal of enormous interest for diagnosis and early intervention. But it is still more important in adolescent populations, as early intervention can slow down the stabilization and progress of more severe psychopathology in later stages of life. This need has encouraged the use of screening instruments easily administered to large samples and with sufficient sensitivity and specificity to make early detection and specialized intervention possible (Sheldrick et al., 2015).
One of the most widely used tests in the world is the 12-item version of the General Health Questionnaire (GHQ-12), originally proposed by Goldberg (1972). The 12-item questionnaire asks about such matters as the ability to concentrate, sleeping problems, decision-making problems, perceived stress, self-confidence or perceived happiness. It does not enable the underlying psychopathology to be established but explores perceptions of stress that could be manifestations of an underlying disorder. There are studies backing both its internal and external validity for detecting nonpsychotic psychopathology (Sánchez-López, & Dresch, 2008).
Although there is wide agreement about the ability of the test to detect such signs of distress, there is persistent controversy concerning what the test really measures. The main problem is that there is no agreement on the number of dimensions that configure the test, which is an indispensable requirement for interpreting the results of its application and comparing the results found. Among the studies published are those which propose that the GHQ-12 has a one, two or three-dimensional structure.
Confusion about the structure of GHQ-12 inevitably leads to misuse and the existence of works that interpret the results in very different ways. Thus, some studies explore the "prevalence" of specific mental health problems in adolescents, such as anxiety or depression, when GHQ-12 is merely a screening test that is unlikely to be able to measure specific disorders and in any way estimate prevalence (Bansal, Goyal, & Srivastava, 2009; Mann et al, 2011) or find links between such supposed disorders and other mental health problems (Almeida et al., 2019; Ogawa et al., 2019; Ojio, Nishida, Shimodera, Togo, & Sasaki, 2016). Therefore, it is necessary to know the true structure of the questionnaire, the dimensions it is able to measure and, ultimately, what it really measures.
When the test has been applied in adult populations, several possible structural formulas have been found (Shevlin, & Adamson, 2005), and in general, acceptable psychometric properties and diagnostic validity (Makowska et al., 2002). When the test has been used in samples of adolescents, two and three-factor solutions have predominated, as summarized in Table 1. Through the statistical methods used, differences have been found in very diverse ethnic groups (Bowe, 2017). It may be observed that there are two studies in Spanish adolescents, one of which found a two-factor solution, and the other, a three-factor structure.
One main reason justifying these different solutions has been suggested. There is empirical evidence that people do not answer positive responses (in the direction of distress) the same way as negative, generating two artefactual factors, the first of which is the more consistent of the two, because negative items are ambiguous about the nonexistence of such symptoms (Hankins, 2008a). The methods generally used for structural analysis are inadequate for the nature of the data explored. The use of the Kaiser criterion (Eigenvalue > 1) or Scree plots are inadequate to the extent that they assume that the Likert scales represent continuum scores and tend to overestimate the presence of artefactual dimensions (Ruscio, & Roche, 2012). The same is true of methods such as extraction of principal components, usually with the Pearson’s correlation matrix. These methods contribute to grouping responses based on the distribution of the variables instead of by their content. Thus, the responses to questions leading to greater ambiguity may simply be grouped by the type of response, generating some spurious factors. The Cronbach’s alpha coefficient is not an adequate estimator for studying the internal consistency of ordinal scales either, nor are Likert responses (Crutzen, & Peters, 2017; Yang, & Green, 2011). Several studies have shown that when these biases are controlled for and appropriate statistics are applied, the solution found is always one-factor (Hankins, 2008b; Rey et al., 2014; Romppel et al., 2013).
Another controversial question is whether the interpretation of the results of the GHQ-12 in adolescents should respond to the same rules as for adults, that is, place the cutoff point at three or more affirmative answers. There are no studies exploring the predictive validity of the GHQ-12 in Spanish adolescent populations. When it has been applied to samples from other countries, the data are very contradictory. Some suggest the scores of adolescents should be interpreted the same way as adults (French, & Tait, 2004), others use Likert scores, finding high prevalence of psychopathological problems (Bansal, Goyal, & Srivastava, 2009), or propose cutoff points of 10 for males and 11 for females on continuum scores (Baksheev et al., 2011). The author of the GHQ-12 suggested using the sample mean as the cutoff point when no previous data are available (Goldberg, & Williams, 1998).
The purpose of this study was to explore the dimensionality of the GHQ-12 in a representative sample of adolescents enrolled in schools in a large Western city (Madrid, Spain), using the most appropriate methods, able to control for the biases mentioned. In addition, the differences between males and females in the sample were estimated according to the factor solution found.
Method
Participants and procedure
Study population
The population enrolled in 4th year Obligatory Secondary Education in public, semi-private and private high schools in the city of Madrid during academic year 2016-2017.
Design
An observational multicenter cross-sectional descriptive study was designed for public, semi-private and private schools in the city of Madrid. The main objective of this study was to understand the relationships between the use and abuse of information and communication technologies and various health indicators, including the risks of poor mental health (Pedrero-Pérez et al., 2019). Stratified random sampling was applied by (a) the city districts grouped in four strata as used in the Study of Health in the City of Madrid 2014 (Díaz-Olalla, & Benítez-Robredo, 2015), based on municipal administrative division into the 21 districts which make up the municipality of Madrid, reflecting grouping closely related to the level of development of the residential area; (b) school funding, based on the complete list of schools in the City of Madrid, including their funding (public, subsidized or private). Their location and the number of 4th year Obligatory Secondary Education students (ESO, usually 15-16-year-olds) per center, as provided by the education authority, the Council for Education and Research in the Community (Region) of Madrid.
The directors and guidance counselors of the thirty-four participating high schools were previously informed of study details and gave their consent. Fieldwork was done by professionals with previous experience in educational intervention who received training in the digital application of the questionnaire. The GHQ-12 questionnaire (among others) was uploaded to an online digital application (Google Form®), enabling simultaneous anonymous answers. The participants were also requested to enter their sex, age and school. The questionnaires were filled out in a computer lab in which each participant had a computer connected to the Internet. The instructors, both the school’s teachers and visiting professionals, remained presented while it was being filled out. The Data were acquired from December 2016 to March 2017. Informed consent was previously requested from parents, tutors or legal guardians of the participants. Only students who handed in this signed consent participated in the study (9% excluded). The data collection method guaranteed anonymity of the participants. The study was approved by the General Direction of Nursery, Primary and Secondary Education of the Community of Madrid. Data collection was done during the month of December 2016 and the months of January, February and March 2017: 12 schools in December 2016, nine schools in January 2017, 12 schools in February 2017 and one in March 2017.
Participants
A representative sample was taken of the total population of 4th year ESO students (n= 2,341) at the 34 schools selected, stratified by level of development of the neighborhoods and ownership: public, semi-private or private. For this study, and to keep data homogeneous, only questionnaires answered by 15 or 16-year-old students were included (n= 2,171; 50.2% males). According to population distribution, 34.3% went to a public school, 58.1% to a semi-private one and 7.6% to a private one.
Instrument
The Spanish (Lobo, & Muñoz, 1996) 12-item version by Goldberg (1972) was used. It is a 12-item self-report questionnaire with multiple-choice answers, always with four choices. Six of the items are positive (psychological distress) and six are negative (no distress). When they are corrected, all the items are interpreted in the direction of distress, so the higher the scores, the greater the distress is. Two ways of correcting the responses have been suggested: (a) GHQ-Likert scores from 0 to 3 where higher scores are indicators of worse mental health, and (b) GHQ criterion score, assigning values of 0-0-1-1 to the item choices. Criterion scores in over two affirmative choices suggest risk of poor mental health (Makowska et al., 2002).
Data analysis
First, outliers (participants who answered randomly or incongruently) were excluded after measuring the Mahalanobis distance (p<.001). The descriptive statistics were found for the items. The normality hypothesis was checked applying the Mardia test. If the multivariate normality criterion was not met, a polychoric (Likert scores) or tetrachoric (criterion scores) correlation matrix was constructed. Based on this matrix, an optimized parallel analysis was performed from the randomized generation of 500 submatrices (Timmerman, & Lorenzo-Seva, 2011). The suitability of the one-factor solution was studied with two indicators of suitability: Unidimensional Congruence (UniCo, appropriate when >.5) and the Explained Common Variance (ECV, appropriate when >.85). The accuracy of the one-factor solution was studied using the Determination Index (FDI, appropriate when >.90). Response bias was controlled for by accelerated Bootstrapping (BCa) and establishing a confidence interval for factor loadings on each item. The repeatability of the construct was studied using the generalized h-index (G-H). All the above estimators were used following the suggestions of Ferrando and Lorenzo-Seva (2017). Internal consistency was studied using the Greatest Lower Bound (GLB), McDonald’s Omega (ω) and the standardized Cronbach’s Alpha (αs) (Trizano-Hermosilla, & Alvarado, 2016). Estimators of the model’s goodness of fit to data used were: Root Mean Square Error of Approximation (RMSEA, which should be <.07), Goodness-of-Fit Index (GFI) and Adjusted Goodness-of-Fit Index (AGFI), Non-Normed Fit Index (NNFI) and Comparative Fit Index (CFI), all of them appropriate at >.95 (Hooper, Coughlan, & Mullen, 2008).
The FACTOR Program 10.8.03 was used for all these analyses (Ferrando, & Lorenzo-Seva, 2013). Sample subgroups were compared by analysis of variance and omega squared (ω2) was used as the estimator of effect size, which is interpreted as small when ω2=.01, medium when ω2=.06 and large when ω2=.14. For comparison of categorical variables, the ji squared (χ2) test and Cramer’s V, which is interpreted as small when V=.10, medium when V =.30 and large when V =.50, were used. These analyses were performed using SPSS 22 statistical software (omega squared test was done manually based on the ANOVA provided by the program).
Results
The outlier detection study did not advise excluding any of the subjects from the simple (Mahalanobis Distance p>.001 in all cases).
Likert-type scoring
The descriptive statistics for the items are shown in Table 2.
The Mardia test was applied to the Pearson correlations matrix, which showed that the distribution of items did not meet multivariate normality (p<.05), so a polychoric correlations matrix was constructed. Based on the new matrix, an optimized parallel analysis was made which suggested one-dimensionality of the scale with a single factor explaining 46% of accumulated variance (Table 3). The goodness-of-fit statistics for this one-factor solution were satisfactory: RMSEA=.07; NNFI=.97; CFI=.98; GFI=.98; AGFI=.98. The model repeatability index (Latent h =.91; Observed h =.81) was adequate, as was the suitability of the one-factor solution (UniCo = 0.97; ECV= 0.87). Internal consistency indicators were also appropriate: GLB=.93; ω=.89; αs=.89. Table 4 shows the characteristics of the one-factor solution found. The results did not vary when the same tests were done separately in males and females.
Criterion Scoring
The descriptive statistics of the items are shown in Table 4. The Mardia test showed again that the distribution of the items did not meet multivariate normality (p<.05), so the tetrachoric correlations matrix was constructed. An optimized parallel analysis based on the new matrix suggested the unidimensionality of the scale with a single factor explaining 60.2% of the total variance of the test (Table 5). The goodness-of-fit statistics for this one-factor solution were satisfactory: RMSEA=.04; NNFI=.99; CFI=.99; GFI=.99; AGFI=.99. The model’s Repeatability Index (Latent-h =.95; Observed-h =.70) and the suitability of the one-factor solution (UniCo=.99; ECV=.92) were also adequate. The internal consistency indicators were appropriate as well: GLB=.96; ω=.94; αs=.94. Table 5 shows the characteristics of the one-factor solution found. The results did not vary when the same tests were given males and females separately.
Sex differences
Table 6 shows the differences in answers to the GHQ-12 between males and females. Females said they had more problems concentrating, more depressive symptoms, less self-confidence and less perceived ability to cope with problems, but especially, more feelings of anxiety and tension than males, although all of this with a very small effect size.
Scores
Following the author’s instructions, the cutoff point should be three affirmative answers as an indicator of risk of poor mental health. This risk would be moderate for the mean plus the standard deviation (three, four or five affirmative answers) and severe for over six affirmative answers. Thus, the sample could be classified in three categories: no risk, moderate risk and severe risk (Table 7).
Another way of classifying the participants in our sample would be to use the Likert scores, with which scores not over the sample mean plus the standard deviation show no risk of poor mental health, between the mean plus one and plus two standard deviations show moderate risk, and over the mean plus two standard deviations show severe risk (Table 7).
Discussion
The results of this study clearly show that the GHQ-12 should be considered a unidimensional test of psychological distress, at least when applied to an adolescent population. Exploratory and semi-confirmatory (Lorenzo-Seva, & Ferrando, 2013) factor analysis were used, controlling for biases inherent in the item response method, in a randomized sample representative of the adolescent population (14-16-years-old) in a large city (Madrid). Unlike previous studies, usually using convenience sampling, obsolete methods inadequate for the nature of the data (Kaiser criterion, Scree-plot, Pearson correlation matrix, Cronbach’s alpha, etc.), were avoided and more appropriate statistical indicators were used, both for the Likert-type response format and dichotomous scoring of the questionnaire. All of the statistics enable us to state that there is no doubt at all of the one-dimensionality of the test. These data are consistent with the proposals of the authors who attribute the appearance of two or three dimensions to the use of inadequate estimators, which are based on inadequate assumptions, such that when all those biases are controlled for the solution is always unidimensional (Hankins, 2008).
As a unidimensional test, the GHQ-12 is a reliable instrument, both overall all and each item alone. What in other studies appear to be independent dimensions (social dysfunction, anxiety and loss of self-confidence) are simply components of a general construct of psychological distress, from which diagnostic approaches cannot be derived, but merely initial screening, which must be followed up with specific psychodiagnosis.
When gender differences were studied, females scored higher on many of the symptoms and the overall test score, as is usual. The real reasons for this tendency of females to report more distress, somatic and psychic symptoms is unknown (Davis, Matthews, & Twamley, 1999), a good number of possible reasons having been proposed: innate differences in somatic and visceral perception, differences in labelling, the description and reporting of symptoms, differences in education and socialization, sex differences in the incidence of abuse and violence, and gender prejudices in research as well as in clinical practice (Barsky, Peekna, & Borus, 2001). Whatever the reason, the data in our study suggest that adolescent girls are more vulnerable than their male peers.
Applying the usual criterion scoring, we found that 30% of male and 42% of female adolescents were at risk of developing mental health problems. Doubtless, one of the main characteristics of this period of life is emotional instability derived from the change in roles, search for identity, the beginning of autonomy in decision-making, etc., which coincides with a neurologically critical period of synaptic pruning (Blakemore, & Choudhury, 2006). However, it seems exaggerated for the percentages mentioned to have any kind of clinical entity. In any case, a screening instrument like the GHQ-12 can enable the early detection of problems which often form the basis for establishment of psychopathological symptoms that persist through adulthood (Paus, Keshavan, & Giedd. 2008). The results found using the Likert scores (around 11% of males and 18% of females at risk of poor mental health) are closer to clinical reality, but it should not be forgotten that the GHQ-12 is a screening instrument and not meant for diagnosis.
In conclusion, the GHQ-12 is a unidimensional screening test for psychological distress, with excellent psychometric properties for its application in an adolescent population. Future studies should replicate these results in other populations (e.g., rural, adults, elderly, etc.), in other cultural settings and in clinical populations. The use of adequate statistical methods can enable old controversies to be overcome and favor proper application and interpretation of the results provided by the test.