Introduction
The rise of digital personnel selection procedures has led to a growing interest in new assessment methods (Andrés et al., 2023), such as digital interviews and testing (Woods et al., 2020). Among these approaches, game-related assessments (GRAs) have recently attracted significant attention (Landers & Sanchez, 2022). Their popularity is mainly based on the idea that they reduce the risk of faking and improve applicants' reactions while maintaining predictive validity (Melchers & Basch, 2022; Wu et al., 2022). However, research on GRAs invites skepticism and seeking more evidence (Ramos-Villagrasa et al., 2022). This study extends prior literature by focusing on a type of, until now, scarcely researched type of GRA (i.e., game-based assessments), considering four job performance dimensions (i.e., task performance, contextual performance, counterproductive work behaviors, and adaptive performance). It also analyzes applicants' reactions to different issues (i.e., perception of comfort and suitability) and investigates the influence of individual differences in scores and reactions to game-based assessments.
GRAs: Types and Characteristics
GRAs refer to the application of gamification science to assessment contexts. Gamification science allows the use of game elements in non-gaming contexts such as personnel selection or training (Landers et al., 2018). Focusing on personnel selection, the use of GRAs in real contexts has increased in the last few years for at least two reasons. Firstly, due to the digitalization of Human Resources Management (Nikolaou, 2021) as GRAs are mainly developed through technology (Melchers & Basch, 2022). Secondly, because GRAs seem to be a way to solve typical issues in personnel selection like faking or negative applicant reactions (Landers & Sanchez, 2022).
Nonetheless, research on this topic is growing and providing a less optimistic perspective. The recent review of the literature performed by Ramos-Villagrasa et al. (2022) summarized the existing evidence with respect to GRAs. According to them, it appears that GRAs are a promising selection method, but the accumulated evidence is still scarce and yields mixed results. Focusing on the inconclusive results regarding GRAs, they suggest that differences may be due to the particularities of each type of GRA. Considering this, research should take into account which type of GRA is under study and avoid generalizing research outcomes from one type to another (Ohlms et al., 2023).
According to Ramos-Villagrasa et al. (2022), differences among GRAs are based on their degree of playfulness (the degree in which it is close to traditional assessment or to games designed for entertainment) and their purpose (whether is a serious game designed for assessment purposes, or if it is designed for fun and is being used for assessment). Figure 1 outlines this classification and the main characteristics of the different types of GRAs: soft-gamified assessments, hard-gamified assessments, game-based assessments, and playful games. Each type is described below.
Soft-gamified assessments incorporate elements from games to enhance the assessment experience, while retaining the fundamentals of traditional assessment methods. It is used to measure the theoretically supported constructs (e.g., cognitive ability, personality, competencies). The assessment approach remains traditional in terms of evaluation (e.g., the use of Likert scales). It is considered a soft gamification process to develop the GRA because if game elements are removed, the assessment can still be performed. An example of this type of assessment is the gamified version of the Wisconsin Card Sorting Test (Hommel et al., 2022).
The second type of GRAs is hard-gamified assessments, also known as gamefully designed assessments (Ramos-Villagrasa et al., 2022). This type shares many characteristics with the soft one but is considered a “hard” version because the gamification process is deeply integrated into the assessment. Consequently, measuring the constructs of interest is impossible without the gamified elements. Similar to the soft type, it measures theoretically driven constructs. An example is Owiwi (Nikolaou et al., 2019), a situational judgment test in which item content can be understood only within the context of the game's narrative.
The third type of GRAs related to serious purposes is game-based assessments. This type is closely related to playful games that evaluate the constructs of interest using users' in-game behavior (e.g., reaction time, mouse clicks) instead of, or in addition to, traditional assessment methods such as answering questions or providing solutions to a set of predefined options. The constructs measured with game-based assessments can be theory-driven or data-driven. If they are data-driven, demonstrating the construct-related validity or predictive validity of the constructs measured by the game is needed. An example of this type of GRAs is Cognify (Auer et al., 2022).
The last type of GRAs is playful games. Playful games are designed for entertainment but they are repurposed for assessment. Consequently, any kind of construct-related validity or predictive validity study is needed. An example is the virtual reality game Job Simulator (Simons et al., 2023).
Ramos-Villagrasa et al. (2022) use this classification to clarify mixed results of prior research. They suggest that GRAs closely resembling traditional assessments (i.e., soft- and hard-gamified assessments) demonstrate better psychometric characteristics (reliability, construct validity, and predictive validity), whereas those close to actual games (i.e., game-based assessments and playful games) elicit more favorable applicant reactions. The only exception is the relationship with job performance, although this can be mitigated by providing explanations to the individuals being evaluated on this matter (Georgiou, 2021).
In any event, as the review itself suggests, further research is needed to fill the existing gaps for each type of GRA. Following this idea, we focus on serious games that are closer to playful games: game-based assessments.
Game-based Assessments
To the best of our knowledge, only five studies have provided empirical evidence of the use of game-based assessments (Ramos-Villagrasa et al., 2022). The GRAs used in these studies measure cognitive ability (Auer et al., 2022; Landers et al., 2021), competencies (Albadán et al., 2016; Wiernik et al., 2022), or personality (Wu et al., 2022). To summarize prior research, we are going to focus on construct validity, predictive validity, applicant reactions, and personal characteristics.
Concerning construct validity, Auer et al. (2022) investigated if trace data modelling may improve the prediction of cognitive ability of the game-based assessment Cognify and if this approach also allows to measure conscientiousness. Their study, performed with undergraduate students, found support only for cognitive ability. Similarly, Albadán et al. (2016) suggest the use of fuzzy logic to improve applicant scoring. Another study, performed by Wu et al. (2022), used two game-based assessments (i.e., Click Town and Word Find) to measure the Big Five, but they found that these games actually measure cognitive ability better than personality. Another relevant issue is discriminant validity: Wiernik et al. (2022) provided evidence that their game-based assessment developed for the U.S. Air Force, Virus Slayer, measures different competences relevant to cybersecurity positions. Overall, the results suggest that game-based assessment has more difficulties in providing construct validity than other types of GRAs, such as gamified assessments and gamefully designed assessments (e.g., Georgiou et al., 2019; Landers & Collmus, 2022). To solve this issue, Auer et al. (2022) highlighted the importance of defining the constructs that are going to be measured at the beginning of game design.
As for predictive validity, game-based assessments have demonstrated their ability to predict academic performance (Auer et al., 2022; Landers et al., 2021) and task performance (Landers et al., 2021). However, job performance is a multidimensional construct and other dimensions, namely contextual performance, counterproductive work behaviors, and adaptive performance, may be considered (Ramos-Villagrasa et al., 2019). The more the dimensions of performance a GRA can predict, the greater its versatility for different organizations and job positions.
As far as applicant reactions are concerned, only one study has addressed this issue using game-based assessments: Landers et al. (2021). They compare applicants' reactions to a set of game-based assessments (i.e., Numbubbles, Resemble, Grid Lock, Proof It, Tally Up, Colour Pop, and Short Cuts) and to traditional assessment. They found better results for the games than for the traditional assessment, but the size of the improvement was low. This outcome is less optimistic than other studies that use other GRAs (e.g., Harman & Brown, 2022; Hommel et al., 2022; Landers & Collmus, 2022), encouraging further research to be sure whether this is a characteristic of game-based assessments in general or of the games used by Landers et al.
Personal characteristics are another issue of interest in GRAs, because any variable that may impact on GRAs scores should be controlled. In that sense, Wiernik et al. (2022) found a positive effect of education on Virus Slayer scores. However, Landers et al. (2021) did not find any evidence of gender differences. Research on other types of GRAs has investigated the influence of other characteristics such as age or the use of technology and videogames. Their findings suggest that the influence of these characteristics is modest (Melchers & Basch, 2022). Nevertheless, empirical research must check if this is the case for game-based assessments.
Taking the results as a whole, we may draw some conclusions: (1) the number of studies on game-based assessments is still scarce, then more primary research is needed to further investigate of differences between types of GRAs; (2) game-based assessments should be based on existing psychological theories and constructs to ensure construct validity; (3) the predictive validity of game-based assessments can be improved by investigating their ability to predict other dimensions of job performance; (4) we still know little about applicant reactions to game-based assessments; and (5) the influence of certain personal characteristics like age, the use of technology or experience with videogames should be investigated using game-based assessments. The aim of the present study is to advance research on GRAs in general and game-based assessments in particular.
The Present Study
The game-based assessment investigated in this study was Nawaiam (https://nawaiam.com), a commercial GRA used for selection, career counseling, and team management purposes. The study was conducted in collaboration with the company that owns the game, which provided access codes for Nawaiam to students and alumni from the authors' university. The company did not influence the development or results of the study.
Nawaiam is set in the near future where the polar ice caps have melted. The player must make difficult decisions that contribute to the survival of the human race (related to managing resources, helping people, etc.), and must also overcome some skill-based games that are not part of the evaluation (i.e., they are involved in the assessment, but as an additional part of the gaming experience). Sample images are shown in Figure 2. A description of the game following the taxonomy of Bedwell et al.'s (2012) game elements is shown in Table 1. Nawaiam is different from other game-based assessments because it emphasizes the environment, game fiction, and immersion.
Nawaiam is played on the applicants' mobile phone and lasts around 20 minutes. Subsequently, the applicant is described in terms of one of the game's several “behavioral profiles” (i.e., behavioral dispositions such as personality) inspired by the DISC personality model (Marston, 1928). According to Rodríguez (2004), the DISC personality model describes people according to four dimensions based on two axes: a friendly/unfriendly relationship with the environment and an active/passive behavior. Considering both axes, the dimensions are Dominance (D, unfriendly/active), which characterizes people with confidence and focused on achieving outcomes; Influence (I, friendly/ active), which describes people prone to persuade and cultivate connections with others; Steadiness (S, friendly/passive), which is associated with disposition to cooperation and being honest with others; and Conscientiousness (C, unfriendly/passive), which emphasizes interest in quality and accuracy. Scales tend to measure DISC traits using forced-choice items. Owen et al. (2020) state that the DISC model has construct validity, reliability, and may be used in multiple industries, sectors, and cultures. Nevertheless, although the DISC test has been administered to more than 50 million people in professional practice, empirical research studies using this test are really limited (Gómez et al., 2021). Due to the small amount of studies employing the DISC, its association with other personality models, such as the Big Five, remains unclear. Some studies connect negative emotionality with D and I; extraversion and open-mindedness with all DISC traits; agreeableness with D, S, and C; and conscientiousness with C (Gehrig & Bonnstetter, 2017; Jones & Hartley, 2013). Hence, it seems that Big Five and DISC are different approaches that can be used in conjunction (Jones & Hartley, 2013).
Pertaining to Nawaiam, the DISC scores change their names to Assertiveness (Dominance), Sociability (Influence), Tolerance (Steadiness), and Rules (Conscientiousness). Based on the scores and behavioural profile, Nawaiam generates two qualitative reports, one for the company and the other for the applicant, both with tips regarding which kinds of tasks and working conditions are better for the worker. The company has also access to a dashboard that allows matching competence profiles to specific worker characteristics (e.g., if a creative worker is needed, six competence profiles match, but if a creative and adaptable worker is needed, only two out of six profiles match).
Our study wants to advance research on game-based assessments and GRAs in general using Nawaiam to investigate: (1) its capacity to predict four different dimensions of job performance (i.e., task performance, contextual performance, counterproductive work behaviors, and adaptive performance) and its incremental value over traditional measures (i.e., 'Big Five' personality tests); (2) the reactions of applicants to the game, performing a comparison with traditional assessments; and (3) the influence of personal characteristics on game scores and applicant reactions.
Method
Participants
Participants were students and alumni between 18 and 30 years old from a Spanish university. To obtain the study sample, the research team sent an e-mail to the distribution lists of each faculty of the university, informing about the study. Two hundred and ninety-one persons agreed to participate. After providing the information about the study's proposal, 254 were involved. Of them, 244 completed the first part of the study (i.e., the online questionnaire) but only 225 installed and played the GRA (92.2%). Of the 225 players, some of them showed an “inconsistent style” (i.e., the way they play did not allow ascribing them to a behavioral profile, see Variables and Instruments section below). They were invited to play a second time, but 38 either refused or else they again obtained an inconsistent style. Lastly, 5 participants were removed because they were older than 30 and our study is focused on young individuals. Thus, the final sample comprised 182 participants (74.6% of those who initially wanted to participate). Of them, 68.7% were women, 29.7% were men and the rest did not answer or chose “other gender options” (1.6%). The mean age was 21.68 years (SD = 2.72). Most participants were undergraduate students (n = 163, 89.6%) and used ICT several times a day (n = 138, 76.8%). Regarding videogames, 65 (35.7%) did not play videogames, 67 (36.8%) played monthly, and 50 (27.5%) played several times a week or more. Seventy-seven participants (36.8%) had a job, and 23 (12.6%) were actively searching for one. We decided to retain the whole sample in the analyses because Nawaiam was designed to assess people with or without job experience.
Variables and Instruments
Sociodemographic and Work Characteristics
We recorded gender, age, ICT use (5 levels, from weekly to several times a day), experience with videogames (3 levels, from none to several times a week), and job experience using an ad hoc survey.
Nawaiam Scores
We used the scores in the dimensions that determine the behavioral profile (assertiveness, sociability, tolerance, and rules) of the participant. Each dimension ranged between 0 and 100. As Nawaiam scores are estimated from the forced choices made in the game, and it is part of the intellectual property of the company that developed the game, we do not have available data to report reliability indices.
Personality
Big Five personality traits were measured using the Spanish version of the BFI-2 (Soto & John, 2017) short form. The instrument uses a five-point rating scale (1 = strongly disagree to 5 = strongly agree) to measure the following domains: negative emotionality (e.g., “[I am someone who...] is moody, has up and down mood swings”; observed α = .77); extraversion (e.g., “is outgoing, sociable”; observed α = .80); open-mindedness (e.g., “is curious about many different things”; observed α = .78); agreeableness (e.g., “is compassionate, has a soft heart”; observed α = .63); and conscientiousness (e.g., “is systematic, likes to keep things in order”; observed α = .77).
Performance
Task performance, contextual performance, and counterproductive work behaviors were measured using the Spanish version of the Individual Work Performance Questionnaire (Koopmans, 2015; adapted by Ramos-Villagrasa et al., 2019). This self-report instrument is rated on a Likert scale, ranging from 0 (seldom) to 4 (always) for task performance (e.g., “I knew how to set the right priorities,” 5 items, observed α = .83) and contextual performance (e.g., “I took on extra responsibilities,” 8 items, observed α = .83), and from 0 (never) to 4 (often) for counterproductive work behaviors (e.g., “I complained about unimportant matters at work,” 5 items, observed α = .68). Total scores are computed estimating the mean value of each dimension. Adaptive performance was measured with an 8-item scale (Marques-Quinteiro et al., 2015; adapted by Ramos-Villagrasa et al., 2020) ranging from 1 (totally ineffective) to 7 (totally effective). A sample item is “I find innovative ways to deal with unexpected events”. Observed reliability in this study was .83.
Applicants' Reactions
Two types of applicants' reactions were analyzed, all rated from 1 (strongly disagree) to 5 (strongly agree): perception of Comfort, and perception of Suitability. Both were measured with the Employment Interview Perceptions Scale (EIPS; Alonso & Moscoso, 2018), adapted for its use with tests and GRAs as an assessment method. The Perception of Comfort subscale includes 5 items (observed α = .60 for traditional assessment and .54 for game-based assessment). Perception of Suitability has 6 items (observed α = .63 for traditional assessment and.72 for game-based assessment). Sample items are “I am satisfied with the interview” (comfort) and “The interview will allow me to be evaluated objectively” (suitability). The adapted items can be found in the Appendix.
Procedure
Participants (current and former students) were recruited with the collaboration of the faculties of the researchers' institution. An e-mail with information regarding the study and its purposes was sent, and people interested in participating had to answer the e-mail to be considered. The research team contacted each participant individually by e-mail, informing them about the research procedure, the anonymous treatment of data, and their rights according to the American Psychological Association (APA) standards. If they agreed to participate, a link to a questionnaire for assessing sociodemographic and work characteristics, personality, and applicant reactions to personality measures was sent. After they completed this assessment, they were requested to play the GRA in the following days, completing the questionnaire with applicants' reactions to the game immediately after playing. If their profile was inconclusive, they were asked to play again to receive the report. People who fulfilled all conditions participated in a raffle for one of four vouchers of ‼50 redeemable at the Amazon website (www.amazon.com).
Analyses
The analyses performed were descriptive statistics (mean, standard deviation), Cronbach's alpha, Spearman's correlations, hierarchical regression analyses, mean comparisons, and analysis of variance (ANOVA). Concerning regression analyses, we used sociodemographic variables, Big Five and GRA scores in Step 3 as predictors. Criteria were job performance. All analyses were performed with Jamovi 2.3.28 (https://www.jamovi.org/).
Results
Descriptive statistics are reported in Table 2. The variables displayed the values expected from previous literature (e.g., lower values in Counterproductive behaviors compared with task or contextual performance). Two variables exceeded the acceptable values for kurtosis: job experience (which has a value of 5.310) and Adaptive performance (whose value is 4.820). Thus, in subsequent analyses, non-parametric tests were used.
Note. N = 182; gender: 0 = men, 1 = women; CWB = counterproductive work behaviors; Traditional = assessment using test/questionnaires.
The associations between variables can be seen in Table 3. Being a woman was associated with less videogame use (r = -.39, p < .001). It is remarkable that ICT use was related to one of the reactions to the traditional assessment (i.e., Comfort, r = .17, p = .020) but not to any reaction to the GRA. Regarding the GRA scores, only Rules is associated with other dimensions of Nawaiam, specifically with Assertiveness (r = -.20, p = .006) and Tolerance (r = .22, p = .003). Talking about the relationship between Nawaiam and the Big Five, it is noteworthy that only Sociability and Extraversion displayed a significant association (r = .37, p < .001). Focusing on its relationship with the criteria, Task performance and Contextual performance are associated with Sociability (r = .17, p = .029 and r = .19, p = .013 respectively), and Adaptive performance with Assertiveness (r = .21, p = .004), Sociability (r = .16, p = .028), and Tolerance (r = -.17, p = .023). There is no relationship between counterproductive work behaviors and Nawaiam scores. On the other hand, Big Five personality traits are mostly related to all dimensions of job performance (mean r = |.29|, ranging from -39 to .54), with the exception of Open-mindedness with Task performance (r = .04, p = .585) and Counterproductive work behaviors (r = -.06, p = .467), and Agreeableness with Adaptive performance (r = .12, p = .104). To conclude with the correlations, it is worth highlighting the association between scores in Comfort and Suitability with the two assessment methods analyzed (Comfort: r = .45, p < .001; Suitability: r = .50, p < .001).
Note. N = 182. Bold values are significant associations (up to |.15| = .05; between |.16| and |.23| = .01; higher values = .001); gender: 0 = men; 1 = women; CWB = counterproductive work behaviors.
Concerning regression analyses, results can be seen in Table 4 and Table 5. Table 4 includes the predictive models of performance considering sociodemographic variables (step 1) and game scores (step 2). According with the predictive models, Nawaiam is involved in Contextual performance (6.6% of explained variance, with Sociability, β = .175, p = .026, as predictor), and Adaptive performance (11.9% of explained variance, with three Nawaiam traits as predictors: Assertiveness, β = .244, p = .001; Sociability, β = .171, p = .023; and Tolerance, β = -.154, p = .044).
Note. N = 182. Gender: 0 = men, 1 = women; bold values correspond to statistically significant predictors.
Note. N = 182. Gender: 0 = men, 1 = women. Bold values correspond to statistically significant predictors.
Continuing with regression analyses, Table 5 includes the predictive models considering Big Five personality traits as step between sociodemographic and game scores. As we can see in Table 5, all models showed the Big Five personality traits to be significant predictors: Task performance (35.6% of explained variance) was determined by Conscientiousness (β = .408, p < .001) and Negative emotionality (β = -.266, p < .001); Contextual performance (38.1% of explained variance) was determined by Extraversion (β = .313, p < .001), Open-mindedness (β = .252, p < .001), and Conscientiousness (β = .250, p < .001); Counterproductive work behaviors (14.1%) were explained by Agreeableness (β = -.216, p = .008) and Negative emotionality (β = .201, p = .029); and Adaptive performance (12.5% of explained variance) was predicted by Negative emotionality (β = -.178, p = .047). With the incorporation of GRA scores into the models, only one shows incremental variance: Adaptive performance, in which the explained variance increases up to 23.2% (β = 10.7%) and has Negative emotionality (β = -.265, p = .003), Assertiveness (β = .255, p < .001) and Tolerance (β = -.213, p = .005) as predictors.
Continuing with the analyses, a comparison of applicants' reactions to traditional assessment and GRA was performed. As can be seen in Table 6, there were significant differences in favor of GRA in Comfort (t = -4.16, p = < .011) and Suitability (t = -2.24, p = .026).
Finally, we used correlations, mean comparisons, and ANOVA to determine possible differences in the game-based assessment scores or applicants' reactions related to personal characteristics: gender, age, studies (undergraduate/graduate), ICT use (up to once a day/more than once a day), videogame use (no/weekly/daily), and working status (not working/working). The analyses only yielded significant differences depending on: (1) working status – Assertiveness score of workers was higher than that of non-workers (t = 2.663, p = .008, Cohens' d = -0.387, IC 95% [-0.6880, -0.0854]); (2) ICT use – there are differences in comfort scores regarding traditional assessment in favour of people who use ICT more than once a day (tT = -2.307, p = .022, Cohens' d = -0.360, IC 95% [-0.6610, -0.0559]).
Discussion
Personnel selection is still searching for new assessment methods. In recent years, technology-based methods have become an opportunity. Nevertheless, any new method should maintain or improve the psychometric standards of existing methods (Landers et al., 2021; Salgado et al., 2017; Wiernik et al., 2022). GRAs have the potential to do so, but empirical research should prove their value. With the present study, we extend the literature on game-based assessments, a type of GRA that requires more empirical research with respect to its predictive validity, applicant reactions, and the influence of personal characteristics.
Regarding predictive validity, we investigated the relationship between Nawaiam scores and performance. When the GRA Nawaiam is the only selection method in predictive models, it can determine contextual performance and adaptive performance. When the Big Five are also considered, some interesting findings should be outlined, supporting the idea that the DISC model and the Big Five should be seen as complementary (Jones & Hartley, 2013). First, Nawaiam is not involved in the prediction of contextual performance if personality traits are included. Besides that, the explained variance was greater when we used only the Big Five (38.1%) than when we used Nawaiam (6.6%). Accordingly, we recommend relying solely on personality traits when dealing with contextual performance. With respect to adaptive performance, Nawaiam demonstrated incremental validity over the Big Five in predicting adaptive performance, with an approximately 10% increase in the explained variance. This is the first time that any type of GRA provides evidence of predicting this type of performance, which is increasingly in demand in the BANI (brittle, anxious, non-linear, and incomprehensible) work setting. Although the dimensions measured with Nawaiam are included in some of the other predictive models, there are no significant results that can be generalized to the population. This may be because Nawaiam is based on the DISC model, whose theoretical and psychometric bases are, to say the least, weak, and previous evidence suggests that the greater the support for the theoretical model behind GRA, the better its results (Ramos-Villagrasa et al., 2022).
Findings about applicants' reactions follow previous literature, obtaining better results for game-based assessments than for traditional assessments. As far as we know, this is the second study to use game-based assessments and the first to measure personality traits. In line with Landers et al. (2021), the more the GRAs look like a game instead of an evaluation, the better it is. It is noteworthy that a positive effect has been found regarding suitability, as similar research with other GRAs has yielded more favorable results for traditional assessments (Georgiou, 2020). Future research will help to determine whether this is a characteristic of Nawaiam or game-based assessments in general. These results support the idea that using this type of assessment sends signals to candidates that the organization is innovative, which can be a competitive advantage (McChesney et al., 2022). However, given the small effect size found in the present study and the previous study by Landers et al. (2021), it is necessary to consider the contexts in which the use of a game-based assessment can provide an advantage over traditional tests. An example could be in positions where there are very few candidates, such as in the IT sector (Aguado et al., 2019).
Regarding the influence of sociodemographic and work characteristics, we found only a few differences. As for sociodemographic characteristics, our data support the stereotype that women tend to be less involved in videogames; however, this does not have an impact on Nawaiam outcomes or applicant reactions. We found a small effect on applicant reactions, suggesting that people familiar with technology felt more comfortable with the traditional assessments. This may be due to the structure of a traditional assessment applied online, which resembles polls, quizzes, and other common internet activities. No differences were found in terms of age or videogame experience. Thus, our results are in line with previous research claiming that personal characteristics do not substantially impact GRA (Hommel et al., 2022; Melchers & Basch, 2022; Sanchez et al., 2022). Concerning work characteristics, workers tended to score higher than non-workers on assertiveness. This result, which has a small effect size, may reflect the effect of work experience on game decision-making, because workers have had more opportunities to experience circumstances relatively similar to those in Nawaiam. Future research may help to clarify this aspect.
Overall, the present study updates the study of game-based assessment in three ways: (1) it has demonstrated that game-based assessments are capable of predicting adaptive performance; (2) it offers additional evidence regarding positive applicant reactions to game-based assessments, while also noting that the improvement over traditional assessments is marginal; and (3) it highlights that only minor differences exist concerning personal characteristics, which may not have a substantial impact in a real-world context.
Our research contributes to the study of GRAs in general also. The present study is another piece of evidence showing that caution should be taken when using GRAs for personnel selection. As with any assessment method, the specific GRA should demonstrate reliability, construct validity, predictive validity, and be free from bias before its application, regardless of how positive the applicant's reactions are. We must continue researching existing GRAs to draw more conclusions about what works and what does not work in these assessments, and to learn how to build them with improved psychometric characteristics while maintaining what appears to be their greatest strengths (positive reactions and the lack of influence of personal characteristics). The extent to which researchers and professionals can achieve these goals will determine whether the use of GRAs is consolidated as a selection method.
Limitations and Further Research
As any study, the present research has limitations that should be acknowledged. The sample size was small, which limits the generalizability of the results. This is a common limitation in research on GRAs because research on this selection method is still in its early stages (Ramos-Villagrasa et al., 2022). We believe that despite the small sample size, our research was relevant as it delved into one of the less investigated types of GRAs and involved a serious game used in a real setting. Along the same lines, the sample was composed only of young university students and graduates and not actual applicants. Further research could investigate how this game-based assessment is performed in the personnel selection context.
Another limitation is the use of self-report measures of job performance, because such measures tend to yield more favorable results than other evaluations. However, there are certain scenarios in which self-reports may be useful, such as when it is difficult to collect other data or when the constructs and phenomena under study are still in their infancy (Koopmans et al., 2014). This is the case because some participants are still in the early stages of their careers, and the study of GRAs is just beginning. Further research should extend beyond student samples and self-reports and increase the number of studies involving workers and job applicants with different performance measures.
Considering the aforementioned issues, we must acknowledge threats to the ecological validity as a limitation. There are considerable differences between the context of the evaluation performed in the present study and the actual personnel selection setting. Further research should consider and, if unable to conduct the study in actual selection processes, at least use a simulated context to increase fidelity to reality.
Concerning ideas for future research, more primary studies with different types of GRAs are required (Ohlms et al., 2023). GRA is an umbrella for heterogeneous assessments, so we need to identify which specific game characteristics achieve better reactions, fewer biases, and maintain validity. Meta-analytical studies on GRAs can be conducted when sufficient evidence is available. Additionally, other issues that deserve investigation are more studies about faking of GRAs, the influence of assessment (proctored/unproctored) situation, given that GRAs can be applied remotely, and cross-cultural studies using the same GRAs. All of these will increase our knowledge of when and how to use games in personnel selection processes.