Introduction

According to theories of embodied and grounded cognition, language learning is rooted in bodily experiences that we collect while interacting with the world around us (Barsalou, 1999, 2008, 2020; Borghi, 2004; Jeannerod, 2006). One of the first steps in children’s language acquisition is the naming of objects that can be reached, touched, dropped, tasted, and smelled. When acquiring novel words, children do not memorize the sequence of phonemes provided by their caregivers per se, but rather connect the sequence with their personal sensorimotor experiences (Glenberg & Gallese, 2012; Willems & Hagoort, 2007). These variable sensorimotor experiences may lead to differing representations of words in the brain. For example, children living in the Alps might represent the word shark more in visual regions of the cortex, whereas the word trout could be grounded in regions underlying haptic interaction (Kiefer & Pulvermüller, 2012; Pulvermüller, 1999, 2005), taste (Barrós-Loscertales et al., 2012), or smell (González et al., 2006). In the current study, we assumed that adults might also benefit from sensorimotor experiences connected with novel words in a foreign language (L2). Furthermore, we asked whether the impact of grasping virtual objects might benefit learners and the extent of these potential benefits.

Neuroscientific and behavioral research on native language (L1) processing converges in demonstrating that how we understand words is closely intertwined with sensorimotor experience, therefore refutes the idea that language is a system of abstract symbols (Fodor, 1979). Neuroscientific studies have found that the reading of motor-related words readily elicits motoric brain responses, even in the absence of overt movement (Buccino et al., 2009; Chao & Martin, 2000; Grafton et al., 1997), and these responses show a high fidelity to previous sensorimotor experiences, occurring at a level of specificity of individual limb movements (Hauk et al., 2004). Words referring to tools elicit enhanced motor activity in the brain relative to words that refer to less manipulable objects (Just et al., 2010; Rueschemeyer et al., 2010). Several behavioural lines of work have also examined the close relationship between word semantics and action. Affordance compatibility studies focus on how the perception of words and objects can influence actions executed by the hands and feet (Gibson, 1977, 1979). Response times to words in behavioural experiments are typically faster if the effector (hand or feet) is compatible with a word’s semantics or even with its location (Ambrosecchia et al., 2015; Marino et al., 2013). These findings suggest that motor-related words can prime associated movements. Thus, words and the objects to which they refer may share cognitive representations and access the same motor codes (Gough et al., 2012).

In a study on novel word learning (Gordon et al., 2019), participants grasped virtual objects with either their left or their right hand, and learned their names. The participants then completed a word-colour match task. Response times were shorter for words whose response hand was the same as the hand used to grasp the object earlier in the study. In a word learning study conducted by Madan and Singhal (2012), participants were asked to judge features of words such as their length and function during encoding. Concrete words that ranked highly in terms of motor manipulability such as camera were better memorized than words with reduced motor-related associations such as the word table. Taken together, these results refute the notion that words are merely abstract symbols disconnected from our everyday experience, as Fodor (1979) once suggested. Instead, words are represented through experience-related sensorimotor brain networks (Pulvermüller, 2005, 2018).

In theories of human evolution, actions are often described as the basis for language (Fischer, 2012; Rizzolatti & Arbib, 1998). Arbib (2008) proposes that movements that occurred while interacting with objects became more and more abstract and symbolic across stages of evolution and progressively evolved into gestures. Gestures, together with first vocalizations, may have formed the basis of a protolanguage. The shift from movement-based communication towards the use of spoken language could potentially have been driven by the increasing necessity to communicate abstract meanings that gestures could not sufficiently represent (Corballis, 2009a, b). Hence, from an evolutionary perspective, actions such as grasping and manipulating objects represent the underpinnings of how concepts are comprehended and words are acquired. Other theories suggest that, language may have evolved not only as a means for immediate communication, but also as a tool for triggering memory (Allen & Saidel, 1998; Hockett, 1963; Paivio, 2007). In other words, language and sensorimotor experience belong together.

Grounding Foreign Language Learning in Sensorimotor Experiences Can Benefit Learning Outcomes

Compared to learning vocabulary in one’s native language, learning words in an L2 is often met with little success. Although a number of methods designed to facilitate L2 learning have been developed (Hald et al., 2016), in practice, students generally engage in listening and comprehension activities, as well as in the repetition of bilingual word lists until the meanings of L2 words have been memorized (Rasouli & Jafari, 2016). Visual-only and audiovisual strategies (e.g., reading) have been shown to be less than optimal: L2 words that are learned visually or audiovisually decay fast from memory (Barcroft, 2009; Yamamoto, 2014). One reason why the reading of written word lists may be such a popular learning strategy is because L2 instructional practice has traditionally closely followed principles of generative linguistics (Marino & Gervain, 2019). In generative linguistics, language is described as an amodal and symbolic phenomenon of the mind (Fodor, 1979). This view subscribes to the Cartesian dichotomy between body and mind, and does not consider how cognitive processes are intertwined with concurrent perceptual processing, body movements, and the physical and social environment (Barsalou, 2020). Recent work suggests that, to the contrary, integrating the body, and gestures in particular, into the learning experience, can alter how vocabulary is remembered and represented relative to passive audiovisual-only learning (Andrä et al., 2020; Macedonia, 2019; Schmidt et al., 2019). The positive impact of performing congruent gestures and actions during learning on subsequent memory performance has been referred to as enactment effect, subject-performed task effect, production effect, and sensorimotor enrichment benefit (Cohen, 1981; Macedonia et al., 2011; Mathias et al., 2023; Mayer et al., 2015; Repetto et al., 2021).

Empirical research on the use of gestures in L2 learning began with Quinn-Allen’s (1995) seminal work showing that emblematic gestures, i.e., culture specific gestures that convey verbal meaning such as pointing the thumbs upward, support phrase learning. Since then, a number of studies have confirmed that performing gestures while learning L2 vocabulary facilitates the memorization of the words compared to learning the words audiovisually (for reviews see Macedonia, 2014; Macedonia & von Kriegstein, 2012). Neuroscientific studies attribute the enhanced memorization of gesture-enriched words to the creation of sensorimotor brain networks associated with novel phoneme sequences. In brain imaging experiments, these networks resonate upon stimulation in a similar manner as in L1 (Macedonia et al., 2011; Mayer et al., 2015, 2017). More generally, if individuals learn words with sensorimotor input and subsequently encounter the words, sensorimotor brain regions engaged during learning also respond during the subsequent encounter, even in the absence of sensorimotor input (Fischer & Zwaan, 2008; Pulvermüller et al., 2005; Tettamanti et al., 2005). Evidence from neurostimulation studies suggests that sensory and motor areas of the cortex are causally engaged in the processing of words that have previously been learned with sensorimotor input such as gestures. In two studies, participants learned L2 words that were accompanied by either gestures or pictures. After learning, transcranial magnetic stimulation was applied to motor-related areas of the cortex, which selectively disrupted the translation of L2 words learned with gestures (Mathias et al., 2021a; Mathias et al., 2021b). This implies that the motor cortices can critically support newly-acquired word representations (Repetto et al., 2013).

Several theories aim to account for the benefits of integrating sensorimotor experiences into learning (for review see Mathias & von Kriegstein, 2023). Paivio’s Dual Coding Theory (DCT), for example, describes a linguistic item as consisting of a verbal and a nonverbal code and that this dual code enhances memory for that item (Paivio & Csapo, 1969, 1973; Paivio, 1990, 2007; Sadoski & Paivio, 2012). Paivio and Desrochers (1980) apply the DCT to L2 learning specifically (see also Sadoski, 2012; Spada, 1997). Interestingly, recent theoretical approaches to embodied cognition also emphasize that the grounding of language learning can extend beyond overt physical movements. The grounding, i.e., the multiple components of information enriching a word, can also include the physical and social environment. This novel concept is referred to as 4E cognition—cognition that is embodied, embedded, enactive, and extended (Barsalou, 2020; Jusslin et al., 2022).

Language Aptitude and Vocabulary Learning

Language aptitude (LA) has been described as the capacity for an individual to achieve a higher level of ultimate attainment in a language relative to other individuals within the same time period (Carroll, 1981; Robinson, 2012). Factors such as the duration, age of onset, teaching style, or learning style of L2 learning cannot alone explain variability in language learning outcomes. In fact, differential outcomes can even be found in rather homogeneous learner groups who receive the same instruction. This observation has been taken as evidence for individual differences between language learners. Several psychological models describe the phenomenon of LA. According to Carroll (1962), LA comprises phonetic coding ability, i.e. the capacity to perceive, associate and retain sounds, inductive learning ability (capacity to induce language structure rules), grammatical sensitivity (ability to infer grammatical functions), and associative memory. Skehan (2002) proposed as capacities for LA phonetic coding (input processing), grammatical analytic ability, and memory retrieval, emphasizing the influence of working memory on each of these components. Another influential model is the Aptitude Complex Hypothesis, developed by Robinson (2001). There, the author proposes that the primary abilities underlying language learning are working memory, pattern recognition, grammatical sensitivity, and speed of processing in phonological working memory. Secondary abilities are assumed to support language learning and include memory for contingent speech, deep semantic processing, and metalinguistic rule rehearsal (for a theoretical overview, see Ameringer et al., 2018; Turker et al., 2019).

Taken per se, the learning of new vocabulary relies arguably more on memory than on other cognitive abilities. In fact, a recent study on the intentional learning of L2 Welsh vocabulary shows that short-term and working memory plays a larger role in word learning than in auditory and phonological tasks (Bisson et al., 2021). Interestingly, language instructors have traditionally not taken individual differences in memory capacity into account. This is probably due to the influence of the theory of Universal Grammar (Chomsky, 1957, 1975) in which memory was not described as a core component underlying linguistic abilities. The view that memory is a critical skill needed for the acquisition of language did not arise from linguistics. Rather, this notion came from memory research itself. Baddeley (2003) described working memory as the basis of language acquisition, as language cannot be learned without the capacity to memorize phonemes, morphemes, grammatical structure, and vocabulary. Evidence that memory is fundamental to language learning comes from studies investigating the relationship between learners’ ages and their L2 learning success. The older learners are at the time of learning, the more limited are their memory capacities, and corresponding L2 learning outcomes (Hertzog et al., 2020; Whiting et al., 2011; Palmer & Havelka, 2010; Laumann Long Lisa, 2000).

Procedural and Declarative Memory for Language Learning

Traditionally, (audiovisual) word learning has been associated with capacities residing in declarative memory (Tulving & Madigan, 1970; Ullman, 2004; Brem et al., 2013). Declarative memory is typically engaged while reading, listening, or watching information (Eichenbaum, 2004). At a neural level, declarative memory is associated with hippocampal structures in the medial temporal lobe, which have been linked to word memorization (Cabeza & Moscovitch, 2013). Declarative memory capacities also correspond to increased gyrification, higher grey matter density, and greater cortical thickness in language areas, the hippocampus, and the angular gyrus, a region mediating multimodal integration (Kumar et al., 2021). Declarative memory for L2 words has also been linked to increased responses within the angular gyrus and extra-striatal cortices (Macedonia et al., 2010; Macedonia & Mueller, 2016).

There is evidence that sensorimotor-enriched L2 words rely on brain areas related not only to declarative memory such as those in the anterior temporal lobe, but also procedural memory located in the premotor and motor cortices, basal ganglia, and cerebellum, due to movement-related input that occurs during learning (Macedonia & Mueller, 2016; Pulvermüller, 2005). Procedural memory refers to memories that are encoded and retrieved in an implicit manner (Schacter, 1987). Individual differences in declarative and procedural memory circuits may underlie individual differences in storing information. For example, high-achieving verbal learners may rely primarily on declarative memory as opposed to procedural memory (Ullman & Lovelett, 2016). Nevertheless, previous work on the benefits of integrating complementary movements such as gestures into learning has not typically examined the relationship between individual differences in learning abilities and the magnitude of such benefits (Mathias & von Kriegstein, 2023).

The interplay between declarative and procedural memory in L2 learning may also depend on an individual’s LA. High language aptitude learners might find success in traditional methods that heavily rely on declarative memory, while LLA learners could benefit from more sensorimotor-enriched methods that engage the procedural memory system (Macedonia et al., 2010). High language aptitude learners often possess strong declarative memory skills, which are well-suited for traditional word-learning methods (Ullman, 2004). These learners are also adept at handling higher cognitive loads. Note that cognitive load is essentially a measure of the working memory resources, according to Sweller (1988). On the other hand, individuals with LLA typically face challenges in traditional L2 learning settings due to weaker declarative memory abilities. Macedonia and Müller (2016) have shown that procedural memory systems are engaged by sensorimotor-enriched vocabulary learning tasks. It is conceivable that LLA learners might benefit from such strategies as indicated in a study with LLA learners (Macedonia et al., 2010).

Vocabulary Learning in Virtual Reality (VR) Environments

Virtual reality (VR) technology in L2 instruction presents opportunities for innovation including sensorimotor-enriched learning. Unlike traditional classroom or online methods, VR can provide an immersive experience, with multiple sensory input (Macedonia et al., 2014). Additionally, VR can allow learners the flexibility for autonomous and self-directed study (Lindgren & Johnson-Glenberg, 2013; Repetto et al., 2016). Altogether, VR has been shown to be a useful tool to enact linguistic knowledge, as described by Tuena and colleagues (2019). The utility of VR for learning seems to hinge on the extent to which individuals feel “present” in the VR-mediated environment (Johnson-Glenberg et al., 2021; Mikropoulos & Natsis, 2011). Several language learning VR studies have used avatars whose body movements are controlled by participants as a way to promote learning engagement (Chen, 2016; Ibáñez et al., 2011; Wang et al., 2017). Legault and colleagues (2019) showed that the manipulation of objects in immersive VR environments aided vocabulary learning relative to non-VR traditional word-word associative learning. In another study, Fuhrman et al. (2020) also found that manipulating objects in VR improved vocabulary memory relative to the enactment of irrelevant movements. One limitation of these studies is that object interaction was generally conducted through button-press actions on a controller, rather than through genuine self-enactment. It is possible that sensorimotor interventions that involve VR could be more beneficial if participants are able to self-enact virtual movements. This idea aligns with an argument put forth by Johnson-Glenberg et al. (2014). In their article, they contend that the degree to which VR environments engage the motor system is a critical factor in their educational efficacy.

Aims and Hypotheses

The aim of the current study was to test whether the integration of grasping movements into language learning—a natural learning strategy present in first language acquisition—benefits adult L2 learning outcomes in LLA individuals compared to audiovisual learning. The L1 acquisition mechanism of grasping was simulated in a virtual reality (VR) cave. Adult participants were trained on L2 words in three conditions: In an audiovisual condition (AV), subjects viewed and heard L2 words; in an audiovisual and observation (AVO) condition participants viewed and heard L2 words and saw referent objects; finally, in the audiovisual, observation, and grasping (AVOG) condition, participants grasped virtual objects representing the words to memorize.

We postulated four hypotheses. First, we expected that integrating grasping movements into the learning process would, on average, benefit learning outcomes in all learners relative to audiovisual-only (AV and AVO) learning. Second, we predicted that LLA learners would benefit more from the integration of grasping movements than HLA learners. This hypothesis was based on the expectation that, if procedural learning is incorporated into the learning process, this would potentially support L2 vocabulary memorization by reducing LLA learners’ cognitive load during learning (Ullman & Lovelett, 2016). Third, we tested whether language aptitude was positively associated with vocabulary learning outcomes, and fourth, whether age predicted language aptitude based on prior research showing differences in aptitude between younger and older learners (Gómez, 2017).

Methods

Participants

Forty-six participants with German as an L1 took part in the experiment (M age = 36.6 years, SD = 15.6 years, range: 19–68 years, 27 females and 19 males). A mixed effects modeling power analysis (Judd et al., 2017) based on effects of sensorimotor enrichment on L2 vocabulary learning observed by Repetto et al. (2017), an alpha level of 0.05, and power level of 0.8, suggested a minimum sample size of N = 34 total participants. Participants were recruited from a Linz University database, as well as through advertisements placed at the University and at the Ars Electronica Center (AEC, www.aec.at) located in Linz, Austria. All of the participants indicated that they had knowledge of at least two late-learned foreign languages (starting after the age of 12). Self-rated L2 proficiencies ranged from low to high. No early bilinguals, individuals who regularly learned two languages before the age of 12 (Houwer, 2012), were included in the study. None of the participants reported any vision or hearing impairments, or history of neurological or psychiatric disorders. All participants provided written informed consent prior to testing. Participants received an AEC entry voucher worth €10 for their participation. The study was approved by the Ethics Committee at the University of Linz.

Materials

L1 and L2 vocabulary. The stimulus material comprised 18 words from the Vimmi language corpus, a corpus of artificial vocabulary created for studies on L2 learning in order to avoid associations with participants’ native or foreign languages (Macedonia et al., 2011). Half of the words had two syllables, and the other half had three syllables. The 18 Vimmi words were associated with 18 German language translations, whose number of syllables, overall word length in letters, and frequency of use in written German (https://wortschatz.uni-leipzig.de/de) were equally distributed across experimental conditions (syllable number: M = 2.4, SD = 0.9; word length: M = 7.4, SD = 2.6; frequency: M = 12.2, SD = 2.7). The German translations were all concrete nouns referring to manipulable objects. The initial and final phonemes of the Vimmi words and their German translations always differed (see Table 1). Vimmi items were paired with concrete nouns in L1 denominating graspable objects.

Table 1 German and Vimmi words used in the experiment, and their English translations

Audio recordings. Vimmi and German words were recorded using a Rode NT55 microphone (Rode Microphones) in a sound-dampened chamber. An Italian native speaker recorded the Vimmi words with an Italian accent to highlight the L2 aspect of the stimuli for German-speaking participants. Vimmi audio stimuli ranged from 654 to 850 ms in length (M = 819.7 ms, SD = 47.3 ms). For more details on the audio files used in the current study, see Mayer et al. (2015).

Picture stimuli. Eighteen object pictures (Fig. 1a) corresponding to the meanings of the German words were used in the experiment. The object pictures were presented dynamically such that they “fell” from the ceiling of the VR-cave (so-called Deep Space) at the AEC into an underwater coral reef scene (Fig. 1b). The reef’s underwater environment offered a plausible context for objects to appear and disappear from the same position, as if being dropped from a boat overhead. All pictures were black and white in order to exclude the influence of colour on word memorization (Hoffmann & Engelkamp, 2017). The VR cave offers two projection spaces of 16 × 9 m each, one on the wall and another on the floor, with an ultra-high resolution of 8 K for stereoscopic 3D visualizations. This corresponds to a resolution of 8.192 × 4.320 pixels on each of the two projection areas, totalling more than 70 million pixels. This ultra-high definition resolution is achieved by eight Christie Boxer 4k30 Mirage 120 Hz projectors, combined with two high performance computing workstations equivalent in processing power to 400 ordinary office computers. A 5.1 Surround Sound system with Kling & Freitag speakers is used to deliver audio. Due to these unique properties of the Deep Space cave, visitors can be completely immersed into cinematic, photographic, or virtual scenes. In order to experience these scenes, 3D glasses must be worn inside the Deep Space. For our experiment, a VR learning program was developed with Unity 5.4 software (Unity Technologies, San Francisco, USA) by programmers from Johannes Kepler University Linz, Ars Electronica Solutions (www.aec.at/solutions) and Ars Electronica Future Lab (www.aec.at/futurelab). Devised as an app, the program was started by the experimenter directly from the Deep Space computer system by selecting the app from the computer screen and by starting the program with an XBOX 360 wireless controller (Microsoft Corporation, Redmond, USA).

Fig. 1
figure 1

Visual stimuli used in the experiment. (a) Virtual object plunging into underwater coral reef. (b) Coral reef virtual reality (VR) scene in which the virtual objects appeared

Design

The study utilized a 2 × 3 mixed design with the between-participant factor aptitude (low, high) and the within-participant factor learning condition audiovisual (AV), audiovisual observation (AVO), and audiovisual observation and grasping (AVOG). The order in which the three training conditions were completed was counterbalanced across participants. The assignment of the Vimmi and German words to the learning conditions was counterbalanced across groups of participants, such that each Vimmi and German word was equally represented among the three conditions across participants.

Procedure

Vocabulary training phase. During each training trial, a written L2 (Vimmi) word and its translation into L1 (German) were projected in large yellow font on the Deep Space walls in the center of the coral reef scene for a total of 5 s. In the audiovisual (AV) condition, an audio recording of the spoken Vimmi word was presented 1 s after the written words appeared. After an inter-trial interval (ITI) of 4 s during which an empty coral reef was shown on the screen, the next trial began. In the audiovisual and observation (AVO) condition, the written L1 and L2 and spoken L2 words were accompanied by a virtual picture of the object to which the words referred. The object was presented dynamically such that it “fell” from the ceiling of the Deep Space into the water shown in the coral reef scene. The object took 1 s to fall and land on the coral reef ground, where it remained motionless for 9 s and faded away prior to the 4-s ITI. In the AVO condition, participants were instructed to simply observe the objects and not interact with them. The AVOG condition was identical to the AVO condition, except that participants were instructed to grasp the virtual objects immediately after they had reached the ground and to remain grasping them until they faded away. To minimize fatigue, participants remained seated during the AV condition and stood during the AVO and AVOG conditions, in line with previous studies using similar training paradigms (e.g., Mayer et al., 2015).

Training trials were blocked by learning condition such that there were 3 total blocks. There were 72 trials in each block (6 L1-L2 word pairs × 12 repetitions). Trials were pseudo-randomly ordered within each block such that the same Vimmi word was never presented twice in a row. The AV condition lasted 10 min and the AVO and AVOG conditions each lasted 35 min due to the time that the object remained on the Deep Space walls. In a pilot experiment, the AV trials were made the same length as the AVO and AVOG trials. However, participants reported not being able to pay attention for 10-s trials during which no stimuli other than written and spoken words were presented. To facilitate attention, the long gap in stimulus presentation in the AV condition relative to the AVO and AVOG conditions was reduced. Three-min breaks occurred between each of the learning blocks during which participants were provided with water and snacks. A 5-min break followed the final training block. In total, the training lasted 80 min.

The Deep Space walls are sufficiently large to allow for the training of up to six participants simultaneously with six different object projections. We therefore trained groups of six participants simultaneously. The six participants’ positions within the Deep Space were counterbalanced across learning conditions such that participants faced the front, left, or right of the Deep Space walls in different learning conditions.

Vocabulary test phase. After the training phase, participants’ memory for the vocabulary was tested in a separate room. Five vocabulary tests were administered by computer. First, in an L2 free recall test, participants were instructed to type all L2 words that they could retrieve from the training. Second, participants completed an L1 free recall test. Third, in a paired free recall test, participants were instructed to write down all word pairs (rather than individual L1 or L2 words). In a fourth test, a cued L2 recall test, participants were presented with all 18 L1 words, which they were asked to translate into L2 by writing down the correct translation. Finally, in a cued L1 recall test, L2 items were translated into L1. The three free recall tests always occurred before the translation tests, to avoid priming participants’ memory for the L1 and L2 words prior to completing the free recall tests. The order of L1 and L2 free recall tests was randomized across participants, and the order of the L1 and L2 cued recall tests was also randomized across participants. Participants were given a total of 5 min to complete each test. No participants exceeded the 5-min time limit for any of the tests.

Language aptitude tests. After the training phase of the experiment, participants completed parts B and D of the language independent LLAMA Language Aptitude test (Rogers et al., 2017; Granena & Long, 2013; Meara, 2005). The LLAMA B is a vocabulary learning task in which participants are asked to memorize the names of twenty fantasy cartoon figures in 2 min. The names are based on a Mesoamerican native language. Following the two-minute learning phase, participants performed a memory task in which they selected the figures corresponding to the twenty written names displayed in random order on a computer screen. The LLAMA D test examines the capacity to identify, recognize and memorize sound sequences, which is crucial for the aptitude to learn foreign language words. LLAMA D scores have been found to predict L2 vocabulary acquisition outcomes (Hummel & French, 2016). In the LLAMA D test, participants are given 2 min to familiarize themselves with sound sequences and are then asked to select the sound sequences that would be used to spell novel auditorily presented two-syllable words. Both the LLAMA B and D tests were administered via computer and were completed in a fixed order with the LLAMA B test always occurring first.

Data Analysis

Vocabulary test scoring. Free recall and cued recall tests were scored by assigning a value of 1 for each correct response, and a value of 0 in case of incorrect response or lack of response. The total score for each test could range from 0 to 18. Scores on the five vocabulary tests (L1 free recall, L2 free recall, paired free recall, cued L1 recall, and cued L2 recall) were averaged for each participant and learning condition (AV, AVO, and AVOG), yielding a single composite test score for each participant and learning condition.

Language aptitude test scoring. Scores on the LLAMA B and Llama D questionnaires (% correct) were averaged for each participant to create a composite LLAMA test score. To group participants into LLA and HLA learners, we conducted a median split analysis. Based on the median composite LLAMA test score (Median = 38%), participants were split into two groups: the LLA group (n = 21, M LLAMA score = 27%, SD = 7%) and the HLA group (n = 25, M LLAMA score = 52%, SD = 13%).

Linear mixed effects modelling. We first inspected for outlying composite test scores based on the Interquartile Range (IQR), as suggested by Hoaglin and colleagues (1986). No participants were classified as outliers according to this procedure. We used a linear mixed effects modelling (LMM) approach to test our hypothesis that benefits of grasping on memory for L1 and L2 words would depend on participants’ language learning aptitude. The model included fixed effects of aptitude (high, low) and learning condition (AV, AVO, AVOG) and a random intercept by participant. The aptitude factor was a binomial between-subjects factor. The mixed effects model used the AV condition as the reference level for the learning condition factor. The model was generated in R version 1.2.1335 using the ‘lme4’ package (Bates et al., 2015). Significance testing was performed using Satterthwaite’s method implemented in the ‘lmerTest’ package, with an alpha level of α = 0.05 (Kuznetsova et al., 2017). Post-hoc Tukey tests were conducted using the ‘emmeans’ package (Lenth et al., 2020). Cohen’s d was computed as a measure of effect size.

Correlation analyses. To test whether language learning aptitude was associated with vocabulary learning outcomes, we correlated the composite LLAMA test scores with average vocabulary test scores for each learning condition by participant. We also examined whether age predicted language learning aptitude by correlating participants’ ages with their composite LLAMA test scores.

Results

Grasping and Viewing Virtual Visual Objects During Learning Enhances Vocabulary Acquisition

We first tested the hypothesis that the performance of grasping movements in the AVOG condition would benefit learning relative to audiovisual learning that occurred in the AV and AVO conditions. The linear mixed effects model revealed that AVOG learning significantly enhanced vocabulary test performance relative to AV learning (b = 0.99; p < .001, d = 1.18), as shown in Fig. 2. AVO learning also significantly enhanced vocabulary test scores relative to learning in the AV condition (b = 0.89; p < .001, d = 1.06). For the full set of model results, see Table 2. Post-hoc tests indicated that, overall, vocabulary test performance following AVOG learning did not significantly exceed performance following AVO learning (t = 0.56, p = .84). Condition means and standard deviations are shown in Table 3.

Fig. 2
figure 2

Vocabulary test performance by learning condition. Viewing and grasping virtual objects referred to by foreign language (L2) words during learning significantly enhanced L2 word retention compared to viewing and hearing the L2 words. The polygons represent density estimates of the data from each condition and extend to extreme values. White circles show the medians, box limits indicate the 25th and 75th percentiles, and whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles. AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning. ***p < .001

Table 2 Linear mixed effects model testing the effects of learning condition and language learning aptitude on vocabulary test scores. AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning. *p < .05, ***p < .001
Table 3 Vocabulary test scores by learning condition. AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning

Grasping Visual Objects During Learning Benefits LLA Learners More Than HLA Learners

Table 4 Vocabulary test scores by language aptitude and learning condition. LLA = low language aptitude; HLA = high language aptitude; AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning

Our second hypothesis was that integrating grasping movements into the learning process would be of greater benefit to LLA learners than to HLA learners. We therefore tested whether the aptitude factor modulated effects of learning condition on vocabulary test performance. The AVOG × Aptitude contrast was significant (b = -0.70; p = .048, d = 0.42), and the AVO × Aptitude contrast was not significant (b = -0.10; p = .77). This indicates that only words that had been learned in the AVOG condition (by performing grasping movements) were differently recalled depending on whether an individual participants’ language learning aptitude was low or high. Post-hoc comparisons revealed a significant difference between test scores in the AVOG and AV learning conditions for LLA learners (t = -5.08; p < .001, d = 1.60), but not for HLA learners (t = -2.64; p = .098), shown in Fig. 3. Thus, performing grasping movements in the AVOG learning condition did not significantly benefit HLA learners, but did significantly benefit LLA learners. Both LLA and HLA learners significantly benefited from simply observing objects in the AVO learning condition relative to the AV learning condition (LLA learners: t = 3.57, p = .007, d = 1.12; HLA learners: t = 3.47, p = .01, d = 1.00). Finally, as expected, HLA learners scored significantly higher on the vocabulary tests than LLA learners (b = 1.14; p < .001, d = 1.19). Condition means and standard deviations are shown in Table 4.

Fig. 3
figure 3

Vocabulary test performance by learning condition and language learning aptitude. The integration of grasping movements into foreign language (L2) training significantly benefitted low aptitude learners, but not high aptitude learners, relative to simply viewing and hearing L2 words. The polygons represent density estimates of the data from each condition and extend to extreme values. White circles show the medians, box limits indicate the 25th and 75th percentiles, and whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles. AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning. *p < .05, **p < .01, ***p < .001

Language Learning Aptitude Correlates with Vocabulary Retention Following audiovisual-only Learning but Not Grasping-Based Learning

We next tested our third hypothesis regarding whether individual participants’ language learning aptitudes could predict their vocabulary test scores in the three learning conditions. LLAMA scores showed a significant positive correlation with vocabulary test scores in the AV learning condition (r (46) = 0.42, p = .003) and AVO learning condition (r (46) = 0.35, p = .017), but not in the AVOG learning condition (r (46) = 0.26, p = .076), shown in Fig. 4. Thus, the addition of grasping movements to the vocabulary learning task changed the relationship between language learning aptitude and vocabulary retention. The correlation between age and language learning aptitude did not reach significance (r (46) = − 0.27, p = .070), although it was in the expected direction (higher aptitude scores for younger participants).

Fig. 4
figure 4

Correlations of language aptitude scores and vocabulary test scores by learning condition. Language aptitude significantly correlated with vocabulary learning outcomes following audiovisual forms of learning, but not learning that involved grasping virtual objects. AV = audiovisual learning; AVO = audiovisual observation learning; AVOG = audiovisual observation and grasping learning. *p < .05, **p < .01

Discussion

The current study investigated whether benefits of sensorimotor-enriched L2 learning by means of grasping might differ across levels of word-learning aptitude, or whether learners of different aptitude respond similarly to sensorimotor learning interventions. Adult LLA and HLA word-learners were trained on novel L2 vocabulary by simply viewing the L2 words and their L1 translations (AV condition), viewing the L2 and L1 words along with a virtual visual object depicting the word’s referent (AVO condition), and viewing the words while grasping the virtual object (AVOG condition).

In line with our first hypothesis, grasping virtual objects during learning benefitted vocabulary acquisition relative to simply viewing L2 and L1 words. However, this benefit was specific to LLA learners. HLA learners did not benefit from integrating grasping movements into learning. This result confirms our second hypothesis. Interestingly, viewing objects referred to by L2 words during L2 learning enhanced retention relative to viewing the L2 words themselves, both in LLA and HLA learners. When participants grasped the virtual objects while learning, the relationship between language learning aptitude and vocabulary learning was altered such that aptitude could no longer positively predict learning outcomes. Finally, age was not found to predict language learning aptitude in this study, contrary to our fourth hypothesis.

We interpret these findings in terms of procedural and declarative memory. Procedural memory is likely engaged during movement-enriched learning, as demonstrated by Macedonia and Müller (2016). We reason that HLA learners may inherently rely on their declarative memory for L2 learning, as long as the L2 input does not constitute too high a cognitive load. Conversely, LLA learners’ word learning may be supported to a greater extent by the recruitment of procedural memory systems. In other words, HLA learners may effectively pick up new words without needing to involve the motor system, while LLA learners’ performance is improved by integrating sensorimotor elements into the learning experience.

Viewing Visual Objects during Vocabulary Learning Benefits both low and high Aptitude Learners

Viewing a virtual object during L2 learning enhanced subsequent memory performance for both low and high language aptitude learners. This finding is consistent with several recent studies that demonstrated beneficial effects of presenting complementary visual information along with written or spoken words during L2 word acquisition (Andrä et al., 2020; Mathias et al., 2022; Mayer et al., 2015). The enrichment of L2 learning with pictures has also been used as a teaching strategy in educational practice for decades, although such teaching methods (Riesenberg et al., 2009) have not been scientifically investigated until recently. The picture benefit here is consistent with cognitive and neural theories emphasizing multimodal interactions in sensorimotor enrichment learning benefits such as the DCT (Paivio & Csapo, 1969, 1973) and multisensory learning theory (Mayer et al., 2015; Mathias & von Kriegstein, 2023).

Differential Effects of Grasping Objects for low and high Aptitude Language Learners

Grasping virtual objects during word learning also enhanced word retention relative to baseline learning, but only in LLA learners and not in HLA learners. The grasping benefit demonstrated in LLA learners is consistent with numerous studies showing positive effects of sensorimotor enrichment of words and phrases by means of congruent gestures or movements (Bäckman & Nilsson, 1985; Engelkamp & Krumnacker, 1980; Engelkamp, 1980; Engelkamp et al., 1994; Engelkamp et al., 1995; Kormi-Nouri et al., 1994; Mimura et al., 1998; Zimmer, 1996; Zimmer & Saathoff, 1997). Although the current study did not involve the enactment of gestures, these findings extend the gesture results and demonstrate that object-directed movements performed while grasping virtual objects can also support L2 word retention.

The finding that grasping virtual objects—without the tactile experience of actually touching them, which was the case in the VR environment—can improve word retention in certain learners has important implications for theories of embodiment. Specifically, it challenges the idea that physical touch or tactile feedback is an essential component for activating the sensorimotor systems that facilitate learning. Instead, this research suggests that even simulated, non-tactile interactions can sufficiently engage these systems, broadening our understanding of what embodiment can entail. We therefore view this finding as supporting the grounded cognition view that cognitive processes extend beyond body movements and into the surrounding environment (Barsalou, 2020). This finding also adds a fascinating layer to our understanding of the efficacy of sensorimotor learning strategies in the context of VR environments. The fact that full tactile engagement may not always be necessary to benefit from embodied learning strategies expands the range of effective, low-cost educational VR interventions that can be developed. The findings open up new avenues for exploring the minimum requirements for effective sensorimotor enrichment in VR environments.

In general, learning outcomes are influenced by individual variations in learning aptitude (Dahlen & Caldwell–Harris, 2013). However, pinpointing the specific mechanisms involved has proven to be a complex task. Prior studies, such as one by Poschner (2018), indicate that LLA and HLA learners don’t necessarily employ different cognitive strategies for language acquisition; for instance, both groups use sound associations between foreign language (L2) words and their native language (L1) translations. Further complicating the picture, Matusz and colleagues (2017) demonstrated that HLA learners excel in integrating multisensory events within the intraparietal cortex, a neural hub crucial for selective attention, as noted by Fiebelkorn and Kastner (2020), and multisensory processing. Moreover, HLA learners display heightened activity in the left angular gyrus and right extra-striate cortex when recognizing gesture-associated words compared to LLA learners (Macedonia et al., 2010). These brain regions are also implicated in the integration of information across diverse sensory modalities (Binder et al., 2009; Seghier, 2012). Taken together, our findings suggest that LLA and HLA learners may differentially process multisensory and sensorimotor cues during their learning experiences. Note however, that the differences in accessing memory resources, i.e., the recruitment of procedural memory in LLA learners, does not occur intentionally. It seems to be an innate strategy that is applied in order to perform the task.

If HLA learners exhibit more efficient multisensory integration processes compared to LLA learners, one might expect that enrichment cues would be especially beneficial for the HLA group. This is not what we observed. Instead, the integration of grasping movements into the learning process was more beneficial for LLA than HLA learners. Why could this be so? We propose that LLA and HLA learners may differ in how declarative and procedural memory systems are deployed during vocabulary learning. The LLA and HLA learners in our study likely differed in terms of declarative memory ability, as assessed by the LLAMA B and D subtests. Declarative memory abilities have previously been associated with individual differences in phonological processing, which encompasses both phonological working memory and retrieval (Arthur et al., 2021; Baddeley, 2010) and was a core component of the current word-learning tasks. Thus, both the language aptitude tests and the audiovisual learning conditions likely engaged declarative memory. However, procedural memory was likely recruited by the grasping movements performed during learning (Macedonia & Mueller, 2016). It is possible that LLA learners relied to a greater extent than HLA learners on procedural memory systems when learning words by grasping their referent objects. This led to greater benefits of grasping for LLA learners than HLA learners relative to baseline audiovisual learning.

One possible explanation for the greater engagement of procedural memory among LLA learners compared to HLA learners could be that the LLA group leveraged procedural memory to mitigate cognitive load, economizing the working memory resources required for effective learning. Though limited in capacity (Miller, 1956; Cowan, 2001), there is some evidence that working memory can be improved by factors such as training, expertise, or even encoding strategy (Ericsson & Kintsch, 1995). If physical actions performed during learning are congruent with to-be-learned stimuli, then these actions typically enhance task performance (Cook et al., 2008; Skulmowski & Rey, 2018). Hence, actions may be able to reduce cognitive load and serve as a successful – “natural” – strategy for LLA learners. This explanation is supported by Paas and Sweller’s (2012) biological evolution theory that considers sensorimotor experiences to be sources of biologically primary information. While interacting with the world, individuals acquire knowledge schemas that are necessary in order to build up secondary biological information such as language. More importantly, by constructing schemas through sensorimotor experiences, individuals are able to save cognitive resources. The ability to gesture in order to reduce cognitive load and to externalize thoughts during speech production and language development has been also addressed by Goldin-Meadow (2001) and Ping & Goldin-Meadow (2010). We propose that the grasping movements performed in the current study saved cognitive resources via cognitive offloading to a greater extent in the LLA learners, who were defined in our study based on tests of short-term and working memory (cf. Risko & Gilbert, 2016). This cognitive offloading may, like the use of procedural memory systems, have improved the retention of sensorimotor-enriched words.

Potential Effects of Stimulus Timing and Study Limitations

It is worth noting that the lengths of trials in which participants grasped and viewed objects in the current study differed from the length of trials in which participants merely viewed and heard L2 words without seeing any objects (baseline learning trials). Object grasping and object viewing trials were roughly twice as long as trials in which no objects were presented. Despite this difference, HLA learners showed no learning advantage for grasping-enriched trials relative to baseline. Thus, the differences in trial timing between baseline and grasping conditions is not able to explain the divergence between LLA and HLA learners in terms of grasping benefits. Previous L2 learning studies that shortened the length of baseline trials (e.g., Andrä et al., 2020) and studies that have used equivalent trial lengths for all learning conditions (e.g., Macedonia et al., 2011; Mayer et al., 2015) did not observe any systematic relationship between trial lengths and vocabulary learning outcomes.

A limitation of the present study is that our findings are specifically applicable to the learning of L2 concrete words with meanings already well-understood in the learner’s L1. It is conceivable that the benefits of sensorimotor enriched learning could be even more pronounced when applied to vocabulary items that are unfamiliar in both the learner’s L1 and L2. In such cases, sensorimotor cues could offer crucial support for establishing entirely new semantic representations. An additional limitation concerns the nature of the gestures involved. Participants engaged in simple grasping movements, without performing more complex, functionally relevant manipulations of the objects—such as inserting and turning a key or hammering a nail. It is conceivable that functional manipulations might engage the motor system more deeply, thereby enhancing an item’s distinctiveness in procedural memory. Future research could investigate the potentially larger benefits of performing more functionally meaningful gestures in VR environments on vocabulary learning outcomes.

Conclusion and Pedagogical Outlook

A growing body of research has shown that the use of sensorimotor enrichment strategies during L2 word learning can enhance the memorization of those words. The present study has demonstrated that grasping virtual objects also benefits retention, particularly in LLA learners. We propose that grasping virtual objects during learning engaged LLA learners’ procedural memory. This in turn enhanced their vocabulary acquisition compared to HLA learners who benefitted from higher declarative memory capacities. Although VR is not a new technology, research on the use of VR for language learning in pedagogical settings, is rare. With the advent of VR devices that can be purchased at reasonable price, vocabulary learning with VR objects could support LLA learners in an efficient way: VR would allow training to be provided ubiquitously, accessible to everyone and at any time in facilitation of multilingualism. The technology could at the same time allow personalized programs that take into account a learner’s aptitude and individual learning needs (Macedonia et al., 2014).