<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2021.778060</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Remarks on Multimodality: Grammatical Interactions in the Parallel Architecture</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Cohn</surname> <given-names>Neil</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/44152/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Schilperoord</surname> <given-names>Joost</given-names></name>
</contrib>
</contrib-group>
<aff><institution>Department of Communication and Cognition, Tilburg School of Humanities and Digital Sciences, Tilburg University</institution>, <addr-line>Tilburg</addr-line>, <country>Netherlands</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Anastasia Smirnova, San Francisco State University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Tim Fernando, Trinity College Dublin, Ireland; Yao-Ying Lai, National Chengchi University, Taiwan</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Neil Cohn <email>neilcohn&#x00040;visuallanguagelab.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Language and Computation, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>04</day>
<month>01</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>778060</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>09</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>12</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Cohn and Schilperoord.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Cohn and Schilperoord</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract><p>Language is typically embedded in multimodal communication, yet models of linguistic competence do not often incorporate this complexity. Meanwhile, speech, gesture, and/or pictures are each considered as indivisible components of multimodal messages. Here, we argue that multimodality should not be characterized by whole interacting behaviors, but by interactions of similar substructures which permeate across expressive behaviors. These structures comprise a unified architecture and align within Jackendoff&#x00027;s Parallel Architecture: a modality, meaning, and grammar. Because this tripartite architecture persists across modalities, interactions can manifest within each of these substructures. Interactions between modalities alone create correspondences in time (ex. speech with gesture) or space (ex. writing with pictures) of the sensory signals, while multimodal meaning-making balances how modalities carry &#x0201C;semantic weight&#x0201D; for the gist of the whole expression. Here we focus primarily on interactions between grammars, which contrast across two variables: symmetry, related to the complexity of the grammars, and allocation, related to the relative independence of interacting grammars. While independent allocations keep grammars separate, substitutive allocation inserts expressions from one grammar into those of another. We show that substitution operates in interactions between all three natural modalities (vocal, bodily, graphic), and also in unimodal contexts within and between languages, as in codeswitching. Altogether, we argue that unimodal and multimodal expressions arise as emergent interactive states from a unified cognitive architecture, heralding a reconsideration of the &#x0201C;language faculty&#x0201D; itself.</p></abstract>
<kwd-group>
<kwd>multimodality</kwd>
<kwd>linguistic theory</kwd>
<kwd>parallel architecture</kwd>
<kwd>grammar</kwd>
<kwd>codeswitching</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="2"/>
<equation-count count="0"/>
<ref-count count="75"/>
<page-count count="21"/>
<word-count count="13145"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Natural human communication combines speech, bodily movements, and drawings into multimodal messages (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Goldin-Meadow, <xref ref-type="bibr" rid="B31">2003a</xref>; Kress, <xref ref-type="bibr" rid="B47">2009</xref>; Bateman, <xref ref-type="bibr" rid="B1">2014</xref>; Bateman et al., <xref ref-type="bibr" rid="B2">2017</xref>). Rarely does speech or writing appear in isolation, but rather we gesture when we talk and we combine pictures with text. Yet, models of language competence typically do not account for this diversity of expression. Consider the well-known phrase &#x0201C;I<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> NY,&#x0201D; created by designer Milton Glaser, as seen on t-shirts, mugs, posters, and other paraphernalia. When seen for the first time, one has to connect the heart image to the linguistic construction [<sub>S</sub> Subject&#x02014;Verb&#x02014;Object] in order to recognize that the heart is playing a role as a verb, specifically to mean LOVE. Now consider the following sentences, all taken from real-world contexts:</p>
<p>(1)</p>
<list list-type="simple">
<list-item><p>a) I<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> making new friends (Twitter post)</p></list-item>
<list-item><p>b) Please drive slowly. We<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> our children. (Street sign)</p></list-item>
<list-item><p>c) They<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> weddings (Twitter post)</p></list-item>
<list-item><p>d) I<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> transitive pictograph verbalizations (T-shirt)</p></list-item>
</list>
<p>Across these examples, the heart plays a role in the uninflected verb position carrying the consistent meaning (and possibly pronunciation) of LOVE. Repeated exposure to these kinds of expressions may lead one to generalize the heart in different sentences while playing this role, which overall creates a construction in the form of [<sub>S</sub> Subject&#x02014;<inline-graphic xlink:href="frai-04-778060-i0001.tif"/><sub>V</sub>&#x02013;Object], a pattern even self-referentially appearing in (1d). Thus, the heart&#x02014;a graphic sign&#x02014;has become a part of the written English lexicon.</p>
<p>Now consider the sentences in <xref ref-type="fig" rid="F1">Figure 1</xref>, all from t-shirts, which each use a picture in the verb position, but which do not carry as deterministic meaning or pronunciation as the heart. Rather, the semantics of the pictures-as-verbs either maintain the meaning of &#x0201C;love&#x0201D; and/or invoke semantic relatedness to the Direct Object of the sentence. Following the original construction for New York, the pattern may involve an Object that is a place, but with a verb-picture related to that place, like for Tokyo with a sport played there (sumo) or a monster that destroys it (Godzilla). However, this pattern can be used beyond places. For example, &#x0201C;Nyuk&#x0201D; is an utterance typically made by Curly from The Three Stooges, whose face appears in the verb position of that sentence, while the skull-and-crossbones comes from an activist t-shirt reflecting a displeasure with a former U.S. president.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>T-shirts all using a pattern of Subject&#x02014;Picture<sub>Verb</sub>&#x02013;Object.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0001.tif"/>
</fig>
<p>These examples imply a further, more general construction of [<sub>S</sub> Subject&#x02014;Picture<sub>Verb</sub>&#x02013;Object] where the verb slot of the canonical sentence structure (N-V-N) must be filled by a picture, not a written word, that semantically connects to the Direct Object. This construction is more general than the heart-construction, since the open verb can be filled by any image associated with the Direct Object, not just a heart. This forms an abstract grammatical pattern, but with slots mandatorily filled by different modalities.</p>
<p>These patterns form a taxonomy of entrenched relationships from general (S-Picture<sub>V</sub>-O) to constrained (S-<inline-graphic xlink:href="frai-04-778060-i0001.tif"/><sub>V</sub>-O) to specific (I<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> NY). People learn these patterns from encountering instances repeatedly, which then allow for abstraction to a generalized construction. Crucially, these are multimodal patterns, which all leave their traces in our lexicons, but involve more modalities than just speech/writing. In fact, many stored multimodal patterns involve both fixed patterns and constructional variables. Productive schemas have been identified in multimodal memes shared on social media (Dancygier and Vandelanotte, <xref ref-type="bibr" rid="B17">2017</xref>; Schilperoord and Cohn, <xref ref-type="bibr" rid="B69">2022</xref>), and emoji which are systematically integrated with written language across digital communication (Gawne and McCulloch, <xref ref-type="bibr" rid="B29">2019</xref>; Weissman, <xref ref-type="bibr" rid="B71">2019</xref>). In addition, gestures have long been recognized as integrated with speech in ways that question their separability (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Goldin-Meadow, <xref ref-type="bibr" rid="B31">2003a</xref>), and indeed have been argued to have constructional properties (Lanwer, <xref ref-type="bibr" rid="B50">2017</xref>; Ladewig, <xref ref-type="bibr" rid="B49">2020</xref>).</p>
<p>Because these multimodal patterns entwine forms of spoken and written language with those of other modalities, accounting for these phenomena requires discussing them in terms of the language system. In fact, we contend that any <italic>complete</italic> theory of language must account for how elements from other modalities are integrated with the verbal lexicon. We may thus articulate these issues as fundamental questions involved in a theory of &#x0201C;knowledge of language,&#x0201D; expanding on those articulated by Jackendoff and Audring (<xref ref-type="bibr" rid="B41">2020</xref>) for unimodal language:</p>
<list list-type="order">
<list-item><p>What elements does a speaker/signer/drawer store in memory, and in what form?</p></list-item>
<list-item><p>How are these elements combined online to create novel (multimodal) utterances?</p></list-item>
<list-item><p>How are these elements acquired?</p></list-item>
</list>
<p>As demonstrated by the picture-substitution constructions described above, and attested by decades of research on co-speech gesture, multimodal expressions that involve the body or graphics cannot be separated from the linguistic system. Such interactions are not between &#x0201C;language&#x0201D; and other &#x0201C;external&#x0201D; systems, given that encoded lexical items themselves may integrate multiple modalities. Such integration heralds a single system&#x02014;an architecture of language&#x02014;that covers such multimodal expressions in full. We contend that in order to accurately characterize the natural manifestation of language, multimodality must be incorporated into the cognitive model of language.</p>
<p>In fact, such an architecture is already available in Jackendoff&#x00027;s Parallel Architecture (Jackendoff, <xref ref-type="bibr" rid="B37">2002</xref>; Jackendoff and Audring, <xref ref-type="bibr" rid="B41">2020</xref>), and a first attempt at extending it as a multimodal model was taken in Cohn (<xref ref-type="bibr" rid="B9">2016</xref>). We here clarify, refine, and elaborate on this approach. In the sections below we first provide an overview of our multimodal expansion of the Parallel Architecture. We then focus specifically on questions exemplified by our examples above: how do grammatical structures interact across and within modalities?</p>
</sec>
<sec id="s2">
<title>Multimodal Parallel Architecture</title>
<p>Many theories characterize multimodal expressions as built of indivisible &#x0201C;modalities&#x0201D; &#x02014; such as speech, gesture, or pictures&#x02014;which then interact (Fricke, <xref ref-type="bibr" rid="B27">2013</xref>; Bateman et al., <xref ref-type="bibr" rid="B2">2017</xref>; Forceville, <xref ref-type="bibr" rid="B26">2020</xref>). However, most linguistic models agree that language is composed of interacting mental structures&#x02014;phonology, syntax, semantics&#x02014;that give rise to a holistic experience of speech. Thus, to describe the holistic experience of a multimodal expression, we aim to first identify the mental structures that comprise those expressions, and then describe how these structures are interacting. These structures are not found in the culturally manifested expressions &#x0201C;out there&#x0201D; in the world, but instead in the minds of people that construct and comprehend those expressions. Thus, ingredients of language itself&#x02014;and of multimodal interactions&#x02014;are not the &#x0201C;features&#x0201D; or &#x0201C;characteristics&#x0201D; that one can describe about the messages, but rather are the mental structures that coalesce to allow those expressions.</p>
<p>In line with the mental structures described for language&#x02014;phonology, semantics, and syntax&#x02014;but abstracted, these three components are: Modality, Meaning, and Grammar. Following Jackendoff&#x00027;s Parallel Architecture (Jackendoff, <xref ref-type="bibr" rid="B37">2002</xref>; Jackendoff and Audring, <xref ref-type="bibr" rid="B41">2020</xref>), each of these components are mutually interfacing. In addition, each structure allows for combinatoriality using the operation of Unification, a principle of assembling schematic structures. We describe each of these structures in brief below, with the full architecture presented in <xref ref-type="fig" rid="F2">Figure 2A</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The Parallel Architecture including multiple modalities. Expressive forms arise through emergent states within the full architecture <bold>(A)</bold>. These include single unit expressions <bold>(B,D,F)</bold> and potentially full languages using recursive grammars <bold>(C,E,G)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0002.tif"/>
</fig>
<sec>
<title>Modality</title>
<p>A modality is the channel by which a message is expressed and conveyed. We take a modality to involve a cluster of substructures that include the sensory apparatus for producing and comprehending a signal, the cognitive structures that guide such signals, and the combinatorial principles that govern how those signals combine. For example, speech uses the vocal modality, which is produced through oral articulation, and perceived via the auditory system. It uses phonemics to codify signals, which are combined using phonological structures. Gesture and sign language would use the bodily modality, which is produced through articulation of the body (hands, torso, face, etc.) in different positions and movements, perceived through the visual and/or haptic system. Bodily signals are instantiated cognitively as the bodily equivalent of phonemics and phonology, which appear to be different than those guiding speech (Pa et al., <xref ref-type="bibr" rid="B59">2008</xref>). Finally, pictures manifest in the graphic modality, which is produced through bodily motions that leave traces to make marks, which are perceived through the visual system. These marks are decoded as graphemes, guided by combinatorial structures of a &#x0201C;graphology&#x0201D; (Willats, <xref ref-type="bibr" rid="B73">1997</xref>; Cohn, <xref ref-type="bibr" rid="B5">2012</xref>).</p>
<p>While each natural modality is optimized for particular types of messaging, cross-modal correspondences between modalities have also emerged. For example, <italic>writing</italic> maps the natural vocal modality (speech) into the natural graphic modality (drawing) to create an unnatural correspondence for graphic depictions of speech, which repurposes neural areas naturally associated with the visual system (Hervais-Adelman et al., <xref ref-type="bibr" rid="B33">2019</xref>). Sign languages have also attempted to be graphically instantiated in their own writing systems (Sutton, <xref ref-type="bibr" rid="B70">1995</xref>), and indeed various gestures appear in graphic form in the emoji vocabulary (Gawne and McCulloch, <xref ref-type="bibr" rid="B29">2019</xref>), not to mention gesticulations drawn in pictures more widely (Fein and Kasher, <xref ref-type="bibr" rid="B23">1996</xref>). In <xref ref-type="fig" rid="F2">Figure 2A</xref>, we notate these cross-modal mappings between modalities with dotted lines.</p>
<p>Thus, the vocal, bodily, and graphic modalities constitute the three natural modalities that humans have available to express conceptual structures. In the original Parallel Architecture, only phonology addressed modalities, which was largely characterized in terms of the vocal auditory modality. While the &#x0201C;phonology&#x0201D; of sign language was alluded to, it was left ambiguous as to whether &#x0201C;phonology&#x0201D; in the Parallel Architecture was conceived as a modality-specific construct (i.e., the auditory-vocal modality) or whether it was a modality-general construct with different sensory manifestations (i.e., &#x0201C;phonology&#x0201D; could serve as the broader class for all modalities). The multimodal Parallel Architecture makes it explicit, that all modalities are present <italic>at once</italic>, as depicted in <xref ref-type="fig" rid="F2">Figure 2A</xref>. The important implication is that &#x0201C;language&#x0201D; is not an amodal representation that &#x0201C;flows out&#x0201D; of different modalities, but rather all modalities are present and persisting as part of a larger holistic communicative faculty, whether or not expressions in those modalities rise to the level of full languages (as in <xref ref-type="fig" rid="F2">Figures 2B&#x02013;G</xref>, and discussed further below).</p>
</sec>
<sec>
<title>Meaning</title>
<p>A second component of language and communicative systems is their capacity to convey meaning. We follow Jackendoff (<xref ref-type="bibr" rid="B35">1983</xref>, <xref ref-type="bibr" rid="B36">1987</xref>, <xref ref-type="bibr" rid="B37">2002</xref>) in calling this Conceptual Structure, a modality-sensitive &#x0201C;hub&#x0201D; of semantic memory which aggregates semantic information from across sensory and cognitive systems (Jackendoff, <xref ref-type="bibr" rid="B36">1987</xref>; Kutas and Federmeier, <xref ref-type="bibr" rid="B48">2011</xref>; Ralph et al., <xref ref-type="bibr" rid="B64">2016</xref>). Conceptual structure is fundamentally combinatorial and constituting an independent level of structure, using intrinsically semantic units, like objects, paths, events, properties, and quantifiers. The specific ways that these structures may manifest depends on the modality, i.e., speech and graphics convey meaning in different ways, or on the representational systems within a modality, i.e., English and Swahili differ in how they convey meaning.</p>
<p>We here follow Jackendoff&#x00027;s (<xref ref-type="bibr" rid="B35">1983</xref>; <xref ref-type="bibr" rid="B36">1987</xref>; <xref ref-type="bibr" rid="B37">2002</xref>) model of Conceptual Semantics in articulating these conceptual structures, which we believe provides formalisms which can best express multimodal semantic relationships in explicit terms, including the emergent inferences that multimodality may evoke. We should note that the full treatment of meaning in the Parallel Architecture also includes a Spatial Structure which articulates an abstract geometric representation of meaning. In our full treatment of multimodal semantics in works to come, we include both systems of Conceptual Structure and Spatial Structure, but for simplicity we here omit Spatial Structure.</p>
</sec>
<sec>
<title>Grammar</title>
<p>The final component of languages is that of Grammar, the system that packages meaning in order to be expressed. While taxonomies of grammars have been posited for the vocal and bodily modalities (Chomsky, <xref ref-type="bibr" rid="B3">1956</xref>; Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>), recent work has argued for comparable architectural principles to operate in the sequencing of graphics, particularly in visual narratives like comics (Cohn, <xref ref-type="bibr" rid="B8">2013c</xref>). Neurocognitive research has found similar neural responses to manipulations of picture sequences as those observed in sentence processing (Cohn, <xref ref-type="bibr" rid="B13">2020b</xref>), consistent with findings of shared resources for verbal syntax and music (Patel, <xref ref-type="bibr" rid="B61">2003</xref>; Koelsch, <xref ref-type="bibr" rid="B45">2011</xref>). Indeed, human neurocognition has been posited as allowing for a range of combinatorial sequencing (Dehaene et al., <xref ref-type="bibr" rid="B18">2015</xref>), which could thus manifest in different representations across modalities.</p>
<p>We here make a broad distinction between the complexity of types of grammatical expressions (Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>, <xref ref-type="bibr" rid="B43">2017</xref>; Dehaene et al., <xref ref-type="bibr" rid="B18">2015</xref>). <italic><bold>Simple</bold></italic> <italic><bold>grammars</bold></italic> contribute little to the organizational structure of a sequence beyond the information provided from conceptual structures. They package information as a single unit, two-units, or a linear sequence, whereby the meaning of the units alone motivates organization of the utterance. In contrast, <italic><bold>complex grammars</bold></italic> contribute structure to the message they organize, by assigning categorical roles to differentiate units and by segmenting sequences into constituents, possibly with recursive embedding. Because basic memory capacity is fairly limited for sequencing meaningful information on its own, representations of distinguishable types (categorical grammar) and segmentation (simple phrase grammar) are posited to facilitate more complex sequencing, and thus more complex meaningful expressions. We elaborate further on types of grammars below.</p>
</sec>
<sec>
<title>Unimodal Expressions</title>
<p>We contend that full languages instantiate the three components of Modality, Meaning, and Grammar in a balanced way. Jackendoff&#x00027;s (<xref ref-type="bibr" rid="B37">2002</xref>) Parallel Architecture accounted for these interactions for the vocal modality to describe the structures of spoken languages, and alluded to the bodily modality to describe sign languages. Our extension of the Parallel Architecture thus includes all three natural human modalities of the vocal, bodily, and graphic structures which persist in parallel within a single unified system. All expressive modalities then arise out of emergent interactions between these component parts of the Parallel Architecture. Thus, spoken languages involve an interaction of the vocal modality with a complex grammar and conceptual structures (<xref ref-type="fig" rid="F2">Figure 2C</xref>). Sign languages involve a similar emergent interaction with the bodily modality (<xref ref-type="fig" rid="F2">Figure 2E</xref>), again along with a complex grammar and conceptual structures.</p>
<p>Because the structures in the Parallel Architecture are independent yet mutually interfacing, these same components can yield expressions that may lack certain structures, not fully manifesting as languages with all three components (Cohn, <xref ref-type="bibr" rid="B7">2013b</xref>). For example, expressions of the modality alone in the vocal modality would yield non-sense vocables like <italic>sha-la-la-la-la</italic> or non-words like <italic>fwiggle</italic> and <italic>plord</italic>. In the bodily modality this would be non-meaningful bodily expressions, and in the graphic modality this would yield non-meaningful mark-making as found in abstract art.</p>
<p>We draw primary attention here to meaningful expressions that lack a complex grammar, remaining as single unit expressions or as unstructured linear sequences. For example, vocal expressions using simple grammars guided by their meaning alone include single unit expressions such as <italic>ouch, pow</italic>, or <italic>kablooey</italic>, which do not have grammatical categories allowing them to be combined into well-formed sentences (Jackendoff, <xref ref-type="bibr" rid="B37">2002</xref>; Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>), as illustrated in <xref ref-type="fig" rid="F2">Figure 2B</xref>. Similarly, gestures in the bodily modality are typically single expressions (<xref ref-type="fig" rid="F2">Figure 2D</xref>) lacking a complex grammar, particularly emblems like <italic>thumbs up</italic> or gestural expletives, which may appear in isolation. When produced multimodally, gestures appear at a rate of once per spoken clause (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Goldin-Meadow, <xref ref-type="bibr" rid="B31">2003a</xref>). These single bodily motions may also differ in the degree to which they are instantiated in memory, whether as novel gesticulations, entrenched emblems, or gestural constructions (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Goldin-Meadow, <xref ref-type="bibr" rid="B31">2003a</xref>; Ladewig, <xref ref-type="bibr" rid="B49">2020</xref>). In all cases, these simple expressions lack the complex grammars that characterize spoken and sign languages.</p>
<p>While most research has focused on the vocal and bodily modalities, we further argue that these components also extend to the graphic modality. Single unit meaningful graphic expressions are many pictures (<xref ref-type="fig" rid="F2">Figure 2F</xref>), which might range in internal complexity from depicting full scenes (such as drawings and paintings) to simpler signs (such as emoji or pictograms used in signage). More recent work has argued that sequential drawings in visual narrative sequences use a narrative grammar with categorical roles and recursive constituent structures (Cohn, <xref ref-type="bibr" rid="B7">2013b</xref>,<xref ref-type="bibr" rid="B8">c</xref>), and manipulation of this structure evokes similar neural responses as the syntax of sentences (Cohn, <xref ref-type="bibr" rid="B13">2020b</xref>). Because these graphic systems again use all three components of a modality (graphics), meaning, and grammar (narrative), we argue that this constitutes a <italic>language</italic> in the graphic form as well: <italic><bold>visual language</bold></italic> (Cohn, <xref ref-type="bibr" rid="B7">2013b</xref>).</p>
<p>Thus, expressions across the vocal, bodily, and graphic modalities use a combination of a Modality, Meaning, and Grammar. Crucially, when those correspondences between a piece of Modality, Grammar and Meaning get fixed, they constitute lexical items, i.e., stored representations encoded in the interfaces between levels of structure (Jackendoff and Audring, <xref ref-type="bibr" rid="B41">2020</xref>). That is, we maintain that the lexicon is distributed across all structures of the Parallel Architecture (Modality, Grammar, Meaning), for lexical items of varying sizes and complexity, and such breakdown dissolves the boundaries between lexicon and grammar (i.e., because grammatical schemas are stored within and across the lexicon). This addresses our first question above, about what elements are stored in memory. The multimodal Parallel Architecture which we argue for here thus predicts that lexical items can appear in all modalities, including for the bodily modality in the lexicons of sign languages and gestural emblems, and in the extensive visual lexicons of drawings and graphic representations (Forceville, <xref ref-type="bibr" rid="B24">2011</xref>, <xref ref-type="bibr" rid="B25">2019</xref>; Cohn, <xref ref-type="bibr" rid="B5">2012</xref>, <xref ref-type="bibr" rid="B7">2013b</xref>; Schilperoord and Cohn, <xref ref-type="bibr" rid="B68">2021</xref>).</p>
<p>Consider the vocal word &#x0201C;heart,&#x0201D; for which we provide the lexical entry in <xref ref-type="fig" rid="F3">Figure 3</xref>. As a spoken word, it has a three-part structure of its modality, phonology: /h&#x00251;rt/, graphic spelling: /heart/, its grammar as a noun, and its meaning as an object HEART. The correspondences across levels of structure are marked by subscripted indices, here &#x0201C;1.&#x0201D;</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Lexical entries for <bold>(A)</bold> the word &#x0201C;heart,&#x0201D; <bold>(B)</bold> the heart shape, <bold>(C)</bold> a multimodal construction using the heart shape, and <bold>(D)</bold> the heart shape as a visual affix.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0003.tif"/>
</fig>
<p>We can similarly specify a lexical encoding for the heart-shape, which in fact has multiple entries. First, a simple heart-shape in <xref ref-type="fig" rid="F3">Figure 3B</xref> has the modality of its pictorial form, which is grammatically a <italic>monomorph</italic>&#x02014;a isolatable visual form that can stand alone (Cohn, <xref ref-type="bibr" rid="B10">2018</xref>; Schilperoord and Cohn, <xref ref-type="bibr" rid="B68">2021</xref>)&#x02014;while its conceptual specification is also the object HEART. Because the word &#x0201C;heart&#x0201D; shares its meaning with and provides a label for the heart-shape, the word (<xref ref-type="fig" rid="F3">Figure 3A</xref>) and base image (<xref ref-type="fig" rid="F3">Figure 3B</xref>) can be thought of as sister schemas (Jackendoff and Audring, <xref ref-type="bibr" rid="B41">2020</xref>). This is expressed by the similar indices (&#x0201C;1&#x0201D;) shared across levels and modalities in (2a) and (2b).</p>
<p>The heart-shape has at least two additional entries. First, as discussed in our introduction, it can be used as a verb in a construction, as in the sentence &#x0201C;I<inline-graphic xlink:href="frai-04-778060-i0001.tif"/> NY.&#x0201D; The heart-shape within this construction is characterized in <xref ref-type="fig" rid="F3">Figure 3C</xref>. Here, the heart-shape is pronounced with the phonology/l&#x0028C;v/and plays the role of a verb in a canonical sentence schema, which has the conceptual structure of an event of LOVE with arguments corresponding to the noun phrases. This example shows how not only words are encoded in a lexicon, but also grammatical constructions with open slots that can be filled by other encoded lexical items. It also shows that lexical items can be multimodal, i.e., encoded across modalities.</p>
<p>An additional encoding of the heart-shape concerns its usage as a visual affix in graphic representations, as in <xref ref-type="fig" rid="F3">Figure 3D</xref>, such as an &#x0201C;upfix,&#x0201D; an object that floats &#x0201C;up&#x0201D; above a character&#x00027;s head to indicate a cognitive or emotional state (Cohn, <xref ref-type="bibr" rid="B7">2013b</xref>, <xref ref-type="bibr" rid="B10">2018</xref>). The graphic form of this usage places a heart-shape above or near a character&#x00027;s head, shorthanded here to &#x0201C;REGION&#x0201D; for a visual region of an image that would act as a visual variable. At the level of the visual grammar, the heart-shape corresponds to an affix, which cannot stand alone, attaching to the character which is a monomorph to then form a larger monomorph (Cohn, <xref ref-type="bibr" rid="B10">2018</xref>). This corresponds to two potential meanings: a transitive case when the heart reflects the event LOVE with arguments for its morphological stem and some other entity (i.e., the chicken loves something), or, alternatively, a state of an argument corresponding to the morphological stem as being IN-LOVE (i.e., the chicken is in love).</p>
<p>Expressions in any modality thus make use of, and can combine, these encoded lexical items which include information from all three components of the Parallel Architecture. Stored lexical items can range in size from pieces of form-meaning mappings (like affixes), to whole isolable forms (words, monomorphs) and to grammatical constructions. Because the range of complexity is accessible to all modalities, the combination of modalities within the model allows for multimodal constructions (as in <xref ref-type="fig" rid="F3">Figure 3C</xref>).</p>
<p>As described above, our model posits that all modalities persist within the Parallel Architecture simultaneously, making use of semantics and grammar with modality-specific affordances. There is no &#x0201C;flow&#x0201D; of an &#x0201C;amodal&#x0201D; language into one modality or another (aside from cross-modal correspondences like writing), because all modalities are co-present and functional as part of a holistic system. We thus posit that the determination of a system&#x00027;s complexity depends on how it may become nurtured across development. The correspondences between each natural modality (vocal, bodily, graphic) and conceptual structures persist as &#x0201C;resilient&#x0201D; features (Goldin-Meadow, <xref ref-type="bibr" rid="B32">2003b</xref>) of an innate, &#x0201C;core&#x0201D; meaning-making system. That is, humans innately have a capacity to create simple expressions (single units, linear sequences) of sounds, bodily motions, and drawings, which persist no matter the additional development. Modalities can further develop as full linguistic systems when also engaging substantial grammars and lexicons.</p>
<p>Thus, a person will develop a sign language if they receive the requisite exposure and practice with a system that provides them with a lexicon and grammar. Yet, even if a person does not learn a sign language, in typically-developing circumstances they retain their resilient ability to express meaning with gestures (Goldin-Meadow, <xref ref-type="bibr" rid="B32">2003b</xref>), just as fluent signers also retain the use of gestures (Marschark, <xref ref-type="bibr" rid="B52">1994</xref>; Emmorey, <xref ref-type="bibr" rid="B20">2001</xref>). Similarly, if a person does not learn a full visual language (often reflected in the statement of &#x0201C;I can&#x00027;t draw&#x0201D;), they retain the ability to create basic drawings (Cohn, <xref ref-type="bibr" rid="B5">2012</xref>). The complexity that each modality may develop into is thus determined by exposure to a representational system in one&#x00027;s environment. Nevertheless, no matter what level of complexity is achieved in development for each modality, all modalities persist as part of a holistic expressive system. These issues address the third question above about acquisition.</p>
</sec>
<sec>
<title>Multimodal Expressions</title>
<p>Since unimodal expressions across modalities can arise out of different activation patterns within the Parallel Architecture, multimodal expressions involve simultaneous emergent interactions of unimodal expressions. While there are numerous such potential emergent interactions, we provide three here as examples. First, consider someone saying the sentence <italic>I caught a tiny fish</italic> while simultaneously making a small pinching gesture. As diagrammed in <xref ref-type="fig" rid="F4">Figure 4A</xref>, this interaction would involve a grammatically complex sentence with a one-unit gesture. The modalities and grammars remain independent of each other, but they correspond to a common conceptual structure, reflecting their shared and/or constructed meaning. Such convergence into a common conceptual structure aligns with McNeill (<xref ref-type="bibr" rid="B54">1992</xref>) notion of a &#x0201C;growth point,&#x0201D; the common origin of meaning across both speech and gesture.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Multimodal interactions arising in the Parallel Architecture. Emergent states here describe <bold>(A)</bold> co-speech gesture, <bold>(B)</bold> text-emoji relationships, and <bold>(C)</bold> a visual sequence using a narrative grammar alongside grammatical text. <italic>Savage Chickens</italic> is &#x000A9; 2021 Doug Savage.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0004.tif"/>
</fig>
<p>Next, consider the interaction which occurs in <xref ref-type="fig" rid="F4">Figure 4B</xref>, a combination of a written sentence and an emoji. Like in co-speech gestures, this interaction involves a grammatically complex sentence with another modality using a one-unit grammar, here an emoji instead of a gesture (Cohn et al., <xref ref-type="bibr" rid="B14">2019</xref>; Gawne and McCulloch, <xref ref-type="bibr" rid="B29">2019</xref>). Unlike the speech-gesture combination though, both expressions here originate from the graphic modality. What differs is the cognitive representation of these expressions. The pizza emoji is a natural visual representation, here represented by the light gray highlighting in the &#x0201C;graphic&#x0201D; modality box, along with light gray highlighting across grammar and meaning. The text is signaled here with the dark gray highlighting. Here, the vocal modality interfaces with the graphic modality to create the cross-modal correspondence reflected in writing (see the vertical doubly pointed arrow), which then subsequently interfaces with grammar and meaning.</p>
<p>Finally, consider the comic strip in <xref ref-type="fig" rid="F4">Figure 4C</xref>. This representation again involves both pictorial and textual information in a shared graphic modality. However, both text and images use complex grammars. The written text uses syntactic structure (both uttered in, and &#x0201C;inherent&#x0201D; to, the world of the pictures), while the visual sequence involves a narrative structure which also uses recursive constituent structures (detailed further below, and in <xref ref-type="fig" rid="F4">Figure 4</xref>). While only this complex interaction is diagrammed here, the sequence also involves a simple grammar in the one-unit utterance &#x0201C;ummmmm&#x02026;&#x0201D; in the third panel. This sequence therefore uses two complex grammars in addition to the simple grammar, compared to the speech/gesture and text/emoji interactions which used a combination of complex and simple grammars.</p>
</sec>
</sec>
<sec id="s3">
<title>Relationships of Modality and Meaning</title>
<p>Within the Parallel Architecture, emergent interactions can give rise to unimodal and multimodal expressions. Given that this tripartite structure operates across all modalities of expression, each substructure internally has the possibility of interactions when mixing representations across or within modalities. That is, multimodality can involve interactions between modalities, between meanings expressed by modalities, and/or between the grammatical structures manifested in modalities. Interactions within each structure operate independently of the other structures, but coalesce in the broader expressivity of communication. For example, though all multimodal interactions in <xref ref-type="fig" rid="F4">Figure 4</xref> involve two modalities, the interface between speech and gestures differs in nature from the interface between writing and pictures. Similarly, the multimodal interactions between grammars in <xref ref-type="fig" rid="F4">Figures 4A,B</xref> combine a complex grammar with a simple single unit, while the example in <xref ref-type="fig" rid="F4">Figure 4C</xref> brings together complex grammars. These differences imply characterizable interactions between elements within each substructure of the Parallel Architecture. We here briefly discuss interactions between modalities and meanings, before progressing to more detail about grammars and their interactions.</p>
<sec>
<title>Modality Relations</title>
<p>Interactions between modalities are the primary way that we experience multimodality, since the sensory signals of modalities (light, sound) provide our only overt access to these messages (Jackendoff, <xref ref-type="bibr" rid="B38">2007</xref>). When these sensory signals allow for a singular experience, we follow Clark (<xref ref-type="bibr" rid="B4">1996</xref>) in characterizing them as <italic><bold>composite signals</bold></italic>. Composite signals are aggregations of multiple modalities together into a unified multimodal unit. This creation of unitized composite signals may not be a binary matter, but instead operates across a continuum. We here focus on two primary ways that composite signals may be made based on the affordances of modalities&#x00027; sensory signals.</p>
<p>A first way that modalities interact is through the coordination of the modalities themselves. These relations can be characterized as <italic>alignment</italic> or <italic>correspondence</italic> between the sensory signals in different modalities (Rasenberg et al., <xref ref-type="bibr" rid="B65">2020</xref>), which is constrained by the affordances of how different modalities convey information. For example, speech is produced vocally and received auditorily, while bodily motions are produced through the body and received visually (or haptically), yet both unfurl across a duration of time allowing their alignment. In such temporal correspondence, expressions may come with various degrees of synchrony to create a composite unit. Thus, a small pinching gesture depicting size might be predicted to align with the word &#x0201C;tiny&#x0201D; in <italic>I caught a tiny fish</italic>, not with the word &#x0201C;caught.&#x0201D;</p>
<p>Temporal correspondence between modalities can use the simultaneous production of modalities in time as a way to cue their relationship between each other. While the process of writing or drawing also unfurls in time (Willats, <xref ref-type="bibr" rid="B74">2005</xref>; Cohn, <xref ref-type="bibr" rid="B5">2012</xref>; Wilkins, <xref ref-type="bibr" rid="B72">2016</xref>), this temporality often disappears once the process is completed, after which only a static form persists. Without duration to align modalities, relationships between pictures and words therefore use a spatial correspondence, through the degree to which modalities share a common region and/or use cues to integrate them into a composite multimodal unit (Cohn, <xref ref-type="bibr" rid="B6">2013a</xref>). The least integrated type of multimodal interaction keeps text and pictures fully separate (as in most academic articles, including this one), while greater integration can be facilitated by devices like labels or speech balloons. For example, <xref ref-type="fig" rid="F4">Figure 4C</xref> uses the phrase (<italic>Our love is like&#x02026;</italic>) in two ways: it is interfaced to the images with the device of a speech balloon in the first panel, while the same phrase appears written on a piece of paper within the storyworld in panel 3. The text is the same in both cases, but it interfaces in two different ways with the pictures.</p>
</sec>
<sec>
<title>Meaning Relations</title>
<p>Most theories of multimodality focus on categorizing the ways that modalities meaningfully interact (see Bateman, <xref ref-type="bibr" rid="B1">2014</xref> for review). These categorizations often expand on balanced or imbalanced semantic relationships where information expressed in one modality may support, elaborate, or extend the information expressed in another modality (Martinec and Salway, <xref ref-type="bibr" rid="B53">2005</xref>; Royce, <xref ref-type="bibr" rid="B66">2007</xref>; Kress, <xref ref-type="bibr" rid="B47">2009</xref>; Painter et al., <xref ref-type="bibr" rid="B60">2012</xref>; Bateman, <xref ref-type="bibr" rid="B1">2014</xref>). We here characterize the global &#x0201C;balance&#x0201D; of meaning between modalities as the <italic><bold>semantic weight</bold></italic> of a multimodal utterance. When meaning is conveyed in one modality more than another, it carries more of the &#x0201C;weight&#x0201D; of the overall message. Below we characterize semantic weight as a binary distinction, but it is likely proportional along a scale (again in line with our &#x0201C;weight&#x0201D; metaphor).</p>
<p>Multimodal interactions that are <italic><bold>balanced</bold></italic> in their semantic weight involve multiple modalities with relatively equal contribution of meaning, while <italic><bold>imbalanced</bold></italic> semantic weight places the locus of meaning primarily in one modality. Consider the utterances in (2) as if they were sent as text messages:</p>
<p>(2)</p>
<p>a) Would you like to eat pizza<inline-graphic xlink:href="frai-04-778060-i0002.tif"/> (imbalanced)</p>
<p>b) Would you like to eat pizza<inline-graphic xlink:href="frai-04-778060-i0003.tif"/> (balanced)</p>
<p>Both of these messages would be diagrammed as in <xref ref-type="fig" rid="F4">Figure 4B</xref>, as sentences with a single emoji. In (2a) the sentence is followed by a pizza emoji, which is coreferential to the word &#x0201C;pizza&#x0201D; in the text. Deleting the pizza emoji would have little impact on the overall gist, suggesting that the writing is more informative and thus carries more semantic weight. Omission of the sentence however, leaving only the pizza, would certainly impact the meaning of the message. With the writing carrying more semantic weight than the image, it implies an imbalanced relationship.</p>
<p>This differs from (2b) where the winking face emoji implies an innuendo or at least some added information not conveyed by the text. Omission of either the winking emoji or the text here would alter the overall expression&#x00027;s gist, implying that both modalities substantially contribute, and thus have a balanced relationship. This balanced semantic weight arises in part because the smirking emoji here maintains no direct coreference to the units of the sentence, unlike the coreferential relationship between the word <italic>pizza</italic> and the pizza emoji in (2a). With no direct coreference, this multimodal relationship would then require further inferencing to resolve in the Conceptual Structures.</p>
</sec>
</sec>
<sec id="s4">
<title>Grammatical Complexity</title>
<p>Before progressing to describe relations between grammars, we first will elaborate on our broad categorization of grammars as either simple or complex by detailing the range of grammatical complexity, following Jackendoff and Wittenberg&#x00027;s (<xref ref-type="bibr" rid="B42">2014</xref>) hierarchy of grammars. In contrast to the idealization of grammatical structures in the classic Chomskyan hierarchy (Chomsky, <xref ref-type="bibr" rid="B3">1956</xref>), this hierarchy provides a more ecological characterization of the complexity of combinatorial principles used to map form to meaning.</p>
<p>Jackendoff and Wittenberg&#x00027;s (<xref ref-type="bibr" rid="B42">2014</xref>) hierarchy of grammars is shown in <xref ref-type="table" rid="T1">Table 1</xref>, together with their basic schemas. As will be demonstrated, this hierarchy can be applied to characterize the sequencing across all modalities, and thus we have modified the terminology to apply to this broader context. In this sense, the hierarchy of grammars can be viewed as a modality-independent capacity, and the manifestation of grammars in different modalities may vary in the representations that they use. Although a spoken word and a graphic picture obviously differ in how they convey meaning, both represent a single isolable <italic>utterance</italic>, and thus we argue both are characterizable by the types of grammar in the hierarchy. Similarly, the syntactic structures used in verbal languages and the narrative structures used in visual languages differ in their representations, but both employ the same combinatorial principles. The hierarchy of grammars thus characterizes the abstract means of combinatoriality, which may become manifest in the representational schemas encoded in memory for a given system.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>A hierarchy of grammars by Jackendoff and Wittenberg (<xref ref-type="bibr" rid="B42">2014</xref>), with modified terminology to apply across modalities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Grammar complexity</bold></th>
<th valign="top" align="left"><bold>Grammar type</bold></th>
<th valign="top" align="left"><bold>Schemas</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Simple</td>
<td valign="top" align="left">One-unit grammar</td>
<td valign="top" align="left">[<sub>Utterance</sub> Unit]</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Two-unit grammar</td>
<td valign="top" align="left">[<sub>Utterance</sub> Unit&#x02014;Unit]</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Linear grammar</td>
<td valign="top" align="left">[<sub>Utterance</sub> Unit&#x02014;Unit&#x0002A;]</td>
</tr> <tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Complex</td>
<td valign="top" align="left">Phrase structure grammar</td>
<td valign="top" align="left"><break/> [<sub>Utterance</sub> Unit/Phrase&#x0002A;] <break/> [<sub>Phrase</sub> Unit&#x02014;Unit] (2-unit phrase) <break/> [<sub>Phrase</sub> Unit&#x0002A;] (unlimited phrase)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Categorical grammar</td>
<td valign="top" align="left">[<sub>Utterance</sub> Unit<sub>x</sub> &#x02013;Unit<inline-formula><mml:math id="M1"><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mtext>y</mml:mtext></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>]</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Recursive grammar</td>
<td valign="top" align="left"><break/> [<sub>Utterance</sub> Unit/Phrase&#x0002A;] <break/> [<sub>Phrase</sub> Unit/Phrase&#x0002A;]</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As argued above, simple grammars characterize organizations where conceptual structures motivate the sequencing, with little contribution from the grammar itself. At the most simple, <italic><bold>one-unit grammars</bold></italic> of verbal expressions include words like <italic>abracadabra, gadzooks, ouch!</italic>, or <italic>ummmmm&#x02026;</italic> as in the third panel of <xref ref-type="fig" rid="F4">Figure 4C</xref>, which are not encoded as syntactic categories to place them into sentence structures (Jackendoff, <xref ref-type="bibr" rid="B37">2002</xref>; Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>). Ideophones, like <italic>pow</italic> or <italic>kablam</italic> also use one-unit grammars, generally maintaining morphosyntactic independence from their sentence contexts (Dingemanse, <xref ref-type="bibr" rid="B19">2017</xref>). In the bodily modality, most gesticulations and gestural emblems remain as one-unit grammars that cannot be put into a coherent sequence (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Goldin-Meadow, <xref ref-type="bibr" rid="B31">2003a</xref>), and individual pictures constitute single unit expressions in graphic form, whether complex compositions of whole scenes, like paintings, or simple units, like icons or emoji.</p>
<p><italic><bold>Two-unit grammars</bold></italic> are only slightly more complex. In speech, two-unit sequences appear in children&#x00027;s two-word stage of language learning and in pivot grammars (<italic>Lake X</italic> vs. <italic>X Lake</italic>) (Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>). Two-unit grammars can also characterize compounds, which allow for a wide range of meaningful relations between units (Jackendoff, <xref ref-type="bibr" rid="B39">2009</xref>). In the visual modality, two-unit grammars are used across several constructions illustrating yes/no contrasts (as in signs showing which foods one can and cannot have in a classroom), or pairs of images denoting comparisons or before-after causal relations (Plug et al., <xref ref-type="bibr" rid="B62">2018</xref>; Schilperoord and Cohn, <xref ref-type="bibr" rid="B69">2022</xref>). Overall two-unit grammars allow a wide range of construals of relations between juxtaposed units which are not grammatically encoded.</p>
<p>Simple grammars without constraints on length may manifest as linear sequences with only meaningful associations, a <italic><bold>linear grammar</bold></italic>. In speech this occurs in contexts like lists, the speech of some aphasics, and in languages that require only semantic heuristics to guide their sequencing (Jackendoff and Wittenberg, <xref ref-type="bibr" rid="B42">2014</xref>, <xref ref-type="bibr" rid="B43">2017</xref>). Visual sequences use these linear relations in visual lists, like in instructions about what to or not to carry onto a plane, what you can do in a park, or what tools to use when assembling furniture (Cohn, <xref ref-type="bibr" rid="B12">2020a</xref>). Visual linear grammars also appear when people type numerous related emoji in an unstructured way, such as several emoji related to birthday parties (Cohn et al., <xref ref-type="bibr" rid="B14">2019</xref>), as in: <inline-graphic xlink:href="frai-04-778060-i0004.tif"/>.</p>
<p>In contrast to simple grammars, complex grammars contribute representational structure to their constituent parts beyond only semantic relations. <italic><bold>Simple phrase grammars</bold></italic> segment a sequence into constituent parts with one level of embedding. <italic><bold>Categorical grammars</bold></italic>, called &#x0201C;part-of-speech grammars&#x0201D; by Jackendoff and Wittenberg (<xref ref-type="bibr" rid="B42">2014</xref>), differentiate the units in a sequence with roles which may function with varying salience and distributions in a sequence. In spoken and signed modalities, such categories are typically nouns, verbs, and other syntactic classes. Visual information seems to be less optimized for expressing sentence level parts-of-speech (Cohn et al., <xref ref-type="bibr" rid="B14">2019</xref>), and manifest more naturally as narrative level categories (Cohn, <xref ref-type="bibr" rid="B8">2013c</xref>, <xref ref-type="bibr" rid="B13">2020b</xref>). Though we list simple phrase grammars and categorical grammars sequentially, they lie as various options at the same level of complexity.</p>
<p>Finally, <italic><bold>Recursive grammars</bold></italic> allow for the embedding of units or constituents of one type to embed in constituents of that same type. In <xref ref-type="table" rid="T1">Table 1</xref>, we use the notation of Unit/Phrase to indicate an concatenation of either units or phrases with a Kleene star that indicates that either units or phrases can extend to unlimited length, following the notation of Jackendoff and Wittenberg (<xref ref-type="bibr" rid="B42">2014</xref>). Recursive grammars are the most complex level of grammar and can manifest sentence structures, whether in spoken or sign language, as in <xref ref-type="fig" rid="F5">Figure 5A</xref>. Recursive grammars have also been shown to organize narrative structures in visual sequences (Cohn, <xref ref-type="bibr" rid="B13">2020b</xref>) which involves roles played by different units and recursive structures organizing units into hierarchic constituents. For example, the sequence in <xref ref-type="fig" rid="F5">Figure 5B</xref> uses a canonical narrative schema (Establisher-Initial-Prolongation-Peak-Release) embedded within another narrative structure.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Grammatical structures in both <bold>(A)</bold> the syntactic structure of spoken languages and <bold>(B)</bold> as the narrative structures of visual languages. <italic>Savage Chickens</italic> is &#x000A9; 2021 Doug Savage.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0005.tif"/>
</fig>
</sec>
<sec id="s5">
<title>Grammatical Relationships</title>
<p>Multimodal (and unimodal) interactions may involve expressions using grammars at various levels of the hierarchy of grammatical types. We now turn to detailing how these grammars might interact. Here we posit two principles that characterize the ways that grammars of different expressions and various complexity interact.</p>
<p>The first principle of grammatical interactions is <italic><bold>symmetry</bold></italic>. Following the hierarchy of grammars, grammars broadly fall into classes of simple and complex types. Simple grammars provide little structure of their own, and merely facilitate a mapping between a modality and meaning without additional representations. Complex grammars further contribute representations of categorical roles and/or constituent structure. Given these two broad possibilities, we can characterize interactions between grammars as either <italic>symmetrical</italic> between grammars of the same type, i.e., simple with simple, or complex with complex, or <italic>asymmetrical</italic>, i.e., simple with complex. In the examples in <xref ref-type="fig" rid="F4">Figures 4A,B</xref>, the full sentences (complex) interact with single-unit expressions (simple), thus creating asymmetrical interactions, while the use of two recursive relationships in <xref ref-type="fig" rid="F4">Figure 4C</xref> exemplifies a symmetrical interaction.</p>
<p>The second principle concerns the <italic><bold>allocation</bold></italic> of grammars relative to each other. Allocation specifies the relative independence of one grammar to another grammar. In <italic>independent</italic> relations, the grammars remain fully formed with no direct connection to each other, thus operating in parallel. This is found in all the examples in <xref ref-type="fig" rid="F4">Figure 4</xref>. In <italic>substitutive</italic> relations on the other hand, the units or combinatorics of one grammar function within and are determined by another grammar. This occurs in all the examples of the [<sub>S</sub> Subject&#x02014;Picture<sub>Verb</sub>&#x02013;Object] construction in <xref ref-type="fig" rid="F1">Figure 1</xref>, where the verb-function of the images is determined by the grammar of the verbal sentences.</p>
<p>Below, we further elaborate on the formalisms of these types of interactions. We will then elaborate on their manifestations in the contexts of unimodal and multimodal interactions.</p>
<sec>
<title>Symmetry</title>
<p>As described above, symmetry is a principle of grammatical interactions (GI, from here on) that characterizes the relative complexity of interacting grammars. As our broad classes involve simple and complex grammars, symmetry describes the ways these classes interact, as detailed in <xref ref-type="table" rid="T2">Table 2</xref>. Crossing simple and complex grammars gives rise to both symmetrical relations, maintaining the same complexity of grammars, while asymmetrical interactions arise when grammars of different complexity are used. We now turn to detailing the properties of these relations.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Possibilities for multimodal interactions between grammars of two modalities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th style="border-right: thin solid #000000;"/>
<th valign="top" align="center" colspan="3" style="border-right: thin solid #000000;"><bold>Simple</bold></th>
<th valign="top" align="center" colspan="3"><bold>Complex</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td style="border-right: thin solid #000000;"/>
<td valign="top" align="center"><bold>One-unit</bold></td>
<td valign="top" align="center"><bold>Two-unit</bold></td>
<td valign="top" align="center" style="border-right: thin solid #000000;"><bold>Linear</bold></td>
<td valign="top" align="center"><bold>Simple phrase</bold></td>
<td valign="top" align="center"><bold>Categorical</bold></td>
<td valign="top" align="center"><bold>Recursive</bold></td>
</tr> <tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Simple</td>
<td valign="top" align="left" style="border-right: thin solid #000000;">One-unit</td>
<td/>
<td/>
<td style="border-right: thin solid #000000;"/>
<td/>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="left">Two-unit</td>
<td valign="top" align="center" style="border-right: thin solid #000000; border-left: thin solid #000000;" colspan="3">Symmetrical Simple</td>
<td valign="top" align="center" colspan="3">Asymmetrical</td>
</tr>
<tr>
<td/>
<td valign="top" align="left" style="border-right: thin solid #000000;">Linear</td>
<td/>
<td/>
<td style="border-right: thin solid #000000;"/>
<td/>
<td/>
<td/>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Complex</td>
<td valign="top" align="left" style="border-right: thin solid #000000;">Simple phrase</td>
<td/>
<td/>
<td style="border-right: thin solid #000000;"/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="left" style="border-right: thin solid #000000;">Categorical</td>
<td valign="top" align="center" colspan="3" style="border-right: thin solid #000000;">Asymmetrical</td>
<td valign="top" align="center" colspan="3" style="border-right: thin solid #000000;">Symmetrical Complex</td>
</tr>
<tr>
<td/>
<td valign="top" align="left" style="border-right: thin solid #000000;">Recursive</td>
<td/>
<td/>
<td style="border-right: thin solid #000000;"/>
<td/>
<td/>
<td/>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>Symmetrical Simple</title>
<p>When multiple grammars are simple, we describe them as <italic>Symmetrical Simple</italic> interactions. We here collapse all of these simple types (i.e., one-unit, two-unit, and linear grammars) into a single formalism using optionality (notated with parentheses, and a Kleene star <sup>&#x0002A;</sup> for potential repetition) whereby two grammars are interacting in (GI-1):</p>
<p>(GI-1) Symmetrical Simple</p>
<p>GS<sub>1</sub>: [<sub>Utterance</sub> Unit &#x02013; (Unit<sup>&#x0002A;</sup>)]</p>
<p>GS<sub>2</sub>: [<sub>Utterance</sub> Unit &#x02013; (Unit<sup>&#x0002A;</sup>)]</p>
<p>Each of these schemas describes a simple utterance with at least one unit, possibly elaborated into two or a linear sequence. This type of interaction occurs when a single gesture comes with a single word (like making a deictic pointing gesture along with uttering &#x0201C;that&#x0201D;). It could also describe a single textual word along with a picture, such as a meme with an image and one word, or a single word along with an emoji (&#x0201C;Nice!<inline-graphic xlink:href="frai-04-778060-i0001.tif"/>&#x0201D;).</p>
</sec>
<sec>
<title>Symmetrical Complex</title>
<p>Symmetrical relationships can also persist between complex grammars. We again collapse all three complex grammars (categorical grammars, simple-phrase grammars, and recursive grammars) into a single formalism that attempts to capture these complexities. As before, simple-phrase and recursive grammars require two interacting structures, which allows the embedding of the phrasal level schema into the utterance level schema. Thus, we here divide these parts by a comma, as in (GI-2):</p>
<p>(GI-2) Symmetrical Complex</p>
<p>GS<sub>1</sub>: [<sub>Utterance</sub> Unit<sub>x</sub>/Phrase<inline-formula><mml:math id="M2"><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>], [<sub>Phrase</sub> Unit<sub>x</sub>/Phrase<sub>x</sub> <sup>&#x0002A;</sup>]</p>
<p>GS<sub>2</sub>: [<sub>Utterance</sub> Unit<sub>x</sub>/Phrase<inline-formula><mml:math id="M3"><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>], [<sub>Phrase</sub> Unit<sub>x</sub>/Phrase<sub>x</sub> <sup>&#x0002A;</sup>]</p>
<p>Within each grammatical structure, both schemas specify that a unit or phrase, potentially of a particular category (subscript X), forms an utterance or a phrase. An example of such an expression that involves two complex grammars would be a comic strip, like in <xref ref-type="fig" rid="F4">Figure 4C</xref>, with a complex visual narrative sequence interacting with sentences in text (such as in emergent carriers like balloons or captions). Another example of such an interaction might be the expression of a bimodal bilingual who both speaks and signs at the same time. In video, subtitles appearing while a person talks would also use a Symmetrical Complex interaction, with degrees of redundancy of the meaning between text and speech for whether it is the same language (depending on the quality of subtitling) or different languages (depending on the quality of translation).</p>
</sec>
<sec>
<title>Asymmetrical</title>
<p>Interactions between one simple grammar and one complex grammar are described as asymmetrical. Using the same formalisms as above, asymmetrical interactions are characterized as in GI-3:</p>
<p>(GI-3) Asymmetrical</p>
<p>GS<sub>1</sub>: [<sub>Utterance</sub> Unit &#x02013; (Unit<sup>&#x0002A;</sup>)]</p>
<p>GS<sub>2</sub>: [<sub>Utterance</sub> Unit<sub>x</sub>/Phrase<inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>], [<sub>Phrase</sub> Unit<sub>x</sub>/Phrase<inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow></mml:mrow><mml:mrow><mml:mtext>x</mml:mtext></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>]</p>
<p>A typical example of an asymmetrical grammatical interaction is a gesticulation that runs concurrently with speech, often using a one-unit grammar along with a complex sentence grammar (McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Clark, <xref ref-type="bibr" rid="B4">1996</xref>). Similarly, a single emoji placed at the end of a typed sentence entertains the same relationship (Cohn et al., <xref ref-type="bibr" rid="B14">2019</xref>; Gawne and McCulloch, <xref ref-type="bibr" rid="B29">2019</xref>). These are the examples in <xref ref-type="fig" rid="F4">Figures 4A,B</xref>, which both use asymmetrical interactions. Conversely, a visual sequence with onomatopoeia, such as a fight scene with sound effects (like a sequence of one person punching another with the text &#x0201C;Pow!&#x0201D;), would have a complex narrative grammar along with the one-unit word. All of these examples are asymmetrical in the interactions between their grammars.</p>
</sec>
</sec>
<sec>
<title>Allocation</title>
<p>While symmetry involves the relative complexity of the grammars involved, allocation relates to the way in which those grammars are distributed relative to each other. This distribution gives us two types: Independent and Substitutive. Independent allocation allows each grammar to exist on their own without any direct interaction, while Substitutive allocation places one grammar as a unit within another grammar. These notions have much in common with prior work such as Clark&#x00027;s (<xref ref-type="bibr" rid="B4">1996</xref>) description of &#x0201C;concurrent&#x0201D; and &#x0201C;component&#x0201D; co-speech gestures, here now elaborated across all modalities and operationalized to grammatical interactions specifically. In formal terms of our Parallel Architecture, allocation can be captured by how different grammars may be coindexed.</p>
<sec>
<title>Independent</title>
<p>We begin with Independent allocations. In this allocation, the units of both systems are independently distinguishable for whatever grammatical roles may be played (if any) within and across systems. Independent allocation occurs in all interactions in <xref ref-type="fig" rid="F4">Figure 4</xref>. The critical insight here is that the grammatical allocation is mediated by the interactions between the modalities. That is, the temporal or spatial correspondence between modalities themselves allows for the interfacing of grammars, but on their own, the grammars remain independent. Allocation between grammars here is imposed by the circumstances of the modality interfaces. Consider an interaction between text and an emoji like: &#x0201C;I love pizza<inline-graphic xlink:href="frai-04-778060-i0002.tif"/>,&#x0201D; formalized in <xref ref-type="fig" rid="F6">Figure 6A</xref>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Independent grammatical interactions between <bold>(A)</bold> text and emoji and <bold>(B)</bold> speech and gesture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0006.tif"/>
</fig>
<p>Here, the pizza emoji is merely spatially integrated with the sentence (by following it). This spatial correspondence creates an interface between the adjacent word &#x0201C;pizza&#x0201D; and the emoji. By maintaining both the word &#x0201C;pizza&#x0201D; and the pizza emoji, it warrants a coreferential relationship, leading to coindexation in the conceptual structure (notated as subscript &#x0201C;a&#x0201D;). The grammars themselves (one-unit emoji and a textual sentence) remain independent. A similar interaction appears between the spoken sentence &#x0201C;I caught a tiny fish&#x0201D; with an accompanying pinching gesture to reinforce the small magnitude (Woodin et al., <xref ref-type="bibr" rid="B75">2020</xref>), as in <xref ref-type="fig" rid="F6">Figure 6B</xref>.</p>
<p>Because these grammars remain separate, so do their individual semantics. This independence is what invites coreference between the meaningful elements, broadly as balanced or imbalanced semantic weight, and more specifically as expressed by a range of coreferential connections. In the example above, overall this interaction creates an imbalanced relationship because the text carries more semantic weight than the gesture. This creates an &#x0201C;included&#x0201D; coreferential relationship where one word semantically overlaps with the gesture, but the rest of the verbal utterance is not reflected in the bodily modality. In independent allocation, the grammars work to package the meaning of each modality separately, creating the need for multimodal meaning to emerge outside the context of grammatical constraints. That is, multimodal meaning in this case arises at the level of conceptual structure alone, given the separate but interacting contributions of each expression.</p>
<p>In these allocations, both the modalities and the meanings work to create connections between messages, while the grammars only contribute to their own expressions but not to the overall multimodal message. To reiterate, in these cases the modalities interface to create sensory alignment and/or integration in temporal or spatial correspondence. The grammars of these modalities work to package the message of each expression independent from each other. This independence puts greater demands on the conceptual structure to integrate the meanings of those separate messages, requiring the search to establish coreference between the semantics of each modality and the inferences necessary to resolve such coreference.</p>
</sec>
<sec>
<title>Substitutive</title>
<p>While independent allocation keeps each grammar separate, substitutive allocation incorporates the grammars together into one sequence. Substitution is here defined as when the grammar used by one expression is inserted as a unit within another grammar. We refer to the inserted expression as the &#x0201C;substitution&#x0201D; and to the grammar that receives the substitution as the &#x0201C;matrix grammar.&#x0201D; Thus, the grammatical role of the substitution may be determined by the top-down sequencing schema of the matrix grammar. For example, in the sentence &#x0201C;I love<inline-graphic xlink:href="frai-04-778060-i0002.tif"/>,&#x0201D; the pizza-emoji is substituted for a noun in the matrix grammar, here as the Direct Object noun of a sentence. Unlike the heart in <xref ref-type="fig" rid="F3">Figure 3C</xref>, which is entrenched in the lexicon with the grammatical role of verb, here the pizza emoji is itself not encoded in the lexicon with the grammatical role of a noun. This poses a problem for unification at the level of grammar, because the pizza emoji is not encoded as a noun&#x02014;and cannot become one&#x02014;that can fill a noun slot in the syntactic construction. For example, the pizza emoji cannot express case, like regular nouns.</p>
<p>Below, we address this issue by assuming that the emoji&#x00027;s placement into a canonical sentence position following the transitive verb &#x0201C;love&#x0201D; allows it to fulfill both the semantic and grammatical argument structures. This is depicted in <xref ref-type="fig" rid="F7">Figure 7A</xref>.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Substitutive grammatical interactions between <bold>(A)</bold> a sentence of text and emoji, <bold>(B)</bold> a list of text and an emoji, and <bold>(C)</bold> speech and gesture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0007.tif"/>
</fig>
<p>Here, the modalities have no explicit interface, because the expressions of each modality become units within a single sequence, rather than co-occurring expressions. In the grammatical structure, the pizza emoji appears as an unmarked unit, while the text invokes a canonical transitive sentence construction. Substitution is thus characterized formally by an index to the entire substituted message on the outside of the utterance (here, subscript &#x0201C;a&#x0201D;), which coindexes to a single unit within the matrix grammar (here, the direct object noun). In conceptual structure, the verb &#x0201C;love&#x0201D; licenses a transitive event with an argument structure specifying both an agent (here &#x0201C;I&#x0201D;) and a patient. As the patient is not expressed overtly in the text, this argument is fulfilled with a binding operator (&#x003B1;) which then links to the conceptual structure of the pizza emoji, such that it fulfills the semantic argument of the event structure. Overall then, the substitution results in one modality fulfilling a grammatical role within, and determined by, the grammar of another modality, thereby coalescing their meaning.</p>
<p>Note that the formal challenge of substitution is how can representations of one type of expression (e.g., &#x0201C;unit&#x0201D;) unify with those of another (e.g., &#x0201C;noun&#x0201D;)? Our proposal is that unification occurs solely within the conceptual structure, such that a conceptual category corresponding to Expression 1 (like the Object of PIZZA of a pizza emoji) is licensed to be unified with a conceptual category corresponding to Expression 2 (like the Object slot made available by the transitive event LOVE). Through the prototypical correspondences of that unified conceptual structure, the substituted unit can thus play a role within the matrix grammar (i.e., the unified Object prototypically corresponds to a noun, which can satisfy the grammatical constraints of the transitive verb <italic>love</italic>). We articulate this as a generalized correspondence schema in <xref ref-type="fig" rid="F8">Figure 8</xref>. To reiterate, the binding operator (&#x003B1;) reflects the unification of meaning of the substituted unit into the matrix expression&#x00027;s conceptual structure. This creates the possibility of the substituted unit&#x00027;s grammatical structure (whatever it may be) being inserted into a grammatical unit within the matrix grammar (coindex &#x0201C;a&#x0201D;).</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Abstracted correspondence schema for substitutive allocation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0008.tif"/>
</fig>
<p>Under this view, unification does not occur directly within the grammars, but is in a sense conceptually &#x0201C;coerced&#x0201D; within the grammar. To restate this more simply, the meaning of the substituted unit can satisfy the allowable meanings of the grammatical slot, and thereby can lead to acceptability of being a grammatical substitution. Note, however, that in some cases, this may lead to less satisfactory meaning unification. For example, if a substituted unit does not readily satisfy the constraints of the conceptual structure offered by the matrix grammar, this may lead to a less well-formed grammatical substitution. Consider substitutions like &#x0201C;I love<inline-graphic xlink:href="frai-04-778060-i0001.tif"/>&#x0201D; or &#x0201C;I<inline-graphic xlink:href="frai-04-778060-i0002.tif"/> eating lunch&#x0201D; where, despite the shared semantic fields of the substituted elements with the matrix grammar, the substitutions appear less felicitous. Indeed, substitutions of emoji meanings that align with their grammatical context are readily integrated into the grammar, but substituted emoji with less appropriate meanings create downstream processing costs (Cohn et al., <xref ref-type="bibr" rid="B16">2018</xref>; Scheffler et al., <xref ref-type="bibr" rid="B67">2021</xref>).</p>
<p>Given the orthogonal relationships of grammatical symmetry, substitutions can thus vary across all types of symmetry. Substitutive allocation in the context of Symmetrical Simple interactions can be formalized as:</p>
<p>(GI-4) Symmetrical Simple substitutive allocation</p>
<p>GS<sub>1</sub>: [<sub>Utterance</sub> Unit &#x02013; (Unit<sup>&#x0002A;</sup>)]<sub>1</sub></p>
<p>GS<sub>2</sub>: [<sub>Utterance</sub> [Unit] &#x02013; [Unit]<sub>1</sub> &#x02013; [Unit]<sup>&#x0002A;</sup>]</p>
<p>In this case, each grammatical structure is an utterance whereby the units have no prespecified categorical roles, and may be limited to one-unit, or to sequences using two-units or a linear grammar of an unlimited length (as suggested by the Kleene star <sup>&#x0002A;</sup>). For a sequence to allow for units to be substituted, the matrix grammar in a Symmetrical Simple substitutive allocation needs to use a linear grammar (here GS<sub>2</sub>), while the substitution can vary across levels of simple complexity (GS<sub>1</sub>). For example, imagine a list where one of the items is fulfilled by an image rather than a word: &#x0201C;beer, chips,<inline-graphic xlink:href="frai-04-778060-i0002.tif"/>, football.&#x0201D; We formalize this as in <xref ref-type="fig" rid="F7">Figure 7B</xref>.</p>
<p>Again, the pizza emoji is integrated in the modality by virtue of its sequencing with the text, and it remains a single utterance with a conceptual structure consistent with the substitution in <xref ref-type="fig" rid="F7">Figure 7A</xref>. This consistency reflects the integrity of the lexical entry of the pizza emoji as a unit. The textual list then uses a linear grammar whereby units lacking a grammatical category are ordered sequentially, and the whole utterance of the pizza emoji is coindexed to a single unit within that linear grammar (subscript &#x0201C;a&#x0201D;). The conceptual structure here just specifies a broader semantic field related to, say, recreation, where the semantics of the pizza emoji joins this broader category linked through the binding operator (&#x003B1;). Thus, in Symmetrical Simple substitutive allocation units from one expression can be inserted into a matrix grammar of another, but no further grammatical role is fulfilled because the linear grammar itself does not specify grammatical roles, such as the pizza emoji playing the role of a noun in <xref ref-type="fig" rid="F7">Figure 7A</xref>.</p>
<p>Grammatical roles do become specified in Asymmetrical substitutive allocation. Here, the complex grammar of one expression uses grammatical roles, which the substitution using a simple grammar can inherit. This is what occurs in our example sentence &#x0201C;I love<inline-graphic xlink:href="frai-04-778060-i0002.tif"/>&#x0201D; in <xref ref-type="fig" rid="F7">Figure 7A</xref>, where the pizza emoji acts as a noun in the textual sentence. Generalized, asymmetrical substitutive allocations can be formalized as in (GI-5):</p>
<p>(GI-5) Asymmetrical substitutive allocation</p>
<p>GS<sub>1</sub>: [<sub>Utterance</sub> Unit &#x02013; (Unit<sup>&#x0002A;</sup>)]<sub>1</sub></p>
<p>GS<sub>2</sub>: [<sub>Utterance/Phrase</sub> [Unit<sub>x</sub>/Phrase<sub>x</sub>] &#x02013; [Unit<sub>y</sub>/Phrase<sub>y</sub>]<sub>1</sub> &#x02013;[Unit<sub>z</sub>/Phrase<sub>z</sub>]<sup>&#x0002A;</sup>]</p>
<p>Again, substitutions coindex the whole utterance of the substitution to a unit inside the utterance of the matrix grammar. While our formalized example in <xref ref-type="fig" rid="F7">Figure 7A</xref> shows an image inserted into a textual sentence, multimodal substitutions of &#x0201C;component&#x0201D; (Clark, <xref ref-type="bibr" rid="B4">1996</xref>) or &#x0201C;language-like&#x0201D; (Kendon, <xref ref-type="bibr" rid="B44">1988</xref>; McNeill, <xref ref-type="bibr" rid="B54">1992</xref>) gestures into the syntax of speech are also well attested (Kendon, <xref ref-type="bibr" rid="B44">1988</xref>; McNeill, <xref ref-type="bibr" rid="B54">1992</xref>; Clark, <xref ref-type="bibr" rid="B4">1996</xref>; Fricke, <xref ref-type="bibr" rid="B27">2013</xref>; Ladewig, <xref ref-type="bibr" rid="B49">2020</xref>). For example, this would occur when speaking &#x0201C;I caught a &#x0003C;small pinching gesture&#x0003E; fish,&#x0201D; where the pinching hand gesture fulfills the role of a noun in the sentence corresponding to the notion of small magnitude (Woodin et al., <xref ref-type="bibr" rid="B75">2020</xref>). We diagram this scenario in <xref ref-type="fig" rid="F7">Figure 7C</xref>, which follows the same principles as our other substitutive examples in terms of co-indexation of grammars and alpha binding of conceptual structures. A reverse modality relationship occurs in bimodal bilinguals, who have proficiency in both a spoken language and sign language and have been observed to codeswitch (Emmorey et al., <xref ref-type="bibr" rid="B21">2008</xref>). This codeswitching is a substitution of spoken words into the sign language grammar.</p>
<p>Units of text can also be inserted into a visual sequence that uses a complex grammar, such as in <xref ref-type="fig" rid="F9">Figure 9A</xref>. Here, we first see one boxer reach back his arm while approaching another boxer, followed by the word &#x0201C;Pow&#x0201D; and then see a depiction of the first boxer standing over the second. We infer here that a punch occurred which must have knocked out the second boxer. However, we do not see this action: the climactic event of the visual narrative is replaced by an onomatopoeia, which sponsors inference of an event through the sound that it emits (Goldberg and Jackendoff, <xref ref-type="bibr" rid="B30">2004</xref>; Jackendoff, <xref ref-type="bibr" rid="B40">2010</xref>). We diagram this relationship in <xref ref-type="fig" rid="F9">Figure 9A</xref>.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Substitutive grammatical interactions between a visual narrative sequence and <bold>(A)</bold> an onomatopoeia and <bold>(B)</bold> a sentence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-04-778060-g0009.tif"/>
</fig>
<p>In this example, the onomatopoeia of POW! substitutes for the Peak, i.e., the climax, in the visual narrative sequence. The matrix grammar of the visual sequence depicts the Initial and Release states of the narrative, which correspond semantically to the agent boxer reaching back to punch and to the agent&#x00027;s subsequent celebration at the knock out. Here we simplify the conceptual structure to focus only on the agent&#x00027;s actions, but a complete notation would also include an independent articulation of the patient&#x00027;s event structure which would coindex to the agent&#x00027;s events. The description of the event structure through the Preparation-Head-Coda schema is a concise notation to describe discrete events (Jackendoff, <xref ref-type="bibr" rid="B38">2007</xref>; Cohn et al., <xref ref-type="bibr" rid="B15">2017</xref>). The graphic structure here leaves the punching event unseen, but it instead is implied through the corresponding onomatopoeia which represents a sound emitted by an impact (here, inferred as the impact of the punch), again linked through the binding operator (&#x003B1;). Substitutions of onomatopoeia for narrative Peaks are both a frequent type Asymmetrical substitutive allocation and an entrenched constructional pattern for sponsoring inferences in visual narratives (Cohn, <xref ref-type="bibr" rid="B11">2019</xref><xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>).</p>
<p>Relationships between complex grammars can also use substitutive allocation. In these cases, the substitution utterance has a complex grammar of its own such that its internal parts have their own grammatical roles, but the utterance as a whole plays a role as a unit in the matrix grammar of the other expression. Formalized, this appears as in (GI-6):</p>
<p>(GI-6) Symmetrical Complex substitutive allocation</p>
<p>GS<sub>1</sub>: [<sub>Utterance/Phrase</sub> [Unit<sub>x</sub>/Phrase<sub>x</sub>] &#x02013; [Unit<sub>y</sub>/Phrase<sub>y</sub>] &#x02013; [Unit<sub>z</sub>/Phrase<sub>z</sub>]<sup>&#x0002A;</sup>]<sub>1</sub></p>
<p>GS<sub>2</sub>: [<sub>Utterance/Phrase</sub> [Unit<sub>x</sub>/Phrase<sub>x</sub>] &#x02013; [Unit<sub>y</sub>/Phrase<sub>y</sub>]<sub>1</sub> &#x02013; [Unit<sub>z</sub>/Phrase<sub>z</sub>]<sup>&#x0002A;</sup>]</p>
<p>Again, this relationship is characterized by the whole substituted utterance coindexed with a unit in the matrix grammar. An example of such Symmetrical Complex substitutive allocation would occur when a whole sentence replaces the Peak in a visual narrative sequence, as in <xref ref-type="fig" rid="F9">Figure 9B</xref>. Like in the Asymmetrical example in <xref ref-type="fig" rid="F9">Figure 9A</xref>, this Symmetrical Complex substitution replaces the Peak event in a visual narrative for text, only here the text is a whole sentence rather than a single unit (Cohn, <xref ref-type="bibr" rid="B11">2019</xref>; Huff et al., <xref ref-type="bibr" rid="B34">2020</xref>). This substitution has its own complex grammar (here simplified in notation) that uses categorical roles and constituent structure, but as a whole plays a categorical narrative role determined by the matrix grammar. Again, this substitution is indicated by a coindex (&#x0201C;a&#x0201D;) of the sentence into the narrative structure, while the semantics are linked through the binding operator (&#x003B1;). We can also imagine the reverse situation, where an image sequence appears between words or clauses within a sentence.</p>
<p>Overall, in these substitutive allocations, the modalities thus are not interfaced in a temporal or spatial correspondence, but rather there is a sensory &#x0201C;switching&#x0201D; between modalities such that one expression concludes, a different modality begins and ends, and the original expression continues. As a result, no interfacing arises at the modality level, because no composite signals can be made out of integrating separate expressions into a holistic unit. In other terms, in independent allocations, the modality interfaces provide cues (synchronicity of speech and gesture, spatial proximity or connection for text and images, etc.) which give rise to unification operations within the conceptual structures. Unlike independent allocation, in substitutive allocation it is the grammar that works to integrate these multimodal messages by inserting an expression of one modality as a unit into the dominant matrix grammar of another modality. The result is that the grammar facilitates the access to and unification of meaning from both modalities. This precludes co-referentiality between the semantics of the modalities, thus giving rise to the need for a binding operator (&#x003B1;) to link the meanings. That is, because the modalities themselves remain separate, they do not contribute independent expressions needed to connect. Instead, the grammar facilitates this meaning, whether or not it invites conceptual integration.</p>
<p>Because grammars mediate the meaningful connections in substitutive allocation, our descriptions of balanced and imbalanced semantic weight no longer apply. While in (im)balanced relationships, meaning is negotiated through how modalities establish coreference to each other; in a substitutive relationship, the modalities each contribute independent semantics, and no units explicitly coreference each other. Therefore, we need to introduce an additional semantic interaction that is characterized solely by the substitutive allocation. This we call a <italic><bold>compositional</bold></italic> multimodal meaning, where semantic interactions arise from the unification of meaning facilitated by the insertion of the meaning of one modality into the grammatical structure&#x02014;and thus the corresponding conceptual structure&#x02014;of another modality, while the problem of absent coreference persists despite the grammatical substitution.</p>
<p>These types of interactions characterize the relationships between grammars alone, in the abstract. Though we emphasize that substitutive allocation may occur through grammars in multimodal interactions, all allocations also occur <italic>within</italic> modalities. In the vocal modality, ideophones are a lexical class of typically one-unit words that are prevalent in many of the worlds languages (Dingemanse, <xref ref-type="bibr" rid="B19">2017</xref>). These expressions show morphosyntactic independence&#x02014;often placed at the end of sentences&#x02014;yet they can also be inserted into sentences to take on grammatical roles (Dingemanse, <xref ref-type="bibr" rid="B19">2017</xref>). In the bodily modality, similar asymmetrical allocation occurs with gestures that accompany grammatical sign language (Marschark, <xref ref-type="bibr" rid="B52">1994</xref>; Emmorey, <xref ref-type="bibr" rid="B20">2001</xref>).</p>
<p>Unimodal substitution also arises in interactions between different representational systems, such as in codeswitching between two languages&#x02014;i.e., where the units, of varying sizes, of one language are inserted into the matrix grammar of another language (Kootstra, <xref ref-type="bibr" rid="B46">2015</xref>; Muysken, <xref ref-type="bibr" rid="B55">2020</xref>). Like multimodal substitutions, &#x0201C;insertional&#x0201D; codeswitching is motivated by cognates for substituted words or clauses between languages, and in many cases the morphosyntax of the utterance comes from only the matrix grammar (Myers-Scotton, <xref ref-type="bibr" rid="B56">1997</xref>, <xref ref-type="bibr" rid="B57">2002</xref>). As a result, codeswitches are more often content words (like nouns) than function words. Multimodal substitutions of emoji into sentences are consistent with this, as people more often replace pictures for certain grammatical categories (nouns, adjectives) in sentences over others (verbs, adverbs) (Cohn et al., <xref ref-type="bibr" rid="B14">2019</xref>). We might think of this &#x0201C;unimodal switching&#x0201D; between languages as a type of substitution, whereby the units come from different representational systems within the same vocal modality, rather than a &#x0201C;multimodal codeswitching&#x0201D; of substitution from different modalities. This aligns with the idea that a broader lexicon distinguishes lexical items with features for different languages (Jackendoff and Audring, <xref ref-type="bibr" rid="B41">2020</xref>), which here thus would extend to a lexicon across and between modalities. Thus, again allocation characterizes the interactions between grammars, no matter the modality or representational origins of those grammars.</p>
<p>It is worth noting that psycholinguistic research supports the idea that substituted elements from one modality readily integrate into the grammar and meaning of the matrix modality, despite differences in processing the modalities themselves. For example, reading times for grammatically congruous substituted emoji were slower compared to words in sentences, but viewing times for grammatically incongruous or homophonous rebus emoji were even slower (Cohn et al., <xref ref-type="bibr" rid="B16">2018</xref>; Scheffler et al., <xref ref-type="bibr" rid="B67">2021</xref>). In addition, viewing times for sentences substituted for images in visual sequences (symmetrical substitution, as in <xref ref-type="fig" rid="F9">Figure 9B</xref>), were also found to be slower than their substituted pictures (Huff et al., <xref ref-type="bibr" rid="B34">2020</xref>). However, onomatopoeia in visual narratives were actually viewed faster than the pictures they substituted<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>. Thus, while some work suggests that switching modalities may incur costs, this may either be due to the front-end change in type of representation (graphics to text, or vice versa), or may simply be a matter of relative complexity in the visual representation.</p>
<p>The integration of substitution and matrix grammar is further suggested by studies of grammatical and semantic processing. Pictures inserted into sentences are comprehended with accuracy comparable to all-verbal sentences (Potter et al., <xref ref-type="bibr" rid="B63">1986</xref>; Cohn et al., <xref ref-type="bibr" rid="B16">2018</xref>). Substituted emoji within sentences that better maintain the expectations of the written grammar incur no sustained costs (ex. <italic>John loves eating</italic><inline-graphic xlink:href="frai-04-778060-i0005.tif"/>&#x02026;), while &#x0201C;ungrammatical&#x0201D; pictures (ex. <italic>John</italic><inline-graphic xlink:href="frai-04-778060-i0005.tif"/> <italic>eating pizza&#x02026;</italic>) create spillover costs that persist after the substitution (Cohn et al., <xref ref-type="bibr" rid="B16">2018</xref>). Substituted emoji are also viewed faster than independent allocations of emoji placed at the end of sentences (Cohn et al., <xref ref-type="bibr" rid="B16">2018</xref>). Finally, neural responses indexing semantic processing (the N400, as measured by event-related potentials) are modulated by congruity or predictability of a substitution with the content of its matrix sequence, whether for images substituted into text (Nigam et al., <xref ref-type="bibr" rid="B58">1992</xref>; Ganis et al., <xref ref-type="bibr" rid="B28">1996</xref>; Federmeier and Kutas, <xref ref-type="bibr" rid="B22">2001</xref><xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>) or for text substituted into a visual narrative sequence (Manfredi et al., <xref ref-type="bibr" rid="B51">2017</xref>). Together, these findings imply that, while modalities themselves may incur front-end costs, substituted elements readily integrate with their matrix modality both across semantics and grammar.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>Conclusion</title>
<p>We have presented an expansion of Jackendoff&#x00027;s Parallel Architecture which accounts for both unimodal and multimodal expressions as emergent interactions within a holistic system with primary structures of modality (vocal, bodily, graphic), grammatical structures, and conceptual structures. As this model allows for both unimodal and multimodal expressions, interactions within each of its structures allow for a wide range of variation in expressions. We have primarily focused here on grammatical interactions, arriving at two dimensions of variability. Symmetry characterizes the overall relative complexity of contributing grammars, while allocation describes the ways that those grammars are distributed.</p>
<p>It should be noted that allocation in many ways drives the overall multimodal interactions. When grammars remain separated in independent allocations, it entails that the modalities themselves remain separate. In these cases, modalities interface on their own (temporal or spatial correspondence) and require coreference across signals. By integrating grammars in substitutive allocation, the multimodal messages themselves become integrated. Thus, while the overt experience of multimodality occurs in the perception of the sensory modalities, and the understanding of their integration results through the conceptual structure, we may remain unaware of the covert interactions of the grammars that largely characterize how multimodal messages arise.</p>
<p>In addition, the ability for grammatical structures to substitute into each other has implications for the characteristics of the wider faculty of language. As substitution appears to apply unimodally within a language (ideophones, signers gesturing), unimodally across languages (codeswitching), and in multimodal interactions (text/image, speech/gesture), it implies that substitution does not merely occur between modalities, but between <italic>grammars</italic>, no matter their representational origin. Furthermore, given that substitution serves as a diagnostic more broadly for a linguistic test of complementary distribution, substitutive allocation can be taken as a defining distributional trait for inclusion into a broader linguistic faculty. One can substitute across expressive systems <italic><bold>because</bold></italic> these systems share their cognitive architecture. We claim the Parallel Architecture accommodates this evidence of cross-modal substitution. Thus, substitution can be used as a diagnostic for testing the degree to which modalities or representational systems overlap within a broader linguistic faculty.</p>
<p>Altogether, we have argued for two fundamental observations about human language and multimodality. First, expressions across modalities are not indivisible, but rather are decomposable into similar substructures which have classifiable interactions. Second, in order to accurately account for the structure and cognition of human language, multimodality must be addressed. We contend that any accurate accounting for language and multimodality must address these issues. That is, to address multimodality, but not its decomposable interactions, does not do justice to the complexity of multimodal expressions. At the same time, unimodal linguistic models fail to characterize the full and accurate complexity of human language. As we have argued, the Parallel Architecture provides a model of human expressive capacities capable of accounting for the richness demanded of the natural competence for multimodality.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>NC and JS contributed equally to the theorizing and creation of this paper. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>Funding for open access publication fees is provided from Tilburg University.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bateman</surname> <given-names>J. A.</given-names></name></person-group> (<year>2014</year>). <source>Text and Image: A Critical Introduction to the Visual/Verbal Divide</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>. <pub-id pub-id-type="doi">10.4324/9781315773971</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bateman</surname> <given-names>J. A.</given-names></name> <name><surname>Wildfeuer</surname> <given-names>J.</given-names></name> <name><surname>Hiippala</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <source>Multimodality: Foundations, Research and Analysis&#x02013;A Problem-Oriented Introduction</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Walter de Gruyter GmbH &#x00026; Co KG</publisher-name>. p. <fpage>488</fpage>. <pub-id pub-id-type="doi">10.1515/9783110479898</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chomsky</surname> <given-names>N.</given-names></name></person-group> (<year>1956</year>). <article-title>Three models for the description of language</article-title>. <source>IRE Trans. Inform. Theory</source> <volume>2</volume>, <fpage>113</fpage>&#x02013;<lpage>124</lpage>. <pub-id pub-id-type="doi">10.1109/TIT.1956.1056813</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>H. H.</given-names></name></person-group> (<year>1996</year>). <source>Using Language</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9780511620539</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2012</year>). <article-title>Explaining &#x0201C;I can&#x00027;t draw&#x0201D;: Parallels between the structure and development of language and drawing</article-title>. <source>Hum. Dev.</source> <volume>55</volume>, <fpage>167</fpage>&#x02013;<lpage>192</lpage>. <pub-id pub-id-type="doi">10.1159/000341842</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2013a</year>). <article-title>Beyond speech balloons and thought bubbles: the integration of text and image</article-title>. <source>Semiotica</source> <volume>2013</volume>, <fpage>35</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1515/sem-2013-0079</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2013b</year>). <source>The Visual Language of Comics: Introduction to the Structure and Cognition of Sequential Images</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Bloomsbury</publisher-name>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2013c</year>). <article-title>Visual narrative structure</article-title>. <source>Cogn. Sci.</source> <volume>37</volume>, <fpage>413</fpage>&#x02013;<lpage>452</lpage>. <pub-id pub-id-type="doi">10.1111/cogs.12016</pub-id><pub-id pub-id-type="pmid">23163777</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2016</year>). <article-title>A multimodal parallel architecture: a cognitive framework for multimodal interactions</article-title>. <source>Cognition</source> <volume>146</volume>, <fpage>304</fpage>&#x02013;<lpage>323</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2015.10.007</pub-id><pub-id pub-id-type="pmid">26491835</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Combinatorial morphology in visual languages,&#x0201D;</article-title> in <source>The Construction of Words: Advances in Construction Morphology</source>, ed G. Booij (<publisher-loc>London</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>175</fpage>&#x02013;<lpage>199</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-74394-3_7</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2019</year>). <article-title>Being explicit about the implicit: inference generating techniques in visual narrative</article-title>. <source>Lang. Cogn.</source> <volume>11</volume>, <fpage>66</fpage>&#x02013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1017/langcog.2019.6</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2020a</year>). <source>Who Understands Comics? Questioning the Universality of Visual Language Comprehension.</source> <publisher-loc>London</publisher-loc>: <publisher-name>Bloomsbury</publisher-name>. <pub-id pub-id-type="doi">10.5040/9781350156074</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2020b</year>). <article-title>Your brain on comics: a cognitive model of visual narrative comprehension</article-title>. <source>Top. Cogn. Sci.</source> <volume>12</volume>, <fpage>352</fpage>&#x02013;<lpage>386</lpage>. <pub-id pub-id-type="doi">10.1111/tops.12421</pub-id><pub-id pub-id-type="pmid">30963724</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name> <name><surname>Engelen</surname> <given-names>J.</given-names></name> <name><surname>Schilperoord</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>The grammar of emoji? Constraints on communicative pictorial sequencing</article-title>. <source>Cogn. Res.</source> <volume>4</volume>:<fpage>33</fpage>. <pub-id pub-id-type="doi">10.1186/s41235-019-0177-0</pub-id><pub-id pub-id-type="pmid">31471857</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name> <name><surname>Paczynski</surname> <given-names>M.</given-names></name> <name><surname>Kutas</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Not so secret agents: event-related potentials to semantic roles in visual event comprehension</article-title>. <source>Brain Cogn.</source> <volume>119</volume>, <fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandc.2017.09.001</pub-id><pub-id pub-id-type="pmid">28898720</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohn</surname> <given-names>N.</given-names></name> <name><surname>Roijackers</surname> <given-names>T.</given-names></name> <name><surname>Schaap</surname> <given-names>R.</given-names></name> <name><surname>Engelen</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Are emoji a poor substitute for words? Sentence processing with emoji substitutions,&#x0201D;</article-title> in <source>40th Annual Conference of the Cognitive Science Society</source>, eds T. T. Rogers, M. Rau, X. Zhu, and C. W. Kalish (<publisher-loc>Austin, TX</publisher-loc>: <publisher-name>Cognitive Science Society</publisher-name>), <fpage>1524</fpage>&#x02013;<lpage>1529</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dancygier</surname> <given-names>B.</given-names></name> <name><surname>Vandelanotte</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Internet memes as multimodal constructions</article-title>. <source>Cogn. Linguist.</source> <volume>28</volume>:<fpage>565</fpage>. <pub-id pub-id-type="doi">10.1515/cog-2017-0074</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dehaene</surname> <given-names>S.</given-names></name> <name><surname>Meyniel</surname> <given-names>F.</given-names></name> <name><surname>Wacongne</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Pallier</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <article-title>The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees</article-title>. <source>Neuron</source> <volume>88</volume>, <fpage>2</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2015.09.019</pub-id><pub-id pub-id-type="pmid">26447569</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dingemanse</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;On the margins of language: ideophones, interjections and dependencies in linguistic theory,&#x0201D;</article-title> in <source>Dependencies in Language: On the Causal Ontology of Linguistic Systems</source>, ed N. J. Enfield (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Language Science Press</publisher-name>), <fpage>195</fpage>&#x02013;<lpage>202</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Emmorey</surname> <given-names>K.</given-names></name></person-group> (<year>2001</year>). <source>Language, Cognition, and the Brain: Insights From Sign Language Research</source>. a. <pub-id pub-id-type="doi">10.4324/9781410603982</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Emmorey</surname> <given-names>K.</given-names></name> <name><surname>Borinstein</surname> <given-names>H. B.</given-names></name> <name><surname>Thompson</surname> <given-names>R.</given-names></name> <name><surname>Gollan</surname> <given-names>T. H.</given-names></name></person-group> (<year>2008</year>). <article-title>Bimodal bilingualism</article-title>. <source>Bilingualism</source> <volume>11</volume>, <fpage>43</fpage>&#x02013;<lpage>61</lpage>. <pub-id pub-id-type="doi">10.1017/S1366728907003203</pub-id><pub-id pub-id-type="pmid">19079743</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Federmeier</surname> <given-names>K. D.</given-names></name> <name><surname>Kutas</surname> <given-names>M.</given-names></name></person-group> (<year>2001</year>). <article-title>Meaning and modality: influences of context, semantic memory organization, and perceptual predictability on picture processing</article-title>. <source>J. Exp. Psychol.</source> <volume>27</volume>, <fpage>202</fpage>&#x02013;<lpage>224</lpage>. <pub-id pub-id-type="doi">10.1037/0278-7393.27.1.202</pub-id><pub-id pub-id-type="pmid">11204098</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fein</surname> <given-names>O.</given-names></name> <name><surname>Kasher</surname> <given-names>A.</given-names></name></person-group> (<year>1996</year>). <article-title>How to do things with words and gestures in comics</article-title>. <source>J. Pragmat.</source> <volume>26</volume>, <fpage>793</fpage>&#x02013;<lpage>808</lpage>. <pub-id pub-id-type="doi">10.1016/S0378-2166(96)00023-9</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Forceville</surname> <given-names>C.</given-names></name></person-group> (<year>2011</year>). <article-title>Pictorial runes in Tintin and the Picaros</article-title>. <source>J. Pragmat.</source> <volume>43</volume>, <fpage>875</fpage>&#x02013;<lpage>890</lpage>. <pub-id pub-id-type="doi">10.1016/j.pragma.2010.07.014</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Forceville</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). &#x0201C;Reflections on the creative use of traffic signs&#x00027; &#x0201C;Micro-Language&#x0201D;,&#x0201D; in <source>Perspectives on Visual Learning</source>, <volume>Vol. 3</volume>, eds A. Benedek and K. Ny&#x000ED;ri (<publisher-loc>Budapest</publisher-loc>: <publisher-name>Hungarian Academy of Sciences</publisher-name>), <fpage>103</fpage>&#x02013;<lpage>113</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Forceville</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <source>Visual and Multimodal Communication: Applying the Relevance Principle</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>. <pub-id pub-id-type="doi">10.1093/oso/9780190845230.001.0001</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fricke</surname> <given-names>E.</given-names></name></person-group> (<year>2013</year>). &#x0201C;Towards a unified grammar of gesture and speech: a multimodal approach, in <source>Body&#x02013;Language&#x02013;Communication. An International Handbook on Multimodality in Human Interaction</source>, eds C. M&#x000FC;ller, A. Cienki, E. Fricke, S. Ladewig, D. McNeill, and S. Tessendorf, (<publisher-loc>Berlin</publisher-loc>: <publisher-name>De Gruyter Mouton</publisher-name>), <fpage>733</fpage>&#x02013;<lpage>754</lpage>. <pub-id pub-id-type="doi">10.1515/9783110261318.733</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ganis</surname> <given-names>G.</given-names></name> <name><surname>Kutas</surname> <given-names>M.</given-names></name> <name><surname>Sereno</surname> <given-names>M. I.</given-names></name></person-group> (<year>1996</year>). <article-title>The search for &#x0201C;common sense&#x0201D;: an electrophysiological study of the comprehension of words and pictures in reading</article-title>. <source>J. Cogn. Neurosci.</source> <volume>8</volume>, <fpage>89</fpage>&#x02013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1162/jocn.1996.8.2.89</pub-id><pub-id pub-id-type="pmid">23971417</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Gawne</surname> <given-names>L.</given-names></name> <name><surname>McCulloch</surname> <given-names>G.</given-names></name></person-group> (<year>2019</year>). <source>Emoji as Digital Gestures. Language&#x00040;Internet, (urn:nbn:de:0009-7-48882)</source>. Available Online at: <ext-link ext-link-type="uri" xlink:href="https://www.languageatinternet.org/articles/2019/gawne">https://www.languageatinternet.org/articles/2019/gawne</ext-link></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>A. E.</given-names></name> <name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>2004</year>). <article-title>The english resultative as a family of constructions</article-title>. <source>Language</source> <volume>80</volume>, <fpage>532</fpage>&#x02013;<lpage>568</lpage>. <pub-id pub-id-type="doi">10.1353/lan.2004.0129</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goldin-Meadow</surname> <given-names>S.</given-names></name></person-group> (<year>2003a</year>). <source>Hearing Gesture: How our Hands Help us Think</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Harvard University Press</publisher-name>. <pub-id pub-id-type="doi">10.1037/e413812005-377</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goldin-Meadow</surname> <given-names>S.</given-names></name></person-group> (<year>2003b</year>). <source>The Resiliance of Language: What Gesture Creation in Deaf Children Can Tell Us About How All Children Learn Language</source>. <publisher-loc>New York, Hove, NY</publisher-loc>: <publisher-name>Psychology Press</publisher-name>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hervais-Adelman</surname> <given-names>A.</given-names></name> <name><surname>Kumar</surname> <given-names>U.</given-names></name> <name><surname>Mishra</surname> <given-names>R. K.</given-names></name> <name><surname>Tripathi</surname> <given-names>V. N.</given-names></name> <name><surname>Guleria</surname> <given-names>A.</given-names></name> <name><surname>Singh</surname> <given-names>J. P.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Learning to read recycles visual cortical networks without destruction</article-title>. <source>Sci. Adv.</source> <volume>5</volume>:<fpage>eaax0262</fpage>. <pub-id pub-id-type="doi">10.1126/sciadv.aax0262</pub-id><pub-id pub-id-type="pmid">31555732</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huff</surname> <given-names>M.</given-names></name> <name><surname>Rosenfelder</surname> <given-names>D.</given-names></name> <name><surname>Oberbeck</surname> <given-names>M.</given-names></name> <name><surname>Merkt</surname> <given-names>M.</given-names></name> <name><surname>Papenmeier</surname> <given-names>F.</given-names></name> <name><surname>Meitz</surname> <given-names>T. G. K.</given-names></name></person-group> (<year>2020</year>). <article-title>Cross-codal integration of bridging-event information in narrative understanding</article-title>. <source>Mem. Cognit</source>. <volume>44</volume>, <fpage>1064</fpage>&#x02013;<lpage>1075</lpage>. <pub-id pub-id-type="doi">10.3758/s13421-020-01039-z</pub-id><pub-id pub-id-type="pmid">32342288</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>1983</year>). <source>Semantics and Cognition</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>1987</year>). <source>Consciousness and the Computational Mind</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>.</citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>2002</year>). <source>Foundations of Language: Brain, Meaning, Grammar, Evolution</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>. <pub-id pub-id-type="doi">10.1093/acprof:oso/9780198270126.001.0001</pub-id><pub-id pub-id-type="pmid">15377127</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>2007</year>). <source>Language, Consciousness, Culture: Essays on Mental Structure (Jean Nicod Lectures)</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>. <pub-id pub-id-type="doi">10.7551/mitpress/4111.001.0001</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Compounding in the parallel architecture and conceptual semantics,&#x0201D;</article-title> in <source>Oxford Handbook of Compounding</source>, eds R. Lieber and P. Stekauer (<publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University</publisher-name>), <fpage>105</fpage>&#x02013;<lpage>128</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <source>Meaning and the Lexicon: The Parallel Architecture 1975-2010</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name> <name><surname>Audring</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). The <italic>Texture of the Lexicon: Relational Morphology and the Parallel Architecture</italic>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>. <pub-id pub-id-type="doi">10.1093/oso/9780198827900.001.0001</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name> <name><surname>Wittenberg</surname> <given-names>E.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;What you can say without syntax: a hierarchy of grammatical complexity,&#x0201D;</article-title> in <source>Measuring Linguistic Complexity</source>, eds F. Newmeyer and L. Preston (<publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>), <fpage>65</fpage>&#x02013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1093/acprof:oso/9780199685301.003.0004</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jackendoff</surname> <given-names>R.</given-names></name> <name><surname>Wittenberg</surname> <given-names>E.</given-names></name></person-group> (<year>2017</year>). <article-title>Linear grammar as a possible stepping-stone in the evolution of language</article-title>. <source>Psychon. Bull. Rev.</source> <volume>24</volume>, <fpage>219</fpage>&#x02013;<lpage>224</lpage>. <pub-id pub-id-type="doi">10.3758/s13423-016-1073-y</pub-id><pub-id pub-id-type="pmid">27368633</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kendon</surname> <given-names>A.</given-names></name></person-group> (<year>1988</year>). <article-title>&#x0201C;How gestures can become like words,&#x0201D;</article-title> in <source>Cross-Cultural Perspectives in Nonverbal Communication</source> (<publisher-loc>Ashland, OH</publisher-loc>: <publisher-name>Hogrefe &#x00026; Huber Publishers</publisher-name>), <fpage>131</fpage>&#x02013;<lpage>141</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koelsch</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Toward a neural basis of music perception &#x02013; a review and updated model</article-title>. <source>Front. Psychol.</source> <volume>2</volume>:<fpage>110</fpage>. <pub-id pub-id-type="doi">10.3389/fpsyg.2011.00110</pub-id><pub-id pub-id-type="pmid">21713060</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kootstra</surname> <given-names>G. J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;A psycholinguistic perspective on code-switching: lexical, structural, and socio-interactive processes,&#x0201D;</article-title> in <source>Code-Switching Between Structural and Sociolinguistic Perspectives</source>, eds G. Stell and K. Yakpo (<publisher-loc>Berlin</publisher-loc>: <publisher-name>De Gruyter</publisher-name>), <fpage>39</fpage>&#x02013;<lpage>64</lpage>.</citation>
</ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kress</surname> <given-names>G.</given-names></name></person-group> (<year>2009</year>). <source>Multimodality: A Social Semiotic Approach to Contemporary Communication</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>. <pub-id pub-id-type="doi">10.4324/9780203970034</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kutas</surname> <given-names>M.</given-names></name> <name><surname>Federmeier</surname> <given-names>K. D.</given-names></name></person-group> (<year>2011</year>). <article-title>Thirty years and counting: finding meaning in the N400 component of the Event-Related Brain Potential (ERP)</article-title>. <source>Annu. Rev. Psychol.</source> <volume>62</volume>, <fpage>621</fpage>&#x02013;<lpage>647</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.psych.093008.131123</pub-id><pub-id pub-id-type="pmid">20809790</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ladewig</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <source>Integrating Gestures: The Dimension of Multimodality in Cognitive Grammar</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Walter de Gruyter GmbH &#x00026; Co KG</publisher-name>. <pub-id pub-id-type="doi">10.1515/9783110668568</pub-id></citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lanwer</surname> <given-names>J. P.</given-names></name></person-group> (<year>2017</year>). <article-title>Apposition: a multimodal construction? The multimodality of linguistic constructions in the light of usage-based theory</article-title>. <source>Linguistics Vanguard</source> <volume>3</volume>:<fpage>20160071</fpage>. <pub-id pub-id-type="doi">10.1515/lingvan-2016-0071</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manfredi</surname> <given-names>M.</given-names></name> <name><surname>Cohn</surname> <given-names>N.</given-names></name> <name><surname>Kutas</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>When a hit sounds like a kiss: an electrophysiological exploration of semantic processing in visual narrative</article-title>. <source>Brain Lang.</source> <volume>169</volume>, <fpage>28</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandl.2017.02.001</pub-id><pub-id pub-id-type="pmid">28242517</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marschark</surname> <given-names>M.</given-names></name></person-group> (<year>1994</year>). <article-title>Gesture and sign</article-title>. <source>Appl. Psycholinguist.</source> <volume>15</volume>, <fpage>209</fpage>&#x02013;<lpage>236</lpage>. <pub-id pub-id-type="doi">10.1017/S0142716400005336</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martinec</surname> <given-names>R.</given-names></name> <name><surname>Salway</surname> <given-names>A.</given-names></name></person-group> (<year>2005</year>). <article-title>A system for image&#x02013;text relations in new (and old) media</article-title>. <source>Visual Commun.</source> <volume>4</volume>, <fpage>337</fpage>&#x02013;<lpage>371</lpage>. <pub-id pub-id-type="doi">10.1177/1470357205055928</pub-id></citation>
</ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McNeill</surname> <given-names>D.</given-names></name></person-group> (<year>1992</year>). <source>Hand and Mind: What Gestures Reveal About Thought.</source> <publisher-loc>Chicago, IL</publisher-loc>: <publisher-name>University of Chicago Press</publisher-name>.<pub-id pub-id-type="pmid">17166576</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Muysken</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Code-switching and grammatical theory,&#x0201D;</article-title> in <source>The Bilingualism Reader</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>), <fpage>280</fpage>&#x02013;<lpage>297</lpage>. <pub-id pub-id-type="doi">10.4324/9781003060406-26</pub-id><pub-id pub-id-type="pmid">26733894</pub-id></citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Myers-Scotton</surname> <given-names>C.</given-names></name></person-group> (<year>1997</year>). <source>Duelling Languages: Grammatical Structure in Codeswitching</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>. p. <fpage>147</fpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Myers-Scotton</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <source>Contact Linguistics: Bilingual Encounters and Grammatical Outcomes</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press on Demand</publisher-name>. p. <fpage>256</fpage>. <pub-id pub-id-type="doi">10.1093/acprof:oso/9780198299530.001.0001</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nigam</surname> <given-names>A.</given-names></name> <name><surname>Hoffman</surname> <given-names>J.</given-names></name> <name><surname>Simons</surname> <given-names>R.</given-names></name></person-group> (<year>1992</year>). <article-title>N400 to semantically anomalous pictures and words</article-title>. <source>J. Cogn. Neurosci.</source> <volume>4</volume>, <fpage>15</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1162/jocn.1992.4.1.15</pub-id><pub-id pub-id-type="pmid">23967854</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pa</surname> <given-names>J.</given-names></name> <name><surname>Wilson</surname> <given-names>S. M.</given-names></name> <name><surname>Pickell</surname> <given-names>H.</given-names></name> <name><surname>Bellugi</surname> <given-names>U.</given-names></name> <name><surname>Hickok</surname> <given-names>G.</given-names></name></person-group> (<year>2008</year>). <article-title>Neural organization of linguistic short-term memory is sensory modality&#x02013;dependent: evidence from signed and spoken language</article-title>. <source>J. Cogn. Neurosci.</source> <volume>20</volume>, <fpage>2198</fpage>&#x02013;<lpage>2210</lpage>. <pub-id pub-id-type="doi">10.1162/jocn.2008.20154</pub-id><pub-id pub-id-type="pmid">18457510</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Painter</surname> <given-names>C.</given-names></name> <name><surname>Martin</surname> <given-names>J. R.</given-names></name> <name><surname>Unsworth</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <source>Reading Visual Narratives: Image Analysis of Children&#x00027;s Picture Books</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Equi-nox</publisher-name>.</citation>
</ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Patel</surname> <given-names>A. D.</given-names></name></person-group> (<year>2003</year>). <article-title>Language, music, syntax and the brain</article-title>. <source>Nat. Neurosci.</source> <volume>6</volume>, <fpage>674</fpage>&#x02013;<lpage>681</lpage>. <pub-id pub-id-type="doi">10.1038/nn1082</pub-id><pub-id pub-id-type="pmid">12830158</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Plug</surname> <given-names>I.</given-names></name> <name><surname>Van den Bergh</surname> <given-names>M.</given-names></name> <name><surname>Schilperoord</surname> <given-names>J.</given-names></name> <name><surname>Cohn</surname> <given-names>N.</given-names></name> <name><surname>van Enschot</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <source>Butterflies and Bananas: An Experimental Study Into the Effects of (a)symmetry and Context on Topic Assignment in Juxtapositions (Research Master thesis). Tilburg University, Tilburg</source>. Available Online at: <ext-link ext-link-type="uri" xlink:href="https://theses.ubn.ru.nl/bitstream/handle/123456789/5511/Plug%2C_I._1.pdf?sequence=1">https://theses.ubn.ru.nl/bitstream/handle/123456789/5511/Plug%2C_I._1.pdf?sequence=1</ext-link></citation>
</ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Potter</surname> <given-names>M. C.</given-names></name> <name><surname>Kroll</surname> <given-names>J. F.</given-names></name> <name><surname>Yachzel</surname> <given-names>B.</given-names></name> <name><surname>Carpenter</surname> <given-names>E.</given-names></name> <name><surname>Sherman</surname> <given-names>J.</given-names></name></person-group> (<year>1986</year>). <article-title>Pictures in sentences: understanding without words</article-title>. <source>J. Exp. Psychol.</source> <volume>115</volume>:<fpage>281</fpage>. <pub-id pub-id-type="doi">10.1037/0096-3445.115.3.281</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ralph</surname> <given-names>M. A. L.</given-names></name> <name><surname>Jefferies</surname> <given-names>E.</given-names></name> <name><surname>Patterson</surname> <given-names>K.</given-names></name> <name><surname>Rogers</surname> <given-names>T. T.</given-names></name></person-group> (<year>2016</year>). <article-title>The neural and computational bases of semantic cognition</article-title>. <source>Nat. Rev. Neurosci.</source> <volume>18</volume>:<fpage>42</fpage>&#x02013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1038/nrn.2016.150</pub-id><pub-id pub-id-type="pmid">27881854</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rasenberg</surname> <given-names>M.</given-names></name> <name><surname>&#x000D6;zy&#x000FC;rek</surname> <given-names>A.</given-names></name> <name><surname>Dingemanse</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Alignment in multimodal interaction: an integrative framework</article-title>. <source>Cogn. Sci.</source> <volume>44</volume>:<fpage>e12911</fpage>. <pub-id pub-id-type="doi">10.1111/cogs.12911</pub-id><pub-id pub-id-type="pmid">33124090</pub-id></citation></ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Royce</surname> <given-names>T. D.</given-names></name></person-group> (<year>2007</year>). <article-title>&#x0201C;Intersemiotic complementarity: a framework for multimodal discourse analysis,&#x0201D;</article-title> in <source>New Directions in the Analysis of Multimodal Discourse</source>, eds T. D. Royce and W. L. Bowcher (<publisher-loc>Mahweh, NJ</publisher-loc>: <publisher-name>Lawrence Erlbaum Associates</publisher-name>), <fpage>63</fpage>&#x02013;<lpage>109</lpage>.</citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scheffler</surname> <given-names>T.</given-names></name> <name><surname>Brandt</surname> <given-names>L.</given-names></name> <name><surname>Fuente</surname> <given-names>M.</given-names></name> <name><surname>d. l Nenchev</surname> <given-names>I.</given-names></name></person-group> (<year>2021</year>). <article-title>The processing of emoji-word substitutions: a self-paced-reading study</article-title>. <source>Comput. Human Behav.</source> <volume>127</volume>:<fpage>107076</fpage>. <pub-id pub-id-type="doi">10.1016/j.chb.2021.107076</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schilperoord</surname> <given-names>J.</given-names></name> <name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2021</year>). <article-title>Let there be visual optimal innovations: making visual meaning through Michelangelo&#x00027;s The Creation of Adam</article-title>. <source>Visual Commun.</source> 14703572211004994. <pub-id pub-id-type="doi">10.1177/14703572211004994</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schilperoord</surname> <given-names>J.</given-names></name> <name><surname>Cohn</surname> <given-names>N.</given-names></name></person-group> (<year>2022</year>). <source>Before: Unimodal Linguistics, After: Multimodal Linguistics: An Expoloration of the Before-After construction.</source> Cognitive Semantics.</citation>
</ref>
<ref id="B70">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>V.</given-names></name></person-group> (<year>1995</year>). <source>Lessons in Sign Writing</source>. <publisher-loc>La Jolla, CA</publisher-loc>: <publisher-name>The Deaf Action Committee</publisher-name>.</citation>
</ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Weissman</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Emojis in sentence processing: an electrophysiological approach,&#x0201D;</article-title> in <source>Paper presented at the Companion Proceedings of The 2019 World Wide Web Conference</source> (<publisher-loc>San Francisco, CA</publisher-loc>). <pub-id pub-id-type="doi">10.1145/3308560.3316544</pub-id></citation>
</ref>
<ref id="B72">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wilkins</surname> <given-names>D. P.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Alternative representations of space: arrernte narratives in sand,&#x0201D;</article-title> in <source>The Visual Narrative Reader</source>, ed N. Cohn (<publisher-loc>London</publisher-loc>: <publisher-name>Bloomsbury</publisher-name>), <fpage>252</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.5040/9781474283670.ch-010</pub-id></citation>
</ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Willats</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <source>Art and Representation: New Principles in the Analysis of Pictures</source>. <publisher-loc>Princeton</publisher-loc>: <publisher-name>Princeton University Press</publisher-name>.</citation>
</ref>
<ref id="B74">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Willats</surname> <given-names>J.</given-names></name></person-group> (<year>2005</year>). <source>Making Sense of Children&#x00027;s Drawings</source>. <publisher-loc>Mahwah, NJ</publisher-loc>: <publisher-name>Lawrence Erlbaum</publisher-name>. <pub-id pub-id-type="doi">10.4324/9781410613561</pub-id></citation>
</ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Woodin</surname> <given-names>G.</given-names></name> <name><surname>Winter</surname> <given-names>B.</given-names></name> <name><surname>Perlman</surname> <given-names>M.</given-names></name> <name><surname>Littlemore</surname> <given-names>J.</given-names></name> <name><surname>Matlock</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x00027;Tiny numbers&#x00027; are actually tiny: evidence from gestures in the TV News Archive</article-title>. <source>PLoS ONE</source> <volume>15</volume>:<fpage>e0242142</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0242142</pub-id><pub-id pub-id-type="pmid">33201907</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>Klomberg, B., and Cohn, N. (under review). <italic>Picture Perfect Peaks: Comprehension of Inferential Techniques in Visual Narratives</italic>.</p></fn>
<fn id="fn0002"><p><sup>2</sup>Weissman, B., Cohn, N., and Tanner, D. (in preparation). <italic>Predictable Words and Emoji Neural Correlates of Verbal and Pictorial Lexical Prediction in Sentence Processing</italic>.</p></fn>
</fn-group>
</back>
</article>