<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2023.1229697</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Lexical diversity in kinship across languages and dialects</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Khalilia</surname> <given-names>Hadi</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2036471/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Bella</surname> <given-names>G&#x000E1;bor</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Freihat</surname> <given-names>Abed Alhakim</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Darma</surname> <given-names>Shandy</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2325965/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Giunchiglia</surname> <given-names>Fausto</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2564362/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Information Engineering and Computer Science, University of Trento</institution>, <addr-line>Trento</addr-line>, <country>Italy</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Computer Science, Palestine Technical University &#x02013; Kadoorie</institution>, <addr-line>Tulkarm</addr-line>, <country>Palestine</country></aff>
<aff id="aff3"><sup>3</sup><institution>Lab-STICC CNRS UMR 628, IMT Atlantique</institution>, <addr-line>Brest</addr-line>, <country>France</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Steven Moran, University of Neuch&#x000E2;tel, Switzerland</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Sam Passmore, Australian National University, Australia; Danielle Barth, Australian National University, Australia</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Hadi Khalilia <email>hadi.khalilia&#x00040;unitn.it</email>; <email>h.khalilia&#x00040;ptuk.edu.ps</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>20</day>
<month>11</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1229697</elocation-id>
<history>
<date date-type="received">
<day>26</day>
<month>05</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>10</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Khalilia, Bella, Freihat, Darma and Giunchiglia.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Khalilia, Bella, Freihat, Darma and Giunchiglia</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Languages are known to describe the world in diverse ways. Across lexicons, diversity is pervasive, appearing through phenomena such as lexical gaps and untranslatability. However, in computational resources, such as multilingual lexical databases, diversity is hardly ever represented. In this paper, we introduce a method to enrich computational lexicons with content relating to linguistic diversity. The method is verified through two large-scale case studies on kinship terminology, a domain known to be diverse across languages and cultures: one case study deals with seven Arabic dialects, while the other one with three Indonesian languages. Our results, made available as browseable and downloadable computational resources, extend prior linguistics research on kinship terminology, and provide insight into the extent of diversity even within linguistically and culturally close communities.</p></abstract>
<kwd-group>
<kwd>multilingual lexicon</kwd>
<kwd>dialect</kwd>
<kwd>language diversity</kwd>
<kwd>lexical gap</kwd>
<kwd>kinship</kwd>
<kwd>lexical typology</kwd>
</kwd-group>
<counts>
<fig-count count="12"/>
<table-count count="8"/>
<equation-count count="4"/>
<ref-count count="55"/>
<page-count count="21"/>
<word-count count="12425"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Psychology of Language</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>The culture and the social structure of a community are reflected in the language spoken by its members. One of the most salient examples of this phenomenon is the worldwide diversity of terms used to describe family structures and relationships. While, thanks to studies such as Murdock (<xref ref-type="bibr" rid="B39">1970</xref>), kin terms around the globe are generally well-documented, many local variations&#x02014;across dialects of a single language or across languages of a single country&#x02014;have not yet been fully described or understood. For example, the term &#x00645;&#x0064E;&#x00639;&#x00632;&#x00648;&#x00632;&#x0064A; <italic>maazoozi</italic> in the Algerian Arabic dialect, meaning <italic>younger brother</italic>, does not have any equivalent term in the Gulf Arabic dialect. In contrast, the Gulf word &#x00627;&#x00628;&#x00646; &#x00627;&#x00644;&#x00639;&#x0064F;&#x00648;&#x0062F; <italic>ibn alood</italic> meaning <italic>elder brother</italic> does not exist in Algerian, which instead uses the word &#x00633;&#x00650;&#x0064A;&#x0062F;&#x0064A; <italic>siedi</italic>.</p>
<p>Beyond a linguistic or anthropologic interest, the availability of digital resources on language diversity is also desirable from a computational perspective. Language processing applications need to be aware of such phenomena of diversity in order to provide high-quality results. For example, a machine translation system needs to tackle cases of lexical untranslatability where a word or expression in a source language has no equivalent in a given target language, and the choice of an approximate translation can change the meaning of an utterance. For example, for the English sentence <italic>his cousin gave birth to a twin</italic>, Google Translate provides the Arabic translation &#x00623;&#x00646;&#x0062C;&#x00628; &#x00627;&#x00628;&#x00646; &#x00639;&#x00645;&#x00647; &#x0062A;&#x00648;&#x00623;&#x00645;&#x00627; <italic>a&#x00027;njaba ibna a&#x00027;mihi tawaman</italic> that means <italic>His father&#x00027;s brother&#x00027;s son gave birth to a twin</italic>. This syntactically correct yet unintended meaning of a male giving birth output is due to a <italic>lexical gap</italic>, i.e., a non-existent equivalent Arabic term for <italic>cousin</italic>. Such cases of <italic>techno-linguistic bias</italic>&#x02014;where language technology provides better results <italic>by design</italic> in certain languages than in others&#x02014;tend to remain hidden in monolingual resources but are revealed in multilingual settings (Bella et al., <xref ref-type="bibr" rid="B10">2022a</xref>, <xref ref-type="bibr" rid="B12">2023</xref>).</p>
<p>In recent years, there has been an increasing number of linguistic databases covering a large number of languages. These resources are usually aimed at quantitative studies for comparative linguistics, such as the classification of pain predicates (Reznikova et al., <xref ref-type="bibr" rid="B46">2012</xref>), a semantic map of motion verbs (W&#x000E4;lchli and Cysouw, <xref ref-type="bibr" rid="B53">2012</xref>), the modeling of color terminology (McCarthy et al., <xref ref-type="bibr" rid="B37">2019</xref>), the CLICS database of cross-linguistic colexifications (Rzymski et al., <xref ref-type="bibr" rid="B48">2020</xref>), DiACL (Diachronic Atlas of Comparative Linguistics), a database for ancient Indo-European languages spoken in Eurasia typology (Carling et al., <xref ref-type="bibr" rid="B15">2018</xref>), or the Cross-Linguistic Database of Phonetic Transcription Systems (Anderson et al., <xref ref-type="bibr" rid="B4">2018</xref>). Often, such databases use phonetic representations of lexical units or are limited to a few hundred or a few thousand core concepts, limiting their usability for the processing of contemporary written language. In our experience, most of the existing typology-informed NLP research is restricted to exploring language-specific morphosyntactic features and has ignored diversity within lexical resources (Batsuren et al., <xref ref-type="bibr" rid="B9">2022</xref>). A notable exception is the Universal Knowledge Core, a massively multilingual lexical database that explicitly represents linguistic diversity and that we reuse in our work.</p>
<p>Our research is part of the <italic>LiveLanguage</italic> initiative, the overarching objective of which is to create, publish, and manage language resources that are &#x0201C;diversity-aware&#x0201D;&#x02014;i.e., that reflect the viewpoints of multiple speaker communities&#x02014;and that can be reused by multiple communities: linguists, cognitive scientists, AI engineers, language teachers and students (Bella et al., <xref ref-type="bibr" rid="B12">2023</xref>). Contrary to mainstream exploitative practices, LiveLanguage aims to carry out its goals while empowering local speaker communities, giving them control over resources they help to produce (Helm et al., <xref ref-type="bibr" rid="B26">2023</xref>). Involving human contributors and deciders from speaker communities is therefore a crucial part of our methodology.</p>
<p>In particular, the present paper focuses on diversity where it is less expected to appear: within dialects of the same language and within languages of the same country. Therefore, we describe a multidisciplinary study on the diversity of kin terms across seven Arabic dialects (Algerian, Egyptian, Tunisian, Gulf, Moroccan, Palestinian, and Syrian) and three languages from Indonesia (Indonesian, Javanese, and Banjarese). We consider kin terms as a domain particularly well-suited both for research on the methodology of collecting and producing diversity-aware linguistic data, and for comparative studies on diversity across languages.</p>
<p>Our paper aims to provide four contributions: (1) a general method for collecting multilingual lexical data from native speakers for a given domain (in our case the domain of kin terms), in a diversity-aware manner; (2) 223 kin terms and 1,619 lexical gaps collected in seven Arabic dialects and three Indonesian languages; (3) a qualitative and quantitative discussion of our results regarding the diversity observed across the dialects and languages covered; and (4) the publication of our results as an open, computer-processable dataset, as well as its integration into the Universal Knowledge Core multilingual database. Our starting point is state-of-the-art datasets on worldwide kinship terminology from ethnography (Murdock, <xref ref-type="bibr" rid="B39">1970</xref>) and computational linguistics (Khishigsuren et al., <xref ref-type="bibr" rid="B29">2022</xref>). Our data collection method is based on collaborative input from native speakers and language experts. Our results extend the state-of-the-art resources above with kin terms in languages and dialects not yet covered, as well as with 22 new kinship concepts not yet associated with other languages within those resources.</p>
<p>The structure of the paper is organized as follows. In Section 2, we give an overview of lexical typology and the phenomena of lexical untranslatability and lexical gaps with respect to the domain of kinship in particular. The Universal Knowledge Core resource is presented in Section 3. In Section 4, we describe our data collection method. Sections 5 and 6 introduce two case studies on Arabic dialects and Indonesian languages, respectively. Section 7 discusses previous studies related to our work. Finally, we provide conclusions in Section 8.</p></sec>
<sec id="s2">
<title>2 Untranslatability and lexical typology</title>
<p>Linguists understand translation from one language to another as a complex and multidimensional problem, ranging from multiple coexisting forms of meaning equivalence to untranslatability (Catford, <xref ref-type="bibr" rid="B16">1965</xref>; Bella et al., <xref ref-type="bibr" rid="B10">2022a</xref>). The diversity between cultures is a major cause for this problem appearing on several lexical-semantic levels. Some examples of the linguistic diversity are the richness of Toaripi vocabulary on the various forms of motion verbs describing walking around the beach like (isai) meaning &#x0201C;<italic>go beachward</italic>&#x0201D; and (kavai) meaning &#x0201C;<italic>go inland with respect to the beach</italic>&#x0201D;, the language of the coastal Papua New Guinea country, the lack of vocabulary for the word meaning &#x0201C;<italic>sailing</italic>&#x0201D; in Mongolian, which is the language of a landlocked country, or the Arabic word &#x0062A;&#x0064E;&#x00633;&#x00646;&#x0064E;&#x00651;&#x00645; meaning &#x0201C;<italic>to ascend a camel&#x00027;s hump</italic>&#x0201D;.</p>
<p>The domain of kinship terms, which is the subject of our paper, is known to be extremely varied across languages, due to the different ways family structures are organized around the world. Matriarchal societies may describe certain female relatives with more detail, while strongly patriarchal ones are more descriptive with respect to male relatives. Arabic dialects, for instance, distinguish paternal and maternal brothers but also blood brothers, full brothers, and breastfeeding brothers. Thus, not only are kinship-related vocabularies &#x0201C;richer&#x0201D; or &#x0201C;poorer&#x0201D; across languages, they are also structured in different manners.</p>
<p>In this research, we focus on lexical untranslatability, which manifests most clearly through the lexical gap phenomenon when a word in a source language does not have a concise and precise translation in a given target language. Lexical gaps are often the linguistic manifestation of culturally or spatially defined specificities of a community of language speakers that cannot entirely be predicted or explained through systematic principles or recurrent patterns (Lehrer, <xref ref-type="bibr" rid="B32">1970</xref>). <xref ref-type="table" rid="T1">Table 1</xref> below presents this phenomenon for nine concepts representing sibling relationships from the kinship domain in eight languages.<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> One can observe that none of the eight languages has concise lexicalizations for all nine concepts, yet each concept is lexicalized in at least one language. Such variations in lexicalization pose a problem for both machine and human translation: for instance, substituting a specific term instead of a broader one may result in injecting unintended meaning. In Javanese, at least four specific terms&#x02014;(sedulur/<italic>sibling</italic>), (adhi/<italic>younger sibling</italic>), (kangmas/<italic>elder brother</italic>), and (Mbakyu/<italic>elder sister</italic>)&#x02014;are used for expressing the sibling relationship, and accordingly, translating this sentence through Google Translate (<italic>my sister is ten years older than me</italic>) to Javanese gives this non-sensical sentence (<italic>adhiku luwih tuwa sepuluh taun tinimbang aku</italic>) meaning (<italic>my younger sibling is ten years older than me</italic>). This result is due to the lack of Javanese vocabulary for the word meaning (sister), and also lacks the term meaning &#x0201C;<italic>younger sister</italic>&#x0201D;, so the machine translator uses (adhi) meaning &#x0201C;<italic>younger sibling</italic>,&#x0201D; which finally produces the semantically absurd output.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Lexicalizations of nine meanings around the concept of (sibling) in eight languages.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Meaning</bold></th>
<th valign="top" align="left"><bold>English</bold></th>
<th valign="top" align="left"><bold>Japanese</bold></th>
<th valign="top" align="left"><bold>Arabic</bold></th>
<th valign="top" align="left"><bold>Italian</bold></th>
<th valign="top" align="left"><bold>Indonesian</bold></th>
<th valign="top" align="left"><bold>Hindi</bold></th>
<th valign="top" align="left"><bold>Hungarian</bold></th>
<th valign="top" align="left"><bold>Javanese</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">sibling</td>
<td valign="top" align="left">sibling</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">saudara</td>
<td valign="top" align="left">&#x00938;&#x00939;&#x0094B;&#x00926;&#x00930;</td>
<td valign="top" align="left">testv&#x000E9;r</td>
<td valign="top" align="left">sedulur</td>
</tr>
<tr>
<td valign="top" align="left">elder sibling</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">kakak</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">nagytestv&#x000E9;r</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
</tr>
<tr>
<td valign="top" align="left">younger sibling</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">adik</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">kistestv&#x000E9;r</td>
<td valign="top" align="left">adhi</td>
</tr>
<tr>
<td valign="top" align="left">brother</td>
<td valign="top" align="left">brother</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x00623;&#x0064E;&#x0062E;&#x00652;</td>
<td valign="top" align="left">fratello</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x0092D;&#x00948;&#x0092F;&#x0093E;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
</tr>
<tr>
<td valign="top" align="left">sister</td>
<td valign="top" align="left">sister</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x00623;&#x0064F;&#x0062E;&#x00652;&#x0062A;</td>
<td valign="top" align="left">sorella</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x0092C;&#x00939;&#x00928;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
</tr>
<tr>
<td valign="top" align="left">elder brother</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x03042;&#x0306B;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">fratellone</td>
<td valign="top" align="left">abang</td>
<td valign="top" align="left">&#x0092D;&#x00948;&#x0092F;&#x0093E;</td>
<td valign="top" align="left">b&#x000E1;ty</td>
<td valign="top" align="left">kangmas</td>
</tr>
<tr>
<td valign="top" align="left">elder sister</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x03042;&#x0306D;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">sorellona</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x00926;&#x00940;&#x00926;&#x00940;</td>
<td valign="top" align="left">n&#x000F6;v&#x000E9;r</td>
<td valign="top" align="left">mbakyu</td>
</tr>
<tr>
<td valign="top" align="left">younger brother</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x0304A;&#x03068;&#x03046;&#x03068;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">fratellino</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x0092D;&#x0093E;&#x00908;</td>
<td valign="top" align="left">&#x000F6;cs</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
</tr>
<tr>
<td valign="top" align="left">younger sister</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x03044;&#x03082;&#x03046;&#x03068;</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">sorellina</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
<td valign="top" align="left">&#x0092C;&#x00939;&#x00928;</td>
<td valign="top" align="left">h&#x000FA;g</td>
<td valign="top" align="left" style="background-color:#dee1e1">GAP</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Lexical typology is a field of linguistics that studies the diversity across languages according to the structural features of languages with respect to specific semantic fields (Plungyan, <xref ref-type="bibr" rid="B45">2011</xref>). Different classical studies are conducted in this field on grammar and phonology, such as VoxClamantis V1.0&#x02013;a large-scale corpus for phonetic typology (Salesky et al., <xref ref-type="bibr" rid="B49">2020</xref>) and the structure of the space semantic field by identifying a set of semantic parameters and notions depending on the grammatical information of the field&#x00027;s constituents (Levinson and Wilkins, <xref ref-type="bibr" rid="B33">2006</xref>). Other examples of such studies have been conducted on lexical-typological issues that appear across languages during translation, like the presence or absence of lexicalizations in languages. In these articles, authors focused on semantic fields that offer the richness of cross-lingual diversity: family relationships (Kemp and Regier, <xref ref-type="bibr" rid="B28">2012</xref>), colors (Roberson et al., <xref ref-type="bibr" rid="B47">2005</xref>), food (Bella et al., <xref ref-type="bibr" rid="B11">2022b</xref>), body parts (Wierzbicka, <xref ref-type="bibr" rid="B54">2007</xref>), putting and taking events (Kopecka and Narasimhan, <xref ref-type="bibr" rid="B31">2012</xref>), cutting and breaking events (Majid et al., <xref ref-type="bibr" rid="B36">2007</xref>), or cardinal direction terms (Arora et al., <xref ref-type="bibr" rid="B5">2021</xref>). However, as mentioned in the introduction, only a few open datasets have been published in the scientific research area. These include the classification of kinship by Murdock (<xref ref-type="bibr" rid="B39">1970</xref>), which has been published in D-PLACE (Kirby et al., <xref ref-type="bibr" rid="B30">2016</xref>). Part of Kay and Cook (<xref ref-type="bibr" rid="B27">2016</xref>)&#x00027;s work on colors is published under the lexicon chapter of the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, <xref ref-type="bibr" rid="B17">2013</xref>). Additionally, a color categorization dataset by McCarthy et al. (<xref ref-type="bibr" rid="B37">2019</xref>) is available on GitHub<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>.</p>
<p>Digital lexicons have been increasingly used in lexical typology, enabling typologists to explore a broader range of languages and semantic domains. One noteworthy example is the KinDiv<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> lexicon (Khishigsuren et al., <xref ref-type="bibr" rid="B29">2022</xref>), which encompasses 1,911 words and identifies 37,370 gaps within the domain of kinship, spanning 699 languages. In our current research, we extend our investigation into the kinship domain, specifically focusing on exploring linguistic diversity among Arabic dialects and Indonesian languages. Other examples include Viberg (<xref ref-type="bibr" rid="B52">1983</xref>)&#x00027;s seminal study, which was conducted on perceptual terminology in 50 languages and has been expanded upon by Georgakopoulos et al. (<xref ref-type="bibr" rid="B21">2022</xref>) to cover 1,220 languages. Furthermore, the Kinbank database, recently introduced by Passmore et al. (<xref ref-type="bibr" rid="B43">2023</xref>), serves as a comprehensive repository of kinship terminology, encompassing more than 1,173 languages and offering a broad coverage of various kinship subdomains.</p></sec>
<sec id="s3">
<title>3 Universal Knowledge Core</title>
<p>This section describes the Universal Knowledge Core (UKC)<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref>, a large multilingual lexical database that we adopt for the production of diversity-aware datasets in this research (Giunchiglia et al., <xref ref-type="bibr" rid="B24">2017</xref>). The use of the UKC is motivated by its ability to represent linguistic unity and diversity explicitly: conceptualizations shared across languages, word senses appearing only in certain languages, shared lexicalizations (e.g., cognates), as well as lexical gaps. The theoretical underpinnings of the lexical model of the UKC have been described in Giunchiglia et al. (<xref ref-type="bibr" rid="B25">2018</xref>) and in Bella et al. (<xref ref-type="bibr" rid="B11">2022b</xref>), and are illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Structural elements in the UKC lexical database.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0001.tif"/>
</fig>
<p>The UKC is divided into a supra-lingual concept layer (as shown at the top of <xref ref-type="fig" rid="F1">Figure 1</xref>) and the layer of individual lexicons (at the bottom of <xref ref-type="fig" rid="F1">Figure 1</xref>). The concept layer includes hierarchies of concepts that represent lexical meaning shared across languages. Concepts are language-independent units and act as bridges across languages, and each one should be lexicalized by at least one language to be present in the concept layer. Supra-lingual concepts and their relations (e.g., hypernymy, meronymy) are in part derived from third-party resources such as Princeton WordNet (PWN) (Miller, <xref ref-type="bibr" rid="B38">1995</xref>), and are in part proper to the UKC. In particular, the UKC contains an extensive formal conceptualization of kinship domain terms computed from the KinDiv database, spanning about 200 distinct concepts.<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref> KinDiv itself is based on ethnographic evidence from 699 languages (Khishigsuren et al., <xref ref-type="bibr" rid="B29">2022</xref>). While this existing hierarchy of kinship concepts does not fully cover all terms that appear in our study, it is the most complete one we are aware of, motivating our choice of the UKC as a platform for our research.</p>
<p>The lexicon layer consists of language-specific lexicons that provide lexicalizations for the concepts from the supra-lingual concept layer, while also asserting <italic>lexical gaps</italic> whenever lexicalizations are known not to exist. Lexicons also provide term definitions as well as lexical relationships specific to the language, such as derivations, metonymy, or antonymy relations. Lexicons can also contain <italic>language-specific concepts</italic> that do not appear in the supra-lingual concept layer. For example, in <xref ref-type="fig" rid="F1">Figure 1</xref>, the Arabic &#x00634;&#x0064E;&#x00642;&#x00650;&#x0064A;&#x00642;&#x00629;, meaning &#x0201C;<italic>a female person who has the same father, mother, or both parents as another person</italic>&#x0201D;, is represented as a language-specific concept. The dual mechanism of defining lexical concepts either on the supra-lingual or on the language-specific level allows for the representation of differing worldviews that would be hard or impossible to reconcile into a single global concept graph. The richness of its lexicon-level linguistic knowledge makes the UKC unique among multilingual lexical databases and particularly suitable for our study.</p>
<p>As mentioned in Section 2, a lexical gap for a specific concept is present in a language if there is no concise equivalent word meaning for the concept in that language. For example, neither English nor Arabic has a word meaning <italic>elder sibling</italic>; for such cases, the UKC provides evidence of meaning non-existence and untranslatability by representing lexical gaps inside lexicons, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. This information can be used by the NLP community to indicate the absence of equivalent words to downstream cross-lingual applications.</p>
<p>Beyond providing lexical relations between shared word meanings as other multilingual lexical databases do, the UKC also represents a richer set of lexical-semantic connections between language units in a lexicon. For example, the <italic>antonym</italic> lexical relation expresses that two senses are opposite in meaning. While the lexical-semantic relation, <italic>similar-to</italic>, is used to connect two concepts with similar meanings, and the <italic>hypernym-of</italic> connects parent meaning with its child. For instance, in <xref ref-type="fig" rid="F1">Figure 1</xref>, the English (little brother) and (brother) are connected through a <italic>hypernym-of</italic> relationship. Such information can be used by the NLP community to indicate the concise equivalent language-specific word meaning to downstream cross-lingual applications, e.g., as the position of a language-specific meaning in a language hierarchy in a lexicon.</p>
<p>The UKC currently does not explicitly distinguish between languages and dialects: each vocabulary is a separate entity labeled with a standard three-letter ISO 639-3 code. When such a code is not available, the UKC uses a standard extension mechanism where three additional (not standardized) letters are added to the ISO code: e.g., for Syrian Arabic, the code <monospace>arb-syr</monospace> is used.</p></sec>
<sec id="s4">
<title>4 A methodology for building diversity-aware lexicons</title>
<p>This section presents the general method by which we collected and produced lexicalizations and gaps from native speakers and language experts. The same method presented below was employed in an independent manner for each Arabic dialect and Indonesian language covered by our study. The contents of this section aim to serve as a tried and tested recipe for gathering lexical data in a diversity-aware manner, that we intend to reuse in future lexicon development projects.</p>
<p>We exploit the UKC to import language-independent concepts (e.g., kinship concepts) to be used as an input dataset to our method and use its data representation model to formalize our data. We reuse an already broad and well-formalized hierarchy of 184 kinship concepts from the KinDiv database, which includes kinship terms and gaps in 699 languages. Data in KinDiv is based on the well-known results of Murdock (<xref ref-type="bibr" rid="B39">1970</xref>), as well as on lexicalizations retrieved from Wiktionary that we consider as an overall good-quality resource. In Khishigsuren et al. (<xref ref-type="bibr" rid="B29">2022</xref>), the accuracy of KinDiv was evaluated to be above 96%. One language expert per language provided this percentage, which represents the proportion of the number of words (or gaps) validated as correct to the total number of collected words (or gaps).</p>
<p>Our work extends KinDiv data by new concepts, lexicalizations, and lexical gaps in languages and dialects that are either not present in KinDiv or are incompletely covered. A lexical-semantic expert generates a contribution (kinship terms and gaps) task, then a group of native speakers collects contributions from a dialect (and a local language). After that, two steps for validating collected contributions: language experts evaluate collected lexical units and gaps of a dialect, and a lexical-semantic expert evaluates explored kinship concepts (not existing in UKC). Additionally, resulting data (including gaps, words, and new concepts) is used to update and enrich UKC. So, gaps and words are merged into the lexicons of the UKC while new concepts are integrated with the (top) concept layer. A general view of the method is depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Methodology macro-steps and data sources.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0002.tif"/>
</fig>
<p>Accordingly, the macro-steps of our methodology are as follows:</p>
<list list-type="order">
<list-item><p><italic>Contribution task generation</italic>: First, prepare the materials: the dataset of inputs to be examined and the architecture of the supra-lingual concept layer of each subdomain.</p></list-item>
<list-item><p><italic>Contribution collection</italic>: The actual contribution effort is carried out by a native speaker in a local language or dialect.</p></list-item>
<list-item><p><italic>Lexicon-level validation</italic>: Provided words and gaps are evaluated and corrected by a language expert.</p></list-item>
<list-item><p><italic>Concept-level validation</italic>: New concepts and unclear contributions (i.e., words on the borderline) are verified by a lexico-semantic expert.</p></list-item>
</list>
<sec>
<title>4.1 Contribution task generation</title>
<p>This section describes the material needed during the execution of the next steps of the methodology. Hence, two constituents must be prepared in this step as described below:</p>
<list list-type="order">
<list-item><p><italic>Dataset of inputs</italic>: Constructing the dataset of general word meanings is the first step of studying diversity across dialects and represents the inputs of the contribution collection phase. In this context, the UKC lexicon is employed to build a dataset, which contains several facilities that support retrieving categorized data from its interlingual shared meaning layer as introduced in Section 3. Moreover, typology datasets or other approaches can be used for that, such as the kinship dataset from Murdock (<xref ref-type="bibr" rid="B39">1970</xref>); or gathering data from online dictionaries using automatic methods, i.e., KinDiv retrieves some of its kinship terms from Wiktionary. The constructed dataset is a spreadsheet containing language-independent meanings from one semantic field. At the same time, its content is distributed into subdomains (sheets) for usability and simplicity in designing a concept hierarchy for each subdomain which is a helpful tool for lexical-gap exploration. One spreadsheet row is generated for each concept, containing the concept ID, the source concept definition in the standard language, another definition in English, as well as empty slots for inserting a lexical gap or a word with equivalent meaning, and the data provider&#x00027;s comments in a dialect or local language.</p></list-item>
<list-item><p><italic>Interlingual concept hierarchy</italic>: Modeling the interlingual shared meaning space is essential to explore lexical gaps systematically. In this task, the UKC concept hierarchy is exploited. UKC is the only resource introducing a hierarchy of shared meanings across languages for each semantic field, such as kinship, colors, or food. Furthermore, UKC uses a hybrid linguistic-conceptual approach in modeling each domain. This approach adopts actual domain ontology and linguistic data from typological literature. For example, a fragment of the brotherhood hierarchy in the top layer of the UKC is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. A native speaker can compare each examined concept from the spreadsheet with the hierarchy of its domain to extract additional knowledge about its meaning based on a concept&#x00027;s position in the hierarchy, which helps to provide a concrete answer in terms of a gap or a lexical unit.</p></list-item>
</list>
</sec>
<sec>
<title>4.2 Contribution collection</title>
<p>Contributions from a local language or a dialect are provided by one native speaker who was born and educated (university level) within the speaker community. The following are the most notable instructions they are given:</p>
<list list-type="order">
<list-item><p>They are given the authority to skip concepts, stop contributions, or leave a comment when they deem the terms are becoming too culture-specific and consequently need an exact answer.</p></list-item>
<list-item><p>They are asked to provide a lexicalization in a local language (or dialect) that gives meaning equal to the concept&#x00027;s meaning.</p></list-item>
<list-item><p>They are asked explicitly to identify lexical gaps where no local (or dialect) lexicalization exists.</p></list-item>
<list-item><p>Within a local language (or dialect) and a subdomain (e.g., cousins), they are asked to provide new concepts that did not exist in the list of inputs which is imported from the UKC by providing a word (lemma) and a clear description of its meaning.</p></list-item>
</list>
<p>The process of providing such contributions is depicted in two flowcharts; for instance, <xref ref-type="fig" rid="F3">Figure 3</xref> shows the flowchart of the candidate gap (on the left-hand side of the figure) and candidate equivalent word meaning (on the right-hand side of the figure) exploration; it starts identifying a standard language and a local language (or dialect) and providing a native speaker with a spreadsheet including a list of subdomain concepts (inputs). Then, a native speaker is asked to find a linguistic resource in the local language and use it to search for concepts (concept-by-concept) to confirm lexicalizations and gaps. He/she can use a linguistic resource in the search process as the following steps: searching in a well-known dictionary, then in Wiktionary&#x02014;a large multilingual online lexicon after that in a typology dataset (if it is available), and finally, using Google search (based on the count of search hits). More details about these steps are described in Section 5. The native speaker can rely on search results and the count of Google hits to give a more concrete answer on whether the concept in the standard language has a lexicalization or is a gap in the local language; such candidates are passed to the next phase- lexicon-level validation.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Flowchart of gap and equivalent word meaning identification.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0003.tif"/>
</fig>
<p>A new concept collection is a third contribution in this phase, where the steps of a candidate new concept exploration in a local language can be seen in <xref ref-type="fig" rid="F4">Figure 4</xref>. A native speaker can examine the list of subdomain concepts and provide his/her (own) concepts with their definitions that he/she believes have not existed in the list. The same search steps in gap identification can be followed in this task. As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, All candidate new concepts are passed to the two subsequent validation phases: lexicon- and conceptual level.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Flowchart of a new concept collection.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0004.tif"/>
</fig>
</sec>
<sec>
<title>4.3 Lexicon-level validation</title>
<p>Our lexicon-level validation method formally and explicitly addresses individual gap identifications and their quality, as well as equivalent word meanings and new concepts. It allows a qualitative evaluation of the entire list of provided contributions through word-by-word and gap-by-gap in a loop between a native speaker and a validator. A word, a gap, or a new concept does not pass this validation until the native speaker provides the correct answer for each of them, as shown in the flowcharts in <xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref>.</p>
<p>A language expert who is also a native speaker of the determined language (or dialect) will carry out this validation on a spreadsheet containing the data and results gathered in the previous step with two additional empty columns: the evaluation and lexicon-level validator&#x00027;s comment, producing the following information:</p>
<list list-type="order">
<list-item><p><italic>Equivalent word meanings</italic>: validate the correctness of all provided words in the local language (or dialect) by marking them up as correct, incorrect, or unclear for borderline cases and by providing correct words or indicating them as lexical gaps for incorrect ones.</p></list-item>
<list-item><p><italic>Lexical gaps</italic>: validate the word meanings marked as lexical gaps by the native speaker in the local language, either as confirmed gaps or as non-gaps due to an existing lexicalization in that language, which the validator needs to indicate.</p></list-item>
<list-item><p><italic>New concepts</italic>: validate all proposed new word meanings in each subdomain by marking them up as correct, correct but not new (in case the supposedly new concepts already existed in the list), or not accepted (in case another concept already existed in the list to express the meaning, or the validator does not consider it as a desirable suggestion for other reasons).</p></list-item>
</list>
<p>Correct equivalent word meanings and gaps are integrated with the local language lexicon on the fly. Also, correct new concepts are passed to the next step to be validated at the concept level before merging them with the supra-lingual shared meaning layer. While in case the evaluation is an incorrect equivalent word or a gap, or not accepted new concept, the validator returns each of them with a comment describing the reason to the native speaker to review and address the problem; when the native speaker finishes revising them, then he/she returns the new version of a contribution to the validator. This cycle (native speaker&#x00027;s contribution&#x02014;lexicon level validation) is still alive until the validator confirms the correctness of the contribution or skips it.</p>
</sec>
<sec>
<title>4.4 Concept-level validation</title>
<p>In this step, a lexical-semantic expert who is the manager of the UKC system verifies the new concepts and their quality as accept or reject to add them into the supra-lingual concept layer as well as addresses unclear words and non-confirmed gaps/non-gaps that are borderline cases. This validation is based on a discussion session with the language expert responsible for lexicon-level validation through concept-by-concept and case-by-case issue validation. A spreadsheet containing all new concepts and determined (words and gaps) to be examined is used. Columns of this sheet are the same columns in the previous step and two additional empty ones: the evaluation and concept-level validator&#x00027;s comment. The following tasks are used:</p>
<list list-type="order">
<list-item><p><italic>New concepts</italic>: Validate all proposed new concepts in each subdomain by marking them up as correct, correct but not new (in case the supposedly new concepts already existed in the UKC), or not accepted (in case another concept already existed in the UKC to express the meaning, or the validator does not consider the new concept as a desirable suggestion for any other reason).</p></list-item>
<list-item><p><italic>Unclear words</italic>: Validate the correctness of unclear word cases considered in the border-area by the lexicon-level validator by marking them as correct or incorrect and writing a comment.</p></list-item>
<list-item><p><italic>Non-confirmed gaps/non-gaps</italic>: Validate the word meanings that do not have confirmation as lexical gaps or non-gaps by providing a judgment with a comment.</p></list-item>
</list>
<p>Correct new concepts are imported into UKC by merging them with the supra-lingual conceptual layer. In contrast, not-accepted ones and those correct but not new are returned to the validator at the lexicon level, who may also return them with a comment describing the reason to the native speaker to address an included problem. In a new cycle, modified new concepts by the native speaker are transferred to this phase through the validator of lexicon-level; then, the validator at this level reviews the updates and decides whether to finish the revision cycle by accepting or rejecting the new concepts or issue a new one for more review, as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. In addition, confirmed words and gaps output from this step are integrated with the language lexicon in the UKC, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p></sec>
</sec>
<sec id="s5">
<title>5 Case study on diversity across Arabic dialects</title>
<p>This section demonstrates the use of the methodology described in Section 4 on kinship terminology from seven dialects of the Arabic language. Arabic is the official language of more than four hundred million native speakers in twenty-two countries in the Middle East and northern Africa. Classical Arabic or Modern Standard Arabic (MSA) refers to the standard form of the language used in academic writing, formal communication, classical poetry, and religious sermons (Elkateb et al., <xref ref-type="bibr" rid="B19">2006</xref>). Surprisingly lexical diversity is manifested between Arabic dialects, evident in our study between seven of the twenty dialects spoken worldwide. The selected dialects are Egyptian, Moroccan, Tunisian, Algerian, Gulf, and South Levantine (two examples: Palestinian and Syrian). Let us take the example of the Gulf word &#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644; &#x00627;&#x00644;&#x00639;&#x00648;&#x0062F;&#x00652; meaning &#x0201C;<italic>mother&#x00027;s elder brother</italic>,&#x0201D; which has no equivalent in South Levantine or Moroccan; instead, they use the more general word &#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644; meaning &#x0201C;<italic>mother&#x00027;s brother</italic>,&#x0201D; which can be used for both meanings &#x0201C;<italic>mother&#x00027;s younger brother</italic>&#x0201D; or &#x0201C;<italic>mother&#x00027;s elder brother</italic>&#x0201D;. In this paper, we perform an experiment on the Arabic dialects to capture their diversity in the kinship domain. The resulting dataset with dialect-specific kinship terms will be integrated with an instance of the Universal Knowledge Core for Arabic (Arabic UKC)<xref ref-type="fn" rid="fn0006"><sup>6</sup></xref> ongoing project, which is the first diversity-aware lexical resource for Arabic dialects so far.</p>
<sec>
<title>5.1 Experiment setup</title>
<p>As mentioned in Section 3, the UKC resource is our data source in building the input dataset of kinship-independent language concepts and formalizing such concepts and new word meanings (not existing in the inputs) explored in this experiment. For example, the brotherhood hierarchy is shown in the top layer of the UKC in <xref ref-type="fig" rid="F1">Figure 1</xref>. In this study, contributions are provided by seven native speakers (one per Arabic dialect). Regarding the contributors&#x00027; socio-linguistic background, each has at least a master&#x00027;s degree and was born and educated, at least up to high school level, within the native speaker community. The participants&#x00027; linguistic backgrounds are presented below:</p>
<list list-type="order">
<list-item><p><italic>Participant 1</italic>: a native Algerian speaker with good command of English.</p></list-item>
<list-item><p><italic>Participant 2</italic>: a native Egyptian speaker with good command of English.</p></list-item>
<list-item><p><italic>Participant 3</italic>: a native Tunisian speaker with good command of English and French.</p></list-item>
<list-item><p><italic>Participant 4</italic>: a native Gulf speaker with good command of English and Arabic-Palestinian.</p></list-item>
<list-item><p><italic>Participant 5</italic>: a native Moroccan speaker with good command of English and Italian.</p></list-item>
<list-item><p><italic>Participant 6</italic>: a native Palestinian speaker with good command of Arabic-Syrian and English.</p></list-item>
<list-item><p><italic>Participant 7</italic>: a native Syrian speaker with good command of English.</p></list-item>
</list>
<p>Seven experiments (one for each dialect) are performed to explore lexical units and gaps using our method. In each experiment, a spreadsheet of kinship concepts is imported from the UKC (as the source, they were computed from the KinDiv database), which serves as an input dataset to the contribution (diversity-aspects) collection step. These kinship domain concepts are language-independent units representing lexical meaning shared across 699 languages and spanning 184 distinct concepts. UKC categorizes kinship concepts into six groups; each one contains a distinct subset of concepts sharing a common kinship type meaning called a subdomain, for example, sibling and cousin subdomains. The spreadsheet (the dataset) consists of six sheets, and each one represents a kinship subdomain. See <xref ref-type="table" rid="T2">Table 2</xref>, which shows the subdomain names and the count of containing concepts per subdomain of the dataset.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>The count of concepts in the input dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Subdomains</bold></th>
<th valign="top" align="center"><bold>Count of concepts</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Grandparents</td>
<td valign="top" align="center">19</td>
</tr>
<tr>
<td valign="top" align="left">Grandchildren</td>
<td valign="top" align="center">27</td>
</tr>
<tr>
<td valign="top" align="left">Siblings</td>
<td valign="top" align="center">21</td>
</tr>
<tr>
<td valign="top" align="left">Uncle/aunt</td>
<td valign="top" align="center">27</td>
</tr>
<tr>
<td valign="top" align="left">Nephew/niece</td>
<td valign="top" align="center">33</td>
</tr>
<tr>
<td valign="top" align="left">Cousins</td>
<td valign="top" align="center">57</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="center">184</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In the contribution collection, a native speaker answers by filling a lexical unit or gap in a row empty slot specified for each concept. Linguistic resources and Google Search are used to provide answers as precise as possible. For example, the &#x00627;&#x00644;&#x00645;&#x00639;&#x00627;&#x00646;&#x0064A; Almaany dictionary<xref ref-type="fn" rid="fn0007"><sup>7</sup></xref>, Wiktionary<xref ref-type="fn" rid="fn0008"><sup>8</sup></xref>, and the <italic>Fiqh AlArabiyya</italic> typology book (Muttaqin, <xref ref-type="bibr" rid="B40">2009</xref>) are employed in sequential steps to give a judgment on cousin words in Syrian. Additionally, counting the number of hits returned by the Google search engine is another helpful indicator, where a high count of hits indicates a searching word (i.e., &#x00627;&#x00628;&#x00646; &#x00627;&#x00644;&#x00639;&#x00645;&#x00629; meaning &#x0201C;<italic>son of father&#x00027;s sister</italic>,&#x0201D; has 131.5 million hits) is a lexical unit in Syrian. In contrast, a low count indicates a lexical gap; for example, &#x00627;&#x00644;&#x0062E;&#x00624;&#x00648;&#x00644;&#x00629; meaning &#x0201C;<italic>maternal cousin</italic>,&#x0201D; has 158 thousand hits. Google hits of other cousin terms are shown in <xref ref-type="table" rid="T3">Table 3</xref>. Since Arabic words can be written and read with or without diacritics (i.e., &#x0201C;<italic>fatha</italic>&#x0201D; above a letter or &#x0201C;<italic>kassra</italic>&#x0201D; under it), thus, each word is typed in two forms. Note that the content of this matrix cannot be considered the only criterion for gap exploration because word hits may contain a count of other hits resulting from searching in other Arabic dialects for the same word.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Count of Google search hits for cousin concepts in Arabic.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Concept</bold></th>
<th valign="top" align="left"><bold>With/Without diacritics</bold></th>
<th valign="top" align="center" colspan="2"><bold>Count of hits</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x00627;&#x00644;&#x00639;&#x00645;&#x00648;&#x00645;&#x00629; Paternal cousin</td>
<td valign="top" align="left">&#x00627;&#x00644;&#x00639;&#x0064F;&#x00645;&#x00648;&#x00645;&#x0064E;&#x00629;&#x0064F;</td>
<td valign="top" align="center">1.94 M</td>
<td valign="top" align="center">3.04 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00644;&#x00639;&#x00645;&#x00648;&#x00645;&#x00629;</td>
<td valign="top" align="center">1.1 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00627;&#x00644;&#x0062E;&#x00624;&#x00648;&#x00644;&#x00629; Maternal cousin</td>
<td valign="top" align="left">&#x00627;&#x00644;&#x0062E;&#x0064F;&#x00624;&#x00648;&#x00644;&#x0064E;&#x00629;&#x0064C;</td>
<td valign="top" align="center">111 k</td>
<td valign="top" align="center">158 k</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00644;&#x0062E;&#x00624;&#x00648;&#x00644;&#x00629;</td>
<td valign="top" align="center">47 k</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x00639;&#x00645; Son of father&#x00027;s brother</td>
<td valign="top" align="left">&#x00627;&#x00650;&#x00628;&#x00652;&#x00646;&#x00627;&#x00644;&#x00639;&#x0064E;&#x00645;&#x00651;</td>
<td valign="top" align="center">84.8 M</td>
<td valign="top" align="center">93.96 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x00639;&#x00645;</td>
<td valign="top" align="center">9.16 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x00639;&#x00645; Daughter of father&#x00027;s brother</td>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x00652;&#x0062A;&#x00627;&#x00644;&#x00639;&#x0064E;&#x00645;&#x00651;</td>
<td valign="top" align="center">8.43 M</td>
<td valign="top" align="center">83.13 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x00639;&#x00645;</td>
<td valign="top" align="center">74.7 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x00639;&#x00645;&#x00629; Son of father&#x00027;s sister</td>
<td valign="top" align="left">&#x00627;&#x00650;&#x00628;&#x00652;&#x00646;&#x00627;&#x00644;&#x00639;&#x0064E;&#x00645;&#x0064E;&#x00651;&#x00629;</td>
<td valign="top" align="center">12.5 M</td>
<td valign="top" align="center">131.5 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x00639;&#x00645;&#x00629;</td>
<td valign="top" align="center">119 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x00639;&#x00645;&#x00629; Daughter of father&#x00027;s sister</td>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x00652;&#x0062A;&#x00627;&#x00644;&#x00639;&#x0064E;&#x00645;&#x0064E;&#x00651;&#x00629;</td>
<td valign="top" align="center">9 M</td>
<td valign="top" align="center">30.4 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x00639;&#x00645;&#x00629;</td>
<td valign="top" align="center">21.4 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644; Son of mother&#x00027;s brother</td>
<td valign="top" align="left">&#x00627;&#x00650;&#x00628;&#x00652;&#x00646;&#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644;</td>
<td valign="top" align="center">5.61 M</td>
<td valign="top" align="center">33.01 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;</td>
<td valign="top" align="center">27.4 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644; Daughter of mother&#x00027;s brother</td>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x00652;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644;</td>
<td valign="top" align="center">3.99 M</td>
<td valign="top" align="center">30.69 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;</td>
<td valign="top" align="center">26.7 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;&#x00629; Son of mother&#x00027;s sister</td>
<td valign="top" align="left">&#x00627;&#x00650;&#x00628;&#x00652;&#x00646;&#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644;&#x0064E;&#x00629;</td>
<td valign="top" align="center">12.5 M</td>
<td valign="top" align="center">16.59 M</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">&#x00627;&#x00628;&#x00646;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;&#x00629;</td>
<td valign="top" align="center">4.09 M</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;&#x00629; Daughter of mother&#x00027;s sister</td>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x00652;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x0064E;&#x00627;&#x00644;&#x0064E;&#x00629;</td>
<td valign="top" align="center">11 M</td>
<td valign="top" align="center">16.67 M</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00628;&#x00650;&#x00646;&#x0062A;&#x00627;&#x00644;&#x0062E;&#x00627;&#x00644;&#x00629;</td>
<td valign="top" align="center">5.67 M</td>
<td/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.2 Experiment results</title>
<p>The overall contribution collection effort resulted in 180 words, 1,108 lexical gaps, and 19 new concepts identified, formalized, and collected. Detailed statistics about the collected gaps and words are shown in <xref ref-type="table" rid="T4">Table 4</xref>. New concepts were identified in three subdomains: siblings, cousins, and grandchildren. The total number of new concepts, 19, is lower than the sum of new concepts per language due to overlaps across languages: for example, &#x00623;&#x0064E;&#x0062E;&#x0064C; &#x00641;&#x0064A; &#x00627;&#x00644;&#x00631;&#x00636;&#x00627;&#x00639;&#x00629; meaning <italic>breastfeeding brother</italic> was found in all seven dialects, &#x00644;&#x00623;&#x00645;&#x00651; &#x00623;&#x0064E;&#x0062E;&#x0064C;&#x0062A; meaning <italic>maternal sister</italic> was found both in Syrian and in Egyptian, while &#x00623;&#x00628;&#x00652;&#x0064A;&#x00650;&#x00647; meaning <italic>elder cousin, son of mother&#x00027;s brother</italic> only exists in Egyptian.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>The count of the diversity items collected and identified in the Arabic dialects.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Dialects</bold></th>
<th valign="top" align="center"><bold>Words</bold></th>
<th valign="top" align="left"><bold>Gaps w/o new concepts</bold></th>
<th valign="top" align="center"><bold>New concepts</bold></th>
<th valign="top" align="left"><bold>Gaps considering new concepts</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Algerian</td>
<td valign="top" align="center">28</td>
<td valign="top" align="left">156</td>
<td valign="top" align="center">10</td>
<td valign="top" align="left">165</td>
</tr>
<tr>
<td valign="top" align="left">Egyptian</td>
<td valign="top" align="center">32</td>
<td valign="top" align="left">152</td>
<td valign="top" align="center">19</td>
<td valign="top" align="left">152</td>
</tr>
<tr>
<td valign="top" align="left">Moroccan</td>
<td valign="top" align="center">22</td>
<td valign="top" align="left">162</td>
<td valign="top" align="center">10</td>
<td valign="top" align="left">169</td>
</tr>
<tr>
<td valign="top" align="left">Palestinian</td>
<td valign="top" align="center">23</td>
<td valign="top" align="left">161</td>
<td valign="top" align="center">14</td>
<td valign="top" align="left">166</td>
</tr>
<tr>
<td valign="top" align="left">Syrian</td>
<td valign="top" align="center">24</td>
<td valign="top" align="left">160</td>
<td valign="top" align="center">10</td>
<td valign="top" align="left">169</td>
</tr>
<tr>
<td valign="top" align="left">Tunisian</td>
<td valign="top" align="center">23</td>
<td valign="top" align="left">161</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">178</td>
</tr>
<tr>
<td valign="top" align="left">Gulf</td>
<td valign="top" align="center">28</td>
<td valign="top" align="left">156</td>
<td valign="top" align="center">14</td>
<td valign="top" align="left">169</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="center">180</td>
<td valign="top" align="left">1,108</td>
<td valign="top" align="center">19</td>
<td valign="top" align="left">1,168</td>
</tr></tbody>
</table>
</table-wrap>
<p>Validation was carried out in two phases; in the first phase, words and gaps were validated at the lexicon level by the first author, a Ph.D. student in lexical semantics and a native speaker of Arabic, and the third author, an Arabic native speaker with linguistic-semantic experience and good knowledge in Arabic dialects. In the second phase, new concepts are verified and approved to be added to the concept layer of the UKC by the second author, a lexical-semantic expert, and the UKC system manager.</p>
<p>Using the lexicon-level validation method, the first author evaluated the collected data in Palestinian and Syrian, while the third author validated the remaining five dialects. Results can be seen in <xref ref-type="table" rid="T5">Table 5</xref>, whereby correctness, we understand the number of words (or gaps) validated as correct divided by the total number of words (or gaps). In the case of an incorrect word, the validator either provides a correct word or indicates it as a lexical gap. For example, for the Algerian dialect, the correctness of gathered words is 85.71% and that of gaps is 98.08%. Four Algerian words were deemed incorrect: &#x00645;&#x00627;&#x00646;&#x00651;&#x0064A; for the meaning <italic>maternal grandmother</italic>, &#x00644;&#x00627;&#x00644;&#x00651;&#x00629; for the meaning <italic>paternal grandmother</italic>, &#x0062C;&#x0064E;&#x0062F;&#x00651; for the meaning <italic>grandfather</italic>, and &#x00627;&#x00644;&#x00634;&#x0064A;&#x0062E; &#x00628;&#x00627;&#x00628; for the meaning <italic>grandparent</italic>. The validator indicated <italic>maternal grandmother, paternal grandmother</italic>, and <italic>grandparent</italic> as gaps, while he replaced the mistaken word &#x0062C;&#x0064E;&#x0062F;&#x00651; with the correct word &#x00627;&#x00644;&#x00634;&#x0064A;&#x0062E; &#x00628;&#x00627;&#x00628; for <italic>grandfather</italic>. For gap evaluation, the linguistic expert validates a lexical gap by confirming it as a gap or as a non-gap due to an existing word in a dialect, for which he must provide the correct word. For instance, <italic>Participant 1</italic> identified the meanings <italic>elder sister, father&#x00027;s elder sister</italic> and <italic>mother&#x00027;s elder sister</italic> as gaps in Algerian, but the validator did not accept them and provided the polysemous word &#x00644;&#x00627;&#x00644;&#x00651;&#x00629; for each of them. Evidence for validation was obtained from the dictionary <italic>Dictionnaire arabe alg</italic>&#x000E9;<italic>rien</italic><xref ref-type="fn" rid="fn0009"><sup>9</sup></xref> and from usage attested in Algerian TV films. Upon discussion between the validator and the participants, the mistakes made by the latter can be explained by misunderstandings of the meanings of certain concepts provided in MSA and English. The validator made sure to exclude or fix the mistakes, bringing the correctness of the final dataset closer to 100%.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Validator evaluation of words and lexical gaps by dialect.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left" rowspan="2"><bold>Dialects</bold></th>
<th valign="top" align="center" colspan="2"><bold>Correctness of native speaker contribution</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center"><bold>Words (%)</bold></th>
<th valign="top" align="center"><bold>Gaps (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Algerian</td>
<td valign="top" align="center">85.71</td>
<td valign="top" align="center">98.08</td>
</tr>
<tr>
<td valign="top" align="left">Egyptian</td>
<td valign="top" align="center">96.90</td>
<td valign="top" align="center">97.37</td>
</tr>
<tr>
<td valign="top" align="left">Moroccan</td>
<td valign="top" align="center">95.83</td>
<td valign="top" align="center">97.53</td>
</tr>
<tr>
<td valign="top" align="left">Palestinian</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">98.76</td>
</tr>
<tr>
<td valign="top" align="left">Syrian</td>
<td valign="top" align="center">91.67</td>
<td valign="top" align="center">95.00</td>
</tr>
<tr>
<td valign="top" align="left">Tunisian</td>
<td valign="top" align="center">95.65</td>
<td valign="top" align="center">98.14</td>
</tr>
<tr>
<td valign="top" align="left">Gulf</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">96.79</td>
</tr>
<tr>
<td valign="top" align="left">Average</td>
<td valign="top" align="center">95.11</td>
<td valign="top" align="center">97.38</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In this study, we use the UKC for creating the input dataset and the domain hierarchy and for storing and visualizing diversity data. Thus, the 19 new concepts were merged with the UKC by reconstructing a domain hierarchy at the supra-lingual concept layer. For example, the hierarchy of siblings was redesigned to contain five new brotherhood concepts and five new sisterhood concepts. For instance, in the Arabic-Egyptian lexicon, as shown in <xref ref-type="fig" rid="F5">Figure 5</xref>, &#x00623;&#x0064E;&#x0062E;&#x0064C; &#x00641;&#x0064A; &#x00627;&#x00644;&#x00631;&#x00636;&#x00627;&#x00639;&#x00629; meaning &#x0201C;<italic>breastfeeding brother</italic>,&#x0201D; is set up as a sub-node for a newly created concept of the brother, &#x0201C;<italic>a male person who has the same father, mother, or both parents as another person or has the same breastfeeding woman</italic>.&#x0201D;, also, from the figure, can be seen &#x00623;&#x0064E;&#x0062E;&#x0064C;&#x0062A; &#x00644;&#x00623;&#x00628; meaning &#x0201C;<italic>paternal brother</italic>&#x0201D; and &#x00623;&#x0064E;&#x0062E;&#x0064C;&#x0062A; &#x00644;&#x00623;&#x00645;&#x00651; meaning &#x0201C;<italic>maternal brother</italic>&#x0201D; are inserted and connected the half-brother concept. New concepts and lexicalization are marked with white nodes and connected with blue lines.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Structural elements in the UKC database after merging new concepts.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0005.tif"/>
</fig>
<p>Additionally, resulting lexical units and gaps were added into UKC lexicons. The website of the UKC provides several services for system users, such as browsable online access to database contents, source materials, and data visualization tools. The interactive exploration of linguistic diversity data in lexicons is the central feature of the website. The user can browse: (1) all meanings within a language of a word typed in by the user; and (2) lexicalizations and gaps of a concept in all languages contained in the database.</p>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> shows a screenshot of the concept exploration functionality describing the concept &#x0062C;&#x0064E;&#x0062F;&#x00651; meaning &#x0201C;<italic>parent&#x00027;s father</italic>&#x0201D;. On the left-hand side of the screenshot, details are provided on the lexicalization of the concept in Arabic, such as synonymous words, a definition, and a part of speech. The middle part of the screenshot shows an interactive clickable map of all lexicons that either contain the concept or, on the contrary, lack it due to their languages being known not to lexicalize it. The color-coded dots indicate the language family, while the black circled dot represents a lexical gap. This map presents an instant global typological overview of the concept selected; for instance, from <xref ref-type="fig" rid="F6">Figure 6</xref>, one can see that most languages in Europe lexicalize the concept &#x0062C;&#x0064E;&#x0062F;&#x00651; while several languages in the American United States do not lexicalize it. Finally, the right-hand side shows the concept &#x0062C;&#x0064E;&#x0062F;&#x00651; in the context of concept hierarchy, depicted as an interactive graph: the concept, its parent and child concepts, and other lexical-semantic relations (as metonymy and meronymy) are also presented when they exist. Note that the graph only shows a part of the complete hierarchy for usability reasons. Nevertheless, it is navigable and allows the exploration of the whole concept graph in the selected language.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Exploring the concept of &#x0062C;&#x0064E;&#x0062F;&#x00651; as lexicalized in the Arabic language <bold>(left)</bold>, in the world <bold>(middle)</bold>, and as part of the shared concept hierarchy <bold>(right)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0006.tif"/>
</fig>
<p>As mentioned at the beginning of this section, the resulting Arabic dataset will be imported into the Arabic UKC, which is an instance of the UKC system; the top layer contains independent language concepts, and the bottom layer contains twenty lexicons as the number of Arabic dialects. A screenshot of the homepage of the Arabic UKC is shown in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Homepage of the Arabic UKC ongoing project.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0007.tif"/>
</fig></sec>
<sec>
<title>5.3 Discussion</title>
<p>The lexical diversity we observed across the seven dialects was higher than our original expectations, with 19 new concepts identified. Ten of these concepts are lexicalized in MSA, such as &#x00623;&#x0062E;&#x0062A; &#x00641;&#x0064A; &#x00627;&#x00644;&#x00631;&#x00636;&#x00627;&#x00639;&#x00629; meaning &#x0201C;<italic>breastfeeding sister</italic>&#x0201D; and &#x00623;&#x0064E;&#x0062E;&#x0064C; &#x00644;&#x00623;&#x00628; meaning &#x0201C;<italic>paternal brother</italic>&#x0201D;. The others (nine concepts) are specific to the dialects, such as the Egyptian word &#x00623;&#x00628;&#x00652;&#x00644;&#x0064E;&#x00647; meaning &#x0201C;<italic>elder daughter of mother&#x00027;s sister</italic>&#x0201D;, which returns to the Turkish word &#x0201C;<italic>kuzen</italic>&#x0201D;. Mostly, the origin of these Egyptian-specific concepts is the Ottoman Turkish language, when the Egyptian dialect was influenced by it during the Ottoman occupation of Egypt in the period (1517 AD to 1867 AD).</p>
<p>Several shared meaning overlaps have been found between dialect pairs. Likewise, intersections also existed between gaps. For a given domain <italic>d</italic> and languages <italic>l</italic><sub><italic>a</italic></sub>, &#x02026;, <italic>l</italic><sub><italic>n</italic></sub>, the formula below calculates the similarity of the two languages in terms of the overlap of lexicalized concepts from that domain, where LexConcepts(<italic>d, l</italic>) stands for the set of domain concepts that are lexicalized by the language <italic>l</italic>.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">overlap</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mtext class="textrm" mathvariant="normal">LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02229;</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>&#x02229;</mml:mo><mml:mtext class="textrm" mathvariant="normal">LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">max</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mtext class="textrm" mathvariant="normal">LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mo>|</mml:mo><mml:mtext class="textrm" mathvariant="normal">LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> shows the overlaps between pairs of Arabic dialects over the kinship domain. For example, the intersection of Egyptian and Gulf dialects gives a shared coverage of 74.5%, while all dialects are 47.1% the same. In the former case, the number of lexicalizations in Egyptian is 51, and in Gulf is 42. Also, 38 of these lexical units are included in both dialects; see the dataset uploaded to GitHub.<xref ref-type="fn" rid="fn0010"><sup>10</sup></xref> For example, Formula 1 calculates the overlap between Egyptian and Gulf in the Kinship domain (<italic>K</italic>) as follows:</p>
<disp-formula id="E3"><mml:math id="M3"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">overlap</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Egyptian</mml:mtext><mml:mo>,</mml:mo><mml:mtext>Gulf</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mtext>LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Egyptian</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02229;</mml:mo><mml:mtext>LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Gulf</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">max</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mtext>LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Egyptian</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>,</mml:mo><mml:mo>|</mml:mo><mml:mtext>LexConcepts</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Gulf</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E4"><mml:math id="M4"><mml:mrow><mml:mtext>overlap</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mtext>Egyptian</mml:mtext><mml:mo>,</mml:mo><mml:mtext>Gulf</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>38</mml:mn></mml:mrow><mml:mrow><mml:mtext>max</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>51</mml:mn><mml:mo>,</mml:mo><mml:mn>42</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>38</mml:mn></mml:mrow><mml:mrow><mml:mn>51</mml:mn></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>74</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mi>%</mml:mi></mml:mrow></mml:math></disp-formula>
<p>More detail about the analysis of shared coverage between the rest of the Arabic dialects can be found in the same dataset uploaded to GitHub.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>The overlap (percentage of shared lexicalizations) for Arabic dialects.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0008.tif"/>
</fig>
<p>We find these overlaps&#x02014;e.g., an overlap of 59.5% between Gulf and Tunisian, or the overall overlap of 47.1% among all seven dialects&#x02014;lower than our initial expectations on dialectal variations. Arab dialectologists justify such differences with two major factors: linguistic and religious influence (Zaidan and Callison-Burch, <xref ref-type="bibr" rid="B55">2014</xref>). By linguistic influence, we refer to the historical interaction of language-speaker communities, which affects the lexicons. Examples are the Egyptian dialect influenced by the Coptic language (historically spoken by the Copts, starting from the third century AD in Roman Egypt) or the Levantine dialect influenced by the Western Aramaic, Canaanite, Turkish, and Greek languages. The Gulf dialect is one of the Peninsular groups, which was influenced by South Arabian Languages. Secondly, the religion of the speaker community also affects the lexicon. Religion is a sociolinguistic variable that shapes how Arabic is spoken. Religion in Arab countries is a matter of group affiliation and is not usually considered an individual choice: one is born a Muslim, Christian, Jew, or Druze, and this becomes a bit like one&#x00027;s ethnicity. So, for example, within the Egyptian speech community, one can find language mixing between Islamic and Christian terms, and the same in the Levantine community, which consists of a mixing of Muslims, Christians, Jews, and Druze. The Gulf communities, instead, mostly consist of Muslims (Al-Wer, <xref ref-type="bibr" rid="B3">2008</xref>).</p></sec>
</sec>
<sec id="s6">
<title>6 Case study on diversity across Indonesian languages</title>
<p>This section demonstrates the use of the methodology described in Section 4 on kinship terminology from three Austronesian languages from Indonesia: Indonesian, Javanese, and Banjarese. Contrary to the Arabic dialects in Section 5, these three languages are not mutually intelligible.</p>
<p>Indonesia is the fourth most populous country in the world, and it has more than 700 living languages (Eberhard et al., <xref ref-type="bibr" rid="B18">2022</xref>). The national language spoken in Indonesia is Bahasa Indonesia/Indonesian language, which was decided in the historic moment of Youth&#x00027;s Pledge, October 28th, 1928. However, many Indonesians speak more than one language. For example, out of 198 million people that speak Indonesian, 84 million of them speak Javanese (Aji et al., <xref ref-type="bibr" rid="B2">2022</xref>).</p>
<p>Even with the high number of speakers, the count of natural language processing research on Indonesian languages is very low compared to other languages around the world. As of 2020, the count of published papers on the Indonesian language is lower than other languages with less speaker count, such as Polish and Dutch (Aji et al., <xref ref-type="bibr" rid="B2">2022</xref>). Not surprisingly, the amount of research on other languages (i.e., Banjarese and Javanese) in Indonesia is much lower than that. It is therefore motivating to conduct this study that discovers the richness of linguistic diversity across three Indonesian languages: standard Indonesian, Banjarese, and Javanese. In one semantic field, kinship, we have found that diversity is manifested in these languages; for example, in Javanese, the word <italic>ponakan jaler</italic> meaning &#x0201C;<italic>nephew</italic>&#x0201D;, is a lexical gap in Banjarese, and in the opposite direction, the Banjarese <italic>gulu</italic> meaning &#x0201C;<italic>parent&#x00027;s second eldest sibling</italic>&#x0201D; is also a gap in Javanese.</p>
<sec>
<title>6.1 Experiment setup</title>
<p>As in the Arabic experiment, we use the UKC lexicon to create the input dataset of kinship terms, which are independent language and formalizing such terms and also new concepts (not existing in the input dataset) identified in this experiment, as shown in the top layer of the UKC in <xref ref-type="fig" rid="F1">Figure 1</xref> for the brotherhood categorization.</p>
<p>In this study, three native speakers (one per language), born and educated (high school level) within the speaker community, were recruited to contribute. The participants&#x00027; linguistic backgrounds are listed below:</p>
<list list-type="order">
<list-item><p><italic>Participant 1</italic>: a native Indonesian speaker with good command of English, Javanese, and Banjarese.</p></list-item>
<list-item><p><italic>Participant 2</italic>: a native Banjarese speaker with good command of Indonesian and English.</p></list-item>
<list-item><p><italic>Participant 3</italic>: a native Javanese speaker with good command of Indonesian and English.</p></list-item>
</list>
<p>For each language, an experiment was carried out to identify words and gaps associated with the same 184 kinship concepts as in the Arabic study (see <xref ref-type="table" rid="T2">Table 2</xref>). For example, in Banjarese, the dictionary <italic>Kamus Bahasa Banjar Dialek Hulu-Indonesia</italic> (Balai Bahasa Banjarmasin, <xref ref-type="bibr" rid="B7">2008</xref>) and Google Search hits were used in subsequent steps to provide a precise answer on each concept from the given list of inputs. Such search steps were also followed by the Banjarese native speaker for the task of judgment on new concepts identified in the uncle/aunt subdomain. For instance, the Banjarese term <italic>gulu</italic>, expressing an uncle/aunt relationship with the meaning of <italic>a parent&#x00027;s second eldest sibling</italic> and attested by the dictionary above, did not previously exist in the UKC or in the KinDiv dataset, nor in Murdock (<xref ref-type="bibr" rid="B39">1970</xref>). Indonesian and Javanese native speakers also follow the same steps and use the dictionaries of Utomo (<xref ref-type="bibr" rid="B51">2015</xref>) and Badan Pengembangan dan Pembinaan Bahasa (<xref ref-type="bibr" rid="B6">2017</xref>) for the task of judgment on terms and gaps identified in Indonesian and Javanese, respectively.</p>
</sec>
<sec>
<title>6.2 Experiment results</title>
<p>The overall contribution collection effort resulted in 41 words and 517 lexical gaps. Three new, yet unattested word meanings were also found and formalized as new concepts. All three are used in Banjarese in the uncle/aunt subdomain:</p>
<list list-type="bullet">
<list-item><p><italic>julak</italic>, meaning <italic>parent&#x00027;s eldest sibling</italic>;</p></list-item>
<list-item><p><italic>gulu</italic>, meaning <italic>parent&#x00027;s second eldest sibling</italic>;</p></list-item>
<list-item><p><italic>angah</italic> or <italic>tangah</italic>, meaning <italic>parent&#x00027;s middle elder sibling</italic> (when the number of siblings is odd).</p></list-item>
</list>
<p>Statistics on the data collected for each language are shown in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>The count of the diversity items collected and identified in the Indonesian languages.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Languages</bold></th>
<th valign="top" align="left"><bold>Words</bold></th>
<th valign="top" align="left"><bold>Gaps w/o new concepts</bold></th>
<th valign="top" align="left"><bold>New concepts</bold></th>
<th valign="top" align="left"><bold>Gaps considering new concepts</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Indonesian</td>
<td valign="top" align="left">11</td>
<td valign="top" align="left">173</td>
<td valign="top" align="left">0</td>
<td valign="top" align="left">176</td>
</tr>
<tr>
<td valign="top" align="left">Javanese</td>
<td valign="top" align="left">17</td>
<td valign="top" align="left">167</td>
<td valign="top" align="left">0</td>
<td valign="top" align="left">170</td>
</tr>
<tr>
<td valign="top" align="left">Banjarese</td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">172</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">172</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="left">41</td>
<td valign="top" align="left">511</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">517</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As in Arabic, a two-step validation was carried out in this study. The first step validated words and gaps contributed by native speakers, carried out by the fourth author, a native Indonesian speaker with a good command of all three languages. The second validation step was done on the concept level, performed by the second author, a lexical-semantic expert and UKC system manager for new concept validation. In this step, the new concepts were verified and approved to be added to the concept layer of the UKC.</p>
<p><xref ref-type="table" rid="T7">Table 7</xref> provides correctness results over native speaker contributions, provided by the validator. Upon discussion between the validator and the contributors, the mistakes made by the latter can be explained by misunderstandings of the meanings of certain concepts, provided in English. The validator made sure to exclude or fix the mistakes, bringing the correctness of the final dataset closer to 100%.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Validator evaluation of words and lexical gaps by language.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left" rowspan="2"><bold>Languages</bold></th>
<th valign="top" align="left" colspan="2"><bold>Correctness of native speaker contribution</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="left"><bold>Words (%)</bold></td>
<td valign="top" align="left"><bold>Gaps (%)</bold></td>
</tr>
<tr>
<td valign="top" align="left">Indonesian</td>
<td valign="top" align="left">90.91</td>
<td valign="top" align="left">98.27</td>
</tr>
<tr>
<td valign="top" align="left">Javanese</td>
<td valign="top" align="left">94.44</td>
<td valign="top" align="left">95.78</td>
</tr>
<tr>
<td valign="top" align="left">Banjarese</td>
<td valign="top" align="left">91.7</td>
<td valign="top" align="left">97.67</td>
</tr>
<tr>
<td valign="top" align="left">Average</td>
<td valign="top" align="left">92.35</td>
<td valign="top" align="left">97.24</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The produced kinship datasets from this experiment will be merged with the under-construction Indonesian UKC<xref ref-type="fn" rid="fn0011"><sup>11</sup></xref>, a diversity-aware lexicon for languages spoken in Indonesia, also imported into the main UKC database.</p>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> shows how UKC explores information about a specific Indonesian word. However, the screenshot provides information about the Indonesian word <italic>saudara</italic>, which means &#x0201C;<italic>sibling</italic>&#x0201D; in English. The left-hand side of the screenshot explains synonymous words (lemmas) and the definition of the typed word. The middle of the screenshot displays the map of a global typological overview of the concept. Most languages do not lexicalize this concept, marked by the black-circled dot. Only a few languages lexicalize it, such as Indonesian, Swedish, Ainu, and Malayalam, marked by white-circled dots. The right-hand side shows the lexico-semantic relations of the concept.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Exploring the concept of <italic>saudara</italic> as lexicalized in the Indonesian language <bold>(left)</bold>, in the world <bold>(middle)</bold>, and as part of the shared concept hierarchy <bold>(right)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0009.tif"/>
</fig>
<p>The UKC lexicon is also equipped with several interactive visualization services that can be used to browse lexical units and gaps by domain in all supported languages. <xref ref-type="fig" rid="F10">Figure 10</xref> shows an example of using such services in visualizing the content of the grandparent subdomain in Indonesian.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Interactive browser tool showing lexical units and gaps for the grandparent subdomain in Indonesian.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0010.tif"/>
</fig>
</sec>
<sec>
<title>6.3 Discussion</title>
<p>More than 90% of our 184 initial kinship concepts were found to be gaps in the three Indonesian languages, as shown in <xref ref-type="table" rid="T6">Table 6</xref>. Using Formula 1, we calculated the overlaps between the Indonesian languages in terms of kinship lexicalizations, shown in <xref ref-type="fig" rid="F11">Figure 11</xref>. For more details, see the dataset uploaded to the GitHub repository<xref ref-type="fn" rid="fn0012"><sup>12</sup></xref>. 35.3% of the concepts are lexicalized by the three Indonesian languages studied. The Javanese&#x02013;Banjarese overlap is 52.9%, Javanese&#x02013;Indonesian is 60%, and finally Banjarese&#x02013;Indonesian is 41.2%. Even though all three languages are included in the Malayo-Polynesian branch of the Austronesian language family, Indonesian and Banjarese are considered Malayic languages, while Javanese is not, which is the first reason for this result. Furthermore, these languages exist on different islands in Indonesia; Javanese exists on Java Island, Banjarese is located on the southern part of Borneo Island, and the Indonesian language is based on Malay, which is spoken on Sumatra Island (Sneddon, <xref ref-type="bibr" rid="B50">2003</xref>), so this geographical barrier restricts interactions between speakers, and each language has developed within its own speech community.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>The number of words in the intersection of Indonesian languages according to shared meaning.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0011.tif"/>
</fig>
<p>Finally, using Formula 1, we computed the overlaps between Arabic dialects and Indonesian languages. <xref ref-type="fig" rid="F12">Figure 12</xref> shows that the ten languages together cover only 3.9% of the concepts, and the most similar language pair, namely Egyptian&#x02013;Indonesian, is merely 5.9% similar. For researchers in ethnography or comparative linguistics, the observation of such pronounced levels of cross-lingual and cross-cultural diversity may not come as a surprise, as major variations in kin patterns are well-known in these domains. On the other hand, we believe that beyond these narrow fields of research, there is a general lack in understanding the depth of diversity in how, through languages, people describe and interpret the world. Most computational linguists and engineers who build language processing systems, as well as the users who trust such systems for their daily activities, do not suspect the breadth of the mental divide across languages that language applications, such as machine translation systems, are meant to bridge. We think that through quantified measures, as we are attempting to do with our simple measure of overlap introduced on p. 18, can be useful to improve our qualitative grasp on diversity, which we consider a promising direction for future research.</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p>The number of words in the intersection of Indonesian and Arabic languages according to shared meaning.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-14-1229697-g0012.tif"/>
</fig>
<p><xref ref-type="table" rid="T8">Table 8</xref> includes statistics of collected words and gaps by domain across Arabic and Indonesian languages. The results show that only three words in the domain of cousins are identified in the Indonesian languages, while in Egyptian, 16 words are used around the concept of the cousin.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>The count of the diversity items collected and identified by domain.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Domains</bold></th>
<th valign="top" align="left"><bold>Words</bold></th>
<th valign="top" align="left"><bold>Gaps</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Grandparents</td>
<td valign="top" align="left">21</td>
<td valign="top" align="left">169</td>
</tr>
<tr>
<td valign="top" align="left">Grandchildren</td>
<td valign="top" align="left">19</td>
<td valign="top" align="left">251</td>
</tr>
<tr>
<td valign="top" align="left">Siblings</td>
<td valign="top" align="left">37</td>
<td valign="top" align="left">173</td>
</tr>
<tr>
<td valign="top" align="left">Uncle/aunt</td>
<td valign="top" align="left">44</td>
<td valign="top" align="left">226</td>
</tr>
<tr>
<td valign="top" align="left">Nephew/niece</td>
<td valign="top" align="left">33</td>
<td valign="top" align="left">297</td>
</tr>
<tr>
<td valign="top" align="left">Cousins</td>
<td valign="top" align="left">67</td>
<td valign="top" align="left">503</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="left">221</td>
<td valign="top" align="left">1,619</td>
</tr>
</tbody>
</table>
</table-wrap></sec>
</sec>
<sec id="s7">
<title>7 Related work</title>
<p>Ethnologists and linguists have for a long time studied how family structures map to kinship terminology across languages and social groups. The most famous and comprehensive ethnographic study on kin term patterns is that of Murdock (<xref ref-type="bibr" rid="B39">1970</xref>), upon which our work also indirectly relied: our cross-lingual formalization of kin terms is based on the one provided by the KinDiv resource, itself in part derived from Murdock&#x00027;s data. KinDiv covers 699 languages and is a computer-processable database that can also be exploited for applications in computational linguistics. Our results provide linguistic evidence in seven Arabic dialects and three Indonesian languages that do not figure in these resources.</p>
<p>The exploration of kin terminology and the building of large-scale databases on the topic has also been the subject of more recent efforts&#x02014;we only cite two examples here. The AustKin project<xref ref-type="fn" rid="fn0013"><sup>13</sup></xref> has produced a large-scale database on kin terms in hundreds of indigenous Australian languages. The recent Kinbank database (Passmore et al., <xref ref-type="bibr" rid="B43">2023</xref>) is a comprehensive resource on kinship terminology, covering over 1,173 languages, with a broad coverage of kinship subdomains. As Kinbank was released after the initial submission of our paper, we did not rely on it for our work. We consider our research as complementary to Kinbank: concentrating on a relatively low number of dialects and languages, our results could, in principle, be integrated into Kinbank in order to extend its coverage. And vice versa, we see potential in using Kinbank data in order to cross-validate and possibly to extend the Indonesian terms we collected (as the three Indonesian languages of our study are also covered by Kinbank). There is, however, an important methodological difference between the our and Kinbank&#x00027;s way of representing terms: Kinbank does not explicitly indicate lexical gaps. For example, our work considers the concept of <italic>son of father&#x00027;s brother as pronounced by a male speaker</italic> to be a lexical gap in Javanese, while Kinbank maps the Javanese term <italic>sedulur misan</italic>, simply meaning <italic>cousin</italic>, to this and 95 other meanings. Our work, instead, identifies the Javanese term as the general meaning of <italic>cousin</italic> and considers all other (more specific) cousin terms as lexical gaps. This distinction is useful in comparative linguistics and cross-lingual applications where the explicit indication of the lack of precise meaning equivalence can be exploited.</p>
<p>Concepticon (List et al., <xref ref-type="bibr" rid="B34">2016</xref>) is &#x02018;a resource for linking concept lists&#x00027; frequently used in comparative linguistics. The <italic>concept sets</italic> of Concepticon serve the same purpose as the supra-lingual concepts of the UKC in our study, namely to provide meaning-based mappings among lists of terms (aka <italic>concept lists</italic> in Concepticon) across languages. As of mid-2023, Concepticon consists of nearly 4,000 concept sets, principally targeting core vocabularies (basic-level categories) that are the main subject of study of historical and comparative linguistics. Concepticon is under continuous development and has more recently evolved from a flat list of meanings to a hierarchy with broader&#x02013;narrower relations. At the time of writing, the kinship domain seems to be partially represented in Concepticon: while sibling or grandparent relations are widely covered, fine-grained cousin relationships are mostly missing from it. The UKC, which contains over 100,000 supra-lingual concepts and a wide range of lexical and lexico-semantic relations, was a more suitable resource for our study due to its more complete coverage of the kinship domain and its explicit support for representing term untranslatability via lexical gaps.</p>
<p>Multilingual computational applications being in the core of our focus, we also review relevant resources from computational linguistics. For NLP applications, the most popular and widely-known representation of lexico-semantic knowledge is that of <italic>wordnets</italic> that follow the general structure of the original English <italic>Princeton WordNet</italic> (Miller, <xref ref-type="bibr" rid="B38">1995</xref>). The <italic>wordnet expansion</italic> approach by Fellbaum and Vossen (<xref ref-type="bibr" rid="B20">2012</xref>)&#x02014;an expert-driven lexicon translation effort&#x02014;is frequently used to produce new wordnets for lower-resourced languages: this approach consists of &#x02018;translating&#x00027; (i.e., finding lexicalizations for) English WordNet concepts (&#x02018;synsets&#x00027; in wordnet terminology) into the target language. While this is a straightforward approach that produces resources that remain cross-lingually linked, its downside is that the translation approach cannot involve concepts and words specific to the target language and not present in the source language (which in most cases is English). In cases of diverse conceptualizations of the world, the translation approach often results in incorrect approximations. To take the example of Arabic, both versions of the Arabic Wordnet (Elkateb et al., <xref ref-type="bibr" rid="B19">2006</xref>; Abouenour et al., <xref ref-type="bibr" rid="B1">2013</xref>) map the English synset of <italic>uncle</italic> (&#x0201C;<italic>the brother of your father or mother; the husband of your aunt</italic>&#x0201D;) to the Arabic synset of &#x00639;&#x0064E;&#x00645;&#x00652;, which means &#x0201C;<italic>the brother of your father</italic>.&#x0201D;</p>
<p>A similar situation is observed for Indonesian. As far as we know, the only Indonesian Wordnet currently accessible is Bahasa Wordnet&#x02014;a bilingual Wordnet for standard Indonesian and Malay languages (Noor et al., <xref ref-type="bibr" rid="B41">2011</xref>). It was formed by merging three different wordnets (one in Indonesian and two in Malay) developed mainly by the same expansion approach from PWN. Due to this approach, many English words that have no equivalents in Indonesian are incorrectly mapped, resulting in meaning loss. For example, in Bahasa Wordnet, the English word <italic>sister</italic>, which means &#x0201C;<italic>a female person who has the same parents as another person</italic>,&#x0201D; was mapped to the Indonesian word <italic>kakak</italic> which means &#x0201C;<italic>elder sibling</italic>.&#x0201D;</p>
<p>Finally, we mention MultiWordNet as an early effort at improving the representation of linguistic diversity in multilingual lexical databases (Pianta et al., <xref ref-type="bibr" rid="B44">2002</xref>). It is a multilingual lexicon that was built using the <italic>merge</italic> method that, contrary to the translation-based expand approach presented above, maps together existing high-quality bilingual dictionaries. MultiWordNet explicitly represents lexical gaps in its Italian and Hebrew wordnets: about 1,000 in Italian and about 300 in Hebrew (Bentivogli and Pianta, <xref ref-type="bibr" rid="B14">2000</xref>; Ordan and Wintner, <xref ref-type="bibr" rid="B42">2007</xref>). MultiWordNet, however, is a discontinued effort that does not cover the kinship domain and is thus was not suitable for our purposes.</p>
<p>The methodology we present in Section 4 follows neither the expansion nor the merge approach but a third one, more adapted to diversity-aware lexicography: our starting point is a supra-lingual, diversity-aware conceptualization of the domain of study (kinship in our case). The task of <italic>contribution collection</italic> is performed by native speakers with respect to the supra-lingual concept hierarchy based on evidence from comparative linguistics and covering a wide range of languages. While there is no guarantee that our initial conceptualization is complete&#x02014;indeed, it was not the case in our study&#x02014;it is less biased toward the concepts of a single language and speaker community than the expansion approach.</p></sec>
<sec id="s8">
<title>8 Conclusions and future work</title>
<p>Our paper formally captures lexical diversity across languages and dialects by representing language- or dialect-specific concepts and linguistic gaps. It introduces a systematic method to produce such data in a human-based manner from one semantic domain rather than from general domains, as the efforts of covering the WordNet domains (Magnini and Cavagli&#x000E0;, <xref ref-type="bibr" rid="B35">2000</xref>) that have been conducted in building these wordnets, Mongolian (Batsuren et al., <xref ref-type="bibr" rid="B8">2019</xref>), Unified Scottish Gaelic (Bella et al., <xref ref-type="bibr" rid="B13">2020</xref>), and MultiWordNet (Pianta et al., <xref ref-type="bibr" rid="B44">2002</xref>).</p>
<p>The method is verified through two large-scale case studies on kinship terminology, a domain known to be diverse across languages and cultures: one case study deals with seven Arabic dialects, while the other one with three Indonesian languages. The experiments show that our method outperforms the existing methods in terms of the quantity of explored gaps and words and the quality of results. Overall efforts resulted in 1619 gaps, and 223 words were identified in 10 languages and dialects. Moreover, 22 new word meanings with respect to the imported list of independent-language concepts from the UKC are explored in this research.</p>
<p>In future work, we plan to automate the method presented in this paper and apply it to new languages, such as the rest of the Arabic dialects and Indonesian language, as well as to new domains that are known to be diverse, such as body parts, food, color, or visual objects (Giunchiglia and Bagchi, <xref ref-type="bibr" rid="B23">2021</xref>; Giunchiglia et al., <xref ref-type="bibr" rid="B22">2023</xref>).</p>
<p>Finally, diversity-aware lexicons such as the UKC (which includes our produced datasets) provide essential information to cross-lingual applications, such as multilingual NLP tasks or cross-lingual language models. In the future, we plan to use this resource in implementing one such application, i.e., machine translation.</p></sec>
<sec sec-type="data-availability" id="s9">
<title>Data availability statement</title>
<p>The original contributions presented in the study are publicly available. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://github.com/HadiPTUK/kinship_dialect">https://github.com/HadiPTUK/kinship_dialect</ext-link>.</p></sec>
<sec sec-type="author-contributions" id="s10">
<title>Author contributions</title>
<p>FG and GB conceptualized and supervised the study. GB and HK imported and formatted the dataset of inputs. HK wrote the original manuscript draft and performed the Arabic experiments. AF and HK validated the collected Arabic data at the lexicon level. SD performed the Indonesian experiments and validated the results at the lexicon level. GB validated the identified diverse data at the concept level. FG, GB, AF, and HK analyzed the Arabic and Indonesian data. FG, GB, AF, SD, and HK reviewed and edited the manuscript. All authors contributed to the research and approved the submitted version.</p>
</sec>
</body>
<back>
<ack><p>We thank the University of Trento and Palestine Technical University&#x02013;Kadoori for their support.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="supplementary-material" id="s12">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1229697/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1229697/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Table_1.XLSX" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>These nine concepts do not cover sibling terms exhaustively in all languages: for example, many Austronesian languages use different terms based on the gender of the speaker.</p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/aryamccarthy/basic-color-terms">https://github.com/aryamccarthy/basic-color-terms</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="http://ukc.disi.unitn.it/index.php/kinship">http://ukc.disi.unitn.it/index.php/kinship</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="http://ukc.datascientia.eu">http://ukc.datascientia.eu</ext-link></p></fn>
<fn id="fn0005"><p><sup>5</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/kbatsuren/KinDiv">https://github.com/kbatsuren/KinDiv</ext-link></p></fn>
<fn id="fn0006"><p><sup>6</sup><ext-link ext-link-type="uri" xlink:href="http://arabic.ukc.datascientia.eu/concept">http://arabic.ukc.datascientia.eu/concept</ext-link></p></fn>
<fn id="fn0007"><p><sup>7</sup><ext-link ext-link-type="uri" xlink:href="http://www.almaany.com/thesaurus.php">http://www.almaany.com/thesaurus.php</ext-link></p></fn>
<fn id="fn0008"><p><sup>8</sup><ext-link ext-link-type="uri" xlink:href="http://ar.wiktionary.org">http://ar.wiktionary.org</ext-link></p></fn>
<fn id="fn0009"><p><sup>9</sup><ext-link ext-link-type="uri" xlink:href="https://www.lexilogos.com/arabe_algerien.htm">https://www.lexilogos.com/arabe_algerien.htm</ext-link></p></fn>
<fn id="fn0010"><p><sup>10</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/HadiPTUK/kinship_dialect">https://github.com/HadiPTUK/kinship_dialect</ext-link></p></fn>
<fn id="fn0011"><p><sup>11</sup><ext-link ext-link-type="uri" xlink:href="http://indonesia.ukc.datascientia.eu/">http://indonesia.ukc.datascientia.eu/</ext-link></p></fn>
<fn id="fn0012"><p><sup>12</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/HadiPTUK/kinship_dialect">https://github.com/HadiPTUK/kinship_dialect</ext-link></p></fn>
<fn id="fn0013"><p><sup>13</sup><ext-link ext-link-type="uri" xlink:href="http://austkin.net">http://austkin.net</ext-link></p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abouenour</surname> <given-names>L.</given-names></name> <name><surname>Bouzoubaa</surname> <given-names>K.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). <article-title>On the evaluation and improvement of Arabic WordNet coverage and usability</article-title>. <source>Lang. Resour. Eval.</source> <volume>47</volume>, <fpage>891</fpage>&#x02013;<lpage>917</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-013-9237-0</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aji</surname> <given-names>A. F.</given-names></name> <name><surname>Winata</surname> <given-names>G. I.</given-names></name> <name><surname>Koto</surname> <given-names>F.</given-names></name> <name><surname>Cahyawijaya</surname> <given-names>S.</given-names></name> <name><surname>Romadhony</surname> <given-names>A.</given-names></name> <name><surname>Mahendra</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;One country, 700&#x0002B; languages: NLP challenges for underrepresented languages and dialects in Indonesia,&#x0201D;</article-title> in <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</source> (<publisher-loc>Dublin</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>7226</fpage>&#x02013;<lpage>7249</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Wer</surname> <given-names>E.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;Arabic languages, variation in,&#x0201D;</article-title> in <source>Concise Encyclopedia of Languages of the World</source>, eds K. Brown and S. Ogilvie (Oxford: Elsevier Ltd.), <fpage>53</fpage>&#x02013;<lpage>56</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>C.</given-names></name> <name><surname>Tresoldi</surname> <given-names>T.</given-names></name> <name><surname>Chacon</surname> <given-names>T.</given-names></name> <name><surname>Fehn</surname> <given-names>A.-M.</given-names></name> <name><surname>Walworth</surname> <given-names>M.</given-names></name> <name><surname>Forkel</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;A cross-linguistic database of phonetic transcription systems,&#x0201D;</article-title> in <source>Yearbook of the Poznan Linguistic Meeting</source> (<publisher-loc>Pozna&#x00144;</publisher-loc>: <publisher-name>De Gruyter Open</publisher-name>), <fpage>21</fpage>&#x02013;<lpage>53</lpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arora</surname> <given-names>A.</given-names></name> <name><surname>Farris</surname> <given-names>A.</given-names></name> <name><surname>Gopalakrishnan</surname> <given-names>R.</given-names></name> <name><surname>Basu</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Bh&#x00101;&#x01E63;&#x00101;citra visualising the dialect geography of South Asia,&#x0201D;</article-title> in <source>Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021</source> (<publisher-loc>Association for Computational Linguistics</publisher-loc>), <fpage>51</fpage>&#x02013;<lpage>57</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><collab>Badan Pengembangan dan Pembinaan Bahasa</collab></person-group> (<year>2017</year>). <source>Kamus Besar Bahasa Indonesia</source>. <publisher-loc>Jakarta</publisher-loc>: <publisher-name>Badan Pengembangan dan Pembinaan Bahasa, Kementerian Pendidikan dan Kebudayaan</publisher-name>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><collab>Balai Bahasa Banjarmasin</collab></person-group> (<year>2008</year>). <source>Kamus Bahasa Banjar Dialek Hulu-Indonesia</source>. <publisher-loc>Banjarbaru</publisher-loc>: <publisher-name>Departemen Pendidikan Nasional, Pusat Bahasa, Balai Bahasa Banjarmasin</publisher-name>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Giunchiglia</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;CogNet: a large-scale cognate database,&#x0201D;</article-title> in <source>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source> (<publisher-loc>Florence</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>3136</fpage>&#x02013;<lpage>3145</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Goldman</surname> <given-names>O.</given-names></name> <name><surname>Khalifa</surname> <given-names>S.</given-names></name> <name><surname>Habash</surname> <given-names>N.</given-names></name> <name><surname>Kiera&#x0015B;</surname> <given-names>W.</given-names></name> <name><surname>Bella</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;UniMorph 4.0: universal morphology,&#x0201D;</article-title> in <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source> (<publisher-loc>Marseille</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>840</fpage>&#x02013;<lpage>855</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Khishigsuren</surname> <given-names>T.</given-names></name> <name><surname>Giunchiglia</surname> <given-names>F.</given-names></name></person-group> (<year>2022a</year>). <article-title>&#x0201C;Linguistic diversity and bias in online dictionaries,&#x0201D;</article-title> in <source>Frontiers in African Digital Research</source>, ed K. Lena (Bayreuth: Institute of African Studies), <fpage>173</fpage>&#x02013;<lpage>186</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Byambadorj</surname> <given-names>E.</given-names></name> <name><surname>Chandrashekar</surname> <given-names>Y.</given-names></name> <name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Cheema</surname> <given-names>D.</given-names></name> <name><surname>Giunchiglia</surname> <given-names>F.</given-names></name></person-group> (<year>2022b</year>). <article-title>&#x0201C;Language diversity: visible to humans, exploitable by machines,&#x0201D;</article-title> in <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source> (<publisher-loc>Dublin</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>156</fpage>&#x02013;<lpage>165</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Helm</surname> <given-names>P.</given-names></name> <name><surname>Koch</surname> <given-names>G.</given-names></name> <name><surname>Giunchiglia</surname> <given-names>F.</given-names></name></person-group> (<year>2023</year>). <article-title>Towards bridging the digital language divide</article-title>. <source>arXiv preprint arXiv:2307.13405</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2307.13405</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>McNeill</surname> <given-names>F.</given-names></name> <name><surname>Gorman</surname> <given-names>R.</given-names></name> <name><surname>Donna&#x000ED;le</surname> <given-names>C. &#x000D3;.</given-names></name> <name><surname>MacDonald</surname> <given-names>K.</given-names></name> <name><surname>Chandrashekar</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;A major Wordnet for a minority language: Scottish Gaelic,&#x0201D;</article-title> in <source>Proceedings of the Twelfth Language Resources and Evaluation Conference</source> (<publisher-loc>Marseille</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>2812</fpage>&#x02013;<lpage>2818</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bentivogli</surname> <given-names>L.</given-names></name> <name><surname>Pianta</surname> <given-names>E.</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;Looking for lexical gaps,&#x0201D;</article-title> in <source>Proceedings of the 9th EURALEX International Congress</source>, eds U. Heid and S. Evert (Stuttgart: Institut fur Maschinelle Sprachverarbeitung), <fpage>663</fpage>&#x02013;<lpage>669</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carling</surname> <given-names>G.</given-names></name> <name><surname>Larsson</surname> <given-names>F.</given-names></name> <name><surname>Cathcart</surname> <given-names>C. A.</given-names></name> <name><surname>Johansson</surname> <given-names>N.</given-names></name> <name><surname>Holmer</surname> <given-names>A.</given-names></name> <name><surname>Round</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Diachronic Atlas of Comparative Linguistics (DiACL)&#x02013;a database for ancient language typology</article-title>. <source>PLoS ONE</source> <volume>13</volume>, <fpage>e0205313</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0205313</pub-id><pub-id pub-id-type="pmid">30307985</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Catford</surname> <given-names>J. C.</given-names></name></person-group> (<year>1965</year>). <source>A Linguistic Theory of Translation</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="editor"><name><surname>Dryer</surname> <given-names>M. S.</given-names></name> <name><surname>Haspelmath</surname> <given-names>M.</given-names></name></person-group> (eds.). (<year>2013</year>). <source>WALS Online (v2020.3)</source>. Zenodo.</citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Eberhard</surname> <given-names>D.</given-names></name> <name><surname>Simons</surname> <given-names>G. F.</given-names></name> <name><surname>Fenning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2022</year>). <source>Ethnologue: Languages of Africa and Europe</source>. <publisher-loc>Dallas, TX</publisher-loc>: <publisher-name>SIL International Publications</publisher-name>.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Elkateb</surname> <given-names>S.</given-names></name> <name><surname>Black</surname> <given-names>W.</given-names></name> <name><surname>Rodr&#x000ED;guez</surname> <given-names>H.</given-names></name> <name><surname>Alkhalifa</surname> <given-names>M.</given-names></name> <name><surname>Vossen</surname> <given-names>P.</given-names></name> <name><surname>Pease</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2006</year>). <article-title>&#x0201C;Building a WordNet for Arabic,&#x0201D;</article-title> in <source>Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC&#x00027;06)</source> (<publisher-loc>Genoa</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>29</fpage>&#x02013;<lpage>34</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fellbaum</surname> <given-names>C.</given-names></name> <name><surname>Vossen</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>Challenges for a multilingual WordNet</article-title>. <source>Lang. Resour. Eval.</source> <volume>46</volume>, <fpage>313</fpage>&#x02013;<lpage>326</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-012-9186-z</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Georgakopoulos</surname> <given-names>T.</given-names></name> <name><surname>Grossman</surname> <given-names>E.</given-names></name> <name><surname>Nikolaev</surname> <given-names>D.</given-names></name> <name><surname>Polis</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Universal and macro-areal patterns in the lexicon: a case-study in the perception-cognition domain</article-title>. <source>Linguist. Typol.</source> <volume>26</volume>, <fpage>439</fpage>&#x02013;<lpage>487</lpage>. <pub-id pub-id-type="doi">10.1515/lingty-2021-2088</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giunchiglia</surname> <given-names>F.</given-names></name> <name><surname>Bagchi</surname> <given-names>M.</given-names></name> <name><surname>Diao</surname> <given-names>X.</given-names></name></person-group> (<year>2023</year>). <article-title>A semantics-driven methodology for high-quality image annotation</article-title>. <source>arXiv preprint arXiv:2307.14119</source>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giunchiglia</surname> <given-names>F.</given-names></name> <name><surname>Bagchi</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Classifying concepts via visual properties</article-title>. <source>arXiv preprint arXiv:2105.09422</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2105.09422</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Giunchiglia</surname> <given-names>F.</given-names></name> <name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Bella</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Understanding and exploiting language diversity,&#x0201D;</article-title> in <source>Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17</source> (<publisher-loc>Melbourne, VIC</publisher-loc>), <fpage>4009</fpage>&#x02013;<lpage>4017</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Giunchiglia</surname> <given-names>F.</given-names></name> <name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Freihat</surname> <given-names>A. A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;One world&#x02013;seven thousand languages,&#x0201D;</article-title> in <source>Proceedings 19th International Conference on Computational Linguistics and Intelligent Text Processing, CiCling2018</source>, ed A. Gelbukh (<publisher-loc>Hanoi</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>18</fpage>&#x02013;<lpage>24</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Helm</surname> <given-names>P.</given-names></name> <name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Koch</surname> <given-names>G.</given-names></name> <name><surname>Giunchiglia</surname> <given-names>F.</given-names></name></person-group> (<year>2023</year>). <article-title>Diversity and language technology: how techno-linguistic bias can cause epistemic injustice</article-title>. <source>arXiv preprint arXiv:2307.13714</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2307.13714</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kay</surname> <given-names>P.</given-names></name> <name><surname>Cook</surname> <given-names>R. S.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;World color survey,&#x0201D;</article-title> in <source>Encyclopedia of Color Science and Technology</source>, eds M. R. Luo (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>1265</fpage>&#x02013;<lpage>1271</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kemp</surname> <given-names>C.</given-names></name> <name><surname>Regier</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>Kinship categories across languages reflect general communicative principles</article-title>. <source>Science</source> <volume>336</volume>, <fpage>1049</fpage>&#x02013;<lpage>1054</lpage>. <pub-id pub-id-type="doi">10.1126/science.1218811</pub-id><pub-id pub-id-type="pmid">22628658</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Khishigsuren</surname> <given-names>T.</given-names></name> <name><surname>Bella</surname> <given-names>G.</given-names></name> <name><surname>Batsuren</surname> <given-names>K.</given-names></name> <name><surname>Freihat</surname> <given-names>A. A.</given-names></name> <name><surname>Chandran Nair</surname> <given-names>N.</given-names></name> <name><surname>Ganbold</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Using linguistic typology to enrich multilingual lexicons: the case of lexical gaps in kinship,&#x0201D;</article-title> in <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source> (<publisher-loc>Marseille</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>2798</fpage>&#x02013;<lpage>2807</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kirby</surname> <given-names>K. R.</given-names></name> <name><surname>Gray</surname> <given-names>R. D.</given-names></name> <name><surname>Greenhill</surname> <given-names>S. J.</given-names></name> <name><surname>Jordan</surname> <given-names>F. M.</given-names></name> <name><surname>Gomes-Ng</surname> <given-names>S.</given-names></name> <name><surname>Bibiko</surname> <given-names>H.-J.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>D-PLACE: a global database of cultural, linguistic and environmental diversity</article-title>. <source>PLoS ONE</source> <volume>11</volume>, <fpage>e0158391</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0158391</pub-id><pub-id pub-id-type="pmid">27391016</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kopecka</surname> <given-names>A.</given-names></name> <name><surname>Narasimhan</surname> <given-names>B.</given-names></name></person-group> (<year>2012</year>). <source>Events of Putting and Taking: A Crosslinguistic Perspective</source>. <publisher-loc>Amsterdam</publisher-loc>: <publisher-name>John Benjamins Publishing</publisher-name>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lehrer</surname> <given-names>A.</given-names></name></person-group> (<year>1970</year>). <article-title>Notes on lexical gaps</article-title>. <source>J. Linguist.</source> <volume>6</volume>, <fpage>257</fpage>&#x02013;<lpage>261</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levinson</surname> <given-names>S. C.</given-names></name> <name><surname>Wilkins</surname> <given-names>D. P.</given-names></name></person-group> (<year>2006</year>). <source>Grammars of Space: Explorations in Cognitive Diversity</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>List</surname> <given-names>J.-M.</given-names></name> <name><surname>Cysouw</surname> <given-names>M.</given-names></name> <name><surname>Forkel</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Concepticon: a resource for the linking of concept lists,&#x0201D;</article-title> in <source>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC&#x00027;16)</source> (<publisher-loc>Portoro&#x0017E;</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>2393</fpage>&#x02013;<lpage>2400</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Magnini</surname> <given-names>B.</given-names></name> <name><surname>Cavagli&#x000E0;</surname> <given-names>G.</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;Integrating subject field codes into WordNet,&#x0201D;</article-title> in <source>Proceedings of the Second International Conference on Language Resources and Evaluation (LREC&#x00027;00)</source> (<publisher-loc>Athens</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>).</citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Majid</surname> <given-names>A.</given-names></name> <name><surname>Bowerman</surname> <given-names>M.</given-names></name> <name><surname>van Staden</surname> <given-names>M.</given-names></name> <name><surname>Boster</surname> <given-names>J. S.</given-names></name></person-group> (<year>2007</year>). <article-title>The semantic categories of cutting and breaking events: a crosslinguistic perspective</article-title>. <source>Cogn. Linguist.</source> <volume>18</volume>, <fpage>133</fpage>&#x02013;<lpage>152</lpage>. <pub-id pub-id-type="doi">10.1515/COG.2007.005</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McCarthy</surname> <given-names>A. D.</given-names></name> <name><surname>Wu</surname> <given-names>W.</given-names></name> <name><surname>Mueller</surname> <given-names>A.</given-names></name> <name><surname>Watson</surname> <given-names>B.</given-names></name> <name><surname>Yarowsky</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Modeling color terminology across thousands of languages,&#x0201D;</article-title> in <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source> (<publisher-loc>Hong Kong</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>2241</fpage>&#x02013;<lpage>2250</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miller</surname> <given-names>G. A.</given-names></name></person-group> (<year>1995</year>). <article-title>WordNet: a lexical database for English</article-title>. <source>Commun. ACM</source> <volume>38</volume>, <fpage>39</fpage>&#x02013;<lpage>41</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Murdock</surname> <given-names>G. P.</given-names></name></person-group> (<year>1970</year>). <article-title>Kin term patterns and their distribution</article-title>. <source>Ethnology</source> <volume>9</volume>, <fpage>165</fpage>&#x02013;<lpage>208</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Muttaqin</surname> <given-names>Z.</given-names></name></person-group> (<year>2009</year>). <article-title>Fiqh lughah dalam literatur Arab klasik</article-title>. <source>Afaq &#x00027;Arabiyah: Jurnal Kebahasaaraban dan Pendidikan Bahasa Arab</source> <volume>4</volume>, <fpage>107</fpage>&#x02013;<lpage>122</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Noor</surname> <given-names>N. H. B. M.</given-names></name> <name><surname>Sapuan</surname> <given-names>S.</given-names></name> <name><surname>Bond</surname> <given-names>F.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Creating the open Wordnet Bahasa,&#x0201D;</article-title> in <source>Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation</source> (<publisher-loc>Tokyo</publisher-loc>: <publisher-name>Institute of Digital Enhancement of Cognitive Processing, Waseda University</publisher-name>), <fpage>255</fpage>&#x02013;<lpage>264</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ordan</surname> <given-names>N.</given-names></name> <name><surname>Wintner</surname> <given-names>S.</given-names></name></person-group> (<year>2007</year>). <article-title>Hebrew WordNet: a test case of aligning lexical databases across languages</article-title>. <source>Int. J. Transl.</source> <volume>19</volume>, <fpage>39</fpage>&#x02013;<lpage>58</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Passmore</surname> <given-names>S.</given-names></name> <name><surname>Barth</surname> <given-names>W.</given-names></name> <name><surname>Greenhill</surname> <given-names>S. J.</given-names></name> <name><surname>Quinn</surname> <given-names>K.</given-names></name> <name><surname>Sheard</surname> <given-names>C.</given-names></name> <name><surname>Argyriou</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Kinbank: a global database of kinship terminology</article-title>. <source>PLoS ONE</source> <volume>18</volume>, <fpage>e0283218</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0283218</pub-id><pub-id pub-id-type="pmid">37224178</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pianta</surname> <given-names>E.</given-names></name> <name><surname>Bentivogli</surname> <given-names>L.</given-names></name> <name><surname>Girardi</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <article-title>&#x0201C;Developing an aligned multilingual database,&#x0201D;</article-title> in <source>Proceedings of the 1st International WordNet Conference</source> (<publisher-loc>Mysuru</publisher-loc>: <publisher-name>Global Wordnet Association</publisher-name>), <fpage>293</fpage>&#x02013;<lpage>302</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Plungyan</surname> <given-names>V.</given-names></name></person-group> (<year>2011</year>). <article-title>Modern linguistic typology</article-title>. <source>Herald Russian Acad. Sci.</source> <volume>81</volume>, <fpage>101</fpage>&#x02013;<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1134/S1019331611020158</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reznikova</surname> <given-names>T.</given-names></name> <name><surname>Rakhilina</surname> <given-names>E.</given-names></name> <name><surname>Bonch-Osmolovskaya</surname> <given-names>A.</given-names></name></person-group> (<year>2012</year>). <article-title>Towards a typology of pain predicates</article-title>. <source>Linguistics</source> <volume>50</volume>, <fpage>421</fpage>&#x02013;<lpage>465</lpage>. <pub-id pub-id-type="doi">10.1515/ling-2012-0015</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roberson</surname> <given-names>D.</given-names></name> <name><surname>Davidoff</surname> <given-names>J.</given-names></name> <name><surname>Davies</surname> <given-names>I. R.</given-names></name> <name><surname>Shapiro</surname> <given-names>L. R.</given-names></name></person-group> (<year>2005</year>). <article-title>Color categories: evidence for the cultural relativity hypothesis</article-title>. <source>Cogn. Psychol.</source> <volume>50</volume>, <fpage>378</fpage>&#x02013;<lpage>411</lpage>. <pub-id pub-id-type="doi">10.1016/j.cogpsych.2004.10.001</pub-id><pub-id pub-id-type="pmid">15893525</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rzymski</surname> <given-names>C.</given-names></name> <name><surname>Tresoldi</surname> <given-names>T.</given-names></name> <name><surname>Greenhill</surname> <given-names>S. J.</given-names></name> <name><surname>Wu</surname> <given-names>M.-S.</given-names></name> <name><surname>Schweikhard</surname> <given-names>N. E.</given-names></name> <name><surname>Koptjevskaja-Tamm</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies</article-title>. <source>Sci. Data</source> <volume>7</volume>, <fpage>1</fpage>&#x02013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1038/s41597-019-0341-x</pub-id><pub-id pub-id-type="pmid">31932593</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Salesky</surname> <given-names>E.</given-names></name> <name><surname>Chodroff</surname> <given-names>E.</given-names></name> <name><surname>Pimentel</surname> <given-names>T.</given-names></name> <name><surname>Wiesner</surname> <given-names>M.</given-names></name> <name><surname>Cotterell</surname> <given-names>R.</given-names></name> <name><surname>Black</surname> <given-names>A. W.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;A corpus for large-scale phonetic typology,&#x0201D;</article-title> in <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source> (<publisher-loc>Association for Computational Linguistics</publisher-loc>), <fpage>4526</fpage>&#x02013;<lpage>4546</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sneddon</surname> <given-names>J.</given-names></name></person-group> (<year>2003</year>). <source>The Indonesian Language</source>. <publisher-loc>Sydney, NSW</publisher-loc>: <publisher-name>University of New South Wales Press Ltd</publisher-name>.</citation>
</ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Utomo</surname> <given-names>S. S.</given-names></name></person-group> (<year>2015</year>). <source>Kamus Indonesia-Jawa</source>. <publisher-loc>Jakarta</publisher-loc>: <publisher-name>PT Gramedia Pustaka Utama</publisher-name>.</citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Viberg</surname> <given-names>&#x000C5;.</given-names></name></person-group> (<year>1983</year>). <article-title>The verbs of perception: a typological study</article-title>. <source>Linguistics</source> <volume>21</volume>, <fpage>123</fpage>&#x02013;<lpage>162</lpage>.</citation>
</ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>W&#x000E4;lchli</surname> <given-names>B.</given-names></name> <name><surname>Cysouw</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Lexical typology through similarity semantics: toward a semantic map of motion verbs</article-title>. <source>Linguistics</source> <volume>50</volume>, <fpage>671</fpage>&#x02013;<lpage>710</lpage>. <pub-id pub-id-type="doi">10.1515/ling-2012-0021</pub-id></citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wierzbicka</surname> <given-names>A.</given-names></name></person-group> (<year>2007</year>). <article-title>Bodies and their parts: an NSM approach to semantic typology</article-title>. <source>Lang. Sci.</source> <volume>29</volume>, <fpage>14</fpage>&#x02013;<lpage>65</lpage>. <pub-id pub-id-type="doi">10.1016/j.langsci.2006.07.002</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zaidan</surname> <given-names>O. F.</given-names></name> <name><surname>Callison-Burch</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>Arabic dialect identification</article-title>. <source>Comput. Linguist.</source> <volume>40</volume>, <fpage>171</fpage>&#x02013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1162/COLI_a_00169</pub-id></citation>
</ref>
</ref-list>
</back>
</article>