<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Digit. Health</journal-id>
<journal-title>Frontiers in Digital Health</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Digit. Health</abbrev-journal-title>
<issn pub-type="epub">2673-253X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdgth.2021.778305</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Digital Health</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Development of a Lexicon for Pain</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Chaturvedi</surname> <given-names>Jaya</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1422850/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Mascio</surname> <given-names>Aurelie</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1341572/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Velupillai</surname> <given-names>Sumithra U.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/580008/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Roberts</surname> <given-names>Angus</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/896521/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neurosciences, King&#x00027;s College London</institution>, <addr-line>London</addr-line>, <country>United Kingdom</country></aff>
<aff id="aff2"><sup>2</sup><institution>Health Data Research UK</institution>, <addr-line>London</addr-line>, <country>United Kingdom</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Thomas Martin Deserno, Technische Universitat Braunschweig, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Xia Jing, Clemson University, United States; Vasiliki Foufi, Consultant, Geneva, Switzerland; Maike Krips, Technische Universitat Braunschweig, Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jaya Chaturvedi <email>jaya.1.chaturvedi&#x00040;kcl.ac.uk</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Health Informatics, a section of the journal Frontiers in Digital Health</p></fn></author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>12</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>3</volume>
<elocation-id>778305</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>09</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>11</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2021 Chaturvedi, Mascio, Velupillai and Roberts.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Chaturvedi, Mascio, Velupillai and Roberts</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract><p>Pain has been an area of growing interest in the past decade and is known to be associated with mental health issues. Due to the ambiguous nature of how pain is described in text, it presents a unique natural language processing (NLP) challenge. Understanding how pain is described in text and utilizing this knowledge to improve NLP tasks would be of substantial clinical importance. Not much work has previously been done in this space. For this reason, and in order to develop an English lexicon for use in NLP applications, an exploration of pain concepts within free text was conducted. The exploratory text sources included two hospital databases, a social media platform (Twitter), and an online community (Reddit). This exploration helped select appropriate sources and inform the construction of a pain lexicon. The terms within the final lexicon were derived from three sources&#x02014;literature, ontologies, and word embedding models. This lexicon was validated by two clinicians as well as compared to an existing 26-term pain sub-ontology and MeSH (Medical Subject Headings) terms. The final validated lexicon consists of 382 terms and will be used in downstream NLP tasks by helping select appropriate pain-related documents from electronic health record (EHR) databases, as well as pre-annotating these words to help in development of an NLP application for classification of mentions of pain within the documents. The lexicon and the code used to generate the embedding models have been made publicly available.</p></abstract>
<kwd-group>
<kwd>lexicon</kwd>
<kwd>natural language processing</kwd>
<kwd>pain</kwd>
<kwd>electronic health records</kwd>
<kwd>mental health</kwd>
</kwd-group>
<counts>
<fig-count count="4"/>
<table-count count="8"/>
<equation-count count="0"/>
<ref-count count="50"/>
<page-count count="11"/>
<word-count count="7940"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Pain is known to have a strong relationship with emotions, which can lead to damaging consequences (<xref ref-type="bibr" rid="B1">1</xref>). This is worsened for people suffering with persistent pain. It can lead to long-term mental health effects such as &#x0201C;secondary pain effect&#x0201D; which encapsulates the strong feelings toward the long-term implications of suffering from pain (<xref ref-type="bibr" rid="B1">1</xref>). The Biopsychosocial framework of pain reiterates the multidimensionality of pain and explains the dynamic relationships of pain with biological, psychological, and social factors (<xref ref-type="bibr" rid="B2">2</xref>). Pain has been an active area of research, especially since the onset of the crisis of opioid use in the United States (<xref ref-type="bibr" rid="B3">3</xref>). Pain also has a significant impact on the healthcare system and society in terms of costs (<xref ref-type="bibr" rid="B4">4</xref>). Apart from research, it has also been of interest to the general population. <bold>Figure 2</bold> shows Google trends for the search term &#x0201C;pain&#x0201D; over time (2004 to present) compared with two other common symptoms (&#x0201C;fever&#x0201D; and &#x0201C;cough&#x0201D;) to investigate whether the trends are reflective of a general increase in searches, or an actual increase in search of the term. All three terms were selected as &#x0201C;medical terms&#x0201D; rather than &#x0201C;general search&#x0201D; terms to avoid any metaphorical mentions and make the words more accurately comparable. This was possible through use of a Google Trends feature which allows the user to choose the search category (generic &#x0201C;Search term&#x0201D; category would include any search results for the word &#x0201C;pain,&#x0201D; &#x0201C;Medical condition&#x0201D;/&#x0201C;Disease&#x0201D; category would only include &#x0201C;pain&#x0201D; when searched as a medical condition or disease). Pain shows an incremental increase worldwide (<xref ref-type="fig" rid="F1">Figure 1</xref>) (<xref ref-type="bibr" rid="B5">5</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Google trends for medical condition search term &#x0201C;pain&#x0201D; compared to other common symptoms &#x0201C;fever&#x0201D; and &#x0201C;cough.&#x0201D; X-axis represents time in years. Y-axis numbers represent the search interest relative to the highest point on the chart (100 is the peak popularity for the term, 50 indicates the term is half as popular, and 0 means there was insufficient data for the term).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdgth-03-778305-g0001.tif"/>
</fig>
<p>Research is a growing secondary use of mental health electronic health records (EHRs), specifically the free-text fields (<xref ref-type="bibr" rid="B6">6</xref>). It has the potential to provide additional information on contextual factors around the patient (<xref ref-type="bibr" rid="B7">7</xref>). While it is beneficial to include clinical notes in research, extracting, and understanding information from the free text can be challenging (<xref ref-type="bibr" rid="B8">8</xref>). Natural language processing (NLP) methods can help combat some of the issues inherent in clinical text, such as misspellings, abbreviations, and semantic ambiguities.</p>
<p>Another rich source of health-related textual data is social media as it provides a unique patient perspective into health (<xref ref-type="bibr" rid="B9">9</xref>). In recent years, there has been an increase in the use of social media platforms to share health information, receive and provide support, and look for advice from others suffering with similar ailments (<xref ref-type="bibr" rid="B9">9</xref>). Content from these platforms has also been increasingly used in health research. Examples include finding symptom clusters for breast cancer (<xref ref-type="bibr" rid="B10">10</xref>), understanding the relationships between e-cigarettes and mental illness (<xref ref-type="bibr" rid="B11">11</xref>), as well as understanding user generated discourse around obesity (<xref ref-type="bibr" rid="B12">12</xref>). The main platforms involved in these studies have been Reddit<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> and Twitter<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> Reddit has been a good source for such textual research due to its wide usage as well as the ability to post anonymously (<xref ref-type="bibr" rid="B13">13</xref>). Reddit has more than 330 million active monthly users, and over 138K active communities (<xref ref-type="bibr" rid="B14">14</xref>). A key feature of Reddit is the subforum function which allows creation of subreddit communities dedicated to shared interests (<xref ref-type="bibr" rid="B9">9</xref>). Twitter has shorter text spans than Reddit, a maximum of 280 characters (<xref ref-type="bibr" rid="B15">15</xref>). Despite this limitation, Twitter is widely used in research around mental health and suicidality (<xref ref-type="bibr" rid="B16">16</xref>&#x02013;<xref ref-type="bibr" rid="B18">18</xref>).</p>
<p>The term &#x0201C;pain&#x0201D; presents a unique NLP problem, due to its subjective nature and ambiguous description. Pain can refer to physical distress, or existential suffering, and sometimes even legal punishment (<xref ref-type="bibr" rid="B19">19</xref>). However, within the clinical context, it will most likely be the former two. It also has metaphorical uses in phrases such as &#x0201C;for being a pain&#x0201D; (<xref ref-type="bibr" rid="B19">19</xref>). In order to better understand how pain is described in different textual sources, and to construct a lexicon of pain for use in NLP applications, this study does a preliminary exploration of mentions of pain. This exploration includes analysis of mentions of pain in four different sources with the objective of understanding how mentions of pain differ in these sources, and whether they cover common themes. These exploratory sources include&#x02014;a mental health hospital in the UK (CRIS, from the South London and Maudsley NHS Foundation Trust), the critical care units of a hospital in the United States (MIMIC-III), Reddit, and Twitter.</p>
<p>Gaining a good understanding of how pain is mentioned in text can be formalized by creation of a lexicon of pain terms. Lexicons are a valuable resource that can help develop NLP systems and improve extraction of concepts of interest from clinical text (<xref ref-type="bibr" rid="B20">20</xref>). Lexicons provide a wide range of terms and misspellings from relevant domains, which will be advantageous in future NLP tasks and will minimize the risk of missing important documents that contain these relevant terms. An existing ontology, The Experimental Factor Ontology (<xref ref-type="bibr" rid="B21">21</xref>), consists of a subsection of 26 pain related terms, but to our knowledge, no previous studies have explored how the concept of pain is used in different text sources, and used this to generate a new lexicon. While using terms generated by a domain expert has the benefit of being more precise, we believe that for an ambiguous term such as pain, our method of producing a lexicon semi-automatically, for domain expert review, will favor recall without damaging precision (i.e., sensitivity without loss of positive predictive value). The generation of this lexicon will involve a combination of terms related to pain from three sources&#x02014;literature, ontologies, and embedding models built using EHR data. Mentions from social media that were part of the exploratory sources are not included as lexicon sources since the primary purpose of the lexicon in this instance is for use on EHR data. Any relevant mentions from social media may be added to the lexicon at a later date.</p>
<p>The aim of this study was to conduct an exploration of how pain was mentioned within four different text sources. The purpose of this exploration was to understand what sources of textual information might be useful additions to the lexicon. The eventual goal of generating this lexicon is to be able to use it in downstream NLP tasks where it can be used to identify relevant pain-related documents from EHR databases.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>Materials and Methods</title>
<p>The final lexicon consists of relevant pain related terms from three key areas&#x02014;ontologies, literature, and embedding models. The lexicon was reviewed and validated by domain experts. In addition to this, the lexicon was also compared to another ontology that consists of 26 pain-related terms. This ontology is available as part of the Experimental Factor Ontology (version 1.4) (<xref ref-type="bibr" rid="B21">21</xref>) as a subsection for pain.</p>
<sec>
<title>Data Collection and Exploration/Source Comparison</title>
<p>Four different data sources were explored for mentions of pain within their textual components, and a comparison was conducted to understand the different contexts in which pain can be mentioned. Fifty randomly selected documents were extracted from each source. The number of documents was limited to 50 per text source for pragmatic reasons: manual review is a labor-intensive process. This decision should not impact the lexicon development, as these documents are used only for exploration, with embeddings built on the whole of two sources (MIMIC and CRIS) were used to generate the terms for the lexicon to supplement the development of the lexicon.</p>
<sec>
<title>Ethics and Data Access</title>
<p>While data from Reddit and Twitter are publicly available, applicable ethical research protocols proposed by Benton et al. were followed in this study (<xref ref-type="bibr" rid="B22">22</xref>). No identifiable user data or private accounts were used, and any sensitive direct quotes were paraphrased.</p>
<p>Data from Twitter is available through their API after approval of registration for access to this data, details of which can be found in their general guidelines and policies documentation (<xref ref-type="bibr" rid="B23">23</xref>). Data access information for CRIS (<xref ref-type="bibr" rid="B24">24</xref>) and MIMIC-III (<xref ref-type="bibr" rid="B25">25</xref>) are detailed on their respective websites.</p>
</sec>
<sec>
<title>CRIS</title>
<p>An anonymized version of EHR data from The South London and Maudsley NHS Foundation Trust (SLaM) is stored in the Clinical Record Interactive Search (CRIS) database (<xref ref-type="bibr" rid="B6">6</xref>). The infrastructure of CRIS has been described in detail (<xref ref-type="bibr" rid="B26">26</xref>) with an overview of the cohort profile. This project was approved by the CRIS oversight committee (Oxford C Research Ethics Committee, reference 18/SC/0372). Clinical Record Interactive Search consists of almost 30 million notes and correspondence letters, with an average of 90 documents per patient (<xref ref-type="bibr" rid="B7">7</xref>).</p>
<p>A SQL query was run on the most common source of clinical text (&#x0201C;attachments&#x0201D; table which consists of documents such as discharge and assessment documents, GP letters, review, and referral forms) within the CRIS database, and 50 randomly selected documents that contained the keyword &#x0201C;pain&#x0201D; (both upper and lower case) were extracted. This would include any instance of &#x0201C;pain&#x0201D; regardless of whether it refers to physical pain or emotional/mental pain. Other features of the documents, such as maximum and minimum length of documents were calculated, as well as common collocates for the term &#x0201C;pain.&#x0201D;</p>
</sec>
<sec>
<title>MIMIC-III</title>
<p>Medical Information Mart for Intensive Care (MIMIC-III) is an EHR database which was developed by the Massachusetts Institute of Technology (MIT), available for researchers under a specified governance model (<xref ref-type="bibr" rid="B25">25</xref>). Medical Information Mart for Intensive Care consists of about 1.2 million clinical notes (<xref ref-type="bibr" rid="B27">27</xref>).</p>
<p>A SQL query was run on the &#x0201C;note-events&#x0201D; table which contains majority of the clinical notes (such as nursing and physician notes, ECG reports, radiology reports, and discharge summaries) within the database, and 50 random documents containing the keyword &#x0201C;pain&#x0201D; (both upper and lower case) were extracted. Like the CRIS database, an analysis of the maximum and minimum length of documents was carried out, and common collocates for the term &#x0201C;pain&#x0201D; were explored.</p>
</sec>
<sec>
<title>Reddit</title>
<p>Reddit is an online community which supports unidentifiable accounts to allow users to post anonymously and provides sub communities for people to discuss topics of shared interest. The chronic pain subreddit (r/ChronicPain) community was used in this study. Other subreddits around pain included more specific communities, such as &#x0201C;back pain,&#x0201D; which would not serve our purpose of keeping it general. While this approach might miss mentions of other types of pain, there didn&#x00027;t seem to be a way around this due to absence of a general pain subreddit. Data from Reddit was extracted using the python package PRAW (<xref ref-type="bibr" rid="B28">28</xref>). No time filter was applied. Seven thousand seven hundred posts were extracted, out of which 50 posts were randomly selected.</p>
</sec>
<sec>
<title>Twitter</title>
<p>Twitter is an online micro-blogging platform with an enormous number of users who post short (280 characters or less) messages, referred to as &#x0201C;tweets,&#x0201D; on topics of interest. It is a good resource for textual data because of the volume of tweets posted on it and the public availability of this data (<xref ref-type="bibr" rid="B29">29</xref>). Python package tweepy (<xref ref-type="bibr" rid="B30">30</xref>) was used to extract tweets using the search term &#x0201C;chronic pain.&#x0201D; As with Reddit, chronic pain was used instead of pain to help get more meaningful health-related results. This approach was not applied to the EHR text as the assumption was that metaphorical mentions would be more prevalent in social media. This does carry the risk of possibly missing out on mentions of pain that were not explicitly chronic. Since the Twitter API allows for extraction of tweets within a seven day window, 7,707 tweets were extracted within the time period 06/08/2020 to 11/08/2020 that consisted of the keywords &#x0201C;chronic pain&#x0201D; (case insensitive). Out of these, 50 tweets were randomly selected for analysis.</p>
</sec>
</sec>
<sec>
<title>Lexicon Development</title>
<p>Concordances and analyses on data from the previous step were used to inform the appropriateness of the mentions of &#x0201C;pain&#x0201D; and whether they were meaningful mentions and thereby suitable for inclusion in building a lexicon of pain terms. The terms within the EHR text had more appropriate concordances (i.e., referring to actual pain rather than metaphorical mentions) and were therefore included in the lexicon while the social media ones were not. Embedding models built using Twitter (<xref ref-type="bibr" rid="B31">31</xref>) and Reddit (<xref ref-type="bibr" rid="B32">32</xref>) data were not used as their results returned words that did not seem relevant to the term &#x0201C;pain.&#x0201D; They generated terms such as brain, anger, patience, and habit with Twitter, and words such as apartment, principal, and goal by Reddit. In addition to this, a few publications and ontologies were explored as potential sources as well. The final lexicon was built by combining terms generated through three different sources.</p>
<sec>
<title>Literature-Based Terms</title>
<p>We harvested pain-related words from three publications:</p>
<list list-type="simple">
<list-item><p>(1) A list of symptom terms provided by a systematic review on application of NLP methods for symptom extraction from electronic patient-authored text (ePAT) (<xref ref-type="bibr" rid="B33">33</xref>). Some examples include pain, ache, sore, tenderness, head discomfort.</p></list-item>
<list-item><p>(2) Ten words most similar to pain generated in a survey of biomedical literature-based word embedding models (<xref ref-type="bibr" rid="B34">34</xref>). Some examples include discomfort, fatigue, pains, headache, backache.</p></list-item>
<list-item><p>(3) A list of sign and symptom strings generated using NLP to meaningfully depict experiences of pain in patients with metastatic prostate cancer, as well as identify novel pain phenotypes (<xref ref-type="bibr" rid="B1">1</xref>). In our literature search, this was the only paper on NLP-based extraction of pain terms that included a list of the terms used. Some examples include ache, abdomen pain, backpain, arthralgia, bellyache.</p></list-item>
</list>
<p>These lists were cleaned by lowercasing all terms, and only keeping terms made up of one or two tokens as these included most of the terms, and any terms with more than two tokens were less meaningful or repetitive of the two token terms. Terms with more than two tokens were only listed in one of the papers (<xref ref-type="bibr" rid="B1">1</xref>), and some examples of these were terms such as pain of jaw, right lower quadrant abdominal pain, upper chest pain, and so on, most of which were covered within the two token terms such as abdominal pain and chest pain.</p>
</sec>
<sec>
<title>Ontology-Based Terms</title>
<p>We incorporated synonyms for pain from three biomedical ontologies&#x02014;The Unified Medical Language System (UMLS) (<xref ref-type="bibr" rid="B35">35</xref>), Systematized NOmenclature of MEDicine Clinical Terms (SNOMED-CT) (<xref ref-type="bibr" rid="B36">36</xref>), and International statistical Classification of Diseases and related health problems: tenth revision (ICD-10) (<xref ref-type="bibr" rid="B37">37</xref>). Unified Medical Language System contains concepts from SNOMED-CT and ICD-10, in addition to several other vocabularies. From each, we extracted terms of up to two tokens that either matched &#x0201C;pain<sup>&#x0002A;</sup>,&#x0201D; were synonyms of pain, or described as child nodes of pain.</p>
</sec>
<sec>
<title>Embedding Models</title>
<p>Embedding models (<xref ref-type="bibr" rid="B38">38</xref>, <xref ref-type="bibr" rid="B39">39</xref>) using eight different parameters and four different text sources were used to generate additional words similar to &#x0201C;pain.&#x0201D; The elbow method (<xref ref-type="bibr" rid="B40">40</xref>) was used to determine the cut-off point in word similarity which helped determine the similarity threshold for each model. An advantage of using embedding models is their ability to capture misspellings. Any duplicates were removed, and the remaining terms were added to the lexicon.</p>
<p>Two of the embedding models [both described in Viani et al. (<xref ref-type="bibr" rid="B41">41</xref>)] were built using clinical text available within the MIMIC-II database (<xref ref-type="bibr" rid="B42">42</xref>). Four embedding models were built using clinical text available within MIMIC-III, of which three were built using genism implementation of word2vec (<xref ref-type="bibr" rid="B38">38</xref>) and one using FastText (<xref ref-type="bibr" rid="B43">43</xref>). One model was built using word2vec over a severe mental illness (SMI) cohort from CRIS. Finally, a publicly available model built on PubMed and PubMed Central (PMC) article texts was used (<xref ref-type="bibr" rid="B44">44</xref>). Only unigrams were included from all the models. The parameters for these are detailed in <bold>Table 6</bold>.</p>
</sec>
</sec>
<sec>
<title>Validation</title>
<p>Upon collection of data from the four different sources, common themes were explored. The purpose was to understand the common contexts in which pain might be mentioned. In addition to common themes, length of the text containing mentions of pain was calculated, along with most frequent concordances and mutual information scores.</p>
<p>Validation of the terms for inclusion in the final lexicon was conducted using two methods&#x02014;validation by two clinicians, comparison to an existing pain-related lexicon, and comparison to MeSH (Medical Subject Headings)<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>.</p>
<p>A list of the terms generated through the three text sources was shared with two clinicians who marked each term as: relevant mention of pain, not relevant to pain, or too vague in relation to pain. In addition to this, they added a few new terms to the lexicon.</p>
<p>As an additional validation step, the final lexicon validated by the clinicians was compared to an existing ontology, The Experimental Factor Ontology (<xref ref-type="bibr" rid="B21">21</xref>), which consists of a sub-section of 26 pain-related terms. The final lexicon was also compared to 63 pain-related MeSH terms. Each MeSH term also consisted of a set of entry terms (a total of 941 pain-related terms). Entry terms refer to synonyms, alternate forms, and other terms that are closely related to the MeSH term (<xref ref-type="bibr" rid="B45">45</xref>). With both these comparisons, any terms that did not overlap were investigated to see why they might be missing from our lexicon and any terms that did not overlap were investigated to see why they might be missing from our lexicon.</p>
<p>After generation and validation of the final lexicon, the pain-related terms were separated out from the terms (such as pain from leg pain, arm pain; sore from sore mouth, sore muscle, etc.) and these terms were looked up within a cohort of SMI patients from the CRIS database. A frequency count was conducted to see which of these terms occur most frequently within this cohort of patients.</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>Results</title>
<sec>
<title>Exploration of Pain</title>
<p>Three common pain terms were chosen to gain an understanding of how frequently they are mentioned in EHR documents. These terms were: pain, chronic pain, and words ending with -algia, a common suffix meaning pain. A more detailed search on other pain-related terms such as ache will be conducted at a later stage. A summary of frequencies of these terms within the two EHR-based sources is outlined in <xref ref-type="table" rid="T1">Table 1</xref>. As seen in the table, the term &#x0201C;pain&#x0201D; had the greatest number of mentions and was thus used for selecting documents from the databases for exploration (as described in the Materials and Methods section).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Count of mentions of &#x0201C;pain&#x0201D;, &#x0201C;chronic pain,&#x0201D; and &#x0201C;-algia&#x0201D; per 10,000 tokens (counts for &#x0201C;pain&#x0201D; include &#x0201C;chronic pain&#x0201D; instances too).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Terms</bold></th>
<th valign="top" align="center"><bold>CRIS&#x02014;Attachments</bold></th>
<th valign="top" align="center"><bold>MIMIC-III</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Pain</td>
<td valign="top" align="center">29.59</td>
<td valign="top" align="center">44.13</td>
</tr>
<tr>
<td valign="top" align="left">Chronic pain</td>
<td valign="top" align="center">1.22</td>
<td valign="top" align="center">4.04</td>
</tr>
<tr>
<td valign="top" align="left">&#x0002A;Algia</td>
<td valign="top" align="center">1.14</td>
<td valign="top" align="center">1.44</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Comparing the EHR text data to those from social media platforms Twitter and Reddit, the length of text containing the word &#x0201C;pain&#x0201D; was calculated to understand how much content might be available in each source (<xref ref-type="table" rid="T2">Table 2</xref>).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Length of text within documents containing the word &#x0201C;pain&#x0201D; in the 4 text sources on a random set of 50 documents for each text source.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Source</bold></th>
<th valign="top" align="center"><bold>CRIS</bold></th>
<th valign="top" align="center"><bold>MIMIC</bold></th>
<th valign="top" align="center"><bold>Twitter</bold></th>
<th valign="top" align="center"><bold>Reddit</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Average length of text (charac.)</td>
<td valign="top" align="center">8,144</td>
<td valign="top" align="center">3,864</td>
<td valign="top" align="center">62</td>
<td valign="top" align="center">1,065</td>
</tr>
<tr>
<td valign="top" align="left">Minimum length of text (charac.)</td>
<td valign="top" align="center">1,155</td>
<td valign="top" align="center">165</td>
<td valign="top" align="center">11</td>
<td valign="top" align="center">139</td>
</tr>
<tr>
<td valign="top" align="left">Maximum length of text (charac.)</td>
<td valign="top" align="center">32,767</td>
<td valign="top" align="center">9,549</td>
<td valign="top" align="center">106</td>
<td valign="top" align="center">3,598</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>During the comparison of these sources, four common themes emerged, as shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Common themes around &#x0201C;pain&#x0201D; in the 50 randomly selected documents from the four data sources.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Source</bold></th>
<th valign="top" align="left"><bold>Quality/Type of pain</bold></th>
<th valign="top" align="left"><bold>Feelings/Experiences associated with the pain</bold></th>
<th valign="top" align="left"><bold>Medication or other measures</bold></th>
<th valign="top" align="left"><bold>Related to body parts</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CRIS</td>
<td valign="top" align="left">In constant pain <break/>Ongoing pain <break/>Pain was quite severe</td>
<td valign="top" align="left">Overwhelmed by chronic pain problems <break/>Fear of pain <break/>Pain causing distress <break/>Struggles with chronic pain</td>
<td valign="top" align="left">Drugs to numb the pain <break/>Pain relief medication not controlling the pain <break/>Side effects from pain relief medication <break/>No pain relief with NSAIDs</td>
<td valign="top" align="left">Chronic back pain <break/>Chest pain</td>
</tr>
<tr>
<td valign="top" align="left">MIMIC-III</td>
<td valign="top" align="left">Severe pain <break/>atypical pain</td>
<td valign="top" align="left">&#x02013;</td>
<td valign="top" align="left">PO as needed for pain <break/>Taking narcotic pain medication <break/>Managed with IV pain medication <break/>and <break/>Pain was controlled with oral analgesics</td>
<td valign="top" align="left">Chronic back pain <break/>Chest pain <break/>Abdominal pain <break/>Right leg pain <break/>Chronic lower back pain</td>
</tr>
<tr>
<td valign="top" align="left">Reddit</td>
<td valign="top" align="left">Sharp pain <break/>Widespread pain</td>
<td valign="top" align="left">Could be causing pain <break/>Painful trips to the kitchen <break/>In the same painful position as 3 months ago</td>
<td valign="top" align="left">Helped my back pain</td>
<td valign="top" align="left">Shoulder pain <break/>Back pain <break/>Chronic neck pain <break/>Chronic joint pain</td>
</tr>
<tr>
<td valign="top" align="left">Twitter</td>
<td/>
<td valign="top" align="left">To live pain-free</td>
<td valign="top" align="left">Muscle painbuster</td>
<td valign="top" align="left">Joint muscle pain <break/>Back pain</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>An analysis was conducted using Lancsbox (<xref ref-type="bibr" rid="B46">46</xref>) to get the collocates associated with the term &#x0201C;pain,&#x0201D; limiting to only those words that had a frequency of more than 10. The top five collocates from the different sources are listed in <xref ref-type="table" rid="T4">Table 4</xref>. Reddit and Twitter produced mostly generic terms which were not very meaningful.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Collocates for &#x0201C;pain&#x0201D; with frequency &#x0003E; 10.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>CRIS</bold></th>
<th valign="top" align="left"><bold>MIMIC-III</bold></th>
<th valign="top" align="left"><bold>Reddit</bold></th>
<th valign="top" align="left"><bold>Twitter</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Chronic</td>
<td valign="top" align="left">Control</td>
<td valign="top" align="left">Pain</td>
<td valign="top" align="left">Agony</td>
</tr>
<tr>
<td valign="top" align="left">Back</td>
<td valign="top" align="left">Acute</td>
<td valign="top" align="left">About</td>
<td valign="top" align="left">Amazingly</td>
</tr>
<tr>
<td valign="top" align="left">Clinic</td>
<td valign="top" align="left">Chronic</td>
<td valign="top" align="left">Anyone</td>
<td valign="top" align="left">Achieved</td>
</tr>
<tr>
<td valign="top" align="left">Physical</td>
<td valign="top" align="left">Assessment</td>
<td valign="top" align="left">Back</td>
<td valign="top" align="left">American</td>
</tr>
<tr>
<td valign="top" align="left">Health</td>
<td valign="top" align="left">Plan</td>
<td valign="top" align="left">Anything</td>
<td valign="top" align="left">Body</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The collocation tool within LancsBox looks at five words on either side of the search term &#x0201C;pain,&#x0201D; which explains why &#x0201C;pain&#x0201D; is also a collocate within the Reddit dataset since there were instances of mentions of &#x0201C;pain&#x0201D; as can be seen in these paraphrased examples&#x02014; &#x0201C;I suffer from a condition which causes back pain and pain in legs&#x0201D;; &#x0201C;I have chronic pain. The pain is in my shoulder&#x02026;&#x0201D; and could also be why generic words like &#x0201C;anyone&#x0201D; (instances such as &#x0201C;I have tried opioids for back pain. Has anyone else seen an improvement with this&#x02026;&#x0201D;; &#x0201C;Has anyone used heat for pain&#x02026;&#x0201D;) and &#x0201C;anything&#x0201D; (instances such as &#x0201C;the meds are not doing anything for my pain&#x0201D;) have been selected.</p>
<p><xref ref-type="table" rid="T5">Table 5</xref> lists out the top five collocates for &#x0201C;pain&#x0201D; with a mutual information (MI) score &#x0003E;6. MI score measures the amount of non-randomness present when two words occur (<xref ref-type="bibr" rid="B47">47</xref>) thereby giving a more accurate idea of the relationship between two words (<xref ref-type="bibr" rid="B48">48</xref>). It is recommended that an MI score greater than 3 be used (<xref ref-type="bibr" rid="B48">48</xref>) to get more meaningful results. An MI score of 5 and more was used in this instance since collocates with a lower MI score were generic and vague, including words such as &#x0201C;what,&#x0201D; &#x0201C;if,&#x0201D; and &#x0201C;with.&#x0201D; The letters in the brackets indicate whether they occurred to the right (R) or left (L) of the word &#x0201C;pain.&#x0201D; Reddit and Twitter data produced mostly generic results.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Collocates for &#x0201C;pain&#x0201D; with an MI score &#x0003E; 6.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>CRIS</bold></th>
<th valign="top" align="left"><bold>MIMIC-III</bold></th>
<th valign="top" align="left"><bold>Reddit</bold></th>
<th valign="top" align="left"><bold>Twitter</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Killers (R)</td>
<td valign="top" align="left">Chronic (R)</td>
<td valign="top" align="left">Board (R)</td>
<td valign="top" align="left">People (L)</td>
</tr>
<tr>
<td valign="top" align="left">Chronic (L)</td>
<td valign="top" align="left">Control (L)</td>
<td valign="top" align="left">Certified (L)</td>
<td valign="top" align="left">Amp (R)</td>
</tr>
<tr>
<td valign="top" align="left">Fibromyalgia (R)</td>
<td valign="top" align="left">Complains (L)</td>
<td valign="top" align="left">Suboxone (L)</td>
<td valign="top" align="left">Get (L)</td>
</tr>
<tr>
<td valign="top" align="left">Ongoing (R)</td>
<td valign="top" align="left">Incisional (L)</td>
<td valign="top" align="left">Chronic (L)</td>
<td valign="top" align="left">Medical (L)</td>
</tr>
<tr>
<td valign="top" align="left">Feet (R)</td>
<td valign="top" align="left">Acute (L)</td>
<td valign="top" align="left">Doctor (R)</td>
<td valign="top" align="left">Suffer (L)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Using the observations made during this preliminary exploration, a conceptual diagram (<xref ref-type="fig" rid="F2">Figure 2</xref>) of pain was created. The objective of constructing this conceptual diagram was to visualize what features were commonly found around the mention of pain.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Conceptual diagram of pain. Created using an online tool, Grafo (<xref ref-type="bibr" rid="B43">43</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdgth-03-778305-g0002.tif"/>
</fig>
</sec>
<sec>
<title>Building the Lexicon</title>
<p><xref ref-type="table" rid="T6">Table 6</xref> summarizes the number of words obtained from the three different sources. For the embedding models, the model parameters and elbow thresholds are also included.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Number of words obtained from the different sources, and parameters/elbow threshold for the embedding models.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Source</bold></th>
<th valign="top" align="left"><bold>Parameters</bold></th>
<th valign="top" align="center"><bold>Elbow threshold</bold></th>
<th valign="top" align="center"><bold>No. of unigrams</bold></th>
<th valign="top" align="center"><bold>No. of bigrams</bold></th>
<th valign="top" align="center"><bold>Total no. of words</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Literature</td>
<td valign="top" align="left">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">71</td>
<td valign="top" align="center">170</td>
<td valign="top" align="center">241</td>
</tr>
<tr>
<td valign="top" align="left">Ontologies</td>
<td/>
<td/>
<td valign="top" align="center">83</td>
<td valign="top" align="center">440</td>
<td valign="top" align="center">523</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;UMLS</td>
<td valign="top" align="left">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">11</td>
<td valign="top" align="center">70</td>
<td valign="top" align="center">81</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;SNOMED-CT</td>
<td valign="top" align="left">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">67</td>
<td valign="top" align="center">368</td>
<td valign="top" align="center">435</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;ICD-10</td>
<td valign="top" align="left">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">7</td>
</tr>
<tr>
<td valign="top" align="left">Embedding models</td>
<td/>
<td/>
<td valign="top" align="center">171</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">171</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-II</td>
<td valign="top" align="left">w2v, size = 100, window = 5,<break/> min_count = 15, workers = 4</td>
<td valign="top" align="center">0.57</td>
<td valign="top" align="center">33</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">33</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-II</td>
<td valign="top" align="left">w2v, size = 400, window = 5,<break/> min_count = 15, workers = 4</td>
<td valign="top" align="center">0.47</td>
<td valign="top" align="center">40</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">40</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-III</td>
<td valign="top" align="left">w2v, size = 100, window = 5,<break/> min_count = 15, workers = 4</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">4</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-III</td>
<td valign="top" align="left">w2v, size = 400, window = 5,<break/> min_count = 15, workers = 4</td>
<td valign="top" align="center">0.47</td>
<td valign="top" align="center">12</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">12</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-III</td>
<td valign="top" align="left">w2v, size = 300, window = 10,<break/> min_count = 5, workers = 16</td>
<td valign="top" align="center">0.44</td>
<td valign="top" align="center">26</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">26</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;MIMIC-III</td>
<td valign="top" align="left">FastText, size = 300, window = 10,<break/> min_count = 5</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">30</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;CRIS (SMI)</td>
<td valign="top" align="left">w2v, size = 300, window = 10,<break/> min_count = 5</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">16</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">16</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;PubMed</td>
<td valign="top" align="left">w2v, size = 200, window = 5</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">10</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>After compiling the words from all these sources, the total size of the lexicon was 935 words (including duplicates and 57 misspellings), with 35% of them being unigrams and 65% bigrams. The most frequently occurring words in the final lexicon were pain (<italic>n</italic> = 46), discomfort (<italic>n</italic> = 10), headache (<italic>n</italic> = 8), soreness (<italic>n</italic> = 8), and pains/painful/ache/backache (<italic>n</italic> = 7). <xref ref-type="table" rid="T7">Table 7</xref> shows the coverage of the lexicon at this stage.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Lexicon coverage.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Lexicon source</bold></th>
<th valign="top" align="center"><bold>No. of unique terms</bold></th>
<th valign="top" align="center"><bold>Total no. of terms</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Literature</td>
<td valign="top" align="center">218</td>
<td valign="top" align="center">241</td>
</tr>
<tr>
<td valign="top" align="left">Ontologies</td>
<td valign="top" align="center">291</td>
<td valign="top" align="center">523</td>
</tr>
<tr>
<td valign="top" align="left">Embeddings</td>
<td valign="top" align="center">68</td>
<td valign="top" align="center">171</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The Venn diagrams of the unique terms are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. A total of six terms overlap between the three sources, with the most overlap (54 terms) being between literature and ontology. There is no overlap between all three ontologies, with the most overlap (27 terms) being between SNOMED-CT and UMLS. There is no overlap between ICD-10 and UMLS due to the former consisting of mostly three-token terms, while the terms in all sources have been limited to up to two tokens. For example, ICD-10 consists of terms such as pain in limb, pain in throat, pain in joints, rather than limb pain, throat pain, and joint pain. There was no overlap at all between the different embedding models. A comparison of the two MIMIC models (MIMIC-II and MIMIC-III) showed that they generated unique terms with minimal overlap, thereby justifying the use of both versions.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Venn diagram of unique terms generated from the different sources <bold>(A)</bold>, different ontologies <bold>(B)</bold>, and different embedding models <bold>(C)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdgth-03-778305-g0003.tif"/>
</fig>
<p>After post-processing to remove duplicates, punctuations/symbols, and words of less than four characters, the lexicon was validated by two clinicians, leading to a final size of 382 terms (<xref ref-type="fig" rid="F4">Figure 4</xref>).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Distribution of terms within pain lexicon.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdgth-03-778305-g0004.tif"/>
</fig>
<p>The final pain lexicon and the code to generate the embedding models is openly available on GitHub<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> and will also be added to other ontology collections such as BioPortal<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref>.</p>
<p>Some patterns were identified within the lexicon which enabled generation of a shorter list of pain terms which captured all the other terms within the patterns, such as the word &#x0201C;pain&#x0201D; capturing &#x0201C;chest pain,&#x0201D; &#x0201C;burning pain,&#x0201D; and ache capturing &#x0201C;headache,&#x0201D; &#x0201C;belly ache,&#x0201D; etc. For example, terms such as &#x0201C;chest pain,&#x0201D; &#x0201C;head discomfort,&#x0201D; &#x0201C;aching muscles,&#x0201D; follow a pattern of &#x0003C;anatomy&#x0003E; followed by &#x0003C;pain term&#x0003E; or vice versa; terms like &#x0201C;burning pain&#x0201D; and &#x0201C;chronic pain&#x0201D; follow a pattern of &#x0003C;quality term&#x0003E; &#x0003C;pain term&#x0003E;, and some are a combination of quality and anatomy such as &#x0201C;chronic back pain&#x0201D; which follows a pattern of &#x0003C;quality term&#x0003E; &#x0003C;anatomy term&#x0003E; &#x0003C;pain term&#x0003E;.</p>
<p>A frequency count of some other common pain related terms [using wildcard character (%) to capture any words containing these terms] was conducted on a cohort of SMI patients within the CRIS EHR documents. Top 13 terms are listed out in <xref ref-type="table" rid="T8">Table 8</xref>.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Top 13 common pain-related terms within a cohort of patients (<italic>n</italic> = 57,008) in the CRIS database.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Keyword</bold></th>
<th valign="top" align="center"><bold>Percentage (Over entire cohort) (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">%ache%</td>
<td valign="top" align="center">54</td>
</tr>
<tr>
<td valign="top" align="left">%pain%</td>
<td valign="top" align="center">36</td>
</tr>
<tr>
<td valign="top" align="left">%burn%</td>
<td valign="top" align="center">7</td>
</tr>
<tr>
<td valign="top" align="left">%sore%</td>
<td valign="top" align="center">3</td>
</tr>
<tr>
<td valign="top" align="left">%algia%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">%spasm%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">%dynia%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">%algesia</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">colic%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">hurt%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">sciatic%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">tender%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
<tr>
<td valign="top" align="left">cramp%</td>
<td valign="top" align="center">&#x0003C; 1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Validation of the Lexicon</title>
<p>Two forms of validation were carried out on the lexicon&#x02014;validation by two clinicians, and validation against an existing ontology of pain terms.</p>
<p>Upon validation by the clinicians, 11 new terms were added to the lexicon and 39 terms were removed from the lexicon. The reasons for removal of words were when they were too ambiguous and non-specific (such as fatigue and complaints), and words that did not indicate pain per se (such as itchiness, nausea, paresthesia, tightness). Some examples of terms that were removed are algophobia, bloating, fatigue, and nausea. Terms added were acronyms (such as LBP for lower back pain), pain education, antalgic gait.</p>
<p>The Experimental Factor Ontology (<xref ref-type="bibr" rid="B21">21</xref>) contains a pain sub-section consisting of 26 pain related terms. Upon comparison with our lexicon, it was found that 18 (69%) of the terms within the Experimental Factor Ontology matched. Amongst the ones that did not match, most were words with three tokens, which would have been excluded from our lexicon. The remaining unmatched terms were limb pain, renal colic, pain in abdomen, multisite chronic pain, lower limb pain, episodic abdominal cramps, chronic widespread pain, and abdominal cramps. However, all the pain-related terms (such as cramp, colic, ache, etc.) did match with our lexicon, ensuring the synonyms of pain were indeed all captured.</p>
<p>Medical Subject Headings headings consist of 63 pain-related MeSH terms and 941 pain-related entry terms. Upon comparison with our lexicon, an overlap of 56 terms (89%) was found with the MeSH terms and 649 terms (69%) with the entry terms. The MeSH terms that did not match (11% i.e., seven terms) were not explicitly related to pain, and included terms such as agnosia [a sensory disorder where a person is unable to process sensory information (<xref ref-type="bibr" rid="B49">49</xref>)], pramoxine (a topical anesthetic), and generic somatosensory disorders. The entry terms that did not match (33% i.e., 307 terms) consisted of drug names (2% of total terms, 5% of non-matched terms) such as Pramocaine and Balsabit, disorders and syndromes (20% of total terms, 62% of non-matched terms) such as visual disorientation syndrome and Patellofemoral syndrome, generic terms (10% of total terms, 31% of non-matched terms) such as physical suffering, and tests (1% of total terms, 3% of non-matched terms) such as Formalin test. The pain specific terms within this list were mainly pain (50% of total terms), -algia (8%), ache (7%), -dynia (1%), and -algesia (1%). Two new pain terms discovered within this list were &#x0201C;catch&#x0201D; and &#x0201C;twinge&#x0201D; which might reference pain in the right context but could also lead to false positives when used in NLP tasks to identify mentions of pain.</p>
</sec>
</sec>
<sec id="s4">
<title>Discussion and Conclusion</title>
<p>When looking at how pain was mentioned in the different text sources, most mentions fell into similar themes i.e., quality of pain, feelings/experiences associated with the pain, medications, and other measures for pain relief, and mentions of different body parts associated with the pain. The mentions within MIMIC-III were geared more toward pain relief, which is likely due to the data being from critical care units. In contrast, CRIS covered the feelings and experiences associated with pain. It was hard to get a good sense of the Twitter mentions owing to the short length of strings, while Reddit was a lot more detailed around patient experiences, and pain relief remedies.</p>
<p>The information gained from this exploration helped decide the sources for the development of the pain lexicon. Embedding models built using MIMIC-II/III and CRIS databases were used. The final lexicon consisted of 382 pain-related terms. Embedding models built using Twitter (<xref ref-type="bibr" rid="B50">50</xref>) and Reddit (<xref ref-type="bibr" rid="B32">32</xref>) data were excluded from inclusion into the final lexicon due to the terms not being very relevant to the term &#x0201C;pain.&#x0201D; They generated terms such as brain, anger, patience, and habit with Twitter, and words such as apartment, principal, and goal by Reddit. The Venn diagrams demonstrated the benefits of including different sources as each of these sources provided unique terms thereby enriching the lexicon for pain. CRIS and MIMIC contributed 68 unique terms that are used in &#x0201C;real-life&#x0201D; settings to the final lexicon. These mostly consisted of commonly used words like soreness, pain, aches. Many of these mentions are potentially based on what patients have said, which could also explain why they are a smaller number of terms. The literature and ontologies have a greater variety of words, as they either use more technical terms, or enumerate every term and concept associated with pain. Apart from helping build the lexicon, this exploration will also help further planning for development of NLP applications and deciding on what attributes around pain might be of interest for general and clinical research purposes.</p>
<p>The final lexicon has been validated by two clinicians, compared to an existing Experimental Factor Ontology which consisted of 26 pain-related terms, and MeSH headings and terms (63 pain-related heading terms and 491 pain-related entry terms). The majority of the pain-related terms from both these sources matched those included within the lexicon. The terms that did not match were names of disorders/syndromes that may have pain as a symptom, and other more generic words that could lead to false positives if used in downstream NLP tasks.</p>
<p>This study has several limitations. Most importantly, only a small sample of documents was reviewed for the exploration step. Reviewing a larger sample might have been more representative of the text sources and might have revealed deeper insights. The process of exploration of pain concepts within different sources also highlighted the ambiguous nature of a word like pain, and the different contexts that could contain these mentions (metaphorical or clinical mentions). These factors are important to bear in mind when attempting to use such ambiguous terms in NLP tasks as they could lead to false positive results.</p>
<p>The final lexicon, and the code used to generate the embedding models, have been made openly available. This final lexicon will be used in downstream tasks such as building an NLP application to extract mentions of pain from clinical notes which will in turn help answer important research questions around pain and mental health. The approach followed for the development of this lexicon could be replicated for other clinical terms. Future work includes patient engagement in order to elicit feedback on the terms that have been included in the lexicon. In addition to this, the lexicon will be formalized for submission to portals, such as BioPortal, for wider use by the community.</p>
</sec>
<sec sec-type="data-availability" id="s5">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://github.com/jayachaturvedi/pain_lexicon">https://github.com/jayachaturvedi/pain_lexicon</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>The idea was conceived by JC, AR, and SV. JC conducted the data analysis and drafted the manuscript. AR and SV provided guidance in the design and interpretation of results. AM provided scripts and guidance on building some of the embedding models. All authors commented on drafts of the manuscript and approved the final version.</p>
</sec>
<sec sec-type="funding-information" id="s7">
<title>Funding</title>
<p>AR was funded by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities. AR receives salary support from the National Institute for Health Research (NIHR) Biomedical Research Center at South London and Maudsley NHS Foundation Trust and King&#x00027;s College London. JC was supported by the KCL funded Center for Doctoral Training (CDT) in Data-Driven Health. AM was funded by Takeda California, Inc. This paper represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Center at South London and Maudsley NHS Foundation Trust and King&#x00027;s College London. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This study received funding from Health Data Research UK, KCL funded CDT in Data-Driven Health, and Takeda California, Inc. The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.</p>
</sec>
<sec id="s8">
<title>Author Disclaimer</title>
<p>The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>This work uses data provided by patients and collected by the NHS as part of their care and support. An application for access to the Clinical Record Interactive Search (CRIS) database for this project was submitted and approved by the CRIS Oversight Committee (Oxford C Research Ethics Committee, reference 18/SC/0372). The authors are also grateful to the two clinicians, Dr. Robert Stewart and Dr. Brendon Stubbs for taking the time to review the terms within the lexicon and providing valuable feedback, as well as Dr. Natalia Viani for providing access to some of her python scripts and embedding models.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heintzelman</surname> <given-names>NH</given-names></name> <name><surname>Taylor</surname> <given-names>RJ</given-names></name> <name><surname>Simonsen</surname> <given-names>L</given-names></name> <name><surname>Lustig</surname> <given-names>R</given-names></name> <name><surname>Anderko</surname> <given-names>D</given-names></name> <name><surname>Haythornthwaite</surname> <given-names>JA</given-names></name> <etal/></person-group>. <article-title>Longitudinal analysis of pain in patients with metastatic prostate cancer using natural language processing of medical record text</article-title>. <source>J Am Med Inform Assoc.</source> (<year>2013</year>) <volume>20</volume>:<fpage>898</fpage>&#x02013;<lpage>905</lpage>. <pub-id pub-id-type="doi">10.1136/amiajnl-2012-001076</pub-id><pub-id pub-id-type="pmid">23144336</pub-id></citation></ref>
<ref id="B2">
<label>2.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Merlin</surname> <given-names>JS</given-names></name> <name><surname>Zinski</surname> <given-names>A</given-names></name> <name><surname>Norton</surname> <given-names>WE</given-names></name> <name><surname>Ritchie</surname> <given-names>CS</given-names></name> <name><surname>Saag</surname> <given-names>MS</given-names></name> <name><surname>Mugavero</surname> <given-names>MJ</given-names></name> <etal/></person-group>. <article-title>A conceptual framework for understanding chronic pain in patients with HIV</article-title>. <source>Pain Pract.</source> (<year>2014</year>) <volume>14</volume>:<fpage>207</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1111/papr.12052</pub-id><pub-id pub-id-type="pmid">23551857</pub-id></citation></ref>
<ref id="B3">
<label>3.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>R</given-names></name> <name><surname>Waljee</surname> <given-names>J</given-names></name> <name><surname>Brummett</surname> <given-names>C</given-names></name> <name><surname>Englesbe</surname> <given-names>M</given-names></name> <name><surname>Lee</surname> <given-names>J</given-names></name></person-group>. <article-title>Reduction in opioid prescribing through evidence-based prescribing guidelines</article-title>. <source>JAMA Surg.</source> (<year>2018</year>) <volume>153</volume>:<fpage>285</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1001/jamasurg.2017.4436</pub-id><pub-id pub-id-type="pmid">29214318</pub-id></citation></ref>
<ref id="B4">
<label>4.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Groenewald</surname> <given-names>CB</given-names></name> <name><surname>Essner</surname> <given-names>BS</given-names></name> <name><surname>Wright</surname> <given-names>D</given-names></name> <name><surname>Fesinmeyer</surname> <given-names>MD</given-names></name> <name><surname>Palermo</surname> <given-names>TM</given-names></name></person-group>. <article-title>The economic costs of chronic pain among a cohort of treatment-seeking adolescents in the United States</article-title>. <source>J Pain.</source> (<year>2014</year>) <volume>15</volume>:<fpage>925</fpage>&#x02013;<lpage>33</lpage>. <pub-id pub-id-type="doi">10.1016/j.jpain.2014.06.002</pub-id><pub-id pub-id-type="pmid">24953887</pub-id></citation></ref>
<ref id="B5">
<label>5.</label>
<citation citation-type="web"><person-group person-group-type="author"><collab>Google, Trends</collab></person-group>. <source>Google Trends</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://trends.google.com/trends/explore?date=all&#x00026;q=%2Fm%2F062t2">https://trends.google.com/trends/explore?date=all&#x00026;q=%2Fm%2F062t2</ext-link> (accessed March 8, 2021).</citation>
</ref>
<ref id="B6">
<label>6.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stewart</surname> <given-names>R</given-names></name> <name><surname>Soremekun</surname> <given-names>M</given-names></name> <name><surname>Perera</surname> <given-names>G</given-names></name> <name><surname>Broadbent</surname> <given-names>M</given-names></name> <name><surname>Callard</surname> <given-names>F</given-names></name> <name><surname>Denis</surname> <given-names>M</given-names></name> <etal/></person-group>. <article-title>The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data</article-title>. <source>BMC Psychiatry.</source> (<year>2009</year>) <volume>9</volume>:<fpage>51</fpage>&#x02013;<lpage>51</lpage>. <pub-id pub-id-type="doi">10.1186/1471-244X-9-51</pub-id><pub-id pub-id-type="pmid">19674459</pub-id></citation></ref>
<ref id="B7">
<label>7.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Velupillai</surname> <given-names>S</given-names></name> <name><surname>Suominen</surname> <given-names>H</given-names></name> <name><surname>Liakata</surname> <given-names>M</given-names></name> <name><surname>Roberts</surname> <given-names>A</given-names></name> <name><surname>Shah</surname> <given-names>AD</given-names></name> <name><surname>Morley</surname> <given-names>K</given-names></name> <etal/></person-group>. <article-title>Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances</article-title>. <source>J Biomed Inform.</source> (<year>2018</year>) <volume>88</volume>:<fpage>11</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2018.10.005</pub-id><pub-id pub-id-type="pmid">30368002</pub-id></citation></ref>
<ref id="B8">
<label>8.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mascio</surname> <given-names>A</given-names></name> <name><surname>Kraljevic</surname> <given-names>Z</given-names></name> <name><surname>Bean</surname> <given-names>D</given-names></name> <name><surname>Dobson</surname> <given-names>R</given-names></name> <name><surname>Stewart</surname> <given-names>R</given-names></name> <name><surname>Bendayan</surname> <given-names>R</given-names></name> <etal/></person-group>. <article-title>Comparative analysis of text classification approaches in electronic health records</article-title>. <source>arXiv.</source> (<year>2020</year>) arXiv:200506624. <pub-id pub-id-type="doi">10.18653/v1/2020.bionlp-1.9</pub-id></citation>
</ref>
<ref id="B9">
<label>9.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Foufi</surname> <given-names>V</given-names></name> <name><surname>Timakum</surname> <given-names>T</given-names></name> <name><surname>Gaudet-Blavignac</surname> <given-names>C</given-names></name> <name><surname>Lovis</surname> <given-names>C</given-names></name> <name><surname>Song</surname> <given-names>M</given-names></name></person-group>. <article-title>Mining of textual health information from reddit: analysis of chronic diseases with extracted entities and their relations</article-title>. <source>J Med Internet Res.</source> (<year>2019</year>) <volume>21</volume>:<fpage>e12876</fpage>. <pub-id pub-id-type="doi">10.2196/12876</pub-id><pub-id pub-id-type="pmid">31199327</pub-id></citation></ref>
<ref id="B10">
<label>10.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marshall</surname> <given-names>SA</given-names></name> <name><surname>Yang</surname> <given-names>CC</given-names></name> <name><surname>Ping</surname> <given-names>Q</given-names></name> <name><surname>Zhao</surname> <given-names>M</given-names></name> <name><surname>Avis</surname> <given-names>NE</given-names></name> <name><surname>Ip</surname> <given-names>EH</given-names></name></person-group>. <article-title>Symptom clusters in women with breast cancer: an analysis of data from social media and a research study</article-title>. <source>Qual Life Res.</source> (<year>2016</year>) <volume>25</volume>:<fpage>547</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/s11136-015-1156-7</pub-id><pub-id pub-id-type="pmid">26476836</pub-id></citation></ref>
<ref id="B11">
<label>11.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>R</given-names></name> <name><surname>Wigginton</surname> <given-names>B</given-names></name> <name><surname>Meurk</surname> <given-names>C</given-names></name> <name><surname>Ford</surname> <given-names>P</given-names></name> <name><surname>Gartner</surname> <given-names>CE</given-names></name></person-group>. <article-title>Motivations and limitations associated with vaping among people with mental illness: a qualitative analysis of reddit discussions</article-title>. <source>Int J Environ Res Public Health.</source> (<year>2017</year>) <volume>14</volume>:<fpage>7</fpage>. <pub-id pub-id-type="doi">10.3390/ijerph14010007</pub-id><pub-id pub-id-type="pmid">28025516</pub-id></citation></ref>
<ref id="B12">
<label>12.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname> <given-names>WY</given-names></name> <name><surname>Prestin</surname> <given-names>A</given-names></name> <name><surname>Kunath</surname> <given-names>S</given-names></name></person-group>. <article-title>Obesity in social media: a mixed methods analysis</article-title>. <source>Transl Behav Med.</source> (<year>2014</year>) <volume>4</volume>:<fpage>314</fpage>&#x02013;<lpage>23</lpage>. <pub-id pub-id-type="doi">10.1007/s13142-014-0256-1</pub-id><pub-id pub-id-type="pmid">25264470</pub-id></citation></ref>
<ref id="B13">
<label>13.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>GJ</given-names></name> <name><surname>Ambrose</surname> <given-names>PJ</given-names></name></person-group>. <article-title>Neo-tribes: the power and potential of online communities in health care</article-title>. <source>Commun ACM.</source> (<year>2006</year>) <volume>49</volume>:<fpage>107</fpage>&#x02013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1145/1107458.1107463</pub-id></citation>
</ref>
<ref id="B14">
<label>14.</label>
<citation citation-type="web"><person-group person-group-type="author"><collab>Reddit now has as many users as twitter far higher engagement rates</collab></person-group>. <source>Social Media Today</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.socialmediatoday.com/news/reddit-now-has-as-many-users-as-twitter-and-far-higher-engagement-rates/521789/">https://www.socialmediatoday.com/news/reddit-now-has-as-many-users-as-twitter-and-far-higher-engagement-rates/521789/</ext-link> (accessed March 8, 2021).</citation>
</ref>
<ref id="B15">
<label>15.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boot</surname> <given-names>AB</given-names></name> <name><surname>Tjong Kim Sang</surname> <given-names>E</given-names></name> <name><surname>Dijkstra</surname> <given-names>K</given-names></name> <name><surname>Zwaan</surname> <given-names>RA</given-names></name></person-group>. <article-title>How character limit affects language usage in tweets</article-title>. <source>Palgrave Commun.</source> (<year>2019</year>) <volume>5</volume>:<fpage>1</fpage>&#x02013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1057/s41599-019-0280-3</pub-id></citation>
</ref>
<ref id="B16">
<label>16.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Choudhury</surname> <given-names>MD</given-names></name> <name><surname>Gamon</surname> <given-names>M</given-names></name> <name><surname>Counts</surname> <given-names>S</given-names></name> <name><surname>Horvitz</surname> <given-names>E</given-names></name></person-group>. <article-title>Predicting depression via social media</article-title>. <source>In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media</source>. <publisher-loc>Cambridge, MA</publisher-loc> (<year>2013</year>) p. <fpage>1</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B17">
<label>17.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>De Choudhury</surname> <given-names>M</given-names></name> <name><surname>Kiciman</surname> <given-names>E</given-names></name> <name><surname>Dredze</surname> <given-names>M</given-names></name> <name><surname>Coppersmith</surname> <given-names>G</given-names></name> <name><surname>Kumar</surname> <given-names>M</given-names></name></person-group>. <article-title>Discovering shifts to suicidal ideation from mental health content in social media</article-title>. <source>Proc SIGCHI Conf Hum Factor Comput Syst.</source> (<year>2016</year>) <volume>2016</volume>:<fpage>2098</fpage>&#x02013;<lpage>110</lpage>. <pub-id pub-id-type="doi">10.1145/2858036.2858207</pub-id><pub-id pub-id-type="pmid">29082385</pub-id></citation></ref>
<ref id="B18">
<label>18.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Coppersmith</surname> <given-names>G</given-names></name> <name><surname>Leary</surname> <given-names>R</given-names></name> <name><surname>Wood</surname> <given-names>T</given-names></name> <name><surname>Whyne</surname> <given-names>E</given-names></name></person-group>. <article-title>Quantifying suicidal ideation via language usage on social media</article-title>. <source>Paper presented at: Joint Statistics Meetings Proceedings, Statistical Computing Section (JSM).</source> (<year>2015</year>) (<publisher-loc>Seattle, WA</publisher-loc>).</citation>
</ref>
<ref id="B19">
<label>19.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carlson</surname> <given-names>LA</given-names></name> <name><surname>Hooten</surname> <given-names>WM</given-names></name></person-group>. <article-title>Pain&#x02014;linguistics and natural language processing</article-title>. <source>Mayo Clin Proc Innov Qual Outcomes.</source> (<year>2020</year>) <volume>4</volume>:<fpage>346</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1016/j.mayocpiqo.2020.01.005</pub-id><pub-id pub-id-type="pmid">32542226</pub-id></citation></ref>
<ref id="B20">
<label>20.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Velupillai</surname> <given-names>S</given-names></name> <name><surname>Mowery</surname> <given-names>DL</given-names></name> <name><surname>Conway</surname> <given-names>M</given-names></name> <name><surname>Hurdle</surname> <given-names>J</given-names></name> <name><surname>Kious</surname> <given-names>B</given-names></name></person-group>. <article-title>Vocabulary development to support information extraction of substance abuse from psychiatry notes</article-title>. In: <source>Proceedings of the 15th Workshop on Biomedical Natural Language Processing</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2016</year>). p. <fpage>92</fpage>&#x02013;<lpage>101</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.aclweb.org/anthology/W16-2912">https://www.aclweb.org/anthology/W16-2912</ext-link> (accessed February 5, 2021).</citation>
</ref>
<ref id="B21">
<label>21.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Koscielny</surname> <given-names>G</given-names></name> <name><surname>Ison</surname> <given-names>G</given-names></name> <name><surname>Jupp</surname> <given-names>S</given-names></name> <name><surname>Parkinson</surname> <given-names>H</given-names></name> <name><surname>Pendlington</surname> <given-names>ZM</given-names></name> <name><surname>Williams</surname> <given-names>E</given-names></name> <etal/></person-group>. <article-title>Experimental Factor Ontology.</article-title> Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.ebi.ac.uk/ols/ontologies/efo">https://www.ebi.ac.uk/ols/ontologies/efo</ext-link> (Sep 8, 2021).</citation>
</ref>
<ref id="B22">
<label>22.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Benton</surname> <given-names>A</given-names></name> <name><surname>Coppersmith</surname> <given-names>G</given-names></name> <name><surname>Dredze</surname> <given-names>M</given-names></name></person-group>. <article-title>Ethical research protocols for social media health research</article-title>. In: <source>Proceedings of the First ACL Workshop on Ethics in Natural Language Processing</source>. <publisher-loc>Valencia</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2017</year>), p. <fpage>94</fpage>&#x02013;<lpage>102</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.aclweb.org/anthology/W17-1612">https://www.aclweb.org/anthology/W17-1612</ext-link> [March 18, 2021).</citation>
</ref>
<ref id="B23">
<label>23.</label>
<citation citation-type="web"><source>About Twitter&#x00027;s APIs</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://help.twitter.com/en/rules-and-policies/twitter-api">https://help.twitter.com/en/rules-and-policies/twitter-api</ext-link> (accessed March 16, 2021).</citation>
</ref>
<ref id="B24">
<label>24.</label>
<citation citation-type="web"><person-group person-group-type="author"><collab>NIHR Biomedical Research Centre</collab></person-group>. <source>Clinical Record Interactive Search (CRIS).</source> (<year>2018</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.maudsleybrc.nihr.ac.uk/facilities/clinical-record-interactive-search-cris/">https://www.maudsleybrc.nihr.ac.uk/facilities/clinical-record-interactive-search-cris/</ext-link> (accessed January 11, 2021).</citation>
</ref>
<ref id="B25">
<label>25.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>AEW</given-names></name> <name><surname>Pollard</surname> <given-names>TJ</given-names></name> <name><surname>Shen</surname> <given-names>L</given-names></name> <name><surname>Lehman</surname> <given-names>LH</given-names></name> <name><surname>Feng</surname> <given-names>M</given-names></name> <name><surname>Ghassemi</surname> <given-names>M</given-names></name> <etal/></person-group>. <article-title>MIMIC-III, a freely accessible critical care database</article-title>. <source>Sci Data.</source> (<year>2016</year>) <volume>3</volume>:<fpage>160035</fpage>. <pub-id pub-id-type="doi">10.1038/sdata.2016.35</pub-id><pub-id pub-id-type="pmid">27219127</pub-id></citation></ref>
<ref id="B26">
<label>26.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perera</surname> <given-names>G</given-names></name> <name><surname>Broadbent</surname> <given-names>M</given-names></name> <name><surname>Callard</surname> <given-names>F</given-names></name> <name><surname>Chang</surname> <given-names>C-K</given-names></name> <name><surname>Downs</surname> <given-names>J</given-names></name> <name><surname>Dutta</surname> <given-names>R</given-names></name> <etal/></person-group>. <article-title>Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource</article-title>. <source>BMJ Open.</source> (<year>2016</year>) <volume>6</volume>:<fpage>e008721</fpage>. <pub-id pub-id-type="doi">10.1136/bmjopen-2015-008721</pub-id><pub-id pub-id-type="pmid">26932138</pub-id></citation></ref>
<ref id="B27">
<label>27.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nuthakki</surname> <given-names>S</given-names></name> <name><surname>Neela</surname> <given-names>S</given-names></name> <name><surname>Gichoya</surname> <given-names>JW</given-names></name> <name><surname>Purkayastha</surname> <given-names>S</given-names></name></person-group>. <article-title>Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networks</article-title>. <source>arXiv</source>. (<year>2019</year>) arXiv:191212397.</citation>
</ref>
<ref id="B28">
<label>28.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Boe</surname> <given-names>B</given-names></name></person-group>. <article-title>PRAW: The Python Reddit API Wrapper.</article-title> (<year>2012</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/praw-dev/praw/">https://github.com/praw-dev/praw/</ext-link> (accessed March 8, 2021).</citation>
</ref>
<ref id="B29">
<label>29.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bian</surname> <given-names>J</given-names></name> <name><surname>Topaloglu</surname> <given-names>U</given-names></name> <name><surname>Yu</surname> <given-names>F</given-names></name></person-group>. <article-title>Towards large-scale twitter mining for drug-related adverse events</article-title>. <source>SHB12.</source> (<year>2012</year>) <volume>2012</volume>:<fpage>25</fpage>&#x02013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1145/2389707.2389713</pub-id><pub-id pub-id-type="pmid">28967001</pub-id></citation></ref>
<ref id="B30">
<label>30.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Roesslein</surname> <given-names>J</given-names></name></person-group> <article-title>Tweepy: Twitter for Python!.</article-title> (<year>2020</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/tweepy/tweepy">https://github.com/tweepy/tweepy</ext-link></citation>
</ref>
<ref id="B31">
<label>31.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Pennington</surname> <given-names>J</given-names></name> <name><surname>Socher</surname> <given-names>R</given-names></name> <name><surname>Manning</surname> <given-names>C</given-names></name></person-group>. <article-title>Glove: global vectors for word representation</article-title>. In: <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>. <publisher-loc>Doha</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2014</year>). p. <fpage>1532</fpage>&#x02013;<lpage>43</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://aclweb.org/anthology/D14-1162">http://aclweb.org/anthology/D14-1162</ext-link> (accessed March 9, 2021).</citation>
</ref>
<ref id="B32">
<label>32.</label>
<citation citation-type="web"><source>Reddit Word Embeddings</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://kaggle.com/alaap29/reddit-word-embeddings">https://kaggle.com/alaap29/reddit-word-embeddings</ext-link> (accessed March 17, 2021).</citation>
</ref>
<ref id="B33">
<label>33.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dreisbach</surname> <given-names>C</given-names></name> <name><surname>Koleck</surname> <given-names>TA</given-names></name> <name><surname>Bourne</surname> <given-names>PE</given-names></name> <name><surname>Bakken</surname> <given-names>S</given-names></name></person-group>. <article-title>A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data</article-title>. <source>Int J Med Inform.</source> (<year>2019</year>) <volume>125</volume>:<fpage>37</fpage>&#x02013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijmedinf.2019.02.008</pub-id><pub-id pub-id-type="pmid">30914179</pub-id></citation></ref>
<ref id="B34">
<label>34.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khattak</surname> <given-names>FK</given-names></name> <name><surname>Jeblee</surname> <given-names>S</given-names></name> <name><surname>Pou-Prom</surname> <given-names>C</given-names></name> <name><surname>Abdalla</surname> <given-names>M</given-names></name> <name><surname>Meaney</surname> <given-names>C</given-names></name> <name><surname>Rudzicz</surname> <given-names>F</given-names></name> <etal/></person-group>. <article-title>A survey of word embeddings for clinical text</article-title>. <source>J Biomed Inform.</source> (<year>2019</year>) <volume>4</volume>:<fpage>100057</fpage>. <pub-id pub-id-type="doi">10.1016/j.yjbinx.2019.100057</pub-id><pub-id pub-id-type="pmid">34384583</pub-id></citation></ref>
<ref id="B35">
<label>35.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bodenreider</surname> <given-names>O</given-names></name></person-group>. <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>. <source>Nucleic Acids Res.</source> (<year>2004</year>) <volume>32</volume>:<fpage>D267</fpage>&#x02013;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh061</pub-id><pub-id pub-id-type="pmid">14681409</pub-id></citation></ref>
<ref id="B36">
<label>36.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stearns</surname> <given-names>MQ</given-names></name> <name><surname>Price</surname> <given-names>C</given-names></name> <name><surname>Spackman</surname> <given-names>KA</given-names></name> <name><surname>Wang</surname> <given-names>AY</given-names></name></person-group>. <article-title>SNOMED clinical terms: overview of the development process and project status</article-title>. <source>Proc AMIA Symp.</source> (<year>2001</year>) <volume>2001</volume>:<fpage>662</fpage>&#x02013;<lpage>6</lpage>.<pub-id pub-id-type="pmid">11825268</pub-id></citation></ref>
<ref id="B37">
<label>37.</label>
<citation citation-type="journal"><person-group person-group-type="author"><collab>World Health Organization</collab></person-group>. <source>ICD-10 : International Statistical Classification of Diseases and Related Health Problems : Tenth Revision. 2nd ed</source>. World Health Organization (<year>2004</year>). Spanish version, 1st edition published by PAHO as Publicaci&#x000F3;n Cient&#x000ED;fica 544.</citation>
</ref>
<ref id="B38">
<label>38.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T</given-names></name> <name><surname>Chen</surname> <given-names>K</given-names></name> <name><surname>Corrado</surname> <given-names>G</given-names></name> <name><surname>Dean</surname> <given-names>J</given-names></name></person-group>. <article-title>Efficient estimation of word representations in vector space</article-title>. <source>arXiv</source>. (<year>2013</year>) arXiv:13013781.</citation>
</ref>
<ref id="B39">
<label>39.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y</given-names></name> <name><surname>Liu</surname> <given-names>S</given-names></name> <name><surname>Afzal</surname> <given-names>N</given-names></name> <name><surname>Rastegar-Mojarad</surname> <given-names>M</given-names></name> <name><surname>Wang</surname> <given-names>L</given-names></name> <name><surname>Shen</surname> <given-names>F</given-names></name> <etal/></person-group>. <article-title>A comparison of word embeddings for the biomedical natural language processing</article-title>. <source>J Biomed Inform.</source> (<year>2018</year>) <volume>87</volume>:<fpage>12</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2018.09.008</pub-id><pub-id pub-id-type="pmid">30217670</pub-id></citation></ref>
<ref id="B40">
<label>40.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ye</surname> <given-names>C</given-names></name> <name><surname>Fabbri</surname> <given-names>D</given-names></name></person-group>. <article-title>Extracting similar terms from multiple EMR-based semantic embeddings to support chart reviews</article-title>. <source>J Biomed Inform.</source> (<year>2018</year>) <volume>83</volume>:<fpage>63</fpage>&#x02013;<lpage>72</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2018.05.014</pub-id><pub-id pub-id-type="pmid">29793071</pub-id></citation></ref>
<ref id="B41">
<label>41.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Viani</surname> <given-names>N</given-names></name> <name><surname>Patel</surname> <given-names>R</given-names></name> <name><surname>Stewart</surname> <given-names>R</given-names></name> <name><surname>Velupillai</surname> <given-names>S</given-names></name></person-group>. <article-title>Generating positive psychosis symptom keywords from electronic health records.</article-title> In: Ria&#x000F1;o D, Wilk S, ten Teije A, editors <source>Artificial Intelligence in Medicine</source> (Lecture Notes in Computer Science). <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing;</publisher-name> (<year>2019</year>). p. <fpage>298</fpage>&#x02013;<lpage>303</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-21642-9_38</pub-id></citation>
</ref>
<ref id="B42">
<label>42.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saeed</surname> <given-names>M</given-names></name> <name><surname>Villarroel</surname> <given-names>M</given-names></name> <name><surname>Reisner</surname> <given-names>AT</given-names></name> <name><surname>Clifford</surname> <given-names>G</given-names></name> <name><surname>Lehman</surname> <given-names>L-W</given-names></name> <name><surname>Moody</surname> <given-names>G</given-names></name> <etal/></person-group>. <article-title>Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database</article-title>. <source>Crit Care Med.</source> (<year>2011</year>) <volume>39</volume>:<fpage>952</fpage>&#x02013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.1097/CCM.0b013e31820a92c6</pub-id><pub-id pub-id-type="pmid">21283005</pub-id></citation></ref>
<ref id="B43">
<label>43.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bojanowski</surname> <given-names>P</given-names></name> <name><surname>Grave</surname> <given-names>E</given-names></name> <name><surname>Joulin</surname> <given-names>A</given-names></name> <name><surname>Mikolov</surname> <given-names>T</given-names></name></person-group>. <article-title>Enriching word vectors with subword information</article-title>. <source>Trans Assoc Comput Linguist.</source> (<year>2017</year>) <volume>5</volume>:<fpage>135</fpage>&#x02013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00051</pub-id></citation>
</ref>
<ref id="B44">
<label>44.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pyysalo</surname> <given-names>S</given-names></name> <name><surname>Ginter</surname> <given-names>F</given-names></name> <name><surname>Moen</surname> <given-names>H</given-names></name> <name><surname>Salakoski</surname> <given-names>T</given-names></name> <name><surname>Ananiadou</surname> <given-names>S</given-names></name></person-group>. <article-title>Distributional semantics resources for biomedical text processing</article-title>. In: <source>Proceedings of LBM 2013</source> (<publisher-loc>Tokyo</publisher-loc>).</citation>
</ref>
<ref id="B45">
<label>45.</label>
<citation citation-type="web"><person-group person-group-type="author"><collab>National Library of Medicine</collab></person-group>. <article-title>Use of MeSH in Online Retrieval. U.S. National Library of Medicine.</article-title> Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.nlm.nih.gov/mesh/intro_retrieval.html">https://www.nlm.nih.gov/mesh/intro_retrieval.html</ext-link> (accessed November 9, 2021).</citation>
</ref>
<ref id="B46">
<label>46.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Brezina</surname> <given-names>V</given-names></name> <name><surname>Weill-Tessier</surname> <given-names>P</given-names></name> <name><surname>McEnery</surname> <given-names>A</given-names></name></person-group>. <article-title>&#x00023;LancsBox [software].</article-title> <publisher-loc>Lancaster, UK</publisher-loc> (<year>2020</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="http://corpora.lancs.ac.uk/lancsbox">http://corpora.lancs.ac.uk/lancsbox</ext-link> (accessed February 26, 2021).</citation>
</ref>
<ref id="B47">
<label>47.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hunston</surname> <given-names>S</given-names></name></person-group>. <source>Corpora in Applied Linguistics</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name> (<year>2002</year>). <pub-id pub-id-type="doi">10.1017/CBO9781139524773</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B48">
<label>48.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Smyth</surname> <given-names>C</given-names></name></person-group>. <source>An Introduction to Corpus Linguistics</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Routledge</publisher-name> (<year>2010</year>).</citation>
</ref>
<ref id="B49">
<label>49.</label>
<citation citation-type="web"><person-group person-group-type="author"><collab>Physiopedia</collab></person-group>. <article-title>Agnosia [Internet]</article-title>. <source>Physiopedia</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.physio-pedia.com/Agnosia">https://www.physio-pedia.com/Agnosia</ext-link> (accessed November 9, 2021).</citation>
</ref>
<ref id="B50">
<label>50.</label>
<citation citation-type="web"><source>GloVe: Global Vectors for Word Representation</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</ext-link> (accessed March 17, 2021).</citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://www.reddit.com">https://www.reddit.com</ext-link></p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://twitter.com/?lang=en">https://twitter.com/?lang=en</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/mesh/?term=pain">https://www.ncbi.nlm.nih.gov/mesh/?term=pain</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/jayachaturvedi/pain_lexicon">https://github.com/jayachaturvedi/pain_lexicon</ext-link></p></fn>
<fn id="fn0005"><p><sup>5</sup><ext-link ext-link-type="uri" xlink:href="https://bioportal.bioontology.org/">https://bioportal.bioontology.org/</ext-link></p></fn>
</fn-group>
</back>
</article>