<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Digit. Humanit.</journal-id>
<journal-title>Frontiers in Digital Humanities</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Digit. Humanit.</abbrev-journal-title>
<issn pub-type="epub">2297-2668</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdigh.2017.00022</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Digital Humanities</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Using Semantic Linking to Understand Persons&#x02019; Networks Extracted from Text</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Palmero Aprosio</surname> <given-names>Alessio</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="cor1">&#x0002A;</xref>
<uri xlink:href="http://frontiersin.org/people/u/478796"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tonelli</surname> <given-names>Sara</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://frontiersin.org/people/u/228516"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Menini</surname> <given-names>Stefano</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://frontiersin.org/people/u/492598"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Moretti</surname> <given-names>Giovanni</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://frontiersin.org/people/u/492693"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Digital Humanities Research Unit, Center for Information and Communication Technology, Fondazione Bruno Kessler</institution>, <addr-line>Trento</addr-line>, <country>Italy</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Information Engineering and Computer Science, University of Trento</institution>, <addr-line>Trento</addr-line>, <country>Italy</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Taha Yasseri, University of Oxford, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Fabian Fl&#x000F6;ck, Leibniz Institut f&#x000FC;r Sozialwissenschaften (GESIS), Germany; Juan Juli&#x000E1;n Merelo, University of Granada, Spain; David Laniado, Eurecat (Spain), Spain</p></fn>
<corresp content-type="corresp" id="cor1">&#x0002A;Correspondence: Alessio Palmero Aprosio, <email>aprosio&#x00040;fbk.eu</email></corresp>
<fn fn-type="other" id="fn001"><p>Specialty section: This article was submitted to Big Data, a section of the journal Frontiers in Digital Humanities</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>16</day>
<month>11</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>4</volume>
<elocation-id>22</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>07</month>
<year>2017</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>10</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2017 Palmero Aprosio, Tonelli, Menini and Moretti.</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Palmero Aprosio, Tonelli, Menini and Moretti</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>In this work, we describe a methodology to interpret large persons&#x02019; networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effective in terms of performance and is able to deal also with nodes that are not linked to Wikipedia. The gold standard manually developed for evaluation shows that groups of co-occurring entities share in most of the cases a category that can be automatically assigned. This holds for both languages considered in this study. The outcome of this work may be of interest to enhance the readability of large networks and to provide an additional semantic layer on top of cliques. This would greatly help humanities scholars when dealing with large amounts of textual data that need to be interpreted or categorized. Furthermore, it represents an unsupervised approach to automatically extend DBpedia starting from a corpus.</p>
</abstract>
<kwd-group>
<kwd>persons&#x02019; networks</kwd>
<kwd>semantic linking</kwd>
<kwd>DBpedia ontology</kwd>
<kwd>clique classification</kwd>
<kwd>natural language processing</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="3"/>
<equation-count count="15"/>
<ref-count count="35"/>
<page-count count="9"/>
<word-count count="6969"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="introduction">
<label>1</label> <title>Introduction</title>
<p>In recent years, humanities scholars have faced the challenge of introducing information technologies in their daily research activity to gain new insight from historical sources, literary collections, and other types of corpora, now available in digital format. However, to process large amounts of data and browse through the results in an intuitive way, new advanced tools are needed, specifically designed for researchers without a technical background. Especially scholars in the areas of social sciences or contemporary history need to interpret the content of an increasing flow of information (e.g., news, transcripts, and political debates) in short time, to quickly grasp the content of large amounts of data and then select the most interesting sources.</p>
<p>An effective way to highlight semantic connections emerging from documents, while summarizing their content, is a network. To analyze concepts and topics present in a corpus, several approaches have been successfully presented to model text corpora as networks, based on word co-occurrences, syntactic dependencies (Sudhahar et al., <xref ref-type="bibr" rid="B34">2015</xref>), or Latent Dirichlet Allocation (Henderson and Eliassi-Rad, <xref ref-type="bibr" rid="B13">2009</xref>). While these approaches focus mainly on concepts, other information could be effectively modeled in the form of networks, i.e., <italic>persons</italic>. Indeed, persons&#x02019; networks are the focus of several important research projects, for instance, Mapping the Republic of Letters,<xref ref-type="fn" rid="fn1"><sup>1</sup></xref> where connections between nodes have been manually encoded as metadata. However, when scholars need to manage large amounts of textual data, new challenges related to the creation of persons&#x02019; networks arise. Indeed, the process must be performed automatically, and since networks extracted from large amounts of data can include thousands of nodes and edges, the outcome may be difficult to read. While several software packages have been released to display and navigate networks, an overview of the content of large networks is difficult to achieve. Furthermore, this task also poses a series of technical challenges, for example, the need to find scalable solutions, and the fact that, although single components to extract persons&#x02019; networks from unstructured text may be available, they have never been integrated before in a single pipeline nor evaluated for the task.</p>
<p>In this work, we present an approach to extract persons&#x02019; networks from large amounts of text and to use Semantic Web technologies for classifying clusters of nodes. This classification relies on categories automatically leveraged from DBpedia, proving an effective interplay among Natural Language Processing, Semantic Web technologies, and network analysis. Through this process, interpretation of networks, the so-called <italic>distant reading</italic> (Moretti, <xref ref-type="bibr" rid="B24">2013</xref>), is made easier. We also analyze the impact of persons&#x02019; disambiguation and coreference resolution on the task. An evaluation is performed both on English and on Italian data, to assess whether there are differences depending on the language, on the domains covered by the two corpora, or on the different performance of NLP tools.</p>
<p>The article is structured as follows: in Section <xref ref-type="sec" rid="S2">2</xref>, we discuss past works related to our task, while in Section <xref ref-type="sec" rid="S3">3</xref>, we provide a description of the steps belonging to the proposed methodology. In Section <xref ref-type="sec" rid="S4">4</xref>, the experimental setup and the analyzed corpus are detailed, while in Section <xref ref-type="sec" rid="S5">5</xref>, an evaluation of node and clique classification is provided and discussed. In Section <xref ref-type="sec" rid="S6">6</xref>, we provide details on how to obtain the implemented system and the dataset, and finally we draw some conclusions and discuss future work in Section <xref ref-type="sec" rid="S7">7</xref>.</p>
</sec>
<sec id="S2">
<label>2</label> <title>Related Work</title>
<p>This work lies at the intersection of different disciplines. It takes advantage of studies on graphs, in particular research on the proprieties of cliques, i.e., groups of nodes with all possible ties among themselves. Cliques have been extensively studied in relation to social networks, where they usually represent social circles or communities (Grabowicz et al., <xref ref-type="bibr" rid="B10">2013</xref>; Jin et al., <xref ref-type="bibr" rid="B14">2013</xref>; Mcauley and Leskovec, <xref ref-type="bibr" rid="B20">2014</xref>). Although we use them to model co-occurrence in texts and not social relations, the assumption underlying this work is the same: the nodes belonging to the same clique share some common properties or categories, which we aim at identifying automatically, using the Linked Open Data.</p>
<p>This work relies also on past research analyzing the impact of preprocessing, in particular coreference resolution and named entity disambiguation, on the extraction of networks from text. The work presented in Diesner and Carley (<xref ref-type="bibr" rid="B4">2009</xref>) shows that anaphora and coreference resolution have both an impact on deduplicating nodes and adjusting weights in networks extracted from news. The authors recommend to apply both preprocessing steps to bring the network structure closer to the underlying social structure. This recommendation has been integrated in our processing pipeline, when possible.</p>
<p>The impact of named entity disambiguation on networks extracted from e-mail interactions is analyzed in Diesner et al. (<xref ref-type="bibr" rid="B5">2015</xref>). The authors argue that disambiguation is a precondition for testing hypotheses, answering graph-theoretical and substantive questions about networks, and advancing network theories. We base our study on these premises, in which we introduce a mention normalization step that collapses different person mentions onto the same node if they refer to the same entity.</p>
<p>Kobilarov et al. (<xref ref-type="bibr" rid="B15">2009</xref>) describe how BBC integrates data and links documents across entertainment and news domains by using Linked Open Data. Similarly, in &#x000D6;zg&#x000FC;r et al. (<xref ref-type="bibr" rid="B26">2008</xref>), Reuters News articles are connected in an entity graph at document-level: people are represented as vertices, and two persons are connected if they co-occur in the same article. The authors investigate the importance of a person using various ranking algorithms, such as PageRank. In Hasegawa et al. (<xref ref-type="bibr" rid="B12">2004</xref>), a similar graph of people is created, showing that relations between individuals can be guessed also connecting entities at sentence-level, with high precision and recall. In this work, we extract persons&#x02019; networks in a similar way, but we classify groups of highly connected nodes rather than relations.</p>
<p>In Koper (<xref ref-type="bibr" rid="B12">2004</xref>), the Semantic Web is used to get a representation of educational entities, to build self-organized learning networks, and go beyond course and curriculum centric models. The <italic>Trusty</italic> algorithm (Kuter and Golbeck, <xref ref-type="bibr" rid="B17">2009</xref>) combines network analysis and Semantic Web to compute social trust in a group of users using a particular service on the Web.</p>
</sec>
<sec id="S3">
<label>3</label> <title>Methodology</title>
<p>We propose and evaluate a methodology that takes a corpus in plain text as input and outputs a network, where each <italic>node</italic> corresponds to a person and an <italic>edge</italic> is set between two nodes if the two persons are co-occurring inside the same sentence. Within the network, <italic>cliques</italic>, i.e., maximum number of nodes who have all possible ties present among themselves are automatically labeled with a category extracted from DBpedia. In our case, cliques correspond to persons who tend to occur together in text, for which we assume that they share some commonalities or the same events. The goal of this process is to provide a comprehensive overview of the persons mentioned in large amounts of documents and show dependencies, overlaps, outliers, and other features that would otherwise be hard to discern. A portion of a network with three highlighted cliques is shown in Figure <xref ref-type="fig" rid="F1">1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Visualization of a large persons&#x02019; network with examples of labeled cliques.</p></caption>
<graphic xlink:href="fdigh-04-00022-g001.tif"/>
</fig>
<p>The creation of a persons&#x02019; network from text can be designed to model different types of relations. In case of novels, networks can capture dialog interactions and rely on the conversations between characters (Elson et al., <xref ref-type="bibr" rid="B7">2010</xref>). In case of e-mail corpora (Diesner et al., <xref ref-type="bibr" rid="B5">2015</xref>), edges correspond to emails exchanged between sender and addressee. Each type of interaction must be recognized with an <italic>ad hoc</italic> approach, for instance, using a tool that identifies direct speech in literary texts. On the contrary, our goal is to rely on a general-purpose methodology, therefore our approach to network creation is based on simple co-occurrence, similar to existing approaches to the creation of concept networks (Veling and Van Der Weerd, <xref ref-type="bibr" rid="B35">1999</xref>). In the following subsections, we detail the steps building our approach, displayed in Figure <xref ref-type="fig" rid="F2">2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Workflow of the whole system.</p></caption>
<graphic xlink:href="fdigh-04-00022-g002.tif"/>
</fig>
<sec id="S3-1">
<label>3.1</label> <title>Preprocessing</title>
<p>Each corpus is first processed with a pipeline of NLP tools. The goal is to detect persons&#x02019; names in the documents and link them to DBpedia. Since our approach supports both English and Italian, we adopt two different strategies, given that the NLP tools available for the two languages are very different and generally achieve better performance on English data. For English, we use the PIKES suite (Corcoglioniti et al., <xref ref-type="bibr" rid="B2">2016</xref>): it first launches the Stanford Named Entity Recognizer (Finkel et al., <xref ref-type="bibr" rid="B8">2005</xref>) to identify persons&#x02019; mentions in the documents (e.g., &#x0201C;J. F. Kennedy,&#x0201D; &#x0201C;Lady Gaga,&#x0201D; etc.), and then the Stanford Deterministic Coreference Resolution System (Manning et al., <xref ref-type="bibr" rid="B19">2014</xref>) to set coreferential chains within each document. For instance, the expressions &#x0201C;J. F. Kennedy,&#x0201D; &#x0201C;J. F. K.,&#x0201D; &#x0201C;John Kennedy,&#x0201D; and &#x0201C;he&#x0201D; may all be connected because they all refer to the same person. For Italian, instead, no tool for coreference resolution is available, therefore only NER is performed, using the Tint NLP suite (Palmero Aprosio and Moretti, <xref ref-type="bibr" rid="B31">2016</xref>).</p>
<p>Then, for both languages we run DBpedia Spotlight (Daiber et al., <xref ref-type="bibr" rid="B3">2013</xref>) and the Wiki Machine (Palmero Aprosio and Giuliano, <xref ref-type="bibr" rid="B28">2016</xref>) to link the entities in the text to the corresponding DBpedia pages.<xref ref-type="fn" rid="fn2"><sup>2</sup></xref> In particular, we consider only links that overlap with the NER annotation and belong to the <monospace>Person</monospace> category. We combine the output of the two tools, since past works proved that this outperforms the performance of single linking systems (Rizzo and Troncy, <xref ref-type="bibr" rid="B33">2012</xref>).</p>
<p>In case of mismatch between the output of the two linking annotations, the confidence values (between 0 and 1, provided by both systems) are compared, and only the more confident result is considered. At the end of preprocessing, we obtain for each document a list of (coreferring) persons&#x02019; mentions linked to DBpedia pages.</p>
</sec>
<sec id="S3-2">
<label>3.2</label> <title>Linking Filter</title>
<p>To improve linking precision, a filtering step based on Semantic Web resources has been introduced. It is applied to <italic>highly ambiguous entities</italic>, because it is very likely that they are linked to the wrong Wikipedia page, so it may be preferable to ignore them during the linking process. An entity should be ignored if the probability that it is linked to a Wikipedia page&#x02014;calculated as described in Palmero Aprosio et al. (<xref ref-type="bibr" rid="B29">2013a</xref>)&#x02014;is below a certain threshold. For instance, the word <italic>Plato</italic> can be linked to the philosopher, but also to an actress, <italic>Dana Plato</italic>, a racing driver, <italic>Jason Plato</italic>, and a South African politician, <italic>Dan Plato</italic>. However, the probability that <italic>Plato</italic> is linked to the philosopher page is 0.93, i.e., the link to the philosopher is probably always right. That value is calculated considering&#x02014;in Wikipedia&#x02014;both the number of links referring to that entity, and the semantics of the context extracted from the text surrounding the linked entity. In some cases, thresholds are very low, especially for common combinations of name&#x02013;surname. For example, <italic>Dave Roberts</italic> can be linked to 15 different Wikipedia pages, all of them having similar thresholds (0.19 for the outfielder, 0.14 for the pitcher, 0.06 for the broadcaster, 0.04 for the Californian politician mentioned in Kennedy&#x02019;s speeches, etc.). We manually checked some linking probabilities and set the threshold value to 0.2, so that if every possible page that can be linked to a mention has a probability &#x0003C;0.2, the entity is not linked. The impact of this step on the general task is reported in Table <xref ref-type="table" rid="T1">1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Number of nodes and cliques in the networks with and without mention normalization (MN) and coreference resolution (COREF&#x02014;only for English).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="center">Dataset</th>
<th valign="top" align="center">w/o MN</th>
<th valign="top" align="center">MN</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Number of nodes</td>
<td align="center" valign="top">NK</td>
<td align="center" valign="top">4,754</td>
<td align="center" valign="top">4.261</td>
</tr>
<tr>
<td align="left" valign="top">Number of nodes</td>
<td align="center" valign="top">Adige</td>
<td align="center" valign="top">28,644</td>
<td align="center" valign="top">19,133</td>
</tr>
<tr>
<td align="left" valign="top">Number of cliques</td>
<td align="center" valign="top"/>
<td align="center" valign="top"/>
<td align="center" valign="top"/>
</tr>
<tr>
<td align="left" valign="top">w/o COREF</td>
<td align="center" valign="top">NK</td>
<td align="center" valign="top">720 (4.62)</td>
<td align="center" valign="top">683 (4.60)</td>
</tr>
<tr>
<td align="left" valign="top">COREF</td>
<td align="center" valign="top">NK</td>
<td align="center" valign="top">1,005 (4.91)</td>
<td align="center" valign="top">869 (4.80)</td>
</tr>
<tr>
<td align="left" valign="top">w/o COREF</td>
<td align="center" valign="top">Adige</td>
<td align="center" valign="top">14,762 (5.23)</td>
<td align="center" valign="top">6,294 (5.12)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><italic>In brackets, the average number of entities for each clique</italic>.</p></table-wrap-foot></table-wrap>
</sec>
<sec id="S3-3">
<label>3.3</label> <title>Network Creation</title>
<p>The goal of this step is to take in input the information extracted through preprocessing and filtering and produce a network representing person co-occurrences in the corpus. We assume that persons correspond to nodes and edges express co-occurrence, therefore we build a person&#x02013;person matrix by setting an edge every time two persons are mentioned together in the same sentence.<xref ref-type="fn" rid="fn3"><sup>3</sup></xref></p>
<p>A known issue in network creation is name disambiguation, i.e., identifying whether a set of person mentions refers to one or more real-world persons. This task can be very difficult because it implies understanding whether spellings of seemingly similar names, such as &#x0201C;Smith, John&#x0201D; and &#x0201C;Smith, J.,&#x0201D; represent the same person or not. The given problem can get more complicated, especially when people are named with diminutives (e.g., &#x0201C;Nick&#x0201D; instead of &#x0201C;Nicholas&#x0201D;), acronyms (e.g., &#x0201C;J.F.K.&#x0201D;) or inconsistently spelled.</p>
<p>We tackle this problem with a <bold>mention normalization</bold> step based on a set of rules for English and Italian, dealing both with single- and multiple-token entities. Specifically, entities comprising more than one token (i.e., complex entities) are collapsed onto the same node if they show a certain amount of common tokens (e.g., &#x0201C;John F. Kennedy&#x0201D; and &#x0201C;John Kennedy&#x0201D;). The approach is similar to the <italic>first initial</italic> method that proved to reach 97% accuracy in past experiments (Milojevi&#x00107;, <xref ref-type="bibr" rid="B22">2013</xref>). As for simple entities (i.e., composed only of one token), they can be either proper names or surnames. To assess which simple entity belongs to which category, two lists of first and family names are extracted from biographies in Wikipedia, along with their frequency: a token is considered as a family name if it appears in the corresponding list and it does not appear in the first name list. Tokens not classified as surnames are ignored and not included in the network. Tokens classified as surnames, instead, are merged with the node corresponding to the most frequent complex entity containing such surname. The extraction of name and surname lists is performed using information included in infoboxes: in the English Wikipedia, the name and surname of a person are correctly split in <monospace>DEFAULTSORT</monospace>; in Italian, that information is included in <monospace>Persondata</monospace>.</p>
<p>For example, the single mentions of &#x0201C;Kennedy&#x0201D; are all collapsed onto the &#x0201C;John Fitzgerald Kennedy&#x0201D; node, if it is more frequent in the corpus than any other node containing the same surname such as &#x0201C;Robert F. Kennedy,&#x0201D; &#x0201C;Ted Kennedy,&#x0201D; etc. Normalization is particularly effective to deal with distant mentions of the same person in a document, because in such cases coreference tends to fail. It is also needed for documents in which ambiguous forms cannot be mapped to an extended version, for example, when only &#x0201C;Kennedy&#x0201D; is present. Finally, it is very effective on Italian, since for this language there is no coreference resolution tool. After mention normalization, the network has less nodes but it is more connected than the original version without normalization (see Table <xref ref-type="table" rid="T2">2</xref>).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Evaluation of node and clique classification (HAE means &#x0201C;highly ambiguous entities&#x0201D;).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Data</th>
<th valign="top" align="left">Experiment</th>
<th valign="top" align="center">P</th>
<th valign="top" align="center">R</th>
<th valign="top" align="center"><italic>F</italic><sub>1</sub></th>
<th valign="top" align="center">&#x00023; entities</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Baseline (Politician)</td>
<td align="center" valign="top">0.807</td>
<td align="center" valign="top">0.491</td>
<td align="center" valign="top">0.611</td>
<td align="center" valign="top">347/347</td>
</tr>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Node classification</td>
<td align="center" valign="top">0.689</td>
<td align="center" valign="top">0.481</td>
<td align="center" valign="top">0.566</td>
<td align="center" valign="top">245/347</td>
</tr>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Extension to non-linked</td>
<td align="center" valign="top">0.617</td>
<td align="center" valign="top">0.578</td>
<td align="center" valign="top">0.597</td>
<td align="center" valign="top">245/347</td>
</tr>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Node classification, no HAE</td>
<td align="center" valign="top">0.870</td>
<td align="center" valign="top">0.460</td>
<td align="center" valign="top">0.602</td>
<td align="center" valign="top">176/347</td>
</tr>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Extension to non-linked, no HAE</td>
<td align="center" valign="top">0.738</td>
<td align="center" valign="top">0.632</td>
<td align="center" valign="top">0.681</td>
<td align="center" valign="top">347/347</td>
</tr>
<tr>
<td align="left" valign="top">NK</td>
<td align="left" valign="top">Clique classification</td>
<td align="center" valign="top">0.677</td>
<td align="center" valign="top">0.768</td>
<td align="center" valign="top">0.720</td>
<td align="center" valign="top"/>
</tr>
<tr>
<td align="left" valign="top">Adige</td>
<td align="left" valign="top">Baseline (context)</td>
<td align="center" valign="top">0.802</td>
<td align="center" valign="top">0.488</td>
<td align="center" valign="top">0.607</td>
<td align="center" valign="top">486/486</td>
</tr>
<tr>
<td align="left" valign="top">Adige</td>
<td align="left" valign="top">Node classification</td>
<td align="center" valign="top">0.891</td>
<td align="center" valign="top">0.256</td>
<td align="center" valign="top">0.398</td>
<td align="center" valign="top">154/486</td>
</tr>
<tr>
<td align="left" valign="top">Adige</td>
<td align="left" valign="top">Extension to non-linked</td>
<td align="center" valign="top">0.911</td>
<td align="center" valign="top">0.625</td>
<td align="center" valign="top">0.742</td>
<td align="center" valign="top">486/486</td>
</tr>
<tr>
<td align="left" valign="top">Adige</td>
<td align="left" valign="top">Clique classification</td>
<td align="center" valign="top">0.930</td>
<td align="center" valign="top">0.485</td>
<td align="center" valign="top">0.637</td>
<td align="center" valign="top"/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="S3-4">
<label>3.4</label> <title>Clique Identification and Labeling</title>
<p>The last steps of the process include the identification of cliques, i.e., clusters of nodes with all possible ties among themselves (see Figure <xref ref-type="fig" rid="F1">1</xref>), and their classification by assigning a semantic category covering all nodes included in the clique. In case of small datasets, existing algorithms can quickly find all maximal cliques inside a network (a maximal clique is a clique that cannot be enlarged by adding a vertex). The most efficient one is the Bron&#x02013;Kerbosch clique detection algorithm (Bron and Kerbosch, <xref ref-type="bibr" rid="B1">1973</xref>). Unfortunately, the algorithm takes exponential time <italic>O</italic>(3<italic><sup>n</sup></italic><sup>/3</sup>) (being <italic>n</italic> the number of vertices in the network), which means that it quickly becomes intractable when the size of the network increases. Since in our scenario we are not interested in listing <italic>every</italic> maximal clique, but we can instead limit the size of the cliques to a fixed value <italic>k</italic> (that can be arbitrary big, for example, 10), the execution time drops to <italic>O</italic>(<italic>n<sup>k</sup>k</italic><sup>2</sup>), that is polynomial (Downey and Fellows, <xref ref-type="bibr" rid="B6">1995</xref>).</p>
<p>Clique labeling is performed according to the following algorithm. Let <italic>C</italic> be the set of cliques to be labeled. For each clique <italic>c</italic>&#x02009;&#x02208;&#x02009;<italic>C</italic>, let <italic>c<sub>i</sub>, i</italic>&#x02009;&#x0003D;&#x02009;(1 &#x02026; <italic>k<sub>c</sub></italic>) be the nodes belonging to <italic>c</italic> (note that we extract cliques of different sizes, thus we denote with <italic>k<sub>c</sub></italic> the size of the clique <italic>c</italic>). For each node <italic>c<sub>i</sub></italic> previously linked to a Wikipedia page (see Sections <xref ref-type="sec" rid="S3-1">3.1</xref> and <xref ref-type="sec" rid="S5">5</xref>), we extract the corresponding DBpedia classes using Airpedia (Palmero Aprosio et al., <xref ref-type="bibr" rid="B30">2013b</xref>). This system was chosen because it extends DBpedia coverage, classifying also pages that do not contain an infobox and exploiting cross-lingual links in Wikipedia. This results in a deeper and broader coverage of pages w.r.t. DBpedia classes. Let class (<italic>c<sub>i</sub></italic>) be the set of DBpedia classes associated with an entity <italic>c<sub>i</sub></italic> &#x02208; <italic>c</italic>. Note that class (<italic>c<sub>i</sub></italic>)&#x02009;&#x0003D;&#x02009;&#x000F8; for some <italic>c<sub>i</sub></italic>, as only around 50% of the entities can be successfully linked (see last column of Table <xref ref-type="table" rid="T2">2</xref>).</p>
<p>For each clique, we define the first frequency function <italic>F</italic>&#x02032; that maps each possible DBpedia class to the number of occurrences of that class in that clique. For example, the annotated clique
<disp-formula id="E1"><mml:math id="M1"><mml:mrow><mml:mtext>Gifford&#x02009;Pinchot</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Governor</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E2"><mml:math id="M2"><mml:mrow><mml:mtext>Theodore&#x02009;Roosevelt</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">President</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E3"><mml:math id="M3"><mml:mrow><mml:mtext>Wendell&#x02009;Willkie</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext fontfamily="monospace">none</mml:mtext></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<disp-formula id="E4"><mml:math id="M4"><mml:mrow><mml:mtext>Franklin&#x02009;Roosevelt</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">President</mml:mtext></mml:mrow></mml:math></disp-formula>
will result in
<disp-formula id="E5"><mml:math id="M5"><mml:mrow><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext fontfamily="monospace">Governor</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E6"><mml:math id="M6"><mml:mrow><mml:msup><mml:mi>F</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mtext fontfamily="monospace">President</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>2.</mml:mn></mml:mrow></mml:math></disp-formula></p>
<p>As DBpedia classes are hierarchical, we compute the final frequency function <italic>F</italic> by adding to <italic>F</italic>&#x02032; the ancestors for each class. In our example, as <monospace>Governor</monospace> and <monospace>President</monospace> are both children of <monospace>Politician</monospace>, <italic>F</italic> will result in
<disp-formula id="E7"><mml:math id="M7"><mml:mrow><mml:mi>F</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mtext fontfamily="monospace">Governor</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E8"><mml:math id="M8"><mml:mrow><mml:mi>F</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mtext fontfamily="monospace">President</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math></disp-formula>
<disp-formula id="E9"><mml:math id="M9"><mml:mrow><mml:mi>F</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mtext fontfamily="monospace">Politician</mml:mtext><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>3.</mml:mn></mml:mrow></mml:math></disp-formula></p>
<p>Since in our task we focus on persons, we only deal with the classes dominated by <monospace>Person</monospace> (we ignore the <monospace>Agent</monospace> class, along with <monospace>Person</monospace> itself). Finally, we pick the class that has the highest frequency and extend the annotation to the unknown entities. In the example, <italic>Wendell Willkie</italic> would be classified as <monospace>Politician</monospace>. The same class is also used to guess what the people in the clique have in common, i.e., a possible classification of the whole clique, to help the <italic>distant reading</italic> of the graph.</p>
</sec>
</sec>
<sec id="S4">
<label>4</label> <title>Experimental Setup</title>
<sec id="S4-1">
<label>4.1</label> <title>Evaluation Methodology</title>
<p>We evaluate our approach on two corpora:
<list list-type="bullet">
<list-item><p>The corpus of political speeches uttered by Nixon and Kennedy (NK) during 1960 presidential campaign.<xref ref-type="fn" rid="fn4"><sup>4</sup></xref> It contains around 1,650,000 tokens (830,000 by Nixon and 815,000 by Kennedy).</p></list-item>
<list-item><p>A corpus extracted from articles published on the Italian newspaper L&#x02019;Adige<xref ref-type="fn" rid="fn5"><sup>5</sup></xref> between 2011 and 2014, containing 9,786,625 tokens. To increase the variability of the news content and have a balanced dataset, we retrieve the documents from different news sections (e.g., Sports, Politics, and Events).</p></list-item>
</list></p>
<p>The corpus is first pre-processed as described in Section <xref ref-type="sec" rid="S3-1">3.1</xref>. Then, the recognized entities are linked and mention normalization (MN) is performed. On the English data we also run coreference resolution (COREF) (see Section <xref ref-type="sec" rid="S3-1">3.1</xref>). We show in Table <xref ref-type="table" rid="T1">1</xref> the impact of these two processes on the network dimension and on the number of extracted cliques.</p>
<p>Clique identification is performed by applying the Bron&#x02013;Kerbosch clique detection algorithm (see Section <xref ref-type="sec" rid="S3-4">3.4</xref>), using the implementation available in the JGraphT package.<xref ref-type="fn" rid="fn6"><sup>6</sup></xref> After this extraction, we only work on cliques having at least 4 nodes, as smaller cliques would be too trivial to classify. Table <xref ref-type="table" rid="T3">3</xref> lists the number of cliques grouped by size.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Number of cliques grouped by size.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset/size</th>
<th valign="top" align="center">4</th>
<th valign="top" align="center">5</th>
<th valign="top" align="center">6</th>
<th valign="top" align="center">7</th>
<th valign="top" align="center">8</th>
<th valign="top" align="center">9</th>
<th valign="top" align="center">10</th>
<th valign="top" align="center">11</th>
<th valign="top" align="center">12</th>
<th valign="top" align="center">14</th>
<th valign="top" align="center">15</th>
<th valign="top" align="center">16</th>
<th valign="top" align="center">17</th>
<th valign="top" align="center">19</th>
<th valign="top" align="center">20</th>
<th valign="top" align="center">23</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">NK</td>
<td align="center" valign="top">211</td>
<td align="center" valign="top">158</td>
<td align="center" valign="top">100</td>
<td align="center" valign="top">66</td>
<td align="center" valign="top">39</td>
<td align="center" valign="top">17</td>
<td align="center" valign="top">7</td>
<td align="center" valign="top">5</td>
<td align="center" valign="top">3</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">Adige</td>
<td align="center" valign="top">177</td>
<td align="center" valign="top">120</td>
<td align="center" valign="top">89</td>
<td align="center" valign="top">33</td>
<td align="center" valign="top">40</td>
<td align="center" valign="top">5</td>
<td align="center" valign="top">3</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">2</td>
<td align="center" valign="top">2</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Mention normalization reduces the number of nodes because it collapses different mentions onto the same node. Consequently, the number of cliques decreases (see Table <xref ref-type="table" rid="T1">1</xref>). Coreference resolution, instead, does not have any impact on the network dimension, but it increases the number of edges connecting nodes, resulting in an increment of the number of cliques and also of their dimension. The evaluation presented in the remainder of this article on English data is based on a system configuration including both mention normalization and coreference resolution. For Italian, only mention normalization is performed.</p>
</sec>
<sec id="S4-2">
<label>4.2</label> <title>Gold Standard Creation</title>
<p>Since the goal of this work is to present and evaluate a methodology to assign categories to cliques and make large persons&#x02019; networks more readable, we first create a gold standard with two annotated layers, one at <italic>node</italic> and one at <italic>clique</italic> level. This data set includes 184 cliques randomly extracted from the clique list (see Section <xref ref-type="sec" rid="S3-4">3.4</xref>): 84 from the NK corpus and 100 from Adige.</p>
<p>First, each node in the clique is manually annotated with one or more classes from the DBpedia ontology (Lehmann et al., <xref ref-type="bibr" rid="B18">2015</xref>) expressing the social role of the person under consideration. For example, <italic>Henry Clay</italic> is annotated both as <monospace>Senator</monospace> and <monospace>Congressman</monospace>. For many political roles, the ontology does not contain any class (for instance, <italic>Secretary</italic>). In that case, the person is labeled with the closest more generic class (e.g., <italic>Politician</italic>). Then, for each clique, we identify the most specific class (or classes) of the ontology including every member of the group. The shared class is used as label to define the category of the clique. For example, a clique can be annotated as follows:
<disp-formula id="E10"><mml:math id="M10"><mml:mrow><mml:mtext>John&#x02009;Swainson</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Governor</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E11"><mml:math id="M11"><mml:mrow><mml:mtext>G</mml:mtext><mml:mo>.</mml:mo><mml:mtext>Mennen&#x02009;Williams</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Governor</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E12"><mml:math id="M12"><mml:mrow><mml:mtext>Thaddeus&#x02009;Machrowicz</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Congressman</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E13"><mml:math id="M13"><mml:mrow><mml:mtext>Jim&#x02009;O&#x02019;Hara</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Congressman</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E14"><mml:math id="M14"><mml:mrow><mml:mtext>Pat&#x02009;McNamara</mml:mtext><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Senator</mml:mtext></mml:mrow></mml:math></disp-formula>
<disp-formula id="E15"><mml:math id="M15"><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mtext mathvariant="italic">whole&#x000A0;clique</mml:mtext><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x02192;</mml:mo><mml:mtext fontfamily="monospace">Politician</mml:mtext><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>In case no category covering all nodes exists, the <monospace>Person</monospace> class is assigned. For instance, a clique containing 3 nodes labeled as <monospace>Journalist</monospace> and 2 nodes as <monospace>President</monospace> is assigned the <monospace>Person</monospace> class.</p>
<p>The gold standard contains overall 833 persons (347 from NK, 486 from Adige) grouped into 184 cliques, only 27 of which are labeled with the <italic>Person</italic> category (13 from NK, 14 from Adige). This confirms our initial hypothesis that nodes sharing the same clique (i.e., persons who tend to be mentioned together in text) show a high degree of commonality. All entities in the gold standard are assigned at least one category. Since this task is performed by looking directly at the DBpedia ontology, also persons who are not present in Wikipedia are manually labeled. In case a node is ambiguous (e.g., six persons named <italic>Pat McNamara</italic> are listed in Wikipedia), the annotator looks at the textual context(s) in which the clique occurs to disambiguate the entity.</p>
</sec>
</sec>
<sec id="S5">
<label>5</label> <title>Results</title>
<p>In Table <xref ref-type="table" rid="T2">2</xref>, we report different stages of the evaluation performed by comparing the system output with the gold standards presented in the previous subsection. We also compare our performance with a competitive baseline: for NK, we assign to each clique the <bold>Politician</bold> category, given that this pertains to the domain of the corpus. For Adige, we select the most probable category by article affinity: <monospace>Athlete</monospace> for sport section, <monospace>Artist</monospace> for cultural articles, <monospace>Politician</monospace> for the remaining ones.</p>
<p>We first evaluate the classification of the single nodes (&#x0201C;<italic>node classification</italic>&#x0201D;) by comparing the category assigned through linking with DBpedia Spotlight and the Wiki Machine to the class labels in the gold standard. Since our methodology assigns a category to a clique even if not all nodes are linked to a Wikipedia page, we evaluate also the effect of inheriting the clique class at node level (see row &#x0201C;<italic>Extending to non-linked entities</italic>&#x0201D;).</p>
<p>Besides, we assess the impact of &#x0201C;<italic>highly ambiguous entities</italic>&#x0201D; on node classification, and the effect of removing them from the nodes to be linked (&#x0201C;<italic>without highly ambiguous entities</italic>&#x0201D;). For instance, we removed from the data the node of &#x0201C;Bob Johnson,&#x0201D; which may refer to 21 different persons (see Section <xref ref-type="sec" rid="S3-2">3.2</xref> for details). Note that we report the results only for English, since this step had no effect on the Italian data, containing no person mention with a relevance &#x0003C; 0.2. The last line for each dataset in Table <xref ref-type="table" rid="T2">2</xref> shows the performance of the system on guessing the shared class for the entire clique.</p>
<p>For each entity that needs to be classified, the evaluation is performed as proposed by Melamed and Resnik (<xref ref-type="bibr" rid="B21">2000</xref>) for a similar hierarchical categorization task. Figure <xref ref-type="fig" rid="F3">3</xref> shows an example of the evaluation. The system tries to classify the entity <italic>Dante Fascell</italic> and maps it to the ontology class <monospace>Governor</monospace>, while the correct classification is <monospace>Congressman</monospace>. The missing class (question mark) counts as a false negative (<italic>fn</italic>), the wrong class (cross) counts as a false positive (<italic>fp</italic>), and the correct class (tick) counts as a true positive (<italic>tp</italic>). As in this task we classify only people, we do not consider the true positives associated to the <monospace>Person</monospace> and <monospace>Agent</monospace> classes.<xref ref-type="fn" rid="fn7"><sup>7</sup></xref> In the example above, classification of <italic>Dante Fascell</italic> influences the global rates by adding 1 <italic>tp</italic>, 1 <italic>fn</italic>, and 1 <italic>fp</italic>. Once all rates are collected for each classification, we calculate standard precision (<italic>p</italic>), recall (<italic>r</italic>), and <italic>F</italic><sub>1</sub>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Description of the evaluation.</p></caption>
<graphic xlink:href="fdigh-04-00022-g003.tif"/>
</fig>
<p>Results in Table <xref ref-type="table" rid="T2">2</xref> show some differences between the English and the Italian dataset. With NK, that deals with people who lived in the sixties, the performance of node classification suffers from missing links, depending on the incomplete coverage of DBpedia Spotlight and the Wiki Machine, but also on the fact that some entities are not present in Wikipedia. However, this configuration achieves a good precision. In terms of <italic>F</italic><sub>1</sub>, extending the class assigned to the clique also to non-linked entities yields a performance improvement, due to better recall. Removing highly ambiguous entities is extremely beneficial because it boosts precision as expected, especially in combination with the strategy to extend the clique class to all underlying nodes. The setting based on this combination is the best performing one, achieving an improvement with respect to basic node classification both in precision and in recall. On this corpus, the baseline assigning the <monospace>Politician</monospace> label to all nodes is very competitive because of the domain. Based on the best performing setting for node classification, we evaluated the resulting clique classification, with the goal of assigning a category to clusters of interconnected nodes and easing the network comprehension. Results show that the task achieves good results and, even if not directly comparable, classification performance is higher than on single nodes.</p>
<p>In the Adige corpus, precision is higher than in NK: the persons mentioned in this dataset are in most of the cases still living, therefore they are present in Wikipedia more often than the persons mentioned in NK in 1960. On the contrary, recall is lower. We investigated this issue and discovered that entities in DBpedia are often not classified with the most specific class. For example, Mattia Pellegrin is a cross country skier and was annotated as <monospace>CrossCountrySkier</monospace> by our annotators. On the contrary, in DBpedia the entity <monospace>Mattia_Pellegrin</monospace> is classified as <monospace>Athlete</monospace>, thus this was the label assigned by our system. Following the evaluation described in Figure <xref ref-type="fig" rid="F3">3</xref>, our system is penalized as it misses both <monospace>WinterSportPlayer</monospace> and <monospace>CrossCountrySkier</monospace>. For this reason, in classifying Mattia Pellegrin, the system gets 1 <italic>tp</italic> and 2 <italic>fn</italic>.</p>
<p>Being able to assign classes to cliques, even if not all nodes are linked, our approach has a high potential in terms of coverage. Indeed, it can cover entities that are not in Wikipedia (and in DBpedia), by guessing their class using DBpedia categories. In general terms, it may be used also to automatically extend DBpedia with new person entities. Specifically, we classified 171 new entities in NK (<italic>p</italic>&#x02009;&#x0003D;&#x02009;0.738 and <italic>F</italic><sub>1</sub>&#x02009;&#x0003D;&#x02009;0.681) and 332 entities in Adige (<italic>p</italic>&#x02009;&#x0003D;&#x02009;0.911 and <italic>F</italic><sub>1</sub>&#x02009;&#x0003D;&#x02009;0.742), for a total of 503. Given that the gold standard includes 833 entities, this means that on average 60% of entities in the two datasets (503 out of 833) are not present in Wikipedia (or are too ambiguous, see description of &#x0201C;highly ambiguous entities&#x0201D; in Section <xref ref-type="sec" rid="S3-2">3.2</xref>), and our system is capable of assigning them a DBpedia category. Our gold standard is relatively small, but if this step is launched on a large amount of data, it has the potential to significantly extend DBpedia with unseen entities, for example, those living in the past who are not represented in the knowledge base. On the other hand, we are aware that the way Wikipedia is built and edited can affect the outcome of this work. In particular, Wikipedia Western and English bias must be taken into account when using this kind of approaches for studies in the digital humanities (e.g., cultural analytics), because certain persons&#x02019; categories and nationalities are more present than others.</p>
</sec>
<sec id="S6">
<label>6</label> <title>Datasets and Tool</title>
<p>The tool performing the workflow described in this paper is written in Java and released on GitHub<xref ref-type="fn" rid="fn8"><sup>8</sup></xref> under the GPL license, version 3. On our GitHub page one can find:
<list list-type="bullet">
<list-item><p>the dataset containing the original Nixon and Kennedy speech transcriptions (released under the NARA public domain license) along with the linguistic annotations applied in the preprocessing step (in NAF format (Fokkens et al., <xref ref-type="bibr" rid="B9">2014</xref>), see Section <xref ref-type="sec" rid="S3-1">3.1</xref>);</p></list-item>
<list-item><p>the annotated cliques for both datasets (NK and Adige).</p></list-item>
<list-item><p>Unfortunately, the Adige corpus is not publicly released, therefore we cannot make it available for download.</p></list-item>
</list></p>
</sec>
<sec id="S7">
<label>7</label> <title>Conclusion and Future Work</title>
<p>In this work, we presented an approach to extract persons&#x02019; networks from large amounts of textual data based on co-occurrence relations. Then, we introduced a methodology to identify cliques and assign them a category based on DBpedia ontology. This additional information layer is meant to ease the interpretation of networks, especially when they are particularly large.</p>
<p>We discussed in detail several issues related to the task. First of all, dealing with textual data is challenging because persons&#x02019; mentions can be variable or inconsistent, and the proposed approach must be robust enough to tackle this problem. We rely on a well-known tool for coreference resolution and we perform mention normalization, so that all mentions referring to the same entity are recognized and assigned to the same node. We also introduced a filtering strategy based on information retrieved from Semantic Web resources, to deal with highly ambiguous entities.</p>
<p>Finally, we presented and evaluated a strategy to assign a category to the nodes in a clique and then, by generalization, to the whole clique. The approach yields good results, especially in terms of precision, both at node and at clique level. Furthermore, it is able to classify entities that are not present in Wikipedia/DBpedia and could be also used to enrich other knowledge bases, for example, Wikidata, without any supervision. The data manually annotated for the gold standards confirm the initial hypothesis that co-occurrence networks based on persons&#x02019; mentions can provide an interesting representation of the content of a document collection, and that cliques can effectively capture commonalities among co-occurring persons. To the best of our knowledge, this hypothesis was never proved before, and the clique classification task based on DBpedia ontology is an original contribution of this work.</p>
<p>In the future, we plan to integrate this methodology in the ALCIDE tool (Moretti et al., <xref ref-type="bibr" rid="B25">2016</xref>), which displays large persons&#x02019; networks extracted from text but suffers from a low readability of the results. We also plan to improve and extend nodes and cliques classification, for instance, by applying clique percolation (Palla et al., <xref ref-type="bibr" rid="B27">2005</xref>), a method used in Social Media analysis to discover relations between communities (Gregori et al., <xref ref-type="bibr" rid="B11">2011</xref>). Another research direction will deal with almost-cliques (Pei et al., <xref ref-type="bibr" rid="B32">2005</xref>) or node clusters with high (but not maximal) connectivity, so as to increment the coverage of our approach by including more entities. Finally, we would like to exploit the links connecting different Wikipedia biographies to cross-check the information automatically acquired from cliques and investigate whether this can be used to enrich the cliques with person-to-person relations.</p>
</sec>
<sec id="S8" sec-type="author-contributor">
<title>Author Contributions</title>
<p>APA and ST designed the work and wrote the paper. APA and GM implemented the software to run the experiments. SM and APA developed the datasets for evaluation.</p>
</sec>
<sec id="S9">
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This work has been funded by Fondazione Bruno Kessler.</p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bron</surname> <given-names>C.</given-names></name> <name><surname>Kerbosch</surname> <given-names>J.</given-names></name></person-group> (<year>1973</year>). <article-title>Algorithm 457: finding all cliques of an undirected graph</article-title>. <source>Communications of the ACM</source> <volume>16</volume>: <fpage>575</fpage>&#x02013;<lpage>7</lpage>.<pub-id pub-id-type="doi">10.1145/362342.362367</pub-id></citation></ref>
<ref id="B2"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Corcoglioniti</surname> <given-names>F.</given-names></name> <name><surname>Rospocher</surname> <given-names>M.</given-names></name> <name><surname>Aprosio</surname> <given-names>A.P.</given-names></name></person-group> (<year>2016</year>). <article-title>A 2-phase frame-based knowledge extraction framework</article-title>. In <conf-name>Proc. of ACM Symposium on Applied Computing (SAC&#x02019;16)</conf-name>. <conf-loc>Pisa, Italy</conf-loc>.</citation></ref>
<ref id="B3"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Daiber</surname> <given-names>J.</given-names></name> <name><surname>Jakob</surname> <given-names>M.</given-names></name> <name><surname>Hokamp</surname> <given-names>C.</given-names></name> <name><surname>Mendes</surname> <given-names>P.N.</given-names></name></person-group> (<year>2013</year>). <article-title>Improving efficiency and accuracy in multilingual entity extraction</article-title>. In <conf-name>Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)</conf-name>, <conf-loc>Graz, Austria</conf-loc>.</citation></ref>
<ref id="B4"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Diesner</surname> <given-names>J.</given-names></name> <name><surname>Carley</surname> <given-names>K.</given-names></name></person-group> (<year>2009</year>). <article-title>He says, she says. Pat says, Tricia says. How much reference resolution matters for entity extraction, relation extraction, and social network analysis</article-title>. In <conf-name>Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on</conf-name>, <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <conf-loc>Ottawa, Canada</conf-loc>.</citation></ref>
<ref id="B5"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Diesner</surname> <given-names>J.</given-names></name> <name><surname>Evans</surname> <given-names>C.S.</given-names></name> <name><surname>Kim</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Impact of entity disambiguation errors on social network properties</article-title>. In <conf-name>Proceedings of the Ninth International Conference on Web and Social Media, ICWSM 2015</conf-name>, <fpage>81</fpage>&#x02013;<lpage>90</lpage>. <conf-loc>Oxford, UK</conf-loc>: <conf-sponsor>University of Oxford</conf-sponsor>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Downey</surname> <given-names>R.G.</given-names></name> <name><surname>Fellows</surname> <given-names>M.R.</given-names></name></person-group> (<year>1995</year>). <article-title>Fixed-parameter tractability and completeness II: on completeness for W[1]</article-title>. <source>Theoretical Computer Science</source>, <volume>141</volume>: <fpage>109</fpage>&#x02013;<lpage>31</lpage>.<pub-id pub-id-type="doi">10.1016/0304-3975(94)00097-3</pub-id></citation></ref>
<ref id="B7"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Elson</surname> <given-names>D.K.</given-names></name> <name><surname>Dames</surname> <given-names>N.</given-names></name> <name><surname>McKeown</surname> <given-names>K.R.</given-names></name></person-group> (<year>2010</year>). <article-title>Extracting social networks from literary fiction</article-title>. In <conf-name>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL &#x02019;10</conf-name>, <fpage>138</fpage>&#x02013;<lpage>147</lpage>. <conf-loc>Stroudsburg, PA, USA</conf-loc>: <conf-sponsor>Association for Computational Linguistics</conf-sponsor>.</citation></ref>
<ref id="B8"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Finkel</surname> <given-names>J.R.</given-names></name> <name><surname>Grenager</surname> <given-names>T.</given-names></name> <name><surname>Manning</surname> <given-names>C.</given-names></name></person-group> (<year>2005</year>). <article-title>Incorporating non-local information into information extraction systems by Gibbs sampling</article-title>. In <conf-name>Proceedings of ACL &#x02019;05</conf-name>, <fpage>363</fpage>&#x02013;<lpage>370</lpage>. <conf-loc>Ann Arbor, USA</conf-loc>: <conf-sponsor>Association for Computational Linguistics</conf-sponsor>.</citation></ref>
<ref id="B9"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Fokkens</surname> <given-names>A.</given-names></name> <name><surname>Soroa</surname> <given-names>A.</given-names></name> <name><surname>Beloki</surname> <given-names>Z.</given-names></name> <name><surname>Ockeloen</surname> <given-names>N.</given-names></name> <name><surname>Rigau</surname> <given-names>G.</given-names></name> <name><surname>van Hage</surname> <given-names>W.R.</given-names></name> <etal/></person-group> (<year>2014</year>). <article-title>Naf and gaf: linking linguistic annotations</article-title>. In <conf-name>Proceedings 10th Joint ISO-ACL SIGSEM Workshop on Interoperable Semantic Annotation</conf-name>, <fpage>9</fpage>&#x02013;<lpage>16</lpage>. <conf-loc>Reykjavik, Iceland</conf-loc>.</citation></ref>
<ref id="B10"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Grabowicz</surname> <given-names>P.A.</given-names></name> <name><surname>Aiello</surname> <given-names>L.M.</given-names></name> <name><surname>Eguiluz</surname> <given-names>V.M.</given-names></name> <name><surname>Jaimes</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Distinguishing topical and social groups based on common identity and bond theory</article-title>. In <conf-name>Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM &#x02019;13</conf-name>, <fpage>627</fpage>&#x02013;<lpage>636</lpage>. <conf-loc>New York, NY, USA</conf-loc>: <conf-sponsor>ACM</conf-sponsor>.</citation></ref>
<ref id="B11"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Gregori</surname> <given-names>E.</given-names></name> <name><surname>Lenzini</surname> <given-names>L.</given-names></name> <name><surname>Orsini</surname> <given-names>C.</given-names></name></person-group> (<year>2011</year>). <article-title>k-clique communities in the internet as-level topology graph</article-title>. In <conf-name>Distributed Computing Systems Workshops (ICDCSW), 2011 31st International Conference on</conf-name>, <fpage>134</fpage>&#x02013;<lpage>139</lpage>. <conf-loc>Minneapolis, USA</conf-loc>.</citation></ref>
<ref id="B12"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Hasegawa</surname> <given-names>T.</given-names></name> <name><surname>Sekine</surname> <given-names>S.</given-names></name> <name><surname>Grishman</surname> <given-names>R.</given-names></name></person-group> (<year>2004</year>). <article-title>Discovering relations among named entities from large corpora</article-title>. In <conf-name>Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, ACL &#x02019;04</conf-name>, <conf-loc>Stroudsburg, PA, USA</conf-loc>: <conf-sponsor>Association for Computational Linguistics</conf-sponsor>.</citation></ref>
<ref id="B13"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Henderson</surname> <given-names>K.</given-names></name> <name><surname>Eliassi-Rad</surname> <given-names>T.</given-names></name></person-group> (<year>2009</year>). <article-title>Applying latent dirichlet allocation to group discovery in large graphs</article-title>. In <conf-name>Proceedings of the 2009 ACM Symposium on Applied Computing, SAC &#x02019;09</conf-name>, <fpage>1456</fpage>&#x02013;<lpage>1461</lpage>. <conf-loc>New York, NY, USA</conf-loc>: <conf-sponsor>ACM</conf-sponsor>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Hui</surname> <given-names>P.</given-names></name> <name><surname>Vasilakos</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Understanding user behavior in online social networks: a survey</article-title>. <source>Communications Magazine IEEE</source> <volume>51</volume>: <fpage>144</fpage>&#x02013;<lpage>50</lpage>.<pub-id pub-id-type="doi">10.1109/MCOM.2013.6588663</pub-id></citation></ref>
<ref id="B15"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Kobilarov</surname> <given-names>G.</given-names></name> <name><surname>Scott</surname> <given-names>T.</given-names></name> <name><surname>Raimond</surname> <given-names>Y.</given-names></name> <name><surname>Oliver</surname> <given-names>S.</given-names></name> <name><surname>Sizemore</surname> <given-names>C.</given-names></name> <name><surname>Smethurst</surname> <given-names>M.</given-names></name> <etal/></person-group> (<year>2009</year>). <article-title>Media meets semantic web &#x02013; how the BBC Uses DBpedia and linked data to make connections</article-title>. In <conf-name>The Semantic Web: Research and Applications: 6th European Semantic Web Conference, ESWC 2009</conf-name>, <fpage>723</fpage>&#x02013;<lpage>737</lpage>. <conf-loc>Heraklion, Crete, Greece</conf-loc>: <conf-sponsor>Springer Berlin Heidelberg</conf-sponsor>.</citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koper</surname> <given-names>R.</given-names></name></person-group> (<year>2004</year>). <article-title>Use of the semantic web to solve some basic problems in education: increase flexible, distributed lifelong learning, decrease teacher&#x02019;s workload</article-title>. <source>Journal of Interactive Media in Education</source> <volume>2004</volume>, <fpage>1</fpage>&#x02013;<lpage>23</lpage>.<pub-id pub-id-type="doi">10.5334/2004-6-koper</pub-id></citation></ref>
<ref id="B17"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Kuter</surname> <given-names>U.</given-names></name> <name><surname>Golbeck</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <article-title>Semantic web service composition in social environments</article-title>. In <conf-name>The Semantic Web &#x02013; ISWC 2009: 8th International Semantic Web Conference, ISWC 2009</conf-name>, <fpage>344</fpage>&#x02013;<lpage>358</lpage>. <conf-loc>Chantilly, VA, USA</conf-loc>: <conf-sponsor>Springer Berlin Heidelberg</conf-sponsor>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lehmann</surname> <given-names>J.</given-names></name> <name><surname>Isele</surname> <given-names>R.</given-names></name> <name><surname>Jakob</surname> <given-names>M.</given-names></name> <name><surname>Jentzsch</surname> <given-names>A.</given-names></name> <name><surname>Kontokostas</surname> <given-names>D.</given-names></name> <name><surname>Mendes</surname> <given-names>P.N.</given-names></name> <etal/></person-group> (<year>2015</year>). <article-title>DBpedia &#x02013; a large-scale, multilingual knowledge base extracted from Wikipedia</article-title>. <source>Semantic Web Journal</source> <volume>6</volume>: <fpage>167</fpage>&#x02013;<lpage>195</lpage>.<pub-id pub-id-type="doi">10.3233/SW-140134</pub-id></citation></ref>
<ref id="B19"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Manning</surname> <given-names>C.D.</given-names></name> <name><surname>Surdeanu</surname> <given-names>M.</given-names></name> <name><surname>Bauer</surname> <given-names>J.</given-names></name> <name><surname>Finkel</surname> <given-names>J.</given-names></name> <name><surname>Bethard</surname> <given-names>S.J.</given-names></name> <name><surname>McClosky</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>. In <source>Association for Computational Linguistics (ACL) System Demonstrations</source>, <fpage>55</fpage>&#x02013;<lpage>60</lpage>. <publisher-loc>Baltimore, USA</publisher-loc>.</citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mcauley</surname> <given-names>J.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Discovering social circles in ego networks</article-title>. <source>ACM Transactions on Knowledge Discovery from Data</source> <volume>8</volume>: <fpage>28</fpage>.<pub-id pub-id-type="doi">10.1145/2556612</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Melamed</surname> <given-names>I.D.</given-names></name> <name><surname>Resnik</surname> <given-names>P.</given-names></name></person-group> (<year>2000</year>). <article-title>Tagger evaluation given hierarchical tag sets</article-title>. <source>Computers and the Humanities</source> <volume>34</volume>: <fpage>79</fpage>&#x02013;<lpage>84</lpage>.<pub-id pub-id-type="doi">10.1023/A:1002402902356</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Milojevi&#x00107;</surname> <given-names>S.</given-names></name></person-group> (<year>2013</year>). <article-title>Accuracy of simple, initials-based methods for author name disambiguation</article-title>. <source>Journal of Informetrics</source> <volume>7</volume>: <fpage>767</fpage>&#x02013;<lpage>73</lpage>.<pub-id pub-id-type="doi">10.1016/j.joi.2013.06.006</pub-id></citation></ref>
<ref id="B23"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Mitchell</surname> <given-names>A.</given-names></name> <name><surname>Strassel</surname> <given-names>S.</given-names></name> <name><surname>Przybocki</surname> <given-names>M.</given-names></name> <name><surname>Davis</surname> <given-names>J.</given-names></name> <name><surname>Doddington</surname> <given-names>G.</given-names></name> <name><surname>Grishman</surname> <given-names>R.</given-names></name> <etal/></person-group> (<year>2002</year>). <source>ACE-2 Version 1.0. LDC2003T11</source>. <publisher-loc>Philadelphia, USA</publisher-loc>: <publisher-name>Linguistic Data Consortium</publisher-name>.</citation></ref>
<ref id="B24"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Moretti</surname> <given-names>F.</given-names></name></person-group> (<year>2013</year>). <source>Distant Reading</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Verso</publisher-name>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moretti</surname> <given-names>G.</given-names></name> <name><surname>Sprugnoli</surname> <given-names>R.</given-names></name> <name><surname>Menini</surname> <given-names>S.</given-names></name> <name><surname>Tonelli</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>ALCIDE: extracting and visualising content from large document collections to support humanities studies</article-title>. <source>Knowledge Based Systems</source> <volume>111</volume>: <fpage>100</fpage>&#x02013;<lpage>12</lpage>.<pub-id pub-id-type="doi">10.1016/j.knosys.2016.08.003</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x000D6;zg&#x000FC;r</surname> <given-names>A.</given-names></name> <name><surname>Cetin</surname> <given-names>B.</given-names></name> <name><surname>Bingol</surname> <given-names>H.</given-names></name></person-group> (<year>2008</year>). <article-title>Co-occurrence network of Reuters news</article-title>. <source>International Journal of Modern Physics C</source> <volume>19</volume>: <fpage>689</fpage>&#x02013;<lpage>702</lpage>.<pub-id pub-id-type="doi">10.1142/S0129183108012431</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Palla</surname> <given-names>G.</given-names></name> <name><surname>Der&#x000E9;nyi</surname> <given-names>I.</given-names></name> <name><surname>Farkas</surname> <given-names>I.</given-names></name> <name><surname>Vicsek</surname> <given-names>T.</given-names></name></person-group> (<year>2005</year>). <article-title>Uncovering the overlapping community structure of complex networks in nature and society</article-title>. <source>Nature</source> <volume>435</volume>: <fpage>814</fpage>&#x02013;<lpage>8</lpage>.<pub-id pub-id-type="doi">10.1038/nature03607</pub-id><pub-id pub-id-type="pmid">15944704</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Palmero Aprosio</surname> <given-names>A.</given-names></name> <name><surname>Giuliano</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>The Wiki Machine: an open source software for entity linking and enrichment</article-title>. <source>FBK Technical report</source>.</citation></ref>
<ref id="B29"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Palmero Aprosio</surname> <given-names>A.</given-names></name> <name><surname>Giuliano</surname> <given-names>C.</given-names></name> <name><surname>Lavelli</surname> <given-names>A.</given-names></name></person-group> (<year>2013a</year>). <article-title>Automatic expansion of dbpedia exploiting wikipedia cross-language information</article-title>. In <source>ESWC, Volume 7882 of Lecture Notes in Computer Science</source>, Edited by <person-group person-group-type="editor"><name><surname>Cimiano</surname> <given-names>P.</given-names></name> <name><surname>Corcho</surname> <given-names>V.</given-names></name> <name><surname>Presutti</surname> <given-names>L.</given-names></name> <name><surname>Hollink</surname> <given-names>L.</given-names></name> <name><surname>Rudolph</surname> <given-names>S.</given-names></name></person-group>, <fpage>397</fpage>&#x02013;<lpage>411</lpage>. <publisher-loc>Berlin, Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B30"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Palmero Aprosio</surname> <given-names>A.</given-names></name> <name><surname>Giuliano</surname> <given-names>C.</given-names></name> <name><surname>Lavelli</surname> <given-names>A.</given-names></name></person-group> (<year>2013b</year>). <article-title>Automatic mapping of Wikipedia templates for fast deployment of localised DBpedia datasets</article-title>. In <conf-name>Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies, i-Know &#x02019;13</conf-name>, <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <conf-loc>New York, NY, USA</conf-loc>: <conf-sponsor>ACM</conf-sponsor>.</citation></ref>
<ref id="B31"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Palmero Aprosio</surname> <given-names>A.</given-names></name> <name><surname>Moretti</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). <article-title>Italy goes to Stanford: a collection of CoreNLP modules for Italian</article-title>. <source>ArXiv e-prints</source>.</citation></ref>
<ref id="B32"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Pei</surname> <given-names>J.</given-names></name> <name><surname>Jiang</surname> <given-names>D.</given-names></name> <name><surname>Zhang</surname> <given-names>A.</given-names></name></person-group> (<year>2005</year>). <article-title>On mining cross-graph quasi-cliques</article-title>. In <conf-name>Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD &#x02019;05</conf-name>, <fpage>228</fpage>&#x02013;<lpage>238</lpage>. <conf-loc>New York, NY, USA</conf-loc>: <conf-sponsor>ACM</conf-sponsor>.</citation></ref>
<ref id="B33"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Rizzo</surname> <given-names>G.</given-names></name> <name><surname>Troncy</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>Nerd: a framework for unifying named entity recognition and disambiguation extraction tools</article-title>. In <conf-name>Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL &#x02019;12</conf-name>, <fpage>73</fpage>&#x02013;<lpage>76</lpage>. <conf-loc>Stroudsburg, PA, USA</conf-loc>: <conf-sponsor>Association for Computational Linguistics</conf-sponsor>.</citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sudhahar</surname> <given-names>S.</given-names></name> <name><surname>Veltri</surname> <given-names>G.A.</given-names></name> <name><surname>Cristianini</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Automated analysis of the US presidential elections using big data and network analysis</article-title>. <source>Big Data and Society</source> <volume>2</volume>:<fpage>1</fpage>&#x02013;<lpage>28</lpage>.<pub-id pub-id-type="doi">10.1177/2053951715572916</pub-id></citation></ref>
<ref id="B35"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Veling</surname> <given-names>A.</given-names></name> <name><surname>Van Der Weerd</surname> <given-names>P.</given-names></name></person-group> (<year>1999</year>). <article-title>Conceptual grouping in word co-occurrence networks</article-title>. In <conf-name>Proceedings of the 16th International Joint Conference on Artificial Intelligence &#x02013; Volume 2, IJCAI&#x02019;99</conf-name>, <fpage>694</fpage>&#x02013;<lpage>699</lpage>. <conf-loc>San Francisco, CA, USA</conf-loc>: <conf-sponsor>Morgan Kaufmann Publishers Inc</conf-sponsor>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn1"><p><sup>1</sup><uri xlink:href="http://republicofletters.stanford.edu/">http://republicofletters.stanford.edu/</uri>.</p></fn>
<fn id="fn2"><p><sup>2</sup>DBpedia Spotlight has a reported accuracy of 0.85 on English and 0.78 on Italian. As for the Wiki Machine, the reference paper reports Precision 0.78, Recall 0.74, and F1 0.76 on English (no evaluation provided for Italian).</p></fn>
<fn id="fn3"><p><sup>3</sup>Even if the sentence window is arbitrary, it is common to consider this boundary also when manually annotating relations in benchmarks (Mitchell et al., <xref ref-type="bibr" rid="B23">2002</xref>; Hasegawa et al., <xref ref-type="bibr" rid="B12">2004</xref>).</p></fn>
<fn id="fn4"><p><sup>4</sup>The transcription of the speeches is available online by John T. Woolley and Gerhard Peters, The American Presidency Project (<uri xlink:href="http://www.presidency.ucsb.edu/1960_election.php">http://www.presidency.ucsb.edu/1960_election.php</uri>).</p></fn>
<fn id="fn5"><p><sup>5</sup><uri xlink:href="http://www.ladige.it/">http://www.ladige.it/</uri>.</p></fn>
<fn id="fn6"><p><sup>6</sup><uri xlink:href="http://jgrapht.org/">http://jgrapht.org/</uri>.</p></fn>
<fn id="fn7"><p><sup>7</sup>See <uri xlink:href="http://mappings.dbpedia.org/server/ontology/classes/">http://mappings.dbpedia.org/server/ontology/classes/</uri> for a hierarchical representation of the DBpedia ontology classes.</p></fn>
<fn id="fn8"><p><sup>8</sup><uri xlink:href="https://github.com/dkmfbk/cliques">https://github.com/dkmfbk/cliques</uri>.</p></fn>
</fn-group>
</back>
</article>