<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Digit. Humanit.</journal-id>
<journal-title>Frontiers in Digital Humanities</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Digit. Humanit.</abbrev-journal-title>
<issn pub-type="epub">2297-2668</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdigh.2017.00002</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Digital Humanities</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Studying Linguistic Changes over 200 Years of Newspapers through Resilient Words Analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Buntinx</surname> <given-names>Vincent</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="cor1">&#x0002A;</xref>
<uri xlink:href="http://frontiersin.org/people/u/391107"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Bornet</surname> <given-names>Cyril</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://frontiersin.org/people/u/200765"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Kaplan</surname> <given-names>Fr&#x000E9;d&#x000E9;ric</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://frontiersin.org/people/u/85"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Digital Humanities Laboratory (DHLAB), Swiss Federal Institute of Technology</institution>, <addr-line>Lausanne</addr-line>, <country>Switzerland</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Taha Yasseri, University of Oxford, UK</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Martin Gerlach, Northwestern University, USA; Tom Nicholls, University of Oxford, UK</p></fn>
<corresp content-type="corresp" id="cor1">&#x0002A;Correspondence: Vincent Buntinx, <email>vincent.buntinx&#x00040;epfl.ch</email></corresp>
<fn fn-type="other" id="fn002"><p>Specialty section: This article was submitted to Big Data, a section of the journal Frontiers in Digital Humanities</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>02</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>4</volume>
<elocation-id>2</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>11</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>01</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2017 Buntinx, Bornet and Kaplan.</copyright-statement>
<copyright-year>2017</copyright-year>
<copyright-holder>Buntinx, Bornet and Kaplan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>This paper presents a methodology to analyze linguistic changes in a given textual corpus allowing to overcome two common problems related to corpus linguistics studies. One of these issues is the monotonic increase of the corpus size with time, and the other one is the presence of noise in the textual data. In addition, our method allows to better target the linguistic evolution of the corpus, instead of other aspects like noise fluctuation or topics evolution. A corpus formed by two newspapers &#x0201C;La Gazette de Lausanne&#x0201D; and &#x0201C;Le Journal de Gen&#x000E8;ve&#x0201D; is used, providing 4 million articles from 200&#x02009;years of archives. We first perform some classical measurements on this corpus in order to provide indicators and visualizations of linguistic evolution. We then define the concept of a lexical kernel and word resilience, to face the two challenges of noises and corpus size fluctuations. This paper ends with a discussion based on the comparison of results from linguistic change analysis and concludes with possible future works continuing in that direction.</p>
</abstract>
<kwd-group>
<kwd>linguistic change</kwd>
<kwd>corpus studies</kwd>
<kwd>newspapers archives</kwd>
<kwd>textual distance</kwd>
<kwd>corpora kernel</kwd>
<kwd>word resilience</kwd>
</kwd-group>
<contract-num rid="cn01">149758</contract-num>
<contract-sponsor id="cn01">Schweizerischer Nationalfonds zur F&#x000F6;rderung der Wissenschaftlichen Forschung<named-content content-type="fundref-id">10.13039/501100001711</named-content></contract-sponsor>
<counts>
<fig-count count="10"/>
<table-count count="0"/>
<equation-count count="1"/>
<ref-count count="24"/>
<page-count count="10"/>
<word-count count="4998"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="introduction">
<label>1</label> <title>Introduction</title>
<p>This research investigates methods to study linguistic evolution using a corpus of scanned newspapers, continuing the work presented in conference paper (Buntinx et al., <xref ref-type="bibr" rid="B4">2016</xref>). Language changes quantification in large corpora is a problem widely addressed since the recent availability of large textual databases. One commonly used method is to compute a distance measure between subsets of the corpora and analyze the temporal evolution of such measure. In Bochkarev et al. (<xref ref-type="bibr" rid="B2">2014</xref>), authors used Kullback&#x02013;Leibler divergence in the form of symmetrized relative entropy between two sets of word frequencies. They applied this measure on the Google Books N-Gram Corpus (Michel et al., <xref ref-type="bibr" rid="B15">2011</xref>) in order to compute lexical evolution for multiple languages. Others studies (Pechenick et al., <xref ref-type="bibr" rid="B17">2015a</xref>,<xref ref-type="bibr" rid="B18">b</xref>) have used the Google Books Corpus computing Kullback&#x02013;Leibler and Jensen&#x02013;Shannon divergence. They analyzed the specific contributions to the distance of most frequent words in order to combine quantitative and qualitative analysis. Another work (Cocho et al., <xref ref-type="bibr" rid="B6">2015</xref>) used the frequency rank evolution of words and addresses the linguistic change analysis through the concept of rank diversity of languages. In a recent work, physicists and mathematicians used the generalized entropy on symbolic sequences with heavy-tailed frequency distribution (Gerlach et al., <xref ref-type="bibr" rid="B8">2016</xref>). Their method is particularly suited for textual corpora words distribution, which follow the well-known Zipf law (Zipf, <xref ref-type="bibr" rid="B24">1935</xref>; Piantadosi, <xref ref-type="bibr" rid="B19">2014</xref>). The corpus we used is composed of 4 million press articles, indirectly documenting the evolution of written language, covering about 200&#x02009;years of archives. The corpus is made out of digitized facsimiles of Le Journal de Gen&#x000E8;ve (1826&#x02013;1997) and La Gazette de Lausanne (1804&#x02013;1997). For each newspaper, the daily scanned issues were algorithmically transcribed using an optical character recognition (OCR) system. The whole archive represents more than 20&#x02009;TB of scanned data (including text, metadata, pdf, and images) and contains about two billion words, placing their study beyond the capabilities of most usual analysis techniques used by regular desktop computers. This corpus has already been the subject of several studies (Buntinx and Kaplan, <xref ref-type="bibr" rid="B5">2015</xref>; Buntinx et al., <xref ref-type="bibr" rid="B4">2016</xref>; Rochat et al., <xref ref-type="bibr" rid="B20">2016</xref>). The corpus can easily be divided into subsets corresponding to the year of publication. However, the number of pages and their content fluctuates greatly depending on the year, ranging from 280,000 words per year in the early 19th century to about 18 million in the later years of the 20th century. Figure <xref ref-type="fig" rid="F1">1</xref> shows the relative size of each subset in terms of number of words per year for Le Journal de Gen&#x000E8;ve (JDG) and La Gazette de Lausanne (GDL). The textual data contain some OCR errors and present other potential perturbations due to the nature of some of the content (noise). For example, bus schedules, stock market, or cinema tables contain repeated words that serve the purpose of their informative content but do not reflect the linguistic evolution. This corpus must therefore be considered as potentially noisy. Some periods, like the one from 1900 to 1915 for JDG and the one from 1965 to 1998 for the two newspapers, present higher noise levels than others. It is usual to apply a frequency filter in order to manage this problem. The main contribution of this work is to design a robust method allowing to measure linguistic changes avoiding possible misinterpretations due to noise fluctuations and corpus size variations.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Corpus size versus years for GDL (top) and JDG (bottom)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g001.tif"/>
</fig>
<p>Considering the lack of data for Le Journal de Gen&#x000E8;ve for the years 1837, 1917, 1918, and 1919, we left these years out in all further graphs and analyses. In addition, some years had to be removed because the scanning quality was too poor (1834, 1835, 1859, and 1860 for JDG and 1808 for GDL).</p>
</sec>
<sec id="S2">
<label>2</label> <title>Using Classical Distances to Study Linguistic Drift</title>
<p>A straightforward approach to the problem consists in computing a textual distance between subsets of the corpora. One could, for instance, easily compute the so-called Jaccard distance (Jaccard, <xref ref-type="bibr" rid="B9">1901</xref>, <xref ref-type="bibr" rid="B10">1912</xref>) between two lexical sets. Considering two different corpora <italic>C</italic><sub>1</sub> and <italic>C</italic><sub>2</sub>, and their lexica, i.e., the list of unique (non-lemmatized) words, <italic>L</italic>(<italic>C</italic><sub>1</sub>)&#x02009;&#x02261;&#x02009;<italic>L</italic><sub>1</sub> and <italic>L</italic>(<italic>C</italic><sub>2</sub>)&#x02009;&#x02261;&#x02009;<italic>L</italic><sub>2</sub>, the Jaccard distance <italic>d</italic>(<italic>L</italic><sub>1</sub>, <italic>L</italic><sub>2</sub>) is defined as follows:
<disp-formula id="E1"><mml:math id="M1"><mml:mrow><mml:mi>d</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x02229;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x0222A;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x02229;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:mo>+</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mo>&#x02212;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x02229;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula></p>
<p>In the same way, other distances could also be explored, such as those given by Kullback and Leibler (Kullback and Leibler, <xref ref-type="bibr" rid="B13">1951</xref>; Kullback, <xref ref-type="bibr" rid="B12">1987</xref>), Chi-squared distance (Sakoda, <xref ref-type="bibr" rid="B21">1981</xref>), or Cosine similarity (Singhal, <xref ref-type="bibr" rid="B22">2001</xref>).</p>
<p>The Jaccard distance is an intuitive measure that determines the similarity of two texts using the relative size of their common lexicon. This distance, which is complementary to the notion of lexical connexion (Muller, <xref ref-type="bibr" rid="B16">1980</xref>), is exclusively based on the presence/absence of words in the lexicon and ignores their frequency.</p>
<p>The Jaccard distance is a metric (Levandowsky and Winter, <xref ref-type="bibr" rid="B14">1971</xref>) satisfying the following classical distance properties:
<list list-type="bullet">
<list-item><p>Separation: <italic>d</italic>(<italic>L</italic><sub>1</sub>, <italic>L</italic><sub>2</sub>)&#x02009;&#x0003D;&#x02009;0&#x02009;&#x02261;&#x02009;<italic>L</italic><sub>1</sub>&#x02009;&#x0003D;&#x02009;<italic>L</italic><sub>2</sub>;</p></list-item>
<list-item><p>Symmetry: <italic>d</italic>(<italic>L</italic><sub>1</sub>, <italic>L</italic><sub>2</sub>)&#x02009;&#x0003D;&#x02009;<italic>d</italic>(<italic>L</italic><sub>2</sub>, <italic>L</italic><sub>1</sub>);</p></list-item>
<list-item><p>Triangular inequality: <italic>d</italic>(<italic>L</italic><sub>1</sub>, <italic>L</italic><sub>3</sub>)&#x02009;&#x02264;&#x02009;<italic>d</italic>(<italic>L</italic><sub>1</sub>, <italic>L</italic><sub>2</sub>)&#x02009;&#x0002B;&#x02009;<italic>d</italic>(<italic>L</italic><sub>2</sub>, <italic>L</italic><sub>3</sub>).</p></list-item>
</list></p>
<p>Since the Jaccard distance measure is based only on the presence/absence of word in the corpus subsets, noise can affect the measure of linguistic evolution. In order to reduce this effect, <italic>L</italic>(<italic>C</italic><sub>1</sub>) and <italic>L</italic>(<italic>C</italic><sub>2</sub>) are filtered to keep only the words whose frequency is greater than 1/100,000. However, the frequency threshold is quite arbitrary, and filtered data still present OCR errors and noises. The computation of the Jaccard distance between all subsets yields a symmetric matrix <italic>M</italic>&#x02009;&#x000D7;&#x02009;<italic>M</italic> where <italic>M</italic> is the number of distinct years for a given newspaper. This matrix contain all distances between each pair of years <italic>L</italic>(<italic>C<sub>i</sub></italic>), <italic>L</italic>(<italic>C<sub>j</sub></italic>) normalized in the interval [0, 1]. The heatmaps of the Jaccard distance matrix of Le Journal de Gen&#x000E8;ve (JDG) and of La Gazette de Lausanne (GDL) are given in Figure <xref ref-type="fig" rid="F2">2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>Heatmap of the Jaccard distance matrix of GDL (left) and JDG (right)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g002.tif"/>
</fig>
<p>The values on the matrix&#x02019;s diagonal are equal to zero by definition (property of separation). We observe the expected behavior of the values outside the diagonal, which should be highly correlated with the difference between the compared years. In addition, level lines of the heatmap suggest the hypothesis that the linguistic evolution is not linear but evolves period of time by period of time. Indeed, in the case of a linear evolution, the level line would be parallel to the diagonal of the matrix. The same data are presented in a more convenient form in Figure <xref ref-type="fig" rid="F3">3</xref>. We have plotted the matrix&#x02019;s values in a two-dimensional graph showing the distance values versus the time differences between subsets (blue) with the mean value over time (red). In this representation, we observe that the distances seem to be overall proportional to the number of years separating the two subsets. This observation immediately suggests that the linguistic drift exists and can be quantified by the Jaccard distance. The more time separates the textual corpus, the more the subsets are indeed considered to be distant. However, it is showed in Figure <xref ref-type="fig" rid="F1">1</xref> that the corpus size is correlated with time and can have the role of a hidden variable affecting the distance value more than just the amount of time separating subcorpora. Two windows of time are particularly sensible in term of size fluctuation, which are the period before 1870 (with very low data representativity) and the period after 1965 (showing a sudden increase in the corpus size).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><bold>Jaccard distance (blue) and mean of distances (red) versus the time difference (in number of years) between the compared subset from the GDL (left) and JDG (right)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g003.tif"/>
</fig>
<p>If we restrict the Jaccard distance matrix to using only the data from the most stable years in terms of size and recompute the Figure <xref ref-type="fig" rid="F3">3</xref> visualization, we observe the same Jaccard distance evolution. As showed in Figure <xref ref-type="fig" rid="F3">3</xref>, the behavior of the mean of distances (red curve) is more sensible to the first years of separation for the two newspapers. In order to measure the evolution of linguistic changes and to clarify if these changes are accelerating, decelerating, or remain stable, we show a final visualization of the distance matrix by plotting only distances between years <italic>y<sub>i</sub></italic> and <italic>y<sub>i&#x0002B;n</sub></italic> with <italic>n</italic> equal to 1, 20, 50, and 100 in Figure <xref ref-type="fig" rid="F4">4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>Jaccard distance between the years <italic>y<sub>i</sub></italic> and <italic>y<sub>i</sub></italic><sub>&#x0002B;</sub><italic><sub>n</sub></italic> with <italic>n</italic>&#x02009;&#x0003D;&#x02009;1 (blue), <italic>n</italic>&#x02009;&#x0003D;&#x02009;20 (green), <italic>n</italic>&#x02009;&#x0003D;&#x02009;50 (purple), and <italic>n</italic>&#x02009;&#x0003D;&#x02009;100 (red)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g004.tif"/>
</fig>
<p>On the distance <italic>d</italic>(<italic>y<sub>i</sub>, y<sub>i</sub></italic><sub>&#x0002B;1</sub>) showed in Figure <xref ref-type="fig" rid="F4">4</xref>, we observe that the distance over 1&#x02009;year decreases slowly before stabilizing from year 1920. This suggests the hypothesis that the language is more stable after 1920. We observe a brutal instability in years after 1965, matching the noisy periods of time. On the distance <italic>d</italic>(<italic>y<sub>i</sub>, y<sub>i</sub></italic><sub>&#x0002B;</sub><italic><sub>n</sub></italic>) with <italic>n</italic>&#x02009;&#x0003D;&#x02009;20, 50, and 100, we observe that n-years distance evolution for dates separated by more than 20&#x02009;years slowly decrease before increasing significantly in more recent years because of the &#x0201C;contamination&#x0201D; of this distance matrix by the data of the perturbed years. Same graphs without that noisy period do not show any increase of the distance value, and this can therefore not be interpreted by an acceleration of the linguistic evolution. The Jaccard distance matrix indicates an overall effect that could possibly be caused by a linguistic drift, including the appearance of new words and disappearance of some old ones. However, the Jaccard distance is known to be affected by big size differences (Muller, <xref ref-type="bibr" rid="B16">1980</xref>; Brunet, <xref ref-type="bibr" rid="B3">2003</xref>), and other distance definitions and characterizations have been designed in order to correct this unwanted property. An improved Jaccard distance is given in a study of text similarities (Brunet, <xref ref-type="bibr" rid="B3">2003</xref>) with the purpose to remove size difference sensibility from the Jaccard distance. We computed the improved Jaccard distance, and it appears that this distance has the same behavior, but with a different normalization, as the classical Jaccard distance. In addition, OCR errors and noise can affect the Jaccard distance because of its binary nature and because of its lack of frequency consideration. Frequency filters can be used to decrease noise influence, but the applied threshold is quite arbitrary.</p>
</sec>
<sec id="S3">
<label>3</label> <title>Lexical Kernels and Word Resilience</title>
<sec id="S3-1">
<label>3.1</label> <title>Definition and Basic Measures</title>
<p>The uneven distribution of the size of corpus subsets (Figure <xref ref-type="fig" rid="F1">1</xref>) causes methodological difficulties for interpreting the distances defined in the previous section. Fluctuations in the lexicon size and noise cause an indirect increase of the linguistic drift as measured by the Jaccard formula. Under such conditions, it is difficult to untangle the effects of the unevenness of the distribution of corpus subsets from the actual appearance and disappearance of words.</p>
<p>These difficulties of interpretation motivate the exploration of another, possibly sounder approach to the same problem. We define the notion of the lexical kernel.</p>
<p><bold>Definition 1</bold>. <italic>The lexical kernel K<sub>x,y,C</sub> is the sequential subset of unique words common to a given period of time starting in year x and finishing in year y of a corpus C</italic>.</p>
<p><italic>K</italic><sub>1804,1997,</sub><italic><sub>GDL</sub></italic> is, for instance, the subset of all words present in the yearly corpus of La Gazette de Lausanne. It contains 5,242 unique words that have been used for about 200&#x02009;years. The kernel <italic>K</italic><sub>1826,1997,</sub><italic><sub>JDG</sub></italic> contains 7,485 unique words, covering a period of about 170&#x02009;years. As the covered period is smaller, the time constraint is smaller, and the kernel is naturally larger.</p>
<p>It is interesting to note that 4,464 words are common to the two kernels. Figure <xref ref-type="fig" rid="F5">5</xref> shows the statistical distribution of word typologies for both kernels.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p><bold>Distribution in terms of typologies of words contained in the kernel of <italic>K</italic><sub>1804,1997,</sub><italic><sub>GDL</sub></italic> (left) and <italic>K</italic><sub>1826,1997,</sub><italic><sub>JDG</sub></italic> (right)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g005.tif"/>
</fig>
<p>Extending the notion of a kernel, it is rather easy to study the resilience of a given word.</p>
<p><bold>Definition 2</bold>. <italic>The resilience set R<sub>d,C</sub> is the union of all kernels K<sub>x,y,C</sub> corresponding to a duration of y&#x02009;&#x02212;&#x02009;x</italic>&#x02009;&#x02264;&#x02009;<italic>d years</italic>.</p>
<p>The definition of word resilience is naturally derived from the resilience set notion.</p>
<p><bold>Definition 3</bold>. <italic>The resilience r of a given word w in the corpus C is given by the following formula: r</italic>(<italic>w,C</italic>)&#x02009;&#x0003D;&#x02009;<italic>max</italic>{<italic>d</italic> &#x0007C; <italic>w</italic> &#x02282; <italic>R<sub>d,C</sub></italic>}.</p>
<p>For instance, <italic>R</italic><sub>100,</sub><italic><sub>GDL</sub></italic> contains all the words that are maintained in the corpus <italic>GDL</italic> for at least 100&#x02009;years. R subsets are organized as concentric sets: <italic>R</italic><sub>1,</sub><italic><sub>C</sub></italic> &#x02282; <italic>R</italic><sub>2,</sub><italic><sub>C</sub></italic> &#x02282; &#x02026; &#x02282; <italic>R<sub>i,C</sub></italic> &#x02282; <italic>R<sub>i</sub></italic><sub>&#x0002B;1,</sub><italic><sub>C</sub></italic>. The relative proportion of each subset sheds light on both the stability and dynamics of language change. Figure <xref ref-type="fig" rid="F6">6</xref> shows the distribution of word resilience for both newspapers.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p><bold>Size of <italic>R<sub>d</sub></italic> versus the number of maintained years <italic>d</italic> (logarithmic scale) showing the word resilience distribution for JDG (green) and GDL (blue)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g006.tif"/>
</fig>
<p>The GDL resilience curve in Figure <xref ref-type="fig" rid="F6">6</xref> is normalized (to the same time scale as JDG) in order to make the two curves comparable. This representation of <italic>R<sub>d</sub></italic> shows a similar overall word resilience trend for both JDG and GDL. However, we notice that the two curves intersect when considering the longest durations.</p>
<p>These definitions pave the way for a formulation of the study of linguistic change in terms of algebra of sets. Instead of analyzing what is rapidly changing in the language, we study the most stable elements of language through the notions of kernel and word resilience. We can then apply a new definition of distance to the set of resilient words, which is the maximum duration kernel. Indeed, reducing the analyzed set of words to the more resilient ones allows us to exclude noise efficiently. In addition, the issue of distance sensibility to the corpus size is reduced, and the method targets linguistic evolution more precisely because the lower use of resilient words can be the result of semantic evolution, punctual journalistic events, or linguistic diversity induced by the newspaper layout evolution. The number of words is the same for each year, but the corpus size influences the frequency of kernel words when the size is small. Indeed, the smaller the corpus size, the higher the frequency fluctuation. In order to reduce these effects, we defined a distance based on word ranks ordered by their frequencies.</p>
</sec>
<sec id="S3-2">
<label>3.2</label> <title>Distances Analysis Applied to Kernels</title>
<p>In order to compare the same kernel from two different years, let us consider their ordering according to the frequency of those words in those years. We may then define their distance as the computational cost to reorder one into the other. Again, we require a metric that can satisfy the mathematical properties of a distance. One way to do so is to consider a distance equal to the sum of each differences of position for each word in two given lists.</p>
<p><bold>Definition 4</bold>. <italic>Let be I<sub>j</sub></italic>(<italic>w<sub>i</sub></italic>) <italic>the index of the word w<sub>i</sub>in the list L<sub>j</sub></italic>. <italic>The kernel distance is given by</italic> <inline-formula><mml:math id="M2"><mml:mrow><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mi>K</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02282;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mi>K</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mtext>&#x02009;</mml:mtext><mml:mn>&#x0007C;</mml:mn><mml:msub><mml:mi>I</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:mo>&#x02212;</mml:mo><mml:mtext>&#x02009;</mml:mtext><mml:msub><mml:mi>I</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mn>&#x0007C;</mml:mn></mml:mrow></mml:math></inline-formula>.</p>
<p>We have applied this new distance definition to the list of words from the kernel set, ordered by frequency for each years, and we have plotted the same analysis than for the Jaccard distance in the Figure <xref ref-type="fig" rid="F7">7</xref>, showing a representation of the distance matrix across years and the Figure <xref ref-type="fig" rid="F8">8</xref>, showing the distance between years <italic>y<sub>i</sub></italic> and <italic>y<sub>i</sub></italic><sub>&#x0002B;</sub><italic><sub>n</sub></italic> with <italic>n</italic> equal to 1, 20, 50, and 100.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p><bold>Kernel distance between the years <italic>y<sub>i</sub></italic> and <italic>y<sub>i</sub></italic><sub>&#x0002B;</sub><italic><sub>n</sub></italic> with <italic>n</italic>&#x02009;&#x0003D;&#x02009;1 (blue), <italic>n</italic>&#x02009;&#x0003D;&#x02009;20 (green), <italic>n</italic>&#x02009;&#x0003D;&#x02009;50 (purple), and <italic>n</italic>&#x02009;&#x0003D;&#x02009;100 (red)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g007.tif"/>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p><bold>Kernel distance between the years <italic>y<sub>i</sub></italic> and <italic>y<sub>i</sub></italic><sub>&#x0002B;</sub><italic><sub>n</sub></italic> with <italic>n</italic>&#x02009;&#x0003D;&#x02009;1 (blue), <italic>n</italic>&#x02009;&#x0003D;&#x02009;20 (green), <italic>n</italic>&#x02009;&#x0003D;&#x02009;50 (purple), and <italic>n</italic>&#x02009;&#x0003D;&#x02009;100 (red)</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g008.tif"/>
</fig>
<p>The Jaccard distance represented in Figure <xref ref-type="fig" rid="F3">3</xref> and the kernel distance represented in Figure <xref ref-type="fig" rid="F7">7</xref> are normalized on the interval [0, 1]. They are based on different elements by definition, one on the presence/absence of words in the lexica and the other on the frequency order of kernel words. The two distances increase with increasing time difference, supporting the hypothesis that the linguistic drift exists. The Jaccard distance on the whole lexica is distributed from a mean of 0.25, when the two lexica are separated by 1&#x02009;year, to 0.8, when these lexica are separated by the maximum number of years. The kernel distance, applied by definition to only a very reduced set of resilient words, is distributed from a mean of 0.1 to 0.4. The two distances share a common behavior on the two corpora of JDG and GDL. However, the kernel distance can be viewed as a lower bound of the linguistic drift showing the evolution of the most stable words. It is remarkable that the plotted evolutions share the same behavior even though the distances are based on different information types. Indeed, the Jaccard distance applied to the kernel would be equal to zero, and the kernel distance use information about the frequency of a very reduced set of words.</p>
<p>When comparing the Jaccard distance and the kernel distance in Figures <xref ref-type="fig" rid="F4">4</xref> and <xref ref-type="fig" rid="F8">8</xref>, we observe that the distances between subcorpora for the oldest years share the same fluctuation as with the Jaccard distance and decrease continuously. However, this effect can be due to the low language representativity of data before 1850 (low corpus size). In general, the kernel distance decreases slowly and continuously. We observe that there is no increase but rather a very stable phase when considering two subcorpora separated by more than 20&#x02009;years. The kernel distance is also more stable in more recent years. In order to attest the robustness of this measure even with noise fluctuations, we have done a linear regression on our data and on the specifically unstable period with noise (1965&#x02013;1998) for the two newspapers. We hypothesized that nature of linguistic evolution excludes brutal variations and randomness around a given trend even with a simple linear model, we so expect the coefficient of regression is higher for the more robust method of evolution measurement. The two regressions for GDL and JDG on the whole data are represented in Figure <xref ref-type="fig" rid="F9">9</xref>. We observe that kernel distances has better regression coefficients (0.8218 for GDL and 0.6196 for JDG) than Jaccard distance (0.6294 for GDL and 0.3339 for JDG). The regressions for GDL and JDG on the noisy period are represented in Figure <xref ref-type="fig" rid="F10">10</xref>. We also observe that kernel distances has better regression coefficients on this short unstable period (0.2790 for GDL and 0.1635 for JDG) than Jaccard distance (0.0174 for GDL and 0.00002 for JDG). These results suggest that even if still impacting it, the kernel distance is more robust to noise than Jaccard distance.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p><bold>Jaccard distance (purple for GDL and green for JDG) and kernel distance (red for GDL and blue for JDG) versus years with their linear regressions and regression coefficients</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g009.tif"/>
</fig>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p><bold>Jaccard distance (purple for GDL and green for JDG) and kernel distance (red for GDL and blue for JDG) versus years with their linear regressions and regression coefficients on the period of 1965 and more</bold>.</p></caption>
<graphic xlink:href="fdigh-04-00002-g010.tif"/>
</fig>
</sec>
</sec>
<sec id="S4" sec-type="discussion">
<label>4</label> <title>Discussion</title>
<p>Several distance definitions have been applied to the corpus of GDL and JDG in order to quantify linguistic changes. We first used the Jaccard distance on the whole corpus with a filter frequency. Our observations from the Figures <xref ref-type="fig" rid="F2">2</xref>&#x02013;<xref ref-type="fig" rid="F4">4</xref> support the hypothesis of the existence and quantifiability of the linguistic changes, even though we observe that the Jaccard distance is potentially sensible to noise. In addition, the Jaccard distance is known to be sensible to corpus size fluctuations (Muller, <xref ref-type="bibr" rid="B16">1980</xref>; Brunet, <xref ref-type="bibr" rid="B3">2003</xref>), so we defined the concept of kernel and word resilience in order to study the most stable part of the language.</p>
<p>We defined a kernel distance based on frequency rank comparison between 2&#x02009;years on kernel words. Surprisingly, Figures <xref ref-type="fig" rid="F7">7</xref> and <xref ref-type="fig" rid="F8">8</xref> show the same behavior than the Jaccard distance on the whole corpus. This supports the hypothesis that the linguistic distances&#x02019; information extracted by word presence/absence on the whole corpus can be retrieved using a reduced set of resilient words from the kernel with the kernel distance. In addition, the kernel distance clearly overcomes the noise problems, canceling the effect of the contamination of years with higher noises like the period of 1900&#x02013;1915 for JDG and the period of 1965 and more for the two newspapers. Observation for this distance, like the one for the Jaccard distance, shows a decrease before the period of time prior to 1870, which is a period with low probability of being representative of language because of the small corpus size. After this period of time, distances from a year to the next year seem to decrease slowly but with more stability. Additionally, the distance from a year to 20, 50, or 100&#x02009;year later remains stable.</p>
<p>From our experiments on the two corpora of GDL and JDG, we have made a series of observations that support the existence of a continuous and relatively constant linguistic drift. We tried several methods to quantify this linguistic change with success in overcoming problems of noise, corpus size fluctuation, and precise targeting of linguistic change instead of other cumulated effects on the corpora&#x02019;s textual data like topics, OCR quality, or noise evolution. If these measures show a quantization way of the linguistic drift, we do not have any serious indicators or proof of a potential acceleration or deceleration of the language change evolution on the periods of 1804&#x02013;1997 (GDL) and 1826&#x02013;1997 (JDG). However, these methods should be applied on a corpus where data are available after 1997 in order to verify if this observed stability is maintained during the period of 1998&#x02013;2016 where a lot of technologies mediating our language have potentially accelerated linguistic evolution (Kaplan, <xref ref-type="bibr" rid="B11">2014</xref>).</p>
</sec>
<sec id="S5">
<label>5</label> <title>Conclusion and Future Work</title>
<p>Large databases of scanned newspapers open new avenues for studying linguistic evolution (Westin and Geisler, <xref ref-type="bibr" rid="B23">2002</xref>; Fries and Lehmann, <xref ref-type="bibr" rid="B7">2006</xref>; Bamford et al., <xref ref-type="bibr" rid="B1">2013</xref>). However, these studies should be conducted with sound methodologies in order to avoid misinterpretation of artifacts. Common pitfalls include misinterpreting results linked to the size variation of the subsets or overgeneralizing results obtained from one particular newspaper corpus to general linguistic evolution.</p>
<p>In this paper, we introduced the notion of a kernel as a possible approach to studying linguistic changes under the lens of linguistic stability. Focusing on stable words and their relative distribution is likely to make interpretations more robust. Results were computed from two independent corpora. It is striking to see that most of the results obtained from each of them are extremely similar. The kernels compositions in terms of grammatical word typologies are very similar.</p>
<p>The kernel distance, applied to the kernels words in order to measure the linguistic changes, has showed to be robust when it comes to OCR errors and noise. In addition, we observed that the study of kernel words allows the extraction of the same linguistic distance&#x02019;s information as the Jaccard distance applied to the whole corpus. This suggests that our methods are indeed measuring general linguistic phenomena beyond the specificity of the corpora chosen for this study. Future works and analysis should include the case where corpus kernel size is too small and implement a distance measuring the linguistic change between subset of resilient words that are not necessarily part of the kernel. In addition, our results still need to be confirmed with subsequent studies involving other corpora, such as non-journalistic texts and texts written in other languages.</p>
</sec>
<sec id="S6" sec-type="author-contributor">
<title>Author Contributions</title>
<p>The three authors have contributed equally to the conception and design of this work through ideas and results discussion. VB has performed data acquisition, computation and analysis, visualizations of computed results, and has written the article. CB has provided visualizations of computed results and suggested to reduce the analyzed set of words to those shared by each subcorpora. FK has provided deeper formalization of the developed concepts of kernels and words resilience. The three authors have participated in the process of reviewing the articles&#x02019; final version, ensuring its accuracy and integrity.</p>
</sec>
<sec id="S7">
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer TN and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.</p>
</sec>
</body>
<back>
<ack>
<p>We thank the team of Le Temps newspaper and the BNS (Bibliothque Nationale Suisse) for giving us the opportunity to work on those 200&#x02009;years of archives.</p>
</ack>
<sec id="S8">
<title>Funding</title>
<p>This study is funded by FNS and is part of the project &#x0201C;How algorithms shape language,&#x0201D; number 149758.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Bamford</surname> <given-names>J.</given-names></name> <name><surname>Cavalieri</surname> <given-names>S.</given-names></name> <name><surname>Diani</surname> <given-names>G.</given-names></name></person-group> (<year>2013</year>). <article-title>Variation and Change in Spoken and Written Discourse: Perspectives from Corpus Linguistics</article-title>. <source>Dialogue Studies</source> <volume>21</volume>.</citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bochkarev</surname> <given-names>V.</given-names></name> <name><surname>Solovyev</surname> <given-names>V.</given-names></name> <name><surname>Wichmann</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Universals versus historical contingencies in lexical evolution</article-title>. <source>Journal of the Royal Society Interface</source> <volume>11</volume>: <fpage>20140841</fpage>.<pub-id pub-id-type="doi">10.1098/rsif.2014.0841</pub-id><pub-id pub-id-type="pmid">25274040</pub-id></citation></ref>
<ref id="B3"><citation citation-type="web"><person-group person-group-type="author"><name><surname>Brunet</surname> <given-names>E.</given-names></name></person-group> (<year>2003</year>). <article-title>Peut-on mesurer la distance entre deux textes?</article-title> <source>Corpus</source>. Available at: <uri xlink:href="http://corpus.revues.org/index30.html">http://corpus.revues.org/index30.html</uri></citation></ref>
<ref id="B4"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Buntinx</surname> <given-names>V.</given-names></name> <name><surname>Bornet</surname> <given-names>C.</given-names></name> <name><surname>Kaplan</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>Studying linguistic changes on 200 years of newspapers</article-title>. In <source>Digital Humanities 2016</source>.</citation></ref>
<ref id="B5"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Buntinx</surname> <given-names>V.</given-names></name> <name><surname>Kaplan</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>Inversed N-gram viewer: searching the space of word temporal profiles</article-title>. In <source>Digital Humanities 2015</source>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cocho</surname> <given-names>G.</given-names></name> <name><surname>Flores</surname> <given-names>J.</given-names></name> <name><surname>Gershenson</surname> <given-names>C.</given-names></name> <name><surname>Pineda</surname> <given-names>C.</given-names></name> <name><surname>Snchez</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Rank diversity of languages: generic behavior in computational linguistics</article-title>. <source>PLoS ONE</source> <volume>10</volume>:<fpage>e0121898</fpage>.<pub-id pub-id-type="doi">10.1371/journal.pone.0121898</pub-id><pub-id pub-id-type="pmid">25849150</pub-id></citation></ref>
<ref id="B7"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Fries</surname> <given-names>U.</given-names></name> <name><surname>Lehmann</surname> <given-names>H.M.</given-names></name></person-group> (<year>2006</year>). <article-title>The style of 18th century English newspapers: lexical diversity</article-title>. In <source>News Discourse in Early Modern Britain</source>, <fpage>91</fpage>&#x02013;<lpage>104</lpage>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gerlach</surname> <given-names>M.</given-names></name> <name><surname>Font-Clos</surname> <given-names>F.</given-names></name> <name><surname>Altmann</surname> <given-names>E.G.</given-names></name></person-group> (<year>2016</year>). <article-title>Similarity of symbol frequency distributions with heavy tails</article-title>. <source>Physical Review X</source> <volume>6</volume>: <fpage>021009</fpage>.<pub-id pub-id-type="doi">10.1103/PhysRevX.6.021009</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jaccard</surname> <given-names>P.</given-names></name></person-group> (<year>1901</year>). <article-title>&#x000C9;tude comparative de la distribution florale dans une portion des alpes et des jura</article-title>. <source>Bulletin del la Soci&#x000E9;t&#x000E9; Vaudoise des Sciences Naturelles</source> <volume>37</volume>: <fpage>547</fpage>&#x02013;<lpage>79</lpage>.</citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jaccard</surname> <given-names>P.</given-names></name></person-group> (<year>1912</year>). <article-title>The distribution of the flora in the alpine zone</article-title>. <source>New Phytologist</source> <volume>11</volume>: <fpage>37</fpage>&#x02013;<lpage>50</lpage>.<pub-id pub-id-type="doi">10.1111/j.1469-8137.1912.tb05611.x</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaplan</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>Linguistic capitalism and algorithmic mediation</article-title>. <source>Representations</source> <volume>127</volume>: <fpage>57</fpage>&#x02013;<lpage>63</lpage>.<pub-id pub-id-type="doi">10.1525/rep.2014.127.1.57</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kullback</surname> <given-names>S.</given-names></name></person-group> (<year>1987</year>). <article-title>Letters to the editor</article-title>. <source>The American Statistician</source> <volume>41</volume>: <fpage>338</fpage>&#x02013;<lpage>41</lpage>.<pub-id pub-id-type="doi">10.1080/00031305.1987.10475510</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kullback</surname> <given-names>S.</given-names></name> <name><surname>Leibler</surname> <given-names>R.A.</given-names></name></person-group> (<year>1951</year>). <article-title>On information and sufficiency</article-title>. <source>The Annals of Mathematical Statistics</source> <volume>22</volume>: <fpage>79</fpage>&#x02013;<lpage>86</lpage>.<pub-id pub-id-type="doi">10.1214/aoms/1177729694</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levandowsky</surname> <given-names>M.</given-names></name> <name><surname>Winter</surname> <given-names>D.</given-names></name></person-group> (<year>1971</year>). <article-title>Distance between sets</article-title>. <source>Nature</source> <volume>234</volume>: <fpage>34</fpage>&#x02013;<lpage>5</lpage>.<pub-id pub-id-type="doi">10.1038/234034a0</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Michel</surname> <given-names>J.-B.</given-names></name> <name><surname>Shen</surname> <given-names>Y.K.</given-names></name> <name><surname>Aiden</surname> <given-names>A.P.</given-names></name> <name><surname>Veres</surname> <given-names>A.</given-names></name> <name><surname>Gray</surname> <given-names>M.K.</given-names></name> <collab>Google Books Team</collab> <etal/></person-group> (<year>2011</year>). <article-title>Quantitative analysis of culture using millions of digitized books</article-title>. <source>Science</source> <volume>331</volume>: <fpage>176</fpage>&#x02013;<lpage>82</lpage>.<pub-id pub-id-type="doi">10.1126/science.1199644</pub-id><pub-id pub-id-type="pmid">21163965</pub-id></citation></ref>
<ref id="B16"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Muller</surname> <given-names>C.</given-names></name></person-group> (<year>1980</year>). <source>Principes et m&#x000E9;thodes de statistique lexicale</source>. Vol. <volume>2</volume>. <publisher-name>Bulletin des biblioth&#x000E8;ques de France (BBF)</publisher-name>, <fpage>80</fpage>.</citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pechenick</surname> <given-names>E.A.</given-names></name> <name><surname>Danforth</surname> <given-names>C.M.</given-names></name> <name><surname>Dodds</surname> <given-names>P.S.</given-names></name></person-group> (<year>2015a</year>). <article-title>Characterizing the google books corpus: strong limits to inferences of socio-cultural and linguistic evolution</article-title>. <source>PLoS ONE</source> <volume>10</volume>:<fpage>e0137041</fpage>.<pub-id pub-id-type="doi">10.1371/journal.pone.0137041</pub-id></citation></ref>
<ref id="B18"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Pechenick</surname> <given-names>E.A.</given-names></name> <name><surname>Danforth</surname> <given-names>C.M.</given-names></name> <name><surname>Dodds</surname> <given-names>P.S.</given-names></name></person-group> (<year>2015b</year>). <article-title>Is language evolution grinding to a halt: exploring the life and death of words in English fiction</article-title>. In <source>CoRR</source>. <volume>arXiv</volume>: <issue>1503.03512v1</issue>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>.</citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Piantadosi</surname> <given-names>S.T.</given-names></name></person-group> (<year>2014</year>). <article-title>Zipf&#x02019;s word frequency law in natural language: a critical review and future directions</article-title>. <source>Psychonomic Bulletin &#x00026; Review</source> <volume>21</volume>: <fpage>1112</fpage>&#x02013;<lpage>30</lpage>.<pub-id pub-id-type="doi">10.3758/s13423-014-0585-6</pub-id></citation></ref>
<ref id="B20"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Rochat</surname> <given-names>Y.</given-names></name> <name><surname>Ehrmann</surname> <given-names>M.</given-names></name> <name><surname>Buntinx</surname> <given-names>V.</given-names></name> <name><surname>Bornet</surname> <given-names>C.</given-names></name> <name><surname>Kaplan</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>Navigating through 200 years of historical newspapers</article-title>. In <conf-name>Proceedings of iPRES 2016</conf-name>, <fpage>186</fpage>&#x02013;<lpage>195</lpage>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakoda</surname> <given-names>J.M.</given-names></name></person-group> (<year>1981</year>). <article-title>A generalized index of dissimilarity</article-title>. <source>Demography</source> <volume>18</volume>: <fpage>245</fpage>&#x02013;<lpage>50</lpage>.<pub-id pub-id-type="doi">10.2307/2061096</pub-id><pub-id pub-id-type="pmid">7227588</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Singhal</surname> <given-names>A.</given-names></name></person-group> (<year>2001</year>). <article-title>Modern information retrieval: a brief overview</article-title>. <source>Bulletin of the IEEE Computer Society Technical Committee on Data Engineering</source> <volume>24</volume>: <fpage>35</fpage>&#x02013;<lpage>43</lpage>.</citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Westin</surname> <given-names>I.</given-names></name> <name><surname>Geisler</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <article-title>A multi-dimensional study of diachronic variation in British newspaper editorials</article-title>. <source>International Computer Archive of Modern and Medieval English</source> <volume>26</volume>: <fpage>133</fpage>&#x02013;<lpage>152</lpage>.</citation></ref>
<ref id="B24"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Zipf</surname> <given-names>G.</given-names></name></person-group> (<year>1935</year>). <source>The Psychobiology of Language: An Introduction to Dynamic Philology</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>M.I.T. Press</publisher-name>.</citation></ref>
</ref-list>
</back>
</article>