<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2023.1135918</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Plant Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A latent scale model to minimize subjectivity in the analysis of visual rating data for the National Turfgrass Evaluation Program</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Qu</surname>
<given-names>Yuanshuo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1648373"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kne</surname>
<given-names>Len</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Graham</surname>
<given-names>Steve</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Watkins</surname>
<given-names>Eric</given-names>
</name>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Morris</surname>
<given-names>Kevin</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>National Turfgrass Evaluation Program</institution>, <addr-line>Beltsville, MD</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>U-Spatial, University of Minnesota</institution>, <addr-line>Minneapolis, MN</addr-line>, <country>United States</country>
</aff>
<aff id="aff3">
<sup>3</sup>
<institution>U-Spatial, University of Minnesota</institution>, <addr-line>Duluth, MN</addr-line>, <country>United States</country>
</aff>
<aff id="aff4">
<sup>4</sup>
<institution>Department of Horticultural Science, University of Minnesota</institution>, <addr-line>St. Paul, MN</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Mehedi Masud, Taif University, Saudi Arabia</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Jon Ahlinder, Forestry Research Institute of Sweden, Sweden; Mir Muhammad Nizamani, Guizhou University, China</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Yuanshuo Qu, <email xlink:href="mailto:henry.yqu@ntep.org">henry.yqu@ntep.org</email>
</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>07</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1135918</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>01</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>05</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Qu, Kne, Graham, Watkins and Morris</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Qu, Kne, Graham, Watkins and Morris</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Traditional evaluation procedure in National Turfgrass Evaluation Program (NTEP) relies on visually assessing replicated turf plots at multiple testing locations. This process yields ordinal data; however, statistical models that falsely assume these to be interval or ratio data have almost exclusively been applied in the subsequent analysis. This practice raises concerns about procedural subjectivity, preventing objective comparisons of cultivars across different test locations. It may also lead to serious errors, such as increased false alarms, failures to detect effects, and even inversions of differences among groups.</p>
</sec>
<sec>
<title>Methods</title>
<p>We reviewed this problem, identified sources of subjectivity, and presented a model-based approach to minimize subjectivity, allowing objective comparisons of cultivars across different locations and better monitoring of the evaluation procedure. We demonstrate how to fit the described model in a Bayesian framework with Stan, using datasets on overall turf quality ratings from the 2017 NTEP Kentucky bluegrass trials at seven testing locations.</p>
</sec>
<sec>
<title>Results</title>
<p>Compared with the existing method, ours allows the estimation of additional parameters, i.e., category thresholds, rating severity, and within-field spatial variations, and provides better separation of cultivar means and more realistic standard deviations.</p>
</sec>
<sec>
<title>Discussion</title>
<p>To implement the proposed model, additional information on rater identification, trial layout, rating date is needed. Given the model assumptions, we recommend small trials to reduce rater fatigue. For large trials, ratings can be conducted for each replication on multiple occasions instead of all at once. To minimize subjectivity, multiple raters are required. We also proposed new ideas on temporal analysis, incorporating existing knowledge of turfgrass.</p>
</sec>
</abstract>
<kwd-group>
<kwd>NTEP</kwd>
<kwd>visual ratings</kwd>
<kwd>cultivar evaluation</kwd>
<kwd>subjectivity minimization</kwd>
<kwd>Bayesian model</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="26"/>
<page-count count="9"/>
<word-count count="4791"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-in-acceptance</meta-name>
<meta-value>Technical Advances in Plant Science</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>The National Turfgrass Evaluation Program (NTEP) is an internationally renowned turfgrass research program. Starting from 1981, NTEP has coordinated trials and collected data on a variety of turfgrass species at locations across the United States and Canada (<xref ref-type="bibr" rid="B26">Xie et&#xa0;al., 2022</xref>). At each testing location, replicated turf plots of different cultivars are established, maintained, and visually evaluated by trained raters periodically on various traits of interest. Experienced raters usually mentor new raters following rating guidelines set by NTEP. Evaluated traits have traditionally included overall quality, color, density, resistance to diseases and insects, tolerance to biotic or abiotic stresses, and more recently expanded to drought and traffic tolerance. Over the years, NTEP has created a unique data repository, providing rich information for characterizing and selecting turfgrass cultivars for various applications.</p>
<p>NTEP adopted a 1-9 integer scale to assess traits of selected turfgrass cultivars (hereinafter referred to as the NTEP scale). It was originally used by turfgrass researchers in the 1980s in the northeastern region of the United States (personal communication with Dr. Bill Meyer of Rutgers University), which resembles the 9-point hedonic scale. Developed by David R. Peryam and his colleagues (<xref ref-type="bibr" rid="B20">Peryam and Girardot, 1952</xref>; <xref ref-type="bibr" rid="B21">Peryam and Pilgrim, 1957</xref>), the 9-point hedonic scale was originally used to measure the food, i.e., the stimuli, preferences of soldiers, i.e., the subjects, in the U.S. Armed Forces in the 1950s. Since then, it has become the most widely used scale for testing consumer preferences and acceptability of foods and beverages (<xref ref-type="bibr" rid="B14">Lim et&#xa0;al., 2009</xref>). The original 9-point hedonic scale is a balanced bipolar scale centered around a neutral position with four positive and four negative categories on each side. The categories are labeled with phrases ranging from &#x201c;Dislike Extremely&#x201d; to &#x201c;Like Extremely&#x201d; (<xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>), representing a continuum from dislikes to likes.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Replication of the questionnaire designed for studying soldiers<bold>&#x2019;</bold> preferences in the field.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="center">FOOD ITEM</th>
<th valign="top" colspan="4" align="center">LIKE</th>
<th valign="top" align="center">INDIFFERENT</th>
<th valign="top" colspan="4" align="center">DISLIKE</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Not Tried</td>
<td valign="top" align="left">Cream Gravy</td>
<td valign="top" align="left">Like Extremely</td>
<td valign="top" align="left">Like Very Much</td>
<td valign="top" align="left">Like Moderately</td>
<td valign="top" align="left">Like Slightly</td>
<td valign="top" align="left">Neither Like Nor Dislike</td>
<td valign="top" align="left">Dislike Slightly</td>
<td valign="top" align="left">Dislike Moderately</td>
<td valign="top" align="left">Dislike Very Much</td>
<td valign="top" align="left">Dislike Extremely</td>
</tr>
<tr>
<td valign="top" align="left">Not Tried</td>
<td valign="top" align="left">Bread Putting</td>
<td valign="top" align="left">Like Extremely</td>
<td valign="top" align="left">Like Very Much</td>
<td valign="top" align="left">Like Moderately</td>
<td valign="top" align="left">Like Slightly</td>
<td valign="top" align="left">Neither Like Nor Dislike</td>
<td valign="top" align="left">Dislike Slightly</td>
<td valign="top" align="left">Dislike Moderately</td>
<td valign="top" align="left">Dislike Very Much</td>
<td valign="top" align="left">Dislike Extremely</td>
</tr>
<tr>
<td valign="top" align="left">Not Tried</td>
<td valign="top" align="left">Cheese</td>
<td valign="top" align="left">Like Extremely</td>
<td valign="top" align="left">Like Very Much</td>
<td valign="top" align="left">Like Moderately</td>
<td valign="top" align="left">Like Slightly</td>
<td valign="top" align="left">Neither Like Nor Dislike</td>
<td valign="top" align="left">Dislike Slightly</td>
<td valign="top" align="left">Dislike Moderately</td>
<td valign="top" align="left">Dislike Very Much</td>
<td valign="top" align="left">Dislike Extremely</td>
</tr>
<tr>
<td valign="top" align="left">Not Tried</td>
<td valign="top" align="left">French Fried Onions</td>
<td valign="top" align="left">Like Extremely</td>
<td valign="top" align="left">Like Very Much</td>
<td valign="top" align="left">Like Moderately</td>
<td valign="top" align="left">Like Slightly</td>
<td valign="top" align="left">Neither Like Nor Dislike</td>
<td valign="top" align="left">Dislike Slightly</td>
<td valign="top" align="left">Dislike Moderately</td>
<td valign="top" align="left">Dislike Very Much</td>
<td valign="top" align="left">Dislike Extremely</td>
</tr>
<tr>
<td valign="top" align="left">Not Tried</td>
<td valign="top" align="left">Lettuce Wedges</td>
<td valign="top" align="left">Like Extremely</td>
<td valign="top" align="left">Like Very Much</td>
<td valign="top" align="left">Like Moderately</td>
<td valign="top" align="left">Like Slightly</td>
<td valign="top" align="left">Neither Like Nor Dislike</td>
<td valign="top" align="left">Dislike Slightly</td>
<td valign="top" align="left">Dislike Moderately</td>
<td valign="top" align="left">Dislike Very Much</td>
<td valign="top" align="left">Dislike Extremely</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Response to the 9-point hedonic scale is an ordinal variable as its categories have a natural order (<xref ref-type="bibr" rid="B23">Seddon et&#xa0;al., 2001</xref>). In subsequent analysis, the categories are generally assigned with numerical values from 1 to 9, respectively, such that parametric statistical models can be utilized. For the NTEP scale, a trained rater walks through all plots in serpentine order in each rating event, assigning an integer from 1 to 9 directly for a particular trait of interest where 1 is typically the poorest/lowest and 9 is the best/highest. Similar to analyzing responses to a 9-point hedonic scale, the analysis of NTEP rating data treats the ordinal variables as numerical values, which may lead to serious errors, such as increased false alarms, i.e., detecting non-existing effects, failures to detect effects, and even inversions of differences among groups (<xref ref-type="bibr" rid="B5">B&#xfc;rkner and Vuorre, 2019</xref>). There is abundant literature, e.g., <xref ref-type="bibr" rid="B14">Lim et&#xa0;al. (2009)</xref>, <xref ref-type="bibr" rid="B13">Liddell and Kruschke (2018)</xref>, on the reasons for these problems. Some important ones are summarized here.</p>
<list list-type="order">
<list-item>
<p>The categories in the 9-point hedonic scale are not equidistant, which was first discovered by the Psychometric Laboratory at the University of Chicago (<xref ref-type="bibr" rid="B10">Jones and Thurstone, 1955</xref>; <xref ref-type="bibr" rid="B9">Jones et&#xa0;al., 1955</xref>), and confirmed in later studies (<xref ref-type="bibr" rid="B15">Moskowitz, 1971</xref>; <xref ref-type="bibr" rid="B18">Moskowitz and Sidel, 1971</xref>; <xref ref-type="bibr" rid="B16">Moskowitz, 1977</xref>; <xref ref-type="bibr" rid="B17">Moskowitz, 1980</xref>).</p>
</list-item>
<list-item>
<p>The 9-point hedonic scale lacks an absolute zero point. While there is a neutral position (i.e., the INDIFFERENT category or the "5"), it varies from subject to subject, even across different measurements by the same subject.</p>
</list-item>
<list-item>
<p>The general tendency of subjects to avoid using the extreme categories (<xref ref-type="bibr" rid="B8">Hollingworth, 1910</xref>; <xref ref-type="bibr" rid="B24">Stevens and Galanter, 1957</xref>; <xref ref-type="bibr" rid="B19">Parducci and Wedell, 1986</xref>) makes the scale vulnerable to ceiling and flooring effects. This truncates the 9-point scale, limits the scale&#x2019;s ability to identify extreme stimuli, and skews the response data.</p>
</list-item>
</list>
<p>As a derivation of the original 9-point hedonic scale, the NTEP scale also yields ordinal data. Such data only provide rudimentary information on the hedonic magnitude and cannot directly be used to compare hedonic perceptions across different raters. In the current evaluation process, a turf plot&#x2019;s rating for a specific trait, e.g., turf quality, depends on the rater&#x2019;s severity in the rating event. Given the same plot, it will likely score higher when the rater is lenient or lower when severe, giving rise to subjectivity. In other words, for a specific rater&#x2019;s turf quality ratings, we know a &#x201c;3&#x201d; plot has better turf quality than a &#x201c;2&#x201d; plot. But we cannot conclude a &#x201c;3&#x201d; plot rated by A is better than a &#x201c;3&#x201d; plot rated by B in turf quality without adjusting for rater severity. Considering the temporal nature of the evaluation process, even for the same rater on the same trait, consistency is not guaranteed at different times of the year. Another source of subjectivity relates to the scale categories, which are not equal distances or of the same levels. To meaningfully aggregate data collected from different rating events across different testing sites, both sources of subjectivity need to be addressed. However, current methods, e.g., the additive main effect and multiplicative interaction (AMMI) method, analysis of variance (ANOVA) (<xref ref-type="bibr" rid="B6">Ebdon and Gauch Jr., 2002a</xref>; <xref ref-type="bibr" rid="B7">Ebdon and Gauch Jr., 2002b</xref>), and linear mixed model (LMM), are not adequate and suffer the same errors when they were applied to ordinal data directly. Inspired by Rasch Rating Scale Model (<xref ref-type="bibr" rid="B1">Andrich, 1978</xref>), we propose a latent scale model to minimize subjectivity, hereinafter referred to as NTEP RSM (NTEP Rating Scale Model), allowing more objective comparisons of cultivars across different raters and research groups. We also demonstrate how to fit the described model in a Bayesian framework, using datasets on overall turf quality ratings in the 2017 NTEP Kentucky bluegrass trials. The model is programmed in Stan (<xref ref-type="bibr" rid="B11">Lee et&#xa0;al., 2017</xref>) <italic>via</italic> Python. Stan is a probabilistic programming language for statistical modeling, inference, and computation. Although demonstrations are done for overall turf quality rating, this approach works for other traits of interest evaluated using the 1-9 NTEP rating scale.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Model specifications</title>
<sec id="s2_1">
<label>2.1</label>
<title>NTEP RSM</title>
<p>We started by constructing a latent scale based on the probability distribution of raw ordinal data. The model predicts the decision between two adjacent categories using a threshold parameter on the latent scale. The 1-9 scale is re-indexed in the following sections as 0-8 categories for conciseness in mathematical notations. At a given test location, let <italic>Y<sub>ni</sub>
</italic> denote the rating assigned to plot <italic>n</italic> in rating event <italic>i</italic>, the logarithmic ratio of the probability of plot <italic>n</italic> assigned to category <italic>s</italic> to that of plot <italic>n</italic> assigned to <italic>s</italic>&#x2013;1 can be expressed by the following equation,</p>
<disp-formula>
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>&#x3b8;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c4;</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where</p>
<list list-type="simple">
<list-item>
<p>
<italic>i</italic>=1,2,&#x2026;,<italic>I</italic> is the index for each rating event during the trial;</p>
</list-item>
<list-item>
<p>
<italic>n</italic>=1,2,&#x2026;,<italic>N</italic> is the index for each plot;</p>
</list-item>
<list-item>
<p>
<italic>s</italic>=1,2,&#x2026;,<italic>M</italic> is the index for category thresholds;</p>
</list-item>
<list-item>
<p>
<italic>M</italic>(<italic>M &#x2264;</italic> 8) is both the maximum rating score after reindexing and the number of thresholds;</p>
</list-item>
<list-item>
<p>
<italic>&#x3b8;<sub>n</sub>
</italic> is the perceived turf quality of plot <italic>n</italic> in a specific rating event;</p>
</list-item>
<list-item>
<p>
<italic>&#x3b2;<sub>i</sub>
</italic> measures rating severity in rating event <italic>i</italic>;</p>
</list-item>
<list-item>
<p>
<italic>&#x3c4;<sub>s</sub>
</italic> is the threshold at which at <italic>Pr</italic>(<italic>Y</italic>=<italic>s</italic>&#x2013;1) = <italic>Pr</italic>(<italic>Y</italic>=<italic>s</italic>).</p>
</list-item>
</list>
<p>Constraints were placed on <italic>&#x3b2;<sub>I</sub>
</italic> and <italic>&#x3c4;<sub>S</sub>
</italic> to add a meaningful zero to the scale. Both parameters were constrained to be the negative sum of the other parameters, respectively. We further assume <bold>
<italic>&#x3b8;</italic>
</bold>, <bold>
<italic>&#x3b2;</italic>
</bold>, and <italic>mbol&#x3c4;</italic> are normally distributed. For an unbiased rater in a rating event (<italic>&#x3b2;=0</italic>), the probability density curves for each category are illustrated in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. The vertical dash lines indicate category thresholds located at the points where the probability of a cultivar being assigned to two adjacent categories is equal. Note that these thresholds are not necessarily equidistant. In <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>, if a cultivar is located in a category (i.e., between two adjacent thresholds), then the response in that category has the greatest probability. The x-axis represents the constructed latent scale. It is continuous and equidistant, with a zero indicating the average level of overall turf quality. While the average level in individual rating events might vary (<italic>&#x3b2;</italic>&#x2260;0), we assume the average levels for each research group at different test locations are the same, allowing scale matching across different testing locations. Once subjectivity effects, i.e., <bold>
<italic>&#x3b2;</italic>
</bold> and <bold>
<italic>&#x3c4;</italic>
</bold>, were estimated and removed, <bold>
<italic>&#x3b8;</italic>
</bold> can be further analyzed. In this study, we partitioned <bold>
<italic>&#x3b8;</italic>
</bold> into cultivar and plot location effects, that is,</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Hypothetical category probability curves for nine ordered categories as used in NTEP rating scale.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g001.tif"/>
</fig>
<disp-formula>
<label>(2)</label>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:mi mathvariant="bold">&#x3b8;</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">&#x3b7;</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mi mathvariant="script">C</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <bold>
<italic>&#x3b7;</italic>
</bold> is the cultivar effect, reflecting the intrinsic quality of a cultivar, and <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mi mathvariant="script">C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the plot location effect due to spatial heterogeneity of the field. We further assume cultivar effects follow normal distributions with a mean of 0 and a variance of <italic>&#x3c3;</italic>
<sup>2</sup>. The plot location effect was modeled as a Gaussian process with a zero mean and covariance function <italic>K</italic>,</p>
<disp-formula>
<label>(3)</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mi mathvariant="script">O</mml:mi>
<mml:mi mathvariant="script">C</mml:mi>
<mml:mfenced>
<mml:mo>&#xb7;</mml:mo>
</mml:mfenced>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>N</mml:mi>
<mml:mfenced>
<mml:mrow>
<mml:mstyle mathvariant="bold">
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:mstyle>
<mml:mi>K</mml:mi>
<mml:mfenced>
<mml:mo>&#xb7;</mml:mo>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The covariance function <italic>K</italic>(&#xb7;) implemented here is an exponential quadratic function. For two plots <italic>i</italic> and <italic>j</italic> in the same trial at a specific testing location,</p>
<disp-formula>
<label>(4)</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>&#xb7;</mml:mo>
<mml:mo>|</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>&#x3c1;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mi>&#x3b1;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mi>exp</mml:mi>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mfenced>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:msup>
<mml:mi>&#x3c1;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>&#x3b4;</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mi>&#x3c3;</mml:mi>
<mml:mi>e</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>&#x3b1;</italic>, <italic>&#x3c1;</italic>, and <italic>&#x3c3;<sub>e</sub>
</italic> are hyperparameters defining the covariance function; <italic>&#x3b4;<sub>ij</sub>
</italic> is the Kronecker delta function with value 1 if <italic>i</italic> = <italic>j</italic> and 0 otherwise; <italic>d<sub>ij</sub>
</italic> is the Euclidean distance between centers of the two plots. As this is a Bayesian model, priors for parameters and hyperparameters are required. We adopted weakly informative priors: <italic>t</italic>
<sub>3</sub>(0,1) for <italic>&#x3b1;</italic>, <italic>&#x3c3;</italic> and <italic>&#x3c3;<sub>e</sub>
</italic>; Inv&#x2013;Gamma(5,5) for <italic>&#x3c1;</italic>.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Parameter recovery with NTEP RSM</title>
<p>To ensure that model parameters are identifiable, the following parameter recovery test was performed to evaluate the model. We first generated a synthetic dataset from 3 replications of 10 cultivars rated monthly for 5 years by 5 raters. The entry effects are random draws from a normal distribution with a mean of 0 and a standard deviation of 0.7 (&#x3c3; = 0.7). Plot location effects are generated from a Gaussian process with an assigned mean vector and covariance matrix with <italic>&#x3b1;</italic> = 0.15, <italic>&#x3c1;</italic> = 2.5, <italic>&#x3c3;<sub>e</sub>
</italic> = 0.2. Rating severity is a vector of five evenly spaced numbers over [&#x2013;0.8,0.8], and category threshold is a vector of eight evenly spaced numbers over [&#x2013;2,2]. All parameters, functions, and simulated data can be found in the Github repository. The simulated data were fit to the NTEP RSM for parameter recovery.</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Linear mixed model</title>
<p>To compare with the existing method, we also implemented the following LMM for each testing location,</p>
<disp-formula>
<label>(5)</label>
<mml:math display="block" id="M5">
<mml:mrow>
<mml:mi mathvariant="bold">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">&#x3b7;</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="bold">u</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<p>in which quality rating, <bold>Y</bold>, was treated as a continuous variable and partitioned into a fixed effect of cultivars, <bold>
<italic>&#x3b7;</italic>
</bold>, and a random effect of rating event, <bold>
<italic>u</italic>
</bold>. <bold>
<italic>&#x3f5;</italic>
</bold> denotes the residual that the model does not explain.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Model implementation</title>
<p>The NTEP RSM model is implemented in Stan (version 2.29.1) with a Python interface (version 3.10.4). The same model was fitted to data collected from each trial location, and posterior sampling of model parameters was generated by four Markov chain Monte Carlo chains, each with 1,000 iterations. The first 500 iterations were discarded to minimize the effect of initial values, and the rest were thinned by taking every other sample to reduce sample autocorrelation. The convergence of chains was confirmed <italic>via</italic> visual inspection and examining the <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>R</mml:mi>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> values of all parameters and the log posteriors. Model codes and output files can be found at <ext-link ext-link-type="uri" xlink:href="https://github.com/QhenryQ/ntep-rsm">https://github.com/QhenryQ/ntep-rsm</ext-link>. The LMM is implemented with the Python package Statsmodels (<xref ref-type="bibr" rid="B22">Seabold and Perktold, 2010</xref>).</p>
</sec>
</sec>
<sec id="s3" sec-type="results|discussions">
<label>3</label>
<title>Results and discussions</title>
<sec id="s3_1">
<label>3.1</label>
<title>Preliminary data analysis</title>
<p>Kentucky bluegrass is a cool-season turfgrass that grows best when temperatures are between 60-75&#xb0;F and goes dormant in hot, dry summer and cold winter. Given this behavior, turf quality data is only collected from May to October in northern trial locations, while in the southern trial locations, data is usually collected all year round. <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref> presents monthly histograms for all the raw turf quality rating data. In most months, the quality rating showed good symmetry and central tendency around 5 or 6. In January and February, turf quality ratings were only available from Raleigh, NC, and Stillwater, OK. We noticed decreased turf quality ratings and the number of categories assigned in both locations. For example, the February overall turf quality ratings at Stillwater, OK, were found to have a range of [3, 6], with a median of 4. This is presumably due to raters&#x2019; adjustment to the dormancy of Kentucky bluegrass. The significant reduction of turf quality in dormancy makes it difficult for raters to distinguish cultivars. Ceiling and flooring effects were also observed at other locations, e.g., the overall turf quality data at East Lansing, MI, and Raleigh, NC, ranged from 2 to 9, while that for data at West Lafayette, IN, from 2 to 8.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Histogram of raw overall turf quality ratings for each month at seven test locations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g002.tif"/>
</fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>NTEP RSM results</title>
<sec id="s3_2_1">
<label>3.2.1</label>
<title>Category thresholds</title>
<p>&#x201c;How is Rater A&#x2019;s 5 different from Rater B&#x2019;s 5?&#x201d; This type of question is inevitable when it comes to the comparison of cultivars following the current NTEP procedure. However, such a question cannot be answered without proper definitions of categories, which in our model, are done by identifying category thresholds. These thresholds are points on the latent scale at which a rater is equally likely to select two adjacent response options (<xref ref-type="bibr" rid="B3">Andrich and Luo, 2003</xref>). We also assumed there are fixed distances among the category thresholds for raters within the same research group at the same location. This assumption is reasonable given that experienced raters of the same research group usually train newer raters. Estimation of category thresholds from the data provides important feedback on category definitions and how the scale is utilized by each research group, allowing us to ensure raters are adequately differentiating cultivars. When adjacent thresholds are too far apart, a category becomes too wide and less informative; on the other hand, when adjacent thresholds are close, a category becomes too narrow, indicating underutilization of the scale (see Guidelines for Rating Scales and Andrich Thresholds). We examined the non-terminal categories used at seven testing locations (<xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>) . Their widths spanned the range of [0.07, 4.76] on the logit scale, e.g., Category 2 at Adelphia, NJ, only spanned 0.59 logits, while category 8 at Stillwater, OK, was 3.54 logits. Category thresholds are generally required to be in ascending order concordant with the category numbers, i.e., ordered thresholds (<xref ref-type="bibr" rid="B2">Andrich, 2011</xref>). Disordered thresholds imply a higher rating may not be assigned as a turf cultivar advances along the scale. Such inconsistency of raters is usually the result of too many options or/and poor category definitions in scale development. Estimated category thresholds from all testing locations, ranging from -6.64 to 6.05, were in order. Large variations were observed in the range of category thresholds. Category thresholds at East Lansing, MI, and Stillwater, OK, spread more than 10 logits, while those in Adelphia, NJ, only spanned 4.5 logits.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>The latent scale partitioned by category thresholds into NTEP rating categories at seven test locations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g003.tif"/>
</fig>
</sec>
<sec id="s3_2_2">
<label>3.2.2</label>
<title>Rating severity</title>
<p>Defining category thresholds is not sufficient to answer the question of rater variation. On the constructed latent scale, category thresholds can still slide left (indicating a lenient rating event) or right (indicating a severe rating event). In many fields, severity can be treated as a constant for a given rater. That is to say, whenever the rater conducts a rating, he/she is always the same in terms of severity. However, this might not be true during the evaluation of turfgrass. For new raters, it takes time to achieve consistency; for trained raters, some may adjust their severity to credit cultivars that perform well under harsh environmental conditions or at different times of the year (personal communications with NTEP raters). Historically, there have been two sets of rating criteria for reference standards in NTEP. One is based on an optimal growth environment (e.g., light, temperature, soil moisture) and management regime (e.g., mowing height, fertilization rate), while the other is based on the actual environment or management regime. Using either criterion, the rater must idealize his/her reference standards to compare against all treatments and assign a quality score using a scale of 1 to 9. With the first criterion, we expect consistency of raters regardless of the rating time of the year since the best plot is defined considering all possible growth environments and management regimes. With the second, raters could be either severe or lenient depending on the environment or management regimes at the rating time. We examined the consistency in rating severity estimates of 10 raters who have performed more than 3 ratings across different months. For each rater, we fit a trend line for their rating severity across different months of the year using the weighted scatterplot smoothing (LOWESS) method. No strong trends were observed for raters in St. Paul, MN, West Lafayette, IN, and Adelphia, NJ, while strong seasonal patterns were seen for raters in the other four locations (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>). One potential confounding factor in the current definition of rating severity is the seasonality of turfgrass quality. It is also worth noting that while the model focuses on point estimates for the average turf quality, the actual turf quality of cool-season turfgrass is not a constant; instead, it varies over time with strong annual seasonality. Unfortunately, the current data do not provide sufficient information, e.g., the exact rating dates, for investigation on how rating severity changes in response to the seasonality of turf quality. Standard deviations of rating severity per rater ranged from 0.13 to 0.97 on the logit scale. Considering the category widths, such variation in severity for a given rater could lead to changes in rating categories.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>Rating severity estimates and monthly trend lines of ten raters at seven test locations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g004.tif"/>
</fig>
</sec>
<sec id="s3_2_3">
<label>3.2.3</label>
<title>Field spatial variation</title>
<p>We implemented a Gaussian process to estimate the spatial variation within a specific trial. The traditional cultivar comparison method based on ANOVA or LLM assumes uniform growth conditions within a trial, which is hardly achievable due to heterogeneity in soil texture, seeding depth, elevation gradient, etc. Thus, removing field spatial effect is important for reliable cultivar comparison results. <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref> visualizes the spatial variation estimated by our model at seven testing locations, in which every pixel represents a plot as defined by row and column number. The level of spatial heterogeneity varied from trial to trial; some were higher, e.g., the trial at East Lansing, MI, while some were lower, e.g., the trial at Adelphia, NJ. Noticeably, we observed large edge effects from the trial at Logan, UT, the diagonal division from the trial at St. Paul, MN, and the localized hot spots from trials at East Lansing, MI, and Raleigh, NC. The estimated field spatial variation provided turfgrass researchers with a high-level summary of their trials, which can help improve experimental design and allow better differentiation of cultivars.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Field spatial variation at seven test locations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g005.tif"/>
</fig>
</sec>
<sec id="s3_2_4">
<label>3.2.4</label>
<title>Cultivars comparison across testing locations</title>
<p>Our model quantifies and removes confounding factors at each location, i.e., rating severity and field spatial effect, allowing a more reliable and accurate cultivar comparison. An additional assumption is required for scale alignments to compare cultivars across different testing locations. We assume the average levels for a turfgrass cultivar, as perceived by raters at different NTEP testing locations, are roughly the same. In <xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>, we compared the performance of two example cultivars by aligning the average levels at seven testing locations. Each angular axis represents the latent logit scale at corresponding testing locations, where zero indicates the average level. For &#x2018;After Midnight,&#x2019; it performed above average at Adelphia, NJ, Stillwater, OK, and Raleigh, NC, and below average at St. Paul, MN, East Lansing, MI, Logan, UT, and West Lafayette, IN. &#x2018;Kenblue&#x2019; performed below average at all locations. When comparing the two, the distance between the logit values estimates how much one cultivar is better than the other at each location. After Midnight outperformed Kenblue at all testing locations except East Lansing, MI, and West Lafayette, IN. The comparison of all evaluated cultivars can be found in <xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Materials</bold>
</xref> and the GitHub repository.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>Performance of After Midnight and Kenblue at seven test locations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g006.tif"/>
</fig>
</sec>
<sec id="s3_2_5">
<label>3.2.5</label>
<title>Effect sizes</title>
<p>Effect size quantifies the strengths of relationships between variables and determines their practical importance in the study. One way to determine the effect size is by examining the percentage of variance the effects explain. <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref> illustrates the variance percentage explained by the model&#x2019;s estimated parameters. At all locations except Logan, UT, the effect of field spatial variation is the smallest of the three. In contrast, the effect of rating severity is the largest at all locations but at Adelphia, NJ. Notably, there are seven raters at Adelphia, NJ, compared with 1 to 3 raters at other locations, highlighting the importance of gathering opinions from more raters during cultivar evaluation. The percentage of variance explained by cultivar effect varied drastically, from a merely 4% at Logan, UT, to as much as 79% at Adelphia, NJ. Quantifying and removing these confounding factors is thus essential when evaluating and comparing cultivars in field trials.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>Percentage of explained variance by different effects estimated by the model.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g007.tif"/>
</fig>
</sec>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Comparison with LMM</title>
<p>The advantages of NTEP RSM over the currently-adopted LMM are three-folded. First, it allows the estimation of additional parameters, namely category thresholds, rating severity, and field spatial variation. All three parameters are essential for rater training, better utilization of the whole scale, and understanding of the field conditions. Second, NTEP RSM separates mean estimations of the evaluated cultivars better. To name a few of the numerous examples, Blue Gem (NAI-13-9), MVS-130, Heartland (NAI-14-187), AKB3241, and RAD 553 all received the same mean estimation of -0.261 at East Lansing, MI, from LLM, while the mean estimates from NTEP RSM were 0.030, -0.020, -0.145, -0.268, -0.580 respectively. Similar patterns were observed for DLFPS-340/3556, Paloma (PST-K13-139), DLFPS-340/3552, J-1138 at St. Paul, MN; DLFPS-340/3556, A16-2, NuRush (J-3510) at West Lafayette, IN; and DLFPS-340/3548, A16-17, Barvette HGT<sup>&#xae;</sup>, NK-1 at Logan, UT. Detailed comparison for all cultivars can be found in Among the seven test locations, the largest discrepancies between the two models&#x2019; output were seen at Logan, UT. At the same time, the smallest were observed at Stillwater, OK (<xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>). It is important to highlight the robustness of the current LMM approach despite all the merits of NTEP RSM. Last but not least, RSM provides more realistic standard deviation estimations, while the currently-adopted LMM generates the same standard deviations for all cultivars at each location. Given the different genetic backgrounds of cultivars, they are unlikely to have the same standard deviations.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Correlation coefficients between cultivar mean estimates from LMM and NTEP RSM.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" rowspan="2" align="left">Location</th>
<th valign="top" colspan="2" align="center">Correlation coefficient between LMM and NTEP RSM</th>
</tr>
<tr>
<th valign="top" align="center">Pearson&#x2019;s</th>
<th valign="top" align="center">Spearman&#x2019;s rank</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">St. Paul, MN</td>
<td valign="top" align="center">0.973614</td>
<td valign="top" align="center">0.970781</td>
</tr>
<tr>
<td valign="top" align="left">East Lansing, MI</td>
<td valign="top" align="center">0.928411</td>
<td valign="top" align="center">0.929173</td>
</tr>
<tr>
<td valign="top" align="left">Logan, UT</td>
<td valign="top" align="center">0.800883</td>
<td valign="top" align="center">0.756775</td>
</tr>
<tr>
<td valign="top" align="left">West Lafayette, IN</td>
<td valign="top" align="center">0.969092</td>
<td valign="top" align="center">0.955572</td>
</tr>
<tr>
<td valign="top" align="left">Adelphia, NJ</td>
<td valign="top" align="center">0.997716</td>
<td valign="top" align="center">0.997600</td>
</tr>
<tr>
<td valign="top" align="left">Stillwater, OK</td>
<td valign="top" align="center">0.999583</td>
<td valign="top" align="center">0.999022</td>
</tr>
<tr>
<td valign="top" align="left">Raleigh, NC</td>
<td valign="top" align="center">0.944150</td>
<td valign="top" align="center">0.951401</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Parameter recovery with NTEP RSM</title>
<p>The highest value for <inline-formula>
<mml:math display="inline" id="im3">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>R</mml:mi>
<mml:mo stretchy="true">^</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> was 1.0 for all parameters and the log posterior, suggesting that all four chains have converged. As shown in <xref ref-type="fig" rid="f8">
<bold>Figure 8</bold>
</xref>, all except three of the 95% credit intervals include zero, indicating the model&#x2019;s ability to recover the original values of the parameters.</p>
<fig id="f8" position="float">
<label>Figure&#xa0;8</label>
<caption>
<p>Mean estimation and 95% credit interval for the difference between estimated values and original values of the parameters.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-14-1135918-g008.tif"/>
</fig>
</sec>
<sec id="s3_5" sec-type="discussion">
<label>3.5</label>
<title>Discussions</title>
<p>Despite the promising results, there are at least two major challenges that lie ahead for the successful implementation of the proposed model. The first and foremost is the lack of data. While NTEP has done a remarkable job of gathering, cleaning, organizing, and storing historical data on cultivar evaluation, a significant amount of valuable data are left out in this process. This includes but is not limited to rater identification, trial layout, rating dates, field gradient, etc. Luckily, researchers generally record and preserve such information at each trial location. Additional work is required to incorporate such data into the current NTEP database. Second, there are too few raters at some trial locations. The fundamental debiasing mechanism of the proposed model is to aggregate individuals&#x2019; opinions on the same cultivar into an objective and collective opinion. Multiple raters are required to ensure accurate estimations of the collective opinion on the tested cultivar. As mentioned above, one limitation of the proposed model is the absence of a seasonality component. As a cool-season turfgrass, Kentucky bluegrass thrives during the fall and early spring and slows significantly in growth during the hot summer months. The proposed model focuses on estimating the overall quality for a given cultivar over the entire testing period but cannot provide a quality estimation at a given time of the year. We tested year and month effects as independent Gaussian variables; however, as pointed out by one reviewer, it was unrealistic that months have the same effect across different years. We agree with the reviewer and are exploring better ways to improve the proposed model. A potential approach is the multiple-output Gaussian process model(<xref ref-type="bibr" rid="B12">Li et&#xa0;al., 2021</xref>) that incorporates the seasonal grown pattern of Kentucky bluegrass as a prior distribution. This requires additional information on the rating dates. Once implemented, it will allow the analysis of the temporal variation of cultivars, which caters to needs such as mixing/blending cultivars based on spring green up, comparison of cultivars on growth potential at a given time of the year (<xref ref-type="bibr" rid="B25">Woods, 2013</xref>). Now that the model assumes raters are consistent in all rating event, we encourage small trial sizes at each testing location. Smaller trials reduce the risk of rater fatigue during rating, thus helping raters to maintain better consistency. For trials with too many cultivars, we recommend ratings be conducted on each replication on separate occasions instead of finishing all the plots at once. Regarding the rating scale, researchers should attempt to achieve a uniform distribution (<xref ref-type="bibr" rid="B4">Bond and Fox, 2013</xref>) of category thresholds. NTEP is currently working towards a data ingestion, analysis, and visualization pipeline, with the objectives to provide timely feedback to raters during the reason, to help raters to utilize the rating scale better, and to service a larger audience. NTEP also need to set standards for cultivar average, representing the zero point on the scale, such that results of cultivar comparisons across time and location are accurate and reliable.</p>
</sec>
</sec>
<sec id="s4" sec-type="data-availability">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://github.com/QhenryQ/ntep-rsm/tree/main/model_data">https://github.com/QhenryQ/ntep-rsm/tree/main/model_data</ext-link>.</p>
</sec>
<sec id="s5" sec-type="author-contributions">
<title>Author contributions</title>
<p>YQ conceived the idea, developed the model, performed the analysis, and took the lead in writing the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>The authors are grateful for the generous help from Dr. Cale A. Bigelow of Purdue University, Dr. Stacy A. Bonos of Rutgers University, Dr. Leah Brilman of DLF Pickseed, Dr. Andrea Payne Connally of Oklahoma State University, Ms. Christine Knisley of National Turfgrass Evaluation Program, Dr. Kevin W. Frank of Michigan State University, Mr. Paul Harris of Utah State University, Mr. Andrew Hollman of University of Minnesota - Twin Cities, Dr. Paul G. Johnson of Utah State University, Dr. Dennis L. Martin of Oklahoma State University, Dr. Grady L. Miller of North Carolina State University, Dr. Phillip L. Vines of the University of Georgia. We also want to express our gratitude to two reviewers whose comments helped improve and clarify this manuscript.</p>
</ack>
<sec id="s6" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>YQ and KM are both employed by NTEP.</p>
<p>The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s7" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s8" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fpls.2023.1135918/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fpls.2023.1135918/full#supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet_1.csv" id="SM1" mimetype="text/csv"/>
<supplementary-material xlink:href="Image_1.png" id="SF1" mimetype="image/png"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>1978</year>). <article-title>A rating formulation for ordered response categories</article-title>. <source>Psychometrika</source> <volume>43</volume>, <fpage>561</fpage>&#x2013;<lpage>573</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/BF02293814</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Rating scales and rasch measurement</article-title>. <source>Expert Rev. Pharmacoecon. Outcomes Res.</source> <volume>11</volume>, <fpage>571</fpage>&#x2013;<lpage>585</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1586/erp.11.59</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Conditional pairwise estimation in the rasch model for ordered response categories using principal components</article-title>. <source>J. Appl. Meas.</source> <volume>4</volume>, <fpage>205</fpage>&#x2013;<lpage>221</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bond</surname> <given-names>T. G.</given-names>
</name>
<name>
<surname>Fox</surname> <given-names>C. M.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Applying the rasch model: fundamental measurement in the human sciences</source> (<publisher-name>Psychology Press</publisher-name>).</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>B&#xfc;rkner</surname> <given-names>P.-C.</given-names>
</name>
<name>
<surname>Vuorre</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Ordinal regression models in psychology: a tutorial</article-title>. <source>Adv. Methods Pract. psychol. Sci.</source> <volume>2</volume>, <fpage>77</fpage>&#x2013;<lpage>101</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1177/2515245918823199</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ebdon</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Gauch</surname> <given-names>H.</given-names>
<suffix>Jr.</suffix>
</name>
</person-group> (<year>2002</year>a). <article-title>Additive main effect and multiplicative interaction analysis of national turfgrass performance trials: i. interpretation of genotype&#xd7; environment interaction</article-title>. <source>Crop Sci.</source> <volume>42</volume>, <fpage>489</fpage>&#x2013;<lpage>496</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.2135/cropsci2002.4890</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ebdon</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Gauch</surname> <given-names>H.</given-names>
<suffix>Jr.</suffix>
</name>
</person-group> (<year>2002</year>b). <article-title>Additive main effect and multiplicative interaction analysis of national turfgrass performance trials: II. Cultivar recommendations</article-title>. <source>Crop Sci.</source> <volume>42</volume>, <fpage>497</fpage>&#x2013;<lpage>506</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.2135/cropsci2002.4970</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hollingworth</surname> <given-names>H. L.</given-names>
</name>
</person-group> (<year>1910</year>). <article-title>The central tendency of judgment</article-title>. <source>J. Philosophy Psychol. Sci. Methods</source> <volume>7</volume>, <fpage>461</fpage>&#x2013;<lpage>469</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.2307/2012819</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jones</surname> <given-names>L. V.</given-names>
</name>
<name>
<surname>Peryam</surname> <given-names>D. R.</given-names>
</name>
<name>
<surname>Thurstone</surname> <given-names>L. L.</given-names>
</name>
</person-group>. (<year>1955</year>). <article-title>Development of a scale for measuring soldiers&#x2019; food preferences</article-title>. <source>Food Res.</source> <volume>20</volume>, <fpage>512</fpage>&#x2013;<lpage>520</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1111/j.1365-2621.1955.tb16862.x</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jones</surname> <given-names>L. V.</given-names>
</name>
<name>
<surname>Thurstone</surname> <given-names>L. L.</given-names>
</name>
</person-group> (<year>1955</year>). <article-title>The psychophysics of semantics: an experimental investigation</article-title>. <source>J. Appl. Psychol.</source> <volume>39</volume>, <fpage>31</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1037/h0042184</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lee</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Carpenter</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Morris</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Betancourt</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Maverick</surname> <given-names>G.</given-names>
</name>
<etal/>
</person-group>. (<year>2017</year>). <source>Stan-dev/stan: v2.17.1</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.5281/zenodo.1101116</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Jones</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Banerjee</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Engelhardt</surname> <given-names>B. E.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multi-group gaussian processes</article-title>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2110.08411</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liddell</surname> <given-names>T. M.</given-names>
</name>
<name>
<surname>Kruschke</surname> <given-names>J. K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Analyzing ordinal data with metric models: what could possibly go wrong</article-title>? <source>J. Exp. Soc. Psychol.</source> <volume>79</volume>, <fpage>328</fpage>&#x2013;<lpage>348</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.jesp.2018.08.009</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lim</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Wood</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Green</surname> <given-names>B. G.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Derivation and evaluation of a labeled hedonic scale</article-title>. <source>Chem. Senses</source> <volume>34</volume>, <fpage>739</fpage>&#x2013;<lpage>751</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.foodqual.2011.05.008</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moskowitz</surname> <given-names>H. R.</given-names>
</name>
</person-group> (<year>1971</year>). <article-title>The sweetness and pleasantness of sugars</article-title>. <source>Am. J. Psychol.</source> <volume>84</volume>, <fpage>387</fpage>&#x2013;<lpage>405</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.2307/1420470</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moskowitz</surname> <given-names>H. R.</given-names>
</name>
</person-group> (<year>1977</year>). <article-title>Magnitude estimation: notes on what, how, when, and why to use it</article-title>. <source>J. Food Qual.</source> <volume>1</volume>, <fpage>195</fpage>&#x2013;<lpage>227</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1111/j.1745-4557.1977.tb00942.x</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moskowitz</surname> <given-names>H. R.</given-names>
</name>
</person-group> (<year>1980</year>). <article-title>Psychometric evaluation of food preferences</article-title>. <source>Foodservice Res. Int.</source> <volume>1</volume>, <fpage>149</fpage>&#x2013;<lpage>167</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1111/j.1745-4506.1980.tb00252.x</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moskowitz</surname> <given-names>H. R.</given-names>
</name>
<name>
<surname>Sidel</surname> <given-names>J. L.</given-names>
</name>
</person-group> (<year>1971</year>). <article-title>Magnitude and hedonic scales of food acceptability</article-title>. <source>J. Food Sci.</source> <volume>36</volume>, <fpage>677</fpage>&#x2013;<lpage>680</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1111/j.1365-2621.1971.tb15160.x</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parducci</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Wedell</surname> <given-names>D. H.</given-names>
</name>
</person-group> (<year>1986</year>). <article-title>The category effect with rating scales: number of categories, number of stimuli, and method of presentation</article-title>. <source>J. Exp. Psychol.: Hum. Percept. Perform.</source> <volume>12</volume>, <elocation-id>496</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.1037/0096-1523.12.4.496</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peryam</surname> <given-names>D. R.</given-names>
</name>
<name>
<surname>Girardot</surname> <given-names>N. F.</given-names>
</name>
</person-group> (<year>1952</year>). <article-title>Advanced taste-test method</article-title>. <source>Food Eng.</source> <volume>24</volume>, <fpage>58</fpage>&#x2013;<lpage>61</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peryam</surname> <given-names>D. R.</given-names>
</name>
<name>
<surname>Pilgrim</surname> <given-names>F. J.</given-names>
</name>
</person-group> (<year>1957</year>). <article-title>Hedonic scale method of measuring food preferences</article-title>. <source>Food Technol.</source> <volume>11</volume>, <fpage>9</fpage>&#x2013;<lpage>14</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/BF02638783</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Seabold</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Perktold</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Statsmodels: econometric and statistical modeling with python</article-title>,&#x201d; in <conf-name>9th Python in Science Conference</conf-name> (<publisher-loc>Austin, Texas</publisher-loc>: <publisher-name>SciPy</publisher-name>). doi: <pub-id pub-id-type="doi">10.25080/majora-92bf1922-011</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seddon</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Marshall</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Campbell</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Roland</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Systematic review of studies of quality of clinical care in general practice in the uk, australia and new zealand</article-title>. <source>BMJ Qual. Saf.</source> <volume>10</volume>, <fpage>152</fpage>&#x2013;<lpage>158</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1126/science.103.2684.677</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stevens</surname> <given-names>S. S.</given-names>
</name>
<name>
<surname>Galanter</surname> <given-names>E. H.</given-names>
</name>
</person-group> (<year>1957</year>). <article-title>Ratio scales and category scales for a dozen perceptual continua</article-title>. <source>J. Exp. Psychol.</source> <volume>54</volume>, <fpage>377</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1037/h0043680</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Woods</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Using temperature to predict turfgrass growth potential (gp) and to estimate turfgrass nitrogen use</source> (<publisher-loc>Bangkok</publisher-loc>: <publisher-name>Asian Turfgrass Publication.[Google Scholar]</publisher-name>).</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xie</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Farhadloo</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Shekhar</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Watkins</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Kne</surname> <given-names>L.</given-names>
</name>
<etal/>
</person-group>. (<year>2022</year>). <article-title>Ntep-db 1.0: a relational database for the national turfgrass evaluation program</article-title>. <source>Int. Turfgrass Soc. Res. J.</source> <volume>14</volume>, <fpage>316</fpage>&#x2013;<lpage>332</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1002/its2.76</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>