<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="systematic-review">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2022.850611</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Systematic Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A Survey of Data Quality Measurement and Monitoring Tools</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ehrlinger</surname> <given-names>Lisa</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>W&#x000F6;&#x000DF;</surname> <given-names>Wolfram</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1725562/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Institute for Application-Oriented Knowledge Processing (FAW), Johannes Kepler University</institution>, <addr-line>Linz</addr-line>, <country>Austria</country></aff>
<aff id="aff2"><sup>2</sup><institution>Software Competence Center Hagenberg GmbH</institution>, <addr-line>Hagenberg</addr-line>, <country>Austria</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Tsuyoshi Ide, IBM Research, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Witold Suryn, &#x000C9;cole de technologie sup&#x000E9;rieure (&#x000C9;TS), Canada; Filipe Portela, University of Minho, Portugal</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Lisa Ehrlinger <email>lisa.ehrlinger&#x00040;scch.at</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>31</day>
<month>03</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>850611</elocation-id>
<history>
<date date-type="received">
<day>07</day>
<month>01</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>08</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Ehrlinger and W&#x000F6;&#x000DF;.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Ehrlinger and W&#x000F6;&#x000DF;</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between data quality research and practical implementations with a detailed investigation on <italic>how data quality measurement and monitoring concepts are implemented in state-of-the-art tools</italic>. For the first time and in contrast to all existing data quality tool surveys, we conducted a systematic search, in which we identified 667 software tools dedicated to &#x0201C;data quality.&#x0201D; To evaluate the tools, we compiled a requirements catalog with three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) automated data quality monitoring. Using a set of predefined exclusion criteria, we selected 13 tools (8 commercial and 5 open-source tools) that provide the investigated features and are not limited to a specific domain for detailed investigation. On the one hand, this survey allows a critical discussion of concepts that are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics. On the other hand, it reveals potential for functional enhancement of data quality tools and supports practitioners in the selection of appropriate tools for a given use case.</p></abstract>
<kwd-group>
<kwd>data quality</kwd>
<kwd>data quality tools</kwd>
<kwd>data quality measurement</kwd>
<kwd>data quality monitoring</kwd>
<kwd>data profiling</kwd>
<kwd>information quality</kwd>
</kwd-group>
<contract-sponsor id="cn001">&#x000D6;sterreichische Forschungsf&#x000F6;rderungsgesellschaft<named-content content-type="fundref-id">10.13039/501100004955</named-content></contract-sponsor>
<counts>
<fig-count count="2"/>
<table-count count="10"/>
<equation-count count="11"/>
<ref-count count="94"/>
<page-count count="30"/>
<word-count count="25502"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Data quality (DQ) measurement is a fundamental building block for estimating the relevance of data-driven decisions. Such decisions accompany our everyday life, for instance, machine-based decisions in ranking algorithms, industrial robots, and self-driving cars in the emerging field of artificial intelligence. The negative impact of poor data on the error rate of machine learning (ML) models has been shown by Sessions and Valtorta (<xref ref-type="bibr" rid="B84">2006</xref>) and Ehrlinger et al. (<xref ref-type="bibr" rid="B22">2019</xref>). Also human-based decisions rely on high-quality data, for example, the decision whether to promote or to suspend the production of a specific product is usually based on sales data. Despite the clear correlation between data and decision quality, 84 % of the CEOs in the US are concerned about their DQ (KPMG International, <xref ref-type="bibr" rid="B53">2016</xref>) and &#x0201C;organizations believe poor data quality to be responsible for an average of $15 million per year in losses&#x0201D; (Moore, <xref ref-type="bibr" rid="B63">2018</xref>). Thus, DQ is &#x0201C;no longer a question of &#x02018;hygiene&#x00027; [...], but rather has become critical for operational excellence&#x0201D; and is perceived as the greatest challenge in corporate data management (Otto and &#x000D6;sterle, <xref ref-type="bibr" rid="B67">2016</xref>).</p>
<p>To increase the trust in data-driven decisions, it is necessary to measure, know, and improve the quality of the employed data with appropriate tools (Ehrlinger et al., <xref ref-type="bibr" rid="B25">2018</xref>; Heinrich et al., <xref ref-type="bibr" rid="B38">2018</xref>). DQ improvement (i.e., data cleansing), which is based on DQ measurement, are both part of comprehensive DQ management. Most existing methodologies describe DQ management as cyclic process, which is carried out continuously (cf. Redman, <xref ref-type="bibr" rid="B76">1997</xref>; Wang, <xref ref-type="bibr" rid="B91">1998</xref>; English, <xref ref-type="bibr" rid="B27">1999</xref>; Lee et al., <xref ref-type="bibr" rid="B56">2009</xref>; Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>). Yet, according to a German survey, 66 % of companies use Excel or Access solutions to validate their DQ and 63 % of the companies determine their DQ manually and <italic>ad hoc</italic> without any long-term DQ management strategy (Sch&#x000E4;ffer and Beckmann, <xref ref-type="bibr" rid="B81">2014</xref>). Considering such studies and the increasing amount of data to be processed, there is a clear need for intensive research to automate DQ management tasks. Sebastian-Coleman (<xref ref-type="bibr" rid="B82">2013</xref>) also states that &#x0201C;without <italic>automation</italic>, the speed and volume of data will quickly overwhelm even the most dedicated efforts to measure.&#x0201D;</p>
<p>Research about data quality has been conducted since the 1980s and since then, DQ is most often associated with the &#x0201C;fitness for use&#x0201D; principle (Chrisman, <xref ref-type="bibr" rid="B16">1983</xref>; Wang and Strong, <xref ref-type="bibr" rid="B92">1996</xref>), which refers to the subjectivity and context-dependency of this topic. Data quality is typically referred to as a multi-dimensional concept, where single aspects are described by DQ <italic>dimensions</italic> (e.g., accuracy, completeness, timeliness). The fulfillment of a DQ dimension can be quantified using one or several DQ <italic>metrics</italic> (Ehrlinger et al., <xref ref-type="bibr" rid="B25">2018</xref>). According to the IEEE standard (IEEE, <xref ref-type="bibr" rid="B41">1998</xref>), a metric is a formula that yields a numerical value. In parallel to the scientific background, a wide variety of commercial, open-source, and academic DQ tools with different foci have been developed since then, in order to support the automation of DQ management. The range of functions offered by those tools varies widely, because the term &#x0201C;data quality&#x0201D; is context-dependent and not always used consistently. Despite the large number of publications, tools, and concepts into data quality, it is not always clear how to map the concepts from the theory (i.e., dimensions and metrics) to a practical implementation (i.e., tools). Therefore, the question of how to measure and monitor DQ automatically is still not sufficiently answered (Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>). In this survey, we contribute to this research question by providing a detailed investigation of DQ measurement and monitoring functionalities in state-of-the-art DQ tools.</p>
<p>Specifically, we conducted a systematic search, where we identified 667 software tools dedicated to &#x0201C;data quality.&#x0201D; According to predefined exclusion criteria, we selected 13 DQ tools (8 commercial and 5 open-source tools) for deeper investigation. To systematically evaluate the functional scope of the tools, we introduce a requirements catalog comprising three categories: (1) data profiling, (2) DQ measurement in terms of dimensions and metrics, and (3) continuous DQ monitoring. Since the focus of this survey is on the automation of DQ tasks, we specifically observe the measurement capabilities (i.e., how to detect and report DQ issues) and to which extent the tools support automated DQ monitoring, required to ensure high-quality data over time (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>). We deliberately exclude tools that solely offer data cleansing and improvement functions, because an automated modification of the data (i.e., data cleansing) is usually not possible in productive information systems with critical content. Consequently, our main contributions can be summarizes as follows:</p>
<list list-type="bullet">
<list-item><p>To the best of our knowledge, we conducted the first <italic>systematic search</italic> to identify DQ tools and thus, give a comprehensive overview on the market.</p></list-item>
<list-item><p>We compiled a <italic>requirements catalog</italic> to investigate data profiling, DQ measurement, and automated DQ monitoring functionalities of DQ tools. This catalog summarizes and classifies tasks that are required for automated and continuous DQ measurement in a new way and supports follow-up studies, e.g., on domain-specific DQ tools.</p></list-item>
<list-item><p>Based on the <italic>detailed investigation</italic> of 13 DQ tools, we propose a new research direction for DQ measurement and highlight potential for enhancement in the DQ tools.</p></list-item>
</list>
<p>The results of this survey are not only relevant for DQ professionals to select the most appropriate tool for a given use case, but also highlight the current capabilities of state-of-the-art DQ tools. Especially since such a wide variety of DQ tools exist, it is often not clear which functional scope can be expected. The main findings of this article can be summarized as follows:</p>
<list list-type="bullet">
<list-item><p>Despite the presumption that the emerging market of DQ tools is still under development (cf. Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>), we found a vast number (667) of DQ tools through our systematic search, where most of them have never been included in one of the existing surveys.</p></list-item>
<list-item><p>Approximately half (50.82 %) of the DQ tools were domain specific, which means they were either dedicated to specific types of data or built to measure the DQ of a proprietary tool.</p></list-item>
<list-item><p>16.67 % of the DQ tools focused on data cleansing without a proper DQ measurement strategy (i.e., measurements are used to modify the data, but no comprehensive reports are provided).</p></list-item>
<list-item><p>Most surveyed tools supported data profiling to some extent, but considering the research state, there is potential for functional enhancement in data profiling, especially with respect to multi-column profiling and dependency discovery.</p></list-item>
<list-item><p>We did not find a tool that implements a wider range of DQ metrics for the most important DQ dimensions as proposed in research papers (cf. Piro, <xref ref-type="bibr" rid="B71">2014</xref>; Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>; Heinrich et al., <xref ref-type="bibr" rid="B38">2018</xref>). Identified metric implementations have several drawbacks: some are only applicable on attribute-level (e.g., no aggregation), some require a gold standard that might not exist, and some have implementation errors.</p></list-item>
<list-item><p>In general-purpose DQ tools, DQ monitoring is considered a premium feature, which is liable to costs and only provided in professional versions. Exceptions are dedicated open-source DQ monitoring tools, like Apache Griffin or MobyDQ, which support the automation of rules, but lack predefined functions and data profiling capabilities.</p></list-item>
</list>
<p>This article is structured as follows: Section 2 summarizes related research concerning DQ management, measurement, and monitoring. Section 3 covers the applied methodology to conduct this research, including related surveys, our research questions, and the tool selection strategy. Based on the existing research from Section 2, we introduce a new requirements catalog to evaluate DQ tools and the accompanying evaluation strategy in Section 4. In Section 5, we describe the tools, which have been selected for investigation, and discuss the evaluation. The results and lessons learned are summarized in Section 6. We conclude in Section 7 with an outlook on future work.</p>
</sec>
<sec id="s2">
<title>2. Theoretical Background on Data Quality</title>
<p>Despite different existing interpretations, the term &#x0201C;data quality&#x0201D; is most frequently described as &#x0201C;fitness for use&#x0201D; (Chrisman, <xref ref-type="bibr" rid="B16">1983</xref>; Wang and Strong, <xref ref-type="bibr" rid="B92">1996</xref>), referring to the high subjectivity and context-dependency of this topic. Information quality is often used as synonym for data quality and even though both terms can be clearly distinguished, because &#x0201C;data&#x0201D; refers to plain facts and &#x0201C;information&#x0201D; describes the extension of those facts with context and semantics, they are often used interchangeably in the DQ literature (Wang, <xref ref-type="bibr" rid="B91">1998</xref>; Zhu et al., <xref ref-type="bibr" rid="B94">2014</xref>). We use the term data quality because our focus is on processing objectively, automatically retrievable facts (i.e., intrinsic data characteristics). The term information serves as synonym for data in the systematic search to achieve higher coverage.</p>
<sec>
<title>2.1. Data Quality Management</title>
<p>The Data Management Association (DAMA) defines &#x0201C;data quality management&#x0201D; as the analysis, improvement and assurance of data quality (Otto and &#x000D6;sterle, <xref ref-type="bibr" rid="B67">2016</xref>). Over the years, a number of different DQ methodologies (also declared as &#x0201C;frameworks,&#x0201D; &#x0201C;programs,&#x0201D; or &#x0201C;methods&#x0201D;) have been proposed, for example, TDQM (Total Data Quality Management) by Wang (<xref ref-type="bibr" rid="B91">1998</xref>), AIMQ (A Methodology for Information Quality Assessment) by Lee et al. (<xref ref-type="bibr" rid="B57">2002</xref>), and the DQ assessment methods by Pipino et al. (<xref ref-type="bibr" rid="B70">2002</xref>) and Maydanchik (<xref ref-type="bibr" rid="B61">2007</xref>). Batini et al. (<xref ref-type="bibr" rid="B10">2009</xref>) conducted a comprehensive comparison of DQ methodologies in 2009, and Cichy and Rass (<xref ref-type="bibr" rid="B17">2019</xref>) provide a recent overview on generally applicable DQ methodologies in 2019. Although these methodologies have different characteristics and emphases, it is possible to extract four core activities (cf. English, <xref ref-type="bibr" rid="B27">1999</xref>; Maydanchik, <xref ref-type="bibr" rid="B61">2007</xref>; Batini et al., <xref ref-type="bibr" rid="B10">2009</xref>; Cichy and Rass, <xref ref-type="bibr" rid="B17">2019</xref>): (1) state reconstruction, (2) DQ measurement or assessment, (3) data cleansing or improvement, and (4) the establishment of continuous DQ monitoring. Not all methodologies include all of these steps, for example, step (1) is omitted by Maydanchik (<xref ref-type="bibr" rid="B61">2007</xref>) and step (4) is omitted in the DQ methodology survey by Batini et al. (<xref ref-type="bibr" rid="B10">2009</xref>). Further, some methodologies include additional activities like monitoring of data integration interfaces (cf. Maydanchik, <xref ref-type="bibr" rid="B61">2007</xref>), which we do not consider because of their specialization. In the following paragraphs, we describe the four core steps of a DQ methodology in detail to clarify the difference between DQ measurement, DQ monitoring, and data cleansing activities, where the latter ones are not included in the survey. Step (1), the state reconstruction, describes the collection of contextual information on the observed data, as well as on the organization where a DQ project is carried out (Batini et al., <xref ref-type="bibr" rid="B10">2009</xref>). Since the focus of this article is on DQ tool functionalities, we restrict step (1) in the following to the data part (i.e., data profiling) and do not describe gathering of contextual information on the organization in detail.</p>
<sec>
<title>2.1.1. Data Profiling</title>
<p>Data profiling is described as the process of analyzing a dataset to collect data about data (i.e., metadata) using a broad range of techniques (Naumann, <xref ref-type="bibr" rid="B65">2014</xref>; Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2019</xref>). Thus, it is an essential task prior to any DQ measurement or monitoring activity to get insight into a given dataset. Exemplary information that is gathered during data profiling are the number of distinct or missing (i.e., null) values in a column, data types of attributes, or occurring patterns and their frequency (e.g., formatting of telephone numbers) (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>). We refer to Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2019</xref>) for a detailed discussion on data profiling techniques and tasks. According to Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>) and the findings of our survey, most general-purpose DQ tools offer data profiling capabilities to some extent.</p>
</sec>
<sec>
<title>2.1.2. Data Quality Measurement</title>
<p>According to Sebastian-Coleman (<xref ref-type="bibr" rid="B82">2013</xref>), one of the biggest challenges for DQ practitioners is to answer the question on how data quality should be actually measured. Ge and Helfert (<xref ref-type="bibr" rid="B33">2007</xref>) indicate that this is also true for the synonymously used term <italic>assessment</italic> by stating that one of the major questions in DQ research is &#x0201C;How to assess data quality?.&#x0201D; The term <italic>measure</italic> describes &#x0201C;to ascertain the size, amount, or degree of something by using an instrument or device marked in standard units or by comparing it with an object of known size&#x0201D; (McKean, <xref ref-type="bibr" rid="B62">2005</xref>). Although the term <italic>assessment</italic> is often used as synonym for measurement, especially in DQ literature there is a clear distinction between both terms. Assessment is the &#x0201C;evaluation or estimation of the nature, ability, or quality of something&#x0201D; and extends the concept of measurement by evaluating the measurement results and drawing a conclusion about the object of assessment (McKean, <xref ref-type="bibr" rid="B62">2005</xref>; Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>). DQ assessment is also described as the detection and initial estimation of data quality as well as the impact analysis of occurring DQ problems (English, <xref ref-type="bibr" rid="B27">1999</xref>; Apel et al., <xref ref-type="bibr" rid="B5">2015</xref>). In this survey, we use the term <italic>measurement</italic> since the focus is on measurement capabilities of DQ tools, independently of the interpretation of the results by a user.</p>
<p>In addition to scientific publications, standards should represent the consensus of practitioners and researchers likewise. In terms of data quality, there has been considerable work done by the ISO/IEC JTC 1 (&#x0201C;Information technology,&#x0201D;) subcommittee 7 on &#x0201C;software and systems and engineering.&#x0201D; SC 7&#x00027;s working group 06 published (<xref ref-type="bibr" rid="B45">ISO/IEC 25012:2008</xref>, <xref ref-type="bibr" rid="B45">2008</xref>; <xref ref-type="bibr" rid="B47">ISO/IEC 25040:2011</xref>, <xref ref-type="bibr" rid="B47">2011</xref>; <xref ref-type="bibr" rid="B46">ISO/IEC 25024:2015(E</xref>), <xref ref-type="bibr" rid="B46">2015</xref>). In parallel, subcommittee SC 4 &#x0201C;Industrial data&#x0201D; of the technical committee ISO/TC 184 (&#x0201C;Industrial automation systems and integration&#x0201D;) published (<xref ref-type="bibr" rid="B44">ISO 8000-8:2015(E</xref>), <xref ref-type="bibr" rid="B44">2015</xref>). While (<xref ref-type="bibr" rid="B44">ISO 8000-8:2015(E</xref>), <xref ref-type="bibr" rid="B44">2015</xref>) defines prerequisites for the measurement and reporting of information and data quality on a very general level, (<xref ref-type="bibr" rid="B45">ISO/IEC 25012:2008</xref>, <xref ref-type="bibr" rid="B45">2008</xref>) provides more concrete DQ measures as well as an explanation of how to apply them. According to <xref ref-type="bibr" rid="B44">ISO 8000-8:2015(E</xref>) (<xref ref-type="bibr" rid="B44">2015</xref>), data can be measured on a very general level according to (1) <italic>syntactic quality</italic> that describes the degree to which data conforms to a specified syntax, (2) <italic>semantic quality</italic>, ie., the degree to which data corresponds to its real representation, or (3) <italic>pragmatic quality</italic>, i.e., the degree to which data is suitable for a specific purpose. <xref ref-type="bibr" rid="B45">ISO/IEC 25012:2008</xref> (<xref ref-type="bibr" rid="B45">2008</xref>) defines the &#x0201C;measurement&#x0201D; (of data quality) as &#x0201C;set of operations having the object of determining a value of a measure&#x0201D; and define a set of normalized quality measures (between 0 and 1).</p>
<p>The partition of data quality into a set of DQ dimensions, which can be measured with metrics, is widely accepted in DQ research (cf. Wang and Strong, <xref ref-type="bibr" rid="B92">1996</xref>; Lee et al., <xref ref-type="bibr" rid="B56">2009</xref>; Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>). For example, Lee et al. (<xref ref-type="bibr" rid="B56">2009</xref>) state that &#x0201C;DQ assessment requires assessments along a number of dimensions.&#x0201D; The quality measures provided by <xref ref-type="bibr" rid="B45">ISO/IEC 25012:2008</xref> (<xref ref-type="bibr" rid="B45">2008</xref>) correspond to the most popular metrics in literature (e.g., accuracy, completeness, consistency). Despite the wide agreement on DQ dimensions and metrics (i.e., measures) in general and a lot of research over the last decades, there is still no consensus on a standardized list of dimensions and metrics for DQ measurement (Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>; Myers, <xref ref-type="bibr" rid="B64">2017</xref>). Thus, we observe existing DQ dimensions and metrics and justify their inclusion in our requirements catalog in Section 2.2.</p>
</sec>
<sec>
<title>2.1.3. Data Cleansing</title>
<p>Data cleansing describes process of correcting erroneous data or data glitches (Dasu and Johnson, <xref ref-type="bibr" rid="B20">2003</xref>). In practice, automatable cleansing tasks include customer data standardization, de-duplication, and matching. Other efforts to improve DQ are usually performed manually. While automated data cleansing methods are very valuable for large amounts of data, they pose risks to insert new errors that are rarely well understood (Maydanchik, <xref ref-type="bibr" rid="B61">2007</xref>). We intentionally did not observe data cleansing functionalities in this survey, since the focus is on the detection of DQ problems. However, data cleansing algorithms are usually based on DQ measurement, since it is initially necessary to detect DQ problems to increase the quality of a given dataset.</p>
</sec>
<sec>
<title>2.1.4. Data Quality Monitoring</title>
<p>The term &#x0201C;DQ monitoring&#x0201D; is mainly used implicitly in literature without an established definition and common understanding. This leads to different interpretations when the term is mentioned in scientific publications or by companies promoting and describing their DQ tool. There is a difference between &#x0201C;data monitoring,&#x0201D; which describes continuous checking of rules, and &#x0201C;DQ monitoring,&#x0201D; which is ongoing measurement of DQ (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>). The aim of this survey is to observe not only the functionalities of current DQ tools in terms of data profiling and measurement, but also in terms of true DQ monitoring. Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>) and a follow-up study (Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>) point out that none of the tools observed had any monitoring functionality. We however want to include this criterion in our requirements catalog since there is evidence on several DQ tool websites that they do offer monitoring functionalities, but have not been observed by Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>) and Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>). According to the ISO standard 8,000 (<xref ref-type="bibr" rid="B44">ISO 8000-8:2015(E</xref>), <xref ref-type="bibr" rid="B44">2015</xref>), pragmatic data quality measurement requires interaction with the respective users who validate the data. Consequently, fully automated DQ monitoring is restricted to syntactic and semantic DQ aspects.</p>
</sec>
</sec>
<sec>
<title>2.2. Data Quality Dimensions and Metrics</title>
<p>Data quality is often described as concept with multiple dimensions, so that every DQ dimension refers to a specific aspect of the quality of data (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B24">2019</xref>). Over the years, a wide variety of dimensions and dimension classifications have been proposed (Ballou and Pazer, <xref ref-type="bibr" rid="B8">1985</xref>; Wand and Wang, <xref ref-type="bibr" rid="B90">1996</xref>; Wang and Strong, <xref ref-type="bibr" rid="B92">1996</xref>; Pipino et al., <xref ref-type="bibr" rid="B70">2002</xref>; Ge and Helfert, <xref ref-type="bibr" rid="B33">2007</xref>; Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>). An overview of possible dimensions and classifications is provided by Laranjeiro et al. (<xref ref-type="bibr" rid="B55">2015</xref>); Scannapieco and Catarci (<xref ref-type="bibr" rid="B80">2002</xref>). Despite intensive research and an ongoing discussion on DQ dimensions, there is still no consensus on which dimensions are the essence for DQ measurement (Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>). Our evaluation framework covers the four most frequently used dimensions accuracy, completeness, consistency, and timeliness (Wand and Wang, <xref ref-type="bibr" rid="B90">1996</xref>; Scannapieco and Catarci, <xref ref-type="bibr" rid="B80">2002</xref>; Hildebrand et al., <xref ref-type="bibr" rid="B39">2015</xref>).</p>
<p>Piro (<xref ref-type="bibr" rid="B71">2014</xref>) distinguishes between &#x0201C;hard dimensions&#x0201D; (including accuracy, completeness, and timeliness, amongst others), which can be measured objectively using check routines, and &#x0201C;soft dimensions,&#x0201D; which can only be assessed using subjective evaluation. However, also objective check routines require a preceding subjective and domain-specific definition of the data objects to be measured, in order to consequently follow the &#x0201C;fitness for use&#x0201D; approach (Piro, <xref ref-type="bibr" rid="B71">2014</xref>).</p>
<p>In conjunction with the discussion of DQ dimensions, it is often mentioned that the definition of specific DQ metrics is required to apply those dimensions in practice. A metric is a function that maps a quality dimension to a numerical value, which allows an interpretation of a dimension&#x00027;s fulfillment (IEEE, <xref ref-type="bibr" rid="B41">1998</xref>). Such a DQ metric can be measured on different aggregation levels: on value-level, column or attribute-level, tuple or record-level, table or relation-level, as well as database (DB)-level (Hildebrand et al., <xref ref-type="bibr" rid="B39">2015</xref>). The aggregation could, for example, be performed with the weighted arithmetic mean of the calculated metric results from the previous level (e.g., results of the record-level to calculate the table-level metric) (Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>). Heinrich et al. (<xref ref-type="bibr" rid="B38">2018</xref>) proposed five requirements for DQ metrics to ensure reliable decision-making: &#x0201C;the existence of minimum and maximum metric values (R1), the interval scaling of the metric values (R2), the quality of the configuration parameters and the determination of the metric values (R3), the sound aggregation of the metric values (R4), and the economic efficiency of the metric (R5).&#x0201D; However, other researchers claim &#x0201C;that a more general approach is required&#x0201D; (Bronselaer et al., <xref ref-type="bibr" rid="B13">2018</xref>) to assess the usefulness and validity of a DQ metric. In the following, we describe four prominent DQ dimensions along with common metrics for their calculation. The list of metrics is not exhaustive, but should give an impression about the research conducted in this area since we observe the existence of such or similar metrics in our DQ tool evaluation.</p>
<sec>
<title>2.2.1. Accuracy</title>
<p>Although accuracy is sometimes described as the most important data quality dimension, a number of different definitions exist (Wand and Wang, <xref ref-type="bibr" rid="B90">1996</xref>; Haegemans et al., <xref ref-type="bibr" rid="B35">2016</xref>). In DQ literature, accuracy can be described as the closeness between an information system and the part of the real-world it is supposed to model (Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>). From the natural sciences perspective, accuracy is usually defined as the &#x0201C;magnitude of an error&#x0201D; (Haegemans et al., <xref ref-type="bibr" rid="B35">2016</xref>). We refer to Haegemans et al. (<xref ref-type="bibr" rid="B35">2016</xref>) for a detailed discussion on the definitions of accuracy and a comprehensive list of metrics related to accuracy. Here, we provide a few exemplary metrics. Redman (<xref ref-type="bibr" rid="B77">2005</xref>) defines field- and record-level accuracy as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext>field&#x000A0;level&#x000A0;accuracy&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>number&#x000A0;of&#x000A0;fields&#x000A0;judged&#x000A0;</mml:mtext><mml:mo>&#x00022;</mml:mo><mml:mtext>correct</mml:mtext><mml:mo>&#x00022;</mml:mo></mml:mrow><mml:mrow><mml:mtext>number&#x000A0;of&#x000A0;fields&#x000A0;tested</mml:mtext></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>(Redman, <xref ref-type="bibr" rid="B77">2005</xref>),</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtext>record&#x000A0;level&#x000A0;accuracy&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext>number&#x000A0;of&#x000A0;records&#x000A0;judged&#x000A0;</mml:mtext><mml:mo>&#x00022;</mml:mo><mml:mtext>completely&#x000A0;correct</mml:mtext><mml:mo>&#x00022;</mml:mo></mml:mrow><mml:mrow><mml:mtext>number&#x000A0;of&#x000A0;records&#x000A0;tested</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula>
<p>This metric is also reused by the DAMA UK (Askham et al., <xref ref-type="bibr" rid="B7">2013</xref>) by generalizing &#x0201C;fields&#x0201D; and &#x0201C;records&#x0201D; to &#x0201C;objects.&#x0201D; Lee et al. (<xref ref-type="bibr" rid="B56">2009</xref>) use the inverse metric (<inline-formula><mml:math id="M3"><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">Number of data units in error</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">Total number of data units</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula>) and Fisher et al. (<xref ref-type="bibr" rid="B30">2009</xref>) additionally take into account the randomness of the occurrence of an error <italic>ROE</italic> and the probability distribution of the occurrence of an error <italic>PDOE</italic>:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">accuracy</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">NrOfCorrectValues</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">TotalNrOfValues</mml:mtext></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext class="textrm" mathvariant="normal">ROE</mml:mtext><mml:mo>,</mml:mo><mml:mtext class="textrm" mathvariant="normal">PDOE</mml:mtext></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>(Fisher et al., <xref ref-type="bibr" rid="B30">2009</xref>).</p>
<p>Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>) proposed the accuracy metric in Equation (4), which can be aggregated on different levels. On attribute-value-level, the metric <italic>Q</italic><sub><italic>Gen</italic></sub> for accuracy (<italic>Gen</italic> is &#x0201C;Genauigkeit&#x0201D; in German, which means &#x0201C;accuracy&#x0201D; in English) is defined by the ratio between a value&#x00027;s arity and its optimal arity for numeric values. For a numeric attribute <italic>A</italic>, <italic>s</italic><sub><italic>opt</italic></sub>(<italic>A</italic>) is the optimal number of digits and decimals for <italic>A</italic>, <italic>w</italic> is a value of <italic>A</italic> and <italic>s</italic>(<italic>w</italic>) is the actual number of digits and decimals for <italic>w</italic> in attribute <italic>A</italic>. Since <italic>s</italic><sub><italic>opt</italic></sub>(<italic>A</italic>) is not necessarily maximal, the metric needs to be normalized by [0, 1] (Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>).</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>G</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hinrichs, (<xref ref-type="bibr" rid="B40">2002</xref>).</p>
<p>For non-numeric attributes, Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>) suggests to assign <italic>w</italic> to plane <italic>i</italic> within a classification <italic>K</italic> with <italic>n</italic> planes (<italic>K</italic><sub>1</sub>, ..., <italic>K</italic><sub><italic>n</italic></sub>) and to replace <italic>s</italic>(<italic>w</italic>) with <italic>i</italic> and to select <italic>s</italic><sub><italic>opt</italic></sub>(<italic>A</italic>) from <italic>K</italic> with <italic>s</italic><sub><italic>opt</italic></sub>(<italic>A</italic>) &#x02264; <italic>n</italic>. For a tuple <italic>t</italic>, accuracy <italic>Q</italic><sub><italic>Gen</italic></sub> is measured according to:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>G</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>G</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hinrichs, (<xref ref-type="bibr" rid="B40">2002</xref>),</p>
<p>where <italic>t</italic>.<italic>A</italic><sub>1</sub>, ..., <italic>t</italic>.<italic>A</italic><sub><italic>n</italic></sub> are the attribute values for attributes <italic>A</italic><sub>1</sub>, ..., <italic>A</italic><sub><italic>n</italic></sub> that specify the observed tuple <italic>t</italic>. Factor <italic>g</italic><sub><italic>j</italic></sub> is the relative importance of <italic>A</italic><sub><italic>j</italic></sub> with respect to the total tuple and is an expert-defined weight (Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>). The accuracy on table-level is then calculated as the arithmetic mean of the tuple accuracy measurements, and the accuracy on DB-level is the arithmetic mean of the table-level accuracy measurements. For a more detailed discussion on the metric, we refer to Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>).</p>
</sec>
<sec>
<title>2.2.2. Completeness</title>
<p>Completeness is very generally described as the &#x0201C;breadth, depth, and scope of information contained in the data&#x0201D; (Wang and Strong, <xref ref-type="bibr" rid="B92">1996</xref>; Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>) and covers the condition for data to exist. Considering related work (cf. Redman, <xref ref-type="bibr" rid="B76">1997</xref>; Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>; Lee et al., <xref ref-type="bibr" rid="B56">2009</xref>; Ehrlinger et al., <xref ref-type="bibr" rid="B25">2018</xref>), the most generic metric for completeness can be defined as:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Completeness</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>e</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where |<italic>e</italic><sub><italic>c</italic></sub>| is the number of complete elements and |<italic>e</italic>| is the total number of elements. Here, the generic term &#x0201C;element&#x0201D; can refer to any data unit, e.g., an attribute, a record, or a table. Lee et al. (<xref ref-type="bibr" rid="B56">2009</xref>) use the inverse metric (<inline-formula><mml:math id="M8"><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">Number of incomplete elements</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">Total number of elements</mml:mtext></mml:mstyle></mml:mrow></mml:mfrac></mml:math></inline-formula> Lee et al., <xref ref-type="bibr" rid="B56">2009</xref>) and Batini and Scannapieco (<xref ref-type="bibr" rid="B11">2016</xref>) suggest comparing the number of complete elements to the (total) number of elements in a perfect reference dataset. A more detailed specification on how to calculate completeness is provided by Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>), who assigns 0.0 to a field value that is <monospace>null</monospace> or equivalent and 1.0 else. Based on this assumption, completeness can be calculated analogously to the accuracy metric on different aggregation levels with the weighted arithmetic mean. For example, the completeness <italic>Q</italic><sub><italic>Voll</italic></sub> (<italic>Voll</italic> is &#x0201C;Vollst&#x000E4;ndigkeit&#x0201D; in German, which means &#x0201C;completeness&#x0201D; in English) on table-level is defined as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>T</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:msubsup></mml:mstyle><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>T</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hinrichs, (<xref ref-type="bibr" rid="B40">2002</xref>),</p>
<p>where |<italic>T</italic>| is the number of records in table <italic>T</italic> and <italic>Q</italic><sub><italic>Voll</italic></sub>(<italic>t</italic><sub><italic>i</italic></sub>) is the completeness of record <italic>t</italic><sub><italic>i</italic></sub>. We want to point out that in addition to the assumption by Hinrichs, who counts true missing values (i.e., <monospace>null</monospace>), it is also possible to approach completeness in a more rigorous way by considering default values or textual entries stating &#x0201C;NaN&#x0201D; (i.e., not a number) as incomplete values.</p>
<p>Although Hinrichs does not propose a completeness metric per attribute (i.e., column) and other related work like (Askham et al., <xref ref-type="bibr" rid="B7">2013</xref>) describe attribute-level completeness only textually, such a metric can be derived from the description and Equation (6) as follows:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where |<italic>v</italic>| is the total number of values within a column and |<italic>v</italic><sub><italic>c</italic></sub>| is the number of complete values that are not <monospace>null</monospace>.</p>
</sec>
<sec>
<title>2.2.3. Consistency</title>
<p>There are also different definitions for the consistency dimension. According to Batini and Scannapieco (<xref ref-type="bibr" rid="B11">2016</xref>), &#x0201C;consistency captures the violation of semantic rules defined over data items, where items can be tuples of relational tables or records in a file.&#x0201D; An example for such rules are integrity constraints from the relational theory. Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>) assumes for his proposed consistency metric that domain knowledge is encoded into rules and excludes contradictions within the rules and fuzzy or probabilistic assumptions. Consequently the consistency <italic>Q</italic><sub><italic>Kon</italic></sub> (<italic>Kon</italic> is &#x0201C;Konsistenz&#x0201D; in German, which means &#x0201C;consistency&#x0201D; in English) of an attribute value <italic>w</italic> is defined as</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mi>Q</mml:mi><mml:mrow><mml:mi>K</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>w</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>w</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hinrichs, (<xref ref-type="bibr" rid="B40">2002</xref>),</p>
<p>where <italic>g</italic><sub><italic>j</italic></sub> is the degree of severity of <italic>r</italic><sub><italic>j</italic></sub>(<italic>w</italic>), and <italic>r</italic><sub><italic>j</italic></sub>(<italic>w</italic>) is the violation of consistency rule <italic>r</italic><sub><italic>j</italic></sub> (within a set of <italic>n</italic> consistency rules), applied to the attribute value <italic>w</italic>, and defined as</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>w</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mn>0</mml:mn></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>w</mml:mi><mml:mtext>&#x000A0;satisfies&#x000A0;</mml:mtext><mml:msub><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mn>1</mml:mn></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Hinrichs, (<xref ref-type="bibr" rid="B40">2002</xref>)</p>
<p>Consistency rules cannot only be defined on attribute-value-level, but also on tuple-level. The calculation of the consistency on table- or database-level is in alignment to the accuracy and completeness metric calculated as the arithmetic mean of the tuple-level consistency (Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>).</p>
<p>Sebastian-Coleman (<xref ref-type="bibr" rid="B82">2013</xref>) suggests measuring consistency over time by comparing the &#x0201C;record count distribution of values (column profile) to past instances of data populating the same field.&#x0201D;</p>
</sec>
<sec>
<title>2.2.4. Timeliness</title>
<p>Timeliness describes &#x0201C;how current the data are for the task at hand&#x0201D; (Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>) and is closely connected to the notions of <italic>currency</italic> (update frequency of data) and <italic>volatility</italic> (how fast data becomes irrelevant). A different definition states that &#x0201C;timeliness can be interpreted as the probability that an attribute value is still up-to-date&#x0201D; (Heinrich et al., <xref ref-type="bibr" rid="B36">2007</xref>). A list of different metrics to calculate timeliness is provided by Heinrich and Klier (<xref ref-type="bibr" rid="B37">2009</xref>), where the authors suggest calculating timeliness based on the definition by Heinrich et al. (<xref ref-type="bibr" rid="B36">2007</xref>) according to:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mo>.</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mtext class="textit" mathvariant="italic">decline</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>(Heinrich et al., <xref ref-type="bibr" rid="B36">2007</xref>),</p>
<p>where &#x003C9; is the considered attribute value and <italic>decline</italic>(<italic>A</italic>) is the decline rate, which specifies the average number of attributes that become outdated within the time period <italic>t</italic> (Heinrich et al., <xref ref-type="bibr" rid="B36">2007</xref>).</p>
<p>This list of metrics for the DQ dimensions accuracy, completeness, timeliness, and consistency, is by no means exhaustive, but a comprehensive discussion would be out of scope for this article. We conclude that literature offers a number of specifically formulated metrics to measure DQ dimensions and this survey observes their implementation in state-of-the-art DQ tools.</p>
</sec>
</sec>
<sec>
<title>2.3. Requirements for Data Quality Tools</title>
<p>In general, there are very few scientific papers that study the functional scope of DQ tools and even less papers that propose a dedicated requirements catalog for their evaluation. The differentiation of our DQ tool survey to existing ones (and consequently their requirements) is explained in detail in Section 3.1. In summary, the proposed requirements were of too less detail or with a different functional focus.</p>
<p>In addition to existing surveys, Goasdou&#x000E9; et al. (<xref ref-type="bibr" rid="B34">2007</xref>) explicitly proposed an evaluation framework for DQ tools without publishing the results of their evaluation. The proposed requirements were adapted to the context of the company, they performed the DQ tool evaluation for: &#x000C9;lectricit&#x000E9; de France (EDF), a French electric utility company, and more precisely to their CRM (customer relationship management) environments. Thus, the main differences to our requirements catalog are a more detailed evaluation of address normalization, duplicate detection, and reporting capabilities, but less details in data profiling and no coverage of DQ monitoring functionality.</p>
<p>In addition to requirements defined by researchers, there are several practitioner- and vendor-focused surveys by Gartner Inc. (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>; Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>), which observe DQ tools by means of the following DQ capabilities: connectivity, data profiling, measurement and visualization, monitoring, parsing, standardization and cleaning, matching, linking and merging, multi-domain support, address validation/geocoding, data curation and enrichment, issue resolution and workflow, metadata management, DevOps environment, deployment environment, architecture and integration, and usability. Similarly, Loshin (<xref ref-type="bibr" rid="B59">2010</xref>) defines the following eight requirements a DQ tool must offer: &#x0201C;data profiling, parsing, standardization, identity resolution, record linkage and merging, data cleansing, data enhancement, and data inspection and monitoring.&#x0201D; Such lists of requirements were too coarse grained for our aim to specifically observe data profiling functionality, DQ measurement, and DQ monitoring functionality. While general features like connectivity and usability of the tools are not necessary to answer our research questions, we added a short textual description to each tool we observed.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Survey Methodology</title>
<p>A systematic survey is usually started by defining a &#x0201C;protocol that specifies the research questions being addressed and the methods that will be used&#x0201D; (Kitchenham, <xref ref-type="bibr" rid="B50">2004</xref>). This section describes the protocol we developed to systematically conduct our survey. The structure of the protocol has been derived from the methodology for systematic reviews in computer science by Kitchenham (<xref ref-type="bibr" rid="B50">2004</xref>). Since the focus in Kitchenham (<xref ref-type="bibr" rid="B50">2004</xref>) is on the evaluation of primary research papers and not on specific implementations, we omit steps 5, 6, and 7 of the suggested planning information, including quality assessment, a data extraction strategy, and the synthesis of the extracted data from the original research papers.</p>
<sec>
<title>3.1. Related Surveys</title>
<p>Although a lot of DQ methods and tools have been published, there are few scientific studies about the functional scope of DQ tools. Gartner Inc. (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>; Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>) lists the strengths and cautions of vendors of commercial DQ tools in their &#x0201C;Magic Quadrant for Data Quality Tools&#x0201D; 2016 (17 vendors), 2017 (16 vendors), and 2019 (15 vendors). They include vendors that offer software tools or cloud-based services, which deliver general-purpose DQ functionalities, including at least profiling, parsing, standardization/cleansing, matching, and monitoring (Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>). The study is vendor-focused and does not provide a detailed comparison of the respective data quality tools in terms of functionality (e.g., measurement and monitoring capabilities). However, the &#x0201C;Magic Quadrant for Data Quality Tools&#x0201D; contains a representative selection of commercial DQ tools, which is a valuable complement to our survey. The closest survey to our work in terms of tool comparison structure has been published by Fraunhofer IAO in German language (Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>). While (Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>) focus on tools popular in German, we at a scientific approach to observe the availability of DQ tools from a general perspective by also justifying the tool selection.</p>
<p>Woodall et al. (<xref ref-type="bibr" rid="B93">2014</xref>) categorize different methods to assess and improve DQ. They understand DQ methods as automatically executable algorithms to detect or correct a DQ problem, e.g., column analysis, data verification, or data standardization. As basis for their classification, they reviewed the list of DQ tools included in the &#x0201C;Magic Quadrant for Data Quality Tools 2012&#x0201D; by Gartner and extracted a list of DQ methods that tackle specific DQ problems. Woodall et al. (<xref ref-type="bibr" rid="B93">2014</xref>) do not provide an in-depth comparison of which method is contained in which tool since their focus is on the method classification.</p>
<p>Barateiro and Galhardas (<xref ref-type="bibr" rid="B9">2005</xref>) compared 9 academic and 28 commercial DQ tools in a scientific survey. This article does not cover state-of-the-art tools and the survey was not conducted in a systematic way, which means, it is unclear how the list of DQ tools has been selected. In addition, the authors state that DQ tools aim at detecting and correcting data problems, which is why they observe functionalities for both, DQ measurement as well as data cleansing, with an emphasis on the second aspect. In contrast, we focus on the measurement of data quality issues only, with special consideration of long-term monitoring functionality.</p>
<p>Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>) proposed an overview of 7 open-source or freely available DQ tools. They described each tool briefly and compared the functionalities of the tools by means of <italic>performance criteria</italic> (including 6 usability features like data source connectivity, report creation, or the graphical user interface&#x02014;GUI) and <italic>core functionality</italic>. The core functionality consists of 4 groups, which are further subdivided into specific features that are observed: data profiling (e.g., data pattern discovery), data integration (e.g., ETL), data cleansing (e.g., parsing and standardization), and data monitoring. Due to the limited number of pages, Pushkarev et al. do not provide detailed insights in the implementation of specific criteria, and mainly distinguish between the availability of a feature (Y) or its absence (N). For example, the authors list 9 usability criteria for the GUI, but in the evaluation they only distinguish between (g) representing &#x0201C;not user friendly GUI&#x0201D; and a (G) for &#x0201C;user-friendly GUI&#x0201D; with drag and drop functionality. Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>) published a revised version of the tool overview, which is very similar to the original work in terms of structure and methodology. They used the same criteria structure as Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>), but omitted the data monitoring group [since it is not provided by any of the tools according to Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>)] and 4 other sub-features without further justification. The list of investigated DQ tools was extended from 7 to 10. Our survey differs notably from these two papers since, we conducted a systematic search to select DQ tools and also investigated commercial tools, while Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref> and (Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>) presented a predefined selection of free or open-source tools without publishing their selection strategy. Moreover, we focus on data profiling, DQ measurement, and DQ monitoring and evaluate these feature groups with a more detailed and comprehensive criteria catalog as provided by other published surveys mentioned above.</p>
<p>Another study by Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>) focuses on big data quality assurance. However, the authors did not clarify the methodology, that is, the selection of the investigated tools and evaluation criteria. In contrast to our survey, were the focus is on the actual DQ measurement functionalities, the comparison in Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>) includes mainly technical features like the supported operating system and data sources, as well as a limited list of 4 basic data validation functions.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> provides an overview on related DQ tool surveys and compares them to our work. It can be seen that there exists no other survey, which (1) conducted a systematic search to select the DQ tools for investigation, (2) addresses both practitioners and researchers, and (3) investigates data profiling, DQ measurement, DQ monitoring, as well as the vendors in terms of customer support. In contrast to other surveys that focus mainly on commercial or open-source tools, we provide a good digest of the market by investigating a total number of 13 DQ tools, from which five are open-source and eight commercial.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparison of related data quality tool surveys.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th/>
<th/>
<th valign="top" align="center" colspan="3"><bold>No. of DQ Tools</bold></th>
<th valign="top" align="center" colspan="6"><bold>Scope of Investigation</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>Survey Authors and Year</bold></th>
<th valign="top" align="left"><bold>Target Group</bold></th>
<th valign="top" align="left"><bold>Survey Focus</bold></th>
<th valign="top" align="left"><bold>Selection Strategy</bold></th>
<th valign="top" align="left"><bold>Total number of tools</bold></th>
<th valign="top" align="left"><bold>Open-source tools</bold></th>
<th valign="top" align="left"><bold>Commercial tools</bold></th>
<th valign="top" align="left"><bold>DQ tool vendors</bold></th>
<th valign="top" align="left"><bold>Data profiling</bold></th>
<th valign="top" align="left"><bold>DQ measurement</bold></th>
<th valign="top" align="left"><bold>DQ monitoring</bold></th>
<th valign="top" align="left"><bold>Data cleansing</bold></th>
<th valign="top" align="left"><bold>Technical features</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Gartner Inc. &#x02013; (Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>)</td>
<td valign="top" align="left">Practitioners</td>
<td valign="top" align="left">DQ tool vendors</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">46</td>
<td valign="top" align="center" colspan="2">(1 / 45)</td>
<td valign="top" align="left">x</td>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Gartner Inc. &#x02013; (Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>)</td>
<td valign="top" align="left">Practitioners</td>
<td valign="top" align="left">DQ tool vendors</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">39</td>
<td valign="top" align="center" colspan="2">(1 / 38)</td>
<td valign="top" align="left">x</td>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Gartner Inc. &#x02013; (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>)</td>
<td valign="top" align="left">Practitioners</td>
<td valign="top" align="left">DQ tool vendors</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">26</td>
<td valign="top" align="center" colspan="2">(1 / 25)</td>
<td valign="top" align="left">x</td>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Kokem&#x000FC;ller and Haupt (<xref ref-type="bibr" rid="B52">2012</xref>)</td>
<td valign="top" align="left">Practitioners</td>
<td valign="top" align="left">German DQ tools</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">17</td>
<td valign="top" align="center" colspan="2">(0 / 17)</td>
<td valign="top" align="left">x</td>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="left">x</td>
</tr>
<tr>
<td valign="top" align="left">Woodall et al. (<xref ref-type="bibr" rid="B93">2014</xref>)</td>
<td valign="top" align="left">Both</td>
<td valign="top" align="left">DQ methods</td>
<td valign="top" align="left">By Friedman (<xref ref-type="bibr" rid="B31">2012</xref>)</td>
<td valign="top" align="left">16</td>
<td valign="top" align="center" colspan="2">(1 / 15)</td>
<td/>
<td/>
<td valign="top" align="left">x</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Barateiro and Galhardas (<xref ref-type="bibr" rid="B9">2005</xref>)</td>
<td valign="top" align="left">Researchers</td>
<td valign="top" align="left">DQ tools</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">37</td>
<td valign="top" align="center" colspan="2">(9 / 28)</td>
<td/>
<td/>
<td valign="top" align="left">x</td>
<td/>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
</tr>
<tr>
<td valign="top" align="left">Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>)</td>
<td valign="top" align="left">Researchers</td>
<td valign="top" align="left">Open-source tools</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">6</td>
<td valign="top" align="center" colspan="2">(6 / 0)</td>
<td/>
<td valign="top" align="left">x</td>
<td/>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
</tr>
<tr>
<td valign="top" align="left">Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>)</td>
<td valign="top" align="left">Researchers</td>
<td valign="top" align="left">Open-source tools</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">10</td>
<td valign="top" align="center" colspan="2">(10 / 0)</td>
<td/>
<td valign="top" align="left">x</td>
<td/>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
</tr>
<tr>
<td valign="top" align="left">Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>)</td>
<td valign="top" align="left">Researchers</td>
<td valign="top" align="left">Big data DQ tools</td>
<td valign="top" align="left">By the authors</td>
<td valign="top" align="left">11</td>
<td valign="top" align="center" colspan="2">(4 / 7)</td>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
</tr>
<tr>
<td valign="top" align="left">Ehrlinger and W&#x000F6;&#x000DF; (This work)</td>
<td valign="top" align="left">Both</td>
<td valign="top" align="left">DQ tools</td>
<td valign="top" align="left">Systematic search</td>
<td valign="top" align="left">13</td>
<td valign="top" align="center" colspan="2">(5 / 8)</td>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
<td valign="top" align="left">x</td>
<td/>
<td/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.2. Research Questions</title>
<p>The aim of this survey is to evaluate and compare existing DQ tools with respect to their DQ measurement and monitoring functionalities in order to answer the research question <italic>how DQ measurement and monitoring concepts are implemented in state-of-the-art DQ tools</italic>. This research question can be refined with three sub-questions, where the theoretical background is discussed in Section 2. In Section 4.1, we present our requirements catalog, in which each sub-question is assigned to specific technical requirements.</p>
<list list-type="order">
<list-item><p>Which data profiling capabilities are supported by current DQ tools?</p></list-item>
<list-item><p>Which data quality dimensions and metrics can be measured with current DQ tools?</p></list-item>
<list-item><p>Do DQ tools allow automated data quality monitoring over time?</p></list-item>
</list>
</sec>
<sec>
<title>3.3. DQ Tool Search Strategy</title>
<p>To establish a comprehensive list of existing DQ tools, we developed a three-fold strategy. First, we included all observed tools from previous surveys by Barateiro and Galhardas (<xref ref-type="bibr" rid="B9">2005</xref>), Kokem&#x000FC;ller and Haupt (<xref ref-type="bibr" rid="B52">2012</xref>), Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>), Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>), Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>), and Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>) as candidate tools. Second, we conducted a systematic search to find research papers that introduce, investigate, or mention DQ tools. The third part of our search strategy consists of a random Google search by using the same search term combinations as for the systematic search. In contrast to the systematic search, we do not aim at a comprehensive observation of all search results, which is unfeasible for Google search results. However, to also identify non-research tools that have not been described in scientific papers, we consider this random search as enrichment to guarantee a best possible coverage of candidate tools. The remainder of this section is dedicated to the systematic search.</p>
<p>We identified the following search terms to conduct the systematic search: <italic>data quality, information quality</italic>, and <italic>tool</italic>. Since &#x0201C;information quality&#x0201D; is considered a synonym to &#x0201C;data quality&#x0201D; (Zhu et al., <xref ref-type="bibr" rid="B94">2014</xref>), we applied both search terms to achieve higher coverage. We decided not to add the terms &#x0201C;assessment&#x0201D; and &#x0201C;monitoring&#x0201D; to the search, as it would automatically exclude tools that do not specifically use these keywords. Consequently, the following search expression has been applied:</p>
<disp-quote><p>(&#x0201C;<italic>data quality</italic>&#x0201D; &#x02228; &#x0201C;<italic>information quality</italic>&#x0201D;) &#x02227; <italic>tool</italic></p></disp-quote>
<p>The search expression has then been applied to the list of digital libraries that is provided in <xref ref-type="table" rid="T2">Table 2</xref>. We also included the software development platform GitHub, because the purpose of this search is to select concrete tools. The original aim was to search all titles and abstracts from the computer science domain. However, since each digital library offers different search functionalities, we selected the closest search-engine-specific settings to reflect our original search aim. <xref ref-type="table" rid="T2">Table 2</xref> documents the deviations for each conducted search along with the ultimately utilized search expression, which is already formatted according to the guidelines of the respective search engine. For the GitHub search, we additionally omitted the search term <italic>tool</italic>, because most GitHub results are obviously tools (except for empty repositories, code samples, or documentations).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Systematic search.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Source</bold></th>
<th valign="top" align="left"><bold>Search expression</bold></th>
<th valign="top" align="left"><bold>Scope</bold></th>
<th valign="top" align="left"><bold>Restrictions</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">ACM Digital Library<xref ref-type="table-fn" rid="TN1"><sup>a</sup></xref></td>
<td valign="top" align="left">acmdlTitle:(&#x0002B;(&#x0201C;data quality&#x0201D; &#x0201C;information quality&#x0201D;) &#x0002B;tool) OR recordAbstract:(&#x0002B;(&#x0201C;data quality&#x0201D; &#x0201C;information quality&#x0201D;) &#x0002B;tool)</td>
<td valign="top" align="left">Title, abstract</td>
<td valign="top" align="left">-</td>
</tr>
<tr>
<td valign="top" align="left">GitHub<xref ref-type="table-fn" rid="TN2"><sup>b</sup></xref></td>
<td valign="top" align="left">&#x0201C;data quality&#x0201D; OR &#x0201C;information quality&#x0201D;</td>
<td valign="top" align="left">Full text</td>
<td valign="top" align="left">-</td>
</tr>
<tr>
<td valign="top" align="left">Google Scholar<xref ref-type="table-fn" rid="TN3"><sup>c</sup></xref></td>
<td valign="top" align="left">allintitle: (&#x0201C;data quality&#x0201D; OR &#x0201C;information quality&#x0201D;) AND tool</td>
<td valign="top" align="left">Title</td>
<td valign="top" align="left">Exclude citations and patents</td>
</tr>
<tr>
<td valign="top" align="left">IEEE Xplore Digital Library<xref ref-type="table-fn" rid="TN4"><sup>d</sup></xref></td>
<td valign="top" align="left">(((&#x0201C;data quality&#x0201D;) OR &#x0201C;information quality&#x0201D;) AND tool)</td>
<td valign="top" align="left">Title, abstract, indexing terms</td>
<td valign="top" align="left">-</td>
</tr>
<tr>
<td valign="top" align="left">Science Direct<xref ref-type="table-fn" rid="TN5"><sup>e</sup></xref></td>
<td valign="top" align="left">TITLE-ABSTR-KEY(&#x0201C;data quality&#x0201D; OR &#x0201C;information quality&#x0201D;) and TITLE-ABSTR-KEY(tool)[All Sources(Computer Science)].</td>
<td valign="top" align="left">Title, abstract, keywords</td>
<td valign="top" align="left">Computer science only</td>
</tr>
<tr>
<td valign="top" align="left">Springer Link<xref ref-type="table-fn" rid="TN6"><sup>f</sup></xref></td>
<td valign="top" align="left">tool NEAR (&#x0201C;data quality&#x0201D; OR &#x0201C;information quality&#x0201D;)</td>
<td valign="top" align="left">Full text</td>
<td valign="top" align="left">Computer science only</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN1"><label>a</label><p><italic><ext-link ext-link-type="uri" xlink:href="http://dl.acm.org/advsearch.cfm">http://dl.acm.org/advsearch.cfm</ext-link> (January, 2022)</italic>.</p></fn>
<fn id="TN2"><label>b</label><p><italic><ext-link ext-link-type="uri" xlink:href="https://github.com">https://github.com</ext-link> (January, 2022)</italic>.</p></fn>
<fn id="TN3"><label>c</label><p><italic><ext-link ext-link-type="uri" xlink:href="https://scholar.google.at">https://scholar.google.at</ext-link> (January, 2022)</italic>.</p></fn>
<fn id="TN4"><label>d</label><p><italic><ext-link ext-link-type="uri" xlink:href="http://ieeexplore.ieee.org/search/advsearch.jsp">http://ieeexplore.ieee.org/search/advsearch.jsp</ext-link> (January, 2022)</italic>.</p></fn>
<fn id="TN5"><label>e</label><p><italic><ext-link ext-link-type="uri" xlink:href="http://www.sciencedirect.com">http://www.sciencedirect.com</ext-link> (January, 2022)</italic>.</p></fn>
<fn id="TN6"><label>f</label><p><italic><ext-link ext-link-type="uri" xlink:href="https://link.springer.com/advanced-search">https://link.springer.com/advanced-search</ext-link> (January, 2022)</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>For each search result, we assessed the title and abstract to determine whether a paper actually promotes a candidate DQ tool or not. In cases where title and abstract were not explicit enough, or they indicated the presentation of a tool (and therefore this article could not be directly classified as not relevant), the content of this article was investigated in more detail to record name and purpose of the tool in a first step. In the GitHub search, we excluded all tools that did not offer any kind of description immediately and used the others as candidates. <xref ref-type="fig" rid="F1">Figure 1</xref> illustrates the number of investigated research papers and the resulting tools. The next section describes the subsequent investigation of all candidate tools according to defined exclusion criteria (EC).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Systematic search.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-850611-g0001.tif"/>
</fig>
</sec>
<sec>
<title>3.4. DQ Tool Selection</title>
<p>In accordance with our general search strategy, we defined three inclusion criteria. Each tool that was selected as candidate tool had to satisfy at least one of the following three criteria.</p>
<list list-type="order">
<list-item><p>The tool was included in one of the previous surveys (cf. Barateiro and Galhardas, <xref ref-type="bibr" rid="B9">2005</xref>; Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref>; Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>; Gao et al., <xref ref-type="bibr" rid="B32">2016</xref>; Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>).</p></list-item>
<list-item><p>The tool was identified in our systematic search.</p></list-item>
<list-item><p>The tool was identified in our random search.</p></list-item>
</list>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> shows the number of scientific papers (&#x00023;Papers), which we found in the systematic search per source, as well as the number of tools (&#x00023;Tools) that were mentioned in these papers. It can be seen that some papers mention several DQ tools (e.g., other DQ tool surveys), while some use the term in their title or abstract, but do not refer to a concrete tool directly. In total, 1,298 papers have been discovered through the systematic search, which refer to 567 DQ tools (this number includes duplicates). In the related surveys we located 110 tools (including duplicates) and added 43 additional tools from the random Google search. In the next step, all 720 tools were merged into one file to remove duplicates. This resulted in a total of 667 identified distinct DQ tools. After establishing the list of candidate tools, we conducted a review to exclude all tools from the survey that met at least one of the following exclusion criteria.</p>
<list list-type="simple">
<list-item><p>(EC1) The tool is domain-specific (e.g., for web data or a specific implementations only).</p></list-item>
<list-item><p>(EC2) The tool is dedicated to specific data management tasks without explicitly offering DQ measurement.</p>
<list list-type="simple">
<list-item><p>(a) The tool is dedicated to data cleansing.</p></list-item>
<list-item><p>(b) The tool is dedicated to data integration (including on-the-fly DQ checks).</p></list-item>
<list-item><p>(c) The tool is dedicated to other data management tasks (e.g., data visualization).</p></list-item>
</list>
</list-item>
<list-item><p>(EC3) The tool is not publicly available (e.g., the tool is only described in a research paper).</p></list-item>
<list-item><p>(EC4) The tool is considered deprecated (i.e., the vendor does not exist any more or the tool was found on GitHub and the last commit was before January 1<sup>st</sup>, 2016).</p></list-item>
<list-item><p>(EC5) The tool was found on GitHub without any further information available.</p></list-item>
<list-item><p>(EC6) The tool requires a fee and no free trial is offered upon request.</p></list-item>
</list>
<p>The table in <xref ref-type="fig" rid="F1">Figure 1</xref> shows how many tools were excluded per criterion (multiple selection was possible). Most of the tools were excluded because they are domain specific (EC1) and/or focus on specific data management tasks (EC2). The 267 tools excluded due to EC2 are divided between the three subcriteria as follows: 111 tools were excluded by EC2(a), 46 tools were excluded by EC2(b) and 110 tools by EC2(c).</p>
<p>For the search process and selection, we used Microsoft Excel to collect the identified scientific papers from the search engine results, to assemble a uniform list of identified DQ tools, and to remove duplicate tools. We tracked the exclusion of the tools according to our six criteria in a separate Excel file. 17 DQ tools have been selected for deeper investigation, from which 13 could be evaluated since three were based on SAP, where no installation was available, and one (IBM InfoSphere Information Server) could not be installed successfully during the time of the project, despite great effort but with little support from IBM.</p>
</sec>
<sec>
<title>3.5. Limitations of This Study</title>
<p>As pointed out by Pateli and Giaglis (<xref ref-type="bibr" rid="B68">2004</xref>), &#x0201C;the selection phase is critical, since decisions made at this stage undoubtedly have a considerable impact on the validity of the literature review results.&#x0201D; For our survey, we consider the conduction of the selection process and consequently its inherent limitations (cf. Kitchenham et al., <xref ref-type="bibr" rid="B51">2009</xref>) as main threats to its validity. In this section, we specifically discuss the comprehension of our tool search strategy and the stringency of the exclusion criteria.</p>
<p>The following measures mitigated the risk of missing an important research paper and subsequently a DQ tool: (1) we used the online search engine Google Scholar in addition to the main publisher websites, (2) we specifically observed references from existing DQ tool surveys, and (3) we included a manual Google search in parallel to the systematic search.</p>
<p>Considering the ratio between the number of DQ tools selected for deeper investigation and the total number of identified DQ tools (17/667), the exclusion criteria might seem very stringent. We argue that they have been selected adequately for this survey due to the following reasons. First, we want to point out that there is a huge number of DQ tools (especially a subset of the 296 found on GitHub), which are only simple scripts to clean specific data sets. Two examples are SQL-Utils<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>, which consists of five SQL scripts for cleaning data and performing simple DQ checks, and DescribeCol<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>, which consists of one Python function that implements DQ tests by describing and visualizing a Pandas DataFrame. Although the dedicated investigation of domain specific DQ tools (EC1) is interesting future work, a further restriction of these tools would be required to compile meaningful results for such a study. Second, we deliberately excluded DQ tools that are restricted to specific data management tasks (e.g., data cleansing), because they do not support the answer of our general research question <italic>how DQ measurement and monitoring concepts are implemented in state-of-the-art DQ tools</italic>. Third, the time invested for each tool was about one person per month per tool. This was on the one hand due to the detailed requirements catalog (cf. <xref ref-type="table" rid="T3">Table 3</xref>), and on the other hand, for some tools already the installation or the negotiation with the customer support (e.g., to receive a full functional trial license) was very time-consuming. Considering this time effort, the investigation of all 667 DQ tools, or even only the 339 domain-specific tools, would be out of scope to answer our research question. Fourth, the number of selected DQ tools seems reasonable compared to related surveys. Investigating a considerable larger number of DQ tools would require the refinement of the entire evaluation process.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>DQ tool requirements catalog.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Category</bold></th>
<th valign="top" align="left"><bold>Sub-category</bold></th>
<th valign="top" align="left"><bold>Requirement</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Data Profiling (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2019</xref>)</td>
<td valign="top" align="left">SC &#x02013; Cardinalities</td>
<td valign="top" align="left">(1) &#x0201C;Number of rows&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(2) Number of null values (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(3) &#x0201C;Percentage of null values&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(4) &#x0201C;Number of distinct values; sometimes called &#x02018;cardinality&#x00027; &#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(5) &#x0201C;Number of distinct values divided by the number of rows&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">SC - Valuedistributions</td>
<td valign="top" align="left">(6) &#x0201C;Frequency histograms (equi-width, equi-depth, etc.)&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(7) &#x0201C;Minimum and maximum values in a numeric column&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(8) &#x0201C;Constancy: frequency of most frequent value divided by number of rows&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(9) &#x0201C;Quartiles: 3 points that divide the (numeric) values into 4 equal groups&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(10) &#x0201C;Distribution of first digit in numeric values; to check Benford&#x00027;s law&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">SC - Patterns, data types, and domains</td>
<td valign="top" align="left">(11) &#x0201C;Basic type (numeric, alphanumeric, date, time, etc.)&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(12) &#x0201C;DBMS-specific data type (varchar, timestamp, etc.)&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(13) Measurement of value length (minimum, maximum, average, and median) (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(14) &#x0201C;Maximum number of digits in numeric values&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(15) &#x0201C;Maximum number of decimals in numeric values&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(16) &#x0201C;Histogram of value patterns (Aa9...)&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(17) &#x0201C;Generic semantic data type&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>) [e.g., &#x0201C;code, date/time, quantity, identifier&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)]</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(18) &#x0201C;Semantic domain&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>) (e.g., credit card, first name, city) (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Dependencies</td>
<td valign="top" align="left">(19) &#x0201C;Unique column combinations&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>) (key discovery)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(20) &#x0201C;Relaxed unqiue column combinations&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(21) &#x0201C;Inclusion dependencies&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>) (foreign key discovery)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(22) &#x0201C;Relaxed inclusion dependencies&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(23) &#x0201C;Functional dependencies&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(24) &#x0201C;Relaxed functional dependencies&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Advanced MC profiling</td>
<td valign="top" align="left">(25) Correlation analysis (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(26) Association rule mining (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(27) Cluster analysis (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(28) Outlier detection (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(29) Exact duplicate tuple detection</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(30) Relaxed duplicate tuple detection</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Data Quality Measurement</td>
<td valign="top" align="left">DQ Dimensions</td>
<td valign="top" align="left">(31) Metric to measure accuracy</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(32) Metric to measure completeness</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(33) Metric to measure consistency</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(34) Metric to measure timeliness</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(35) Metrics to measure other DQ dimensions</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Rule-based checks</td>
<td valign="top" align="left">(36) Creation of business rules</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(37) Availability of general-applicable integrity rules</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(38) Verification of data against business rules</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Automated Data Quality Monitoring (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
<td/>
<td valign="top" align="left">(39) Scheduling a DQ metric or data profiling task in user-defined periods (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(40) Storage of DQ measurements and data profiling results (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(41) Retrieval of DQ measurements or data profiling results (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(42) Comparison between several DQ measurements or data profiling results (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">(43) Visualization of DQ measurements / data profiling results over time (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4">
<title>4. Design of the Evaluation Process</title>
<p>As outlined in Section 2.3, existing requirement frameworks for DQ tools did not adequately answer our research question. Thus, we developed a new catalog of requirements for the evaluation of DQ measurement and monitoring tools, which is discussed in the following subsection. The aim is to rate the fulfillment of each requirement with three categories: (&#x02713;) for fulfilled, (&#x02212;) for not fulfilled, and (<italic>p</italic>) for partially fulfilled. In Section 4.2, we discuss the database used for the evaluation of the requirements and in Section 4.3 we list the predefined test cases to compare specific results between the investigated DQ tools.</p>
<sec>
<title>4.1. Evaluation Requirements Catalog</title>
<p>Our requirements catalog in <xref ref-type="table" rid="T3">Table 3</xref> consists of three main categories: data profiling (DP), data quality measurement (DQM), and continuous data quality monitoring (CDQM). The requirements for data profiling are based on the classification of DP tasks by Abedjan et al., which has been originally published by Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>), and updated by Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>). Since we started our survey prior to the classification update, our requirements catalog constitutes a tradeoff between the two versions. Since both versions contain the two sub-categories &#x0201C;single columns (SC) profiling&#x0201D; and &#x0201C;dependency detection,&#x0201D; we adhere here to the newer version by Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>). In the SC sub-category, we split the null values task (i.e., number or percentage of null values) in two different requirements: (DP-2) number of null values and (DP-3) percentage of null values, to separate the results. The newer version (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>) contains an additional sub-category &#x0201C;metadata for non-relational data,&#x0201D; which is not included in our survey, because the evaluation for some tools with a fixed-period trial version was already completed at the time of the update. However, the original version (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>) included a category &#x0201C;multi-column (MC) profiling,&#x0201D; which has been removed by Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>). We renamed this category to &#x0201C;advanced MC profiling&#x0201D; and added it along with two additional requirements (exact and relaxed duplicate tuple detection) to the end of the DP category. One reason for the exclusion of the MC sub-category from the data profiling task taxonomy by Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>) might be the strong overlap of these tasks with the field of data mining. Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>) point out that there exists no clearly defined and widely accepted distinction between the two research fields. Thus, although a separate category for those requirements could be argued, we decided to include it in the data profiling category, because data mining is not in the focus of our survey.</p>
<p>The category for DQ measurement contains requirements to provide metrics for specific DQ dimensions and business rule management capabilities. While we listed metrics for the DQ dimensions accuracy, completeness, consistency, and timeliness as described in Section 2.2 explicitly, we investigate the existence of additional metrics during our evaluation by means of (DQM-34). Since DQ dimensions such as consistency are often measured with a set of rules (cf. Section 2.2.3) and the development of business rules is generally regarded as the basis for DQ measurement in some methodologies (cf. Sebastian-Coleman, <xref ref-type="bibr" rid="B82">2013</xref>), we have expanded our catalog to include (DQM 35-37). It is distinguished between the (DQM-35) creation of domain-specific business rules and the (DQM-36) availability of general integrity rules, for example a birth date cannot be in the future or a temperature value can never reach -270 &#x000B0;C. It should also be possible to verify those rules (DQM-37).</p>
<p>The requirements for CDQM are based on the findings from our previous research published by Ehrlinger and W&#x000F6;&#x000DF; (<xref ref-type="bibr" rid="B23">2017</xref>) and summarize key tasks to ensure automated DQ monitoring over time. The continuous measurement, storage, and usage of the collected metadata should be possible for both data profiling results and DQ measurements.</p>
</sec>
<sec>
<title>4.2. Evaluation Database</title>
<p>For the evaluation of the requirements from <xref ref-type="table" rid="T3">Table 3</xref>, we used a modernized version of the well-known Northwind DB published by dofactory<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>. <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the schema of the database with five tables as UML (unified modeling language) class diagram. Foreign key relationships and their cardinalities are represented in UML notation.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Schema of the northwind evaluation DB.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-850611-g0002.tif"/>
</fig>
</sec>
<sec>
<title>4.3. Data Profiling Test Cases</title>
<p>To compare the results of the requirements between the DQ tools, we defined a test case for each requirement from the data profiling category. We did not define such fine-grained test cases for the DQ measurement category since the DQ metric implementations were too diverse to compare their results directly. Also, the requirements of the DQ monitoring category do not yield a comparable result (e.g., in form of numbers), and hence there are no test cases. The following list comprises all test cases we performed for the DP category, whereby the enumeration can be linked to the DP requirements from <xref ref-type="table" rid="T3">Table 3</xref>:</p>
<list list-type="order">
<list-item><p>Number of rows in table <monospace>Product</monospace>.</p></list-item>
<list-item><p>Number of <monospace>null</monospace>values in column <monospace>Supplier.Fax</monospace>.</p></list-item>
<list-item><p>Percentage of <monospace>null</monospace>values in column <monospace>Supplier.Fax</monospace>.</p></list-item>
<list-item><p>Number of distinct values in column <monospace>Customer.Country</monospace>.</p></list-item>
<list-item><p>Number of distinct values divided through number of rows for <monospace>Customer.Country</monospace>.</p></list-item>
<list-item><p>Frequency histograms for <monospace>Customer.Country</monospace>.</p></list-item>
<list-item><p>Minimum and maximum values in <monospace>OrderItem.UnitPrice</monospace>.</p></list-item>
<list-item><p>Constancy for column <monospace>Customer.Country</monospace>.</p></list-item>
<list-item><p>Quartiles in column <monospace>OrderItem.UnitPrice</monospace>.</p></list-item>
<list-item><p>Distribution if first digit=1 in column <monospace>UnitPrice</monospace>, table <monospace>OrderItem</monospace>.</p></list-item>
<list-item><p>Basic types for <monospace>ProductName</monospace>, <monospace>UnitPrice</monospace>, and <monospace>isDiscontinued</monospace> in table <monospace>Product</monospace>.</p></list-item>
<list-item><p>DBMS-specific data types for <monospace>ProductName</monospace>, <monospace>UnitPrice</monospace>, and <monospace>isDiscontinued</monospace> in table <monospace>Product</monospace>.</p></list-item>
<list-item><p>Minimum, maximum, average, and median value length of column <monospace>Product.ProductName</monospace>.</p></list-item>
<list-item><p>Maximum number of digits in column <monospace>Product.UnitPrice</monospace>.</p></list-item>
<list-item><p>Maximum number of decimals in column <monospace>Product.UnitPrice</monospace>.</p></list-item>
<list-item><p>Count of pattern &#x0201C;AA&#x0201D; in <monospace>Customer.Country</monospace>, derived from histogram</p></list-item>
<list-item><p>Semantic data types for <monospace>ProductName</monospace>, <monospace>UnitPrice</monospace>, and <monospace>isDiscontinued</monospace>in table <monospace>Product</monospace>.</p></list-item>
<list-item><p>Semantic domains for <monospace>ProductName</monospace>, <monospace>UnitPrice</monospace>, and <monospace>isDiscontinued</monospace>in table <monospace>Product</monospace>.</p></list-item>
<list-item><p>All 100 % conforming UCCs in <monospace>Order</monospace>.</p></list-item>
<list-item><p>All 98 % conforming UCCs in <monospace>Order</monospace>.</p></list-item>
<list-item><p>All 100 % conforming INDs between <monospace>Order.CustomerId</monospace>and <monospace>Customer.Id</monospace>.</p></list-item>
<list-item><p>All 93 % conforming INDs between <monospace>Order.CustomerId</monospace>and <monospace>Customer.Id</monospace>.</p></list-item>
<list-item><p>All 100 % conforming FDs in <monospace>Order</monospace>.</p></list-item>
<list-item><p>All 93 % conforming FDs in <monospace>Order</monospace>.</p></list-item>
<list-item><p>Correlation between <monospace>OrderItem.UnitPrice</monospace>and <monospace>OrderItem.Quantity</monospace>.</p></list-item>
<list-item><p>All possible association rules within <monospace>Product</monospace>.</p></list-item>
<list-item><p>Clustering the values in <monospace>Product.UnitPrices</monospace>.</p></list-item>
<list-item><p>All &#x0201C;very high values&#x0201D; in <monospace>Order.TotalAmount</monospace>.</p></list-item>
<list-item><p>All exact duplicates in <monospace>Customer</monospace>, considering <monospace>FirstName</monospace>and <monospace>LastName</monospace>only.</p></list-item>
<list-item><p>All relaxed duplicates in <monospace>Customer</monospace>, considering <monospace>FirstName</monospace>and <monospace>LastName</monospace>only.</p></list-item>
</list><p>All test cases were conducted by two researchers (one of whom is the lead author of this article), who verified each other&#x00027;s results.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Data Quality Tool Evaluation</title>
<p>In this section, we first describe the DQ tools, which we selected for the evaluation, and second, we investigate the selected tools with respect to our evaluation framework and discuss the requirements.</p>
<sec>
<title>5.1. Selected Data Quality Tools</title>
<p>In total, we selected 17 DQ tools for detailed evaluation. Three of them were based on SAP (SAP Information Steward, DQ solution by ISO Professional Services, and dspCompose by BackOffice Associates GmbH) and since we had no access to a SAP installation, we did not include these tools in our survey, but described them textually. To achieve a comparable overview on the investigated DQ tools, we formulated the following seven questions.</p>
<list list-type="bullet">
<list-item><p>Which exact version did we evaluate? (DQ tool name and version).</p></list-item>
<list-item><p>Who is the vendor or creator of the tool?</p></list-item>
<list-item><p>Is the tool open-source?</p></list-item>
<list-item><p>How did we perceive the user interface? (1&#x02013;5 rating, 5 is best).</p></list-item>
<list-item><p>How did we perceive customer support? (1&#x02013;5 rating, 5 is best).</p></list-item>
<list-item><p>How was the investigated DQ tool provided? (e.g., freely available on GitHub/SourceForge or trial license).</p></list-item>
<list-item><p>In which scientific paper or on which online platform was the DQ tool found?</p></list-item>
</list>
<p>An overview on the answers to the questions is given in <xref ref-type="table" rid="T4">Table 4</xref> and a detailed discussion is provided in the following subsections (DQ tools listed in alphabetical order). Since the focus of this survey is on the measurement functionality of DQ tools, technical details like the adoption (i.e., on-premise vs. SaaS) was not relevant for answering our research questions. We refer to related surveys for more technical details, especially the Gartner Magic Quadrant (cf. Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>) and Fraunhofer IAO (cf. Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>).</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Summary of investigated DQ tools.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>DQ Tool Version</bold></th>
<th valign="top" align="left"><bold>Vendor / Creator</bold></th>
<th valign="top" align="left"><bold>Open</bold><break/><bold>-source</bold></th>
<th valign="top" align="left"><bold>User</bold><break/><bold>Interface</bold></th>
<th valign="top" align="left"><bold>Customer</bold><break/><bold>Support</bold></th>
<th valign="top" align="left"><bold>Provided</bold></th>
<th valign="top" align="left"><bold>Found With</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Aggregate Profiler 6.2.4</td>
<td valign="top" align="left">Arrah Technology</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">2 (out of 5)</td>
<td valign="top" align="left">Not consulted</td>
<td valign="top" align="left">Free on SourceForge</td>
<td valign="top" align="left">Dai et al., <xref ref-type="bibr" rid="B19">2016</xref></td>
</tr>
<tr>
<td valign="top" align="left">Apache Griffin 0.2.0</td>
<td valign="top" align="left">Apache Foundation</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">Not available</td>
<td valign="top" align="left">Free on GitHub</td>
<td valign="top" align="left">GitHub</td>
</tr>
<tr>
<td valign="top" align="left">Ataccama ONE profiler</td>
<td valign="top" align="left">Ataccama</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">5 (out of 5)</td>
<td valign="top" align="left">1 (out of 5)</td>
<td valign="top" align="left">Free online version</td>
<td valign="top" align="left">Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref>; Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>; Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">DataCleaner Enterprise Edition 6.3.0</td>
<td valign="top" align="left">Human Inference</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">5 (out of 5)</td>
<td valign="top" align="left">5 (out of 5)</td>
<td valign="top" align="left">Full trial (3 month)</td>
<td valign="top" align="left">Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref>; Gao et al., <xref ref-type="bibr" rid="B32">2016</xref>; Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">Datamartist 1.7.9</td>
<td valign="top" align="left">nModal Solutions</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">2 (out of 5)</td>
<td valign="top" align="left">Not consulted</td>
<td valign="top" align="left">Full trial (30 days)</td>
<td valign="top" align="left">Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref></td>
</tr>
<tr>
<td valign="top" align="left">Pandora 5.9.0</td>
<td valign="top" align="left">Experian</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">5 (out of 5)</td>
<td valign="top" align="left">Full trial (30 days)</td>
<td valign="top" align="left">Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">Informatica Data Quality 10.2.0</td>
<td valign="top" align="left">Informatica</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">5 (out of 5)</td>
<td valign="top" align="left">Full trial (2 x 30 days)</td>
<td valign="top" align="left">Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>; Gao et al., <xref ref-type="bibr" rid="B32">2016</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">InfoSphere Information Server for Data Quality 11.7</td>
<td valign="top" align="left">IBM</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Not evaluated</td>
<td valign="top" align="left">1 (out of 5)</td>
<td valign="top" align="left">Full trial (3 month)</td>
<td valign="top" align="left">Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>; Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">InfoZoom Desktop Professional 2018 release 9.20 and IZDQ 2018.03</td>
<td valign="top" align="left">humanIT Software GmbH</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">Full trial (6 month)</td>
<td valign="top" align="left">Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref></td>
</tr>
<tr>
<td valign="top" align="left">MobyDQ, pulled 05/21/19</td>
<td valign="top" align="left">Alexis Rolland</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">3 (out of 5)</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">Free on GitHub</td>
<td valign="top" align="left">GitHub</td>
</tr>
<tr>
<td valign="top" align="left">OpenRefine version 3.0 and MetricDoc extension (pulled on Feb. 14th 2019)</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">2 (out of 5)</td>
<td valign="top" align="left">Not available</td>
<td valign="top" align="left">Free on GitHub</td>
<td valign="top" align="left">Tsiflidou and Manouselis, <xref ref-type="bibr" rid="B89">2013</xref>; Kusumasari et al., <xref ref-type="bibr" rid="B54">2016</xref>, GitHub</td>
</tr>
<tr>
<td valign="top" align="left">Enterprise Data Quality pre-built Virtual Machine 12.2.1</td>
<td valign="top" align="left">Oracle</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">3 (out of 5)</td>
<td valign="top" align="left">Not consulted</td>
<td valign="top" align="left">Free on Oracle website</td>
<td valign="top" align="left">Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">Open Studio for Data Quality 6.5.1</td>
<td valign="top" align="left">Talend</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">4 (out of 5)</td>
<td valign="top" align="left">2 (out of 5)</td>
<td valign="top" align="left">Free on GitHub / Talend website</td>
<td valign="top" align="left">Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref>; Gao et al., <xref ref-type="bibr" rid="B32">2016</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Pulla et al., <xref ref-type="bibr" rid="B73">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>, GitHub</td>
</tr>
<tr>
<td valign="top" align="left">SAS Base 9.4 and SAS Data Quality Desktop 2.7</td>
<td valign="top" align="left">SAS</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">3 (out of 5)</td>
<td valign="top" align="left">3 (out of 5)</td>
<td valign="top" align="left">Full trial (60 days)</td>
<td valign="top" align="left">Barateiro and Galhardas, <xref ref-type="bibr" rid="B9">2005</xref>; Maletic and Marcus, <xref ref-type="bibr" rid="B60">2009</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref></td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>5.1.1. Aggregate Profiler</title>
<p>Aggregate Profiler (AP) is a freely available DQ tool, which is dedicated to data profiling. The tool was discovered twice in our systematic search: once because it was mentioned by Dai et al. (<xref ref-type="bibr" rid="B19">2016</xref>) in the Springer search results, and once in the Google search results, since it is also published on Sourceforge as &#x0201C;Open Source Data Quality and Profiling,&#x0201D;<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> developed by <italic>arrah</italic> and <italic>arunwizz</italic>. In addition to its data profiling capabilities, like statistical analysis and pattern matching, Aggregate Profiler can also be used for data preparation and cleansing activities, like address correction or duplicate removal. Moreover, business rules can be defined and scheduled in user-defined periods. We perceived the user interface (UI) as inferior compared to other tools, since the navigation and application of DP functions was not intuitive.</p>
</sec>
<sec>
<title>5.1.2. Apache Griffin</title>
<p>Apache Griffin<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref> (AG) differs significantly from the other tools in this survey, because it does not offer any data profiling functionality and is not a comprehensive DQ solution. However, since part of the evaluation is to observe the extent to which current tools support CDQM, we included Apache Griffin, since it is dedicated to continuously measure the quality of Big Data, both batch-based and streaming data. We installed Apache Griffin 0.2.0, which is still in the incubator status of Apache, on Ubuntu 18.04. The tool requires the following dependencies, from which some are (at the time of the installation) still in incubating status as well: JDK (1.8&#x0002B;), MySQL DB, npm, Hadoop (2.6.0&#x0002B;), Spark (2.2.1&#x0002B;), Hive (2.2.0), Livy, and ElasticSearch. Due to these dependencies, the installation was very cumbersome in contrast to other tools. In our case, two experienced computer scientists needed over a week to complete the full installation. Once installed, the UI is intuitive and supports the domain-specific definition of accuracy metrics as well as the scheduling and monitoring of those metrics. Other DQ metrics, like completeness, are planned to be integrated in future versions.</p>
</sec>
<sec>
<title>5.1.3. Ataccama ONE</title>
<p>The company Ataccama with its headquarters in Canada offers several DQ products, which we found through different sources in our search: Data Quality Center and Master Data Center have been previously investigated by Kokem&#x000FC;ller and Haupt (<xref ref-type="bibr" rid="B52">2012</xref>); DQ Analyzer has been included in Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>) and Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>) and in Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>). Gartner additionally mentioned the DQ Issue Tracker and the DQ Dashboard in 2016 (Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>). However, since 2017, Ataccama consolidated their separate DQ solutions into &#x0201C;Ataccama ONE&#x0201D; (A-ONE). While the license of the full DQ solutions is subject to costs, the data profiling module of Ataccama ONE can be accessed freely. Unfortunately, Ataccama customer support did not provide us with a trial license of the complete ONE solution. Thus, we were only able to investigate the free &#x0201C;Ataccama ONE profiler,&#x0201D;<xref ref-type="fn" rid="fn0006"><sup>6</sup></xref> where the focus is on data profiling and which does not provide monitoring functionality. We performed the evaluation of the online-available tool during October 2018. According to Gartner (cf. Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>) and Ataccama customer support, the full solution would provide a much richer scope of functions, including DQ monitoring, but we were not able to investigate it. The data profiling module was very intuitive and easy to use, also for business users. In terms of customer support (from Prague), we experienced very long response times on our contact attempts for a license request. Additionally, we were promised to receive a training as prerequisite to test the full Ataccama ONE solution, which was never redeemed due to the workload on the side of Ataccama.</p>
</sec>
<sec>
<title>5.1.4. DataCleaner by Human Inference</title>
<p>The DQ products &#x0201C;DataCleaner&#x0201D; (DC) and &#x0201C;DataHub&#x0201D; were originally developed by Human Inference, which was incorporated into Neopost in 2012, later into Quadient, and since 2019 into the EDM Media Group, where it is again promoted with its original name &#x0201C;Human Inference.&#x0201D; DataCleaner offers dedicated and independent DQ measurement functionality, although pure data cleansing functions might be expected due to its name. Our customer contact declared that the professional version of DataCleaner (in contrast to the community edition that is freely available on GitHub) offers the same DQ measurement functionalities as DataHub, but differs only with respect to the convenient usage, the UI, and the data integration features. Thus, we evaluated a full trial of DataCleaner Enterprise Edition, which aims at people with technical background. In addition, we were able to observe the functionalities of DataHub in an interactive web session. Human Inference places emphasis on customer data, which is reflected in special algorithms for duplicate detection, address matching, and data cleansing. Under the vendor Quadient, DataCleaner was mentioned by in Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>) (Gartner Inc.), but excluded from the follow-up survey by Chien and Jain (<xref ref-type="bibr" rid="B15">2019</xref>) due to strategic changes. DataCleaner was previously observed by Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>); Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>); Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>), but under different vendors. Although DataCleaner is built for technical users, we perceived the UI as very intuitive. DataHub (with its vision of a single customer view) offers in addition to the administrator&#x00027;s view a data steward view, which is specifically dedicated to business users, for example, to resolve ambiguous duplicates. We also want to highlight [in conformance with Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>)] the very helpful and friendly customer support that provided us with the trial license and more insight in DataHub.</p>
</sec>
<sec>
<title>5.1.5. Datamartist by nModal Solutions Inc.</title>
<p>The commercial tool Datamartist<xref ref-type="fn" rid="fn0007"><sup>7</sup></xref> (DM) by nModal Solutions Inc. requires the operating system Microsoft Windows and the .NET framework 2.0 to be installed. Datamartist is dedicated to data profiling and data transformation. The investigated 30-days trial offers all Pro edition features. Since the trial could be downloaded from the website directly, we did not consult any customer support. We perceived the UI of Datamartist as slightly inferior compared to other commercial tools since for some tasks (e.g., exporting data profiling results) the command line was required.</p>
</sec>
<sec>
<title>5.1.6. Experian Pandora</title>
<p>The company Experian with its headquarters in Ireland offers two commercial DQ solutions: Cleanse and Pandora (EP). During the conduct of our survey, they introduced the new product Aperture Data Studio, which is going to replace Pandora in the future. While Cleanse is dedicated to one-time-data-cleansing, we investigated the more comprehensive tool Pandora. In accordance with the findings by Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>) (Gartner Inc.), we perceived the tool as easy to install and use and want to highlight the comprehensive data profiling capabilities in general, and the cross-table profiling capabilities in particular. In addition, Pandora provides a rich ability to extend the existing feature palette with customized functions. In summary, Pandora achieved one of the best overall assessments in our survey. We perceived the UI as good, though more dedicated to technical users, and had very good experience with the technical customer support who supported us in a timely and target-oriented fashion.</p>
</sec>
<sec>
<title>5.1.7. Informatica Data Quality</title>
<p>Informatica Data Quality (IDQ) is one module of the commercial data management solution by Informatica, which is according to Gartner (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>; Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>), leader in the Magic Quadrant of Data Quality Tools for several years. We were provided with two 30-days trial licenses. The trial included the Informatica Developer (the desktop installation for developers), Informatica Analyst (the web-based platform for business users), and Informatica Administrator (for task scheduling), where all three user interfaces access the same server-side backend of Informatica DQ version 10.2.0. In our systematic search, we found five different tools offered by the company Informatica, from which four had been excluded from the evaluation. For example, the &#x0201C;Master Data Management&#x0201D; solution was excluded due to the focus on master data management. Informatica Data Quality was found through the Springer Link search (cf. Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>), and because it was previously investigated by Judah et al. (<xref ref-type="bibr" rid="B49">2016</xref>); Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>), and Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>). Informatica has its origin in the field of data integration and in addition to the features we evaluated, they offer data cleansing and matching functionalities. In terms of DQ measurement, they offer most probably the closest implementation to the DQ dimension and metric view promoted in the research community. We perceived the UI of Informatica Analyst as easy to use, also for business users, but with less comprehensive functionality than the Informatica Developer, which is more powerful and dedicated to trained and technical users. In accordance to the findings by Gartner customers (cf. Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>), we can confirm the very helpful sales support, which was one of the best we experienced. During the evaluation, we had regular web conferences to ask questions and review the results, and short intermediate requests were answered timely.</p>
</sec>
<sec>
<title>5.1.8. IBM InfoSphere Information Server for Data Quality</title>
<p>The product &#x0201C;Infosphere Information Server for Data Quality&#x0201D; (IBM ISDQ) by IBM was found through the studies by Gartner (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>) and Fraunhofer IAO (cf. Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>). Other product (or product components) from IBM have also been previously mentioned in the following research papers: IBM Informix (previously called &#x0201C;DataBlade&#x0201D;) by Barateiro and Galhardas (<xref ref-type="bibr" rid="B9">2005</xref>), IBM InfoSphere Information Analyzer by Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>), IBM QuerySurge by Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>), IBM Data Integrator by Chen et al. (<xref ref-type="bibr" rid="B14">2012</xref>), IBM InfoSphere MDM Server by Pawluk (<xref ref-type="bibr" rid="B69">2010</xref>), and IBM Quality Stage by Prasad et al. (<xref ref-type="bibr" rid="B72">2011</xref>). For our survey, the IBM partner solvistas GmbH, located in Austria, provided us with the installation files of IBM InfoSphere Information Server for Data Quality version 11.7 for a three-month trial. Unfortunately, we were not able to evaluate the tool due to an early error in the installation process stating that a required file was not found. Despite intensive research of the documentation<xref ref-type="fn" rid="fn0008"><sup>8</sup></xref>, it was not possible to resolve the issue within the timeframe of the project, since no support by IBM nor any specific installation instruction for the received files was provided. We also contacted Fraunhofer IAO, who included IBM ISDQ in their survey (Kokem&#x000FC;ller and Haupt, <xref ref-type="bibr" rid="B52">2012</xref>). However, they did not install the tool, but based their statements on contact with the IBM support and the documentation. Also solvistas GmbH claimed that, so far, they never installed the IBM DQ product line. This experience aligns with the statement by Gartner that reference customer rate the technical support and documentation of IBM below the average (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>).</p>
</sec>
<sec>
<title>5.1.9. InfoZoom by humanIT Software GmbH</title>
<p>InfoZoom is a commercial DQ tool by the German vendor humanIT Software GmbH<xref ref-type="fn" rid="fn0009"><sup>9</sup></xref> and is dedicated to data profiling using in-memory analytics. It was previously surveyed by Kokem&#x000FC;ller and Haupt (<xref ref-type="bibr" rid="B52">2012</xref>). We investigated InfoZoom Desktop Professional with the IZDQ (InfoZoom Data Quality) extension in a 6-month license granted to us from the customer support. While InfoZoom Desktop is dedicated to data profiling and data investigation, the IZDQ extension allows a user to define rules and jobs for comprehensive DQ management. Generally, InfoZoom aims at observing and understanding the data but does not support any cleansing activities, which aligns well with the observations performed in this survey. We perceived the UI of InfoZoom Desktop as easy to use, also for business users, whereas the IZDQ extension requires technical knowledge like the ability to write SQL statements, or at least, intensive training to be used by non-technical users. The customer support was very friendly and helpful and provided us in a timely manner with a relatively long trial licenses in comparison to other commercial DQ tools.</p>
</sec>
<sec>
<title>5.1.10. MobyDQ</title>
<p>MobyDQ<xref ref-type="fn" rid="fn0010"><sup>10</sup></xref>, which was previously termed &#x0201C;Data Quality Framework,&#x0201D; by Alexis Rolland is a free and open-source DQ solution that aims to automate DQ checks during data processing, storing DQ measurements and metric results, and triggering alerts in case of anomaly. The tool was inspired by an internal DQ project at Ubisoft Entertainment, which differs to the open-source version with respect to software dependency and mature but context-dependent configuration. We found MobyDQ through our GitHub search and evaluated the version downloaded on May 21nd, 2019. Similar to the commercial tools we observed, the framework can be used to access different data sources. In contrast to Apache Griffin, MobyDQ could be installed quickly and straightforward, based on the detailed documentation provided on GitHub. MobyDQ does not provide any DP functionality, because its focus is on the creation, application, and automation of DQ checks. The creator Alexis Rolland was very helpful in demonstrating the productive installation at Ubisoft Entertainment to us, which clearly demonstrates the potential of the tool when applied in practice.</p>
</sec>
<sec>
<title>5.1.11. OpenRefine and MetricDoc</title>
<p>OpenRefine<xref ref-type="fn" rid="fn0011"><sup>11</sup></xref> (formerly Google Refine, abbrev. OR) is a free and open-source DQ tool dedicated to data cleansing and data transformation and was discovered through (Kusumasari et al., <xref ref-type="bibr" rid="B54">2016</xref>) in the IEEE search results, and (Tsiflidou and Manouselis, <xref ref-type="bibr" rid="B89">2013</xref>) in the Springer Link search results as well as on GitHub<xref ref-type="fn" rid="fn0012"><sup>12</sup></xref>. While the original functionality of the tools does not primarily align with the focus of our survey, its extension MetricDoc specifically aims at assessing DQ with &#x0201C;customizable, reusable quality metrics in combination with immediate visual feedback&#x0201D; (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>). Apart from the mention by Tsiflidou and Manouselis (<xref ref-type="bibr" rid="B89">2013</xref>) and Kusumasari et al. (<xref ref-type="bibr" rid="B54">2016</xref>), OpenRefine was not evaluated in one of the previous DQ tool surveys, although it is open source. We installed the tool from GitHub and evaluated OpenRefine version 3.0 with the MetricDoc extension (where no version was provided), downloaded on February 14th, 2019. We perceived the usability of OpenRefine as average and especially in the MetricDoc extension, the usability of several functions reflected its state as very current research project.</p>
</sec>
<sec>
<title>5.1.12. Oracle Enterprise Data Quality</title>
<p>The commercial tool Oracle Enterprise Data Quality (EDQ) was previously mentioned by Gartner (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>) and also found in the Springer Link search results (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>). We investigated the freely available pre-built Virtual Machine available at the Oracle website<xref ref-type="fn" rid="fn0013"><sup>13</sup></xref>. In addition to classical data profiling capabilities, EDQ offers data cleansing (parsing, standardization, match and merge, address verification), as well as DQ monitoring to some extent. The GUI was perceived as average with the major drawback being the inflexible data source connection to DBs and files. In comparison to other DQ tools, where a connection can be directly accessed and reused, Oracle EDQ requires a &#x0201C;snapshot&#x0201D; of the actual data connection to be created prior to any profiling or DQ measurement task. This approach prevents an automatic update of the data source. We did not require contact to the customer support and the install documentation and user guide was up-to-date and very intuitive to use.</p>
</sec>
<sec>
<title>5.1.13. Talend Open Studio for Data Quality</title>
<p>The company Talend offers two DQ products: Talend Open Studio (TOS) for Data Quality (a free version) and Talend Data Management Platform (requires subscription). Gartner upgraded Talend in their Magic Quadrant of Data Quality Tools from being &#x0201C;visionary&#x0201D; in 2016 to &#x0201C;leader&#x0201D; in 2017 (cf. Judah et al., <xref ref-type="bibr" rid="B49">2016</xref>; Selvage et al., <xref ref-type="bibr" rid="B83">2017</xref>). Talend Open Studio for Data Quality is one of the most frequently cited DQ tools that we discovered in our systematic search: it was found through Springer Link and GitHub<xref ref-type="fn" rid="fn0014"><sup>14</sup></xref> and was already previously investigated by Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>); Gao et al. (<xref ref-type="bibr" rid="B32">2016</xref>); Pulla et al. (<xref ref-type="bibr" rid="B73">2016</xref>). Both products (Open Studio and Enterprise) offer good support for Big Data analysis like Spark or Hadoop and a variety of data profiling and cleansing functionalities. We evaluated version 6.5.1 of TOS for Data Quality, which can definitely keep up with several commercial DQ tools (which require a fee) in terms of data profiling capabilities, business rule management, and UI experience. However, the free version does not support DQ monitoring capabilities, which is an exclusive feature of the Enterprise edition. It was not possible to receive a free trial of the Talend Data Management Platform, because according to our customer contact, it is unlikely that someone would purchase the Enterprise edition because of this feature.</p>
</sec>
<sec>
<title>5.1.14. SAS Data Quality</title>
<p>The US company SAS<xref ref-type="fn" rid="fn0015"><sup>15</sup></xref> (Statistical Analysis System) offers three commercial DQ products: SAS Data Management, SAS Data Quality, and SAS Data Quality Desktop (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>). Since, the traditional focus of SAS is on data analysis, their DQ product is based on the acquired company DataFlux. The product &#x0201C;dfPower&#x0201D; by DataFlux has previously been surveyed by Barateiro and Galhardas (<xref ref-type="bibr" rid="B9">2005</xref>) and is mentioned by Maletic and Marcus (<xref ref-type="bibr" rid="B60">2009</xref>), which was discovered through our systematic search. In our evaluation, we did not find powerful machine learning (ML) capabilities (as core strength of SAS) in DQ measurement, which was also mentioned by Selvage et al. (<xref ref-type="bibr" rid="B83">2017</xref>). According to our customer contact and also mentioned by Chien and Jain (<xref ref-type="bibr" rid="B15">2019</xref>), SAS&#x00027; overall strategic focus is on migrating all product lines into the cloud-based SAS Viya platform to increase the usability and to better integrate ML and DQ. In the evaluated tool SAS Data Quality Desktop 2.7, we found that the overall usability was below the average when compared to other DQ tools. The customer support was friendly, but hardly any question could be answered directly.</p>
</sec>
<sec>
<title>5.1.15. Data Quality Solutions Dedicated to SAP</title>
<p>SAP (German abbreviation for &#x0201C;Systeme, Anwendungen und Produkte in der Datenverarbeitung,&#x0201D; i.e., &#x0201C;Systems, Applications and Products in Data Processing,&#x0201D;) is a worldwide operating company for enterprise application software with headquarters in Germany. Since SAP is market leader in the data processing domain, there are several DQ tools that are specifically built to operate on top of an existing SAP installation. During this survey, we had no access to such an installation, and thus, were not able to include those tools in our evaluation. However, due to the practical relevance of DQ measurement in SAP, we describe the most relevant tools dedicated to SAP, which we found through our systematic search.</p>
<sec>
<title>5.1.15.1. SAP Information Steward</title>
<p>SAP Information Steward was found through our systematic search and previously mentioned by Chien and Jain (<xref ref-type="bibr" rid="B15">2019</xref>); Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2019</xref>). According to the documentation, the tool offers different data profiling functionalities (like simple statistics, histograms, data types, and dependencies), allows to define and execute business rules, as well as to monitor DQ with scorecards. Its strength are the wide range of out-of-the-box functions for specific domains like customers, supply chains, and products, however, customers often state that the costs for the product are too high and the interface needs some modernization for business users (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>).</p>
</sec>
<sec>
<title>5.1.15.2. Data Quality Solution by ISO Professional Services</title>
<p>The German company ISO Professional Services offers a data governance solution, which is implemented directly in SAP and reuses user-defined business rules from the SAP environment. A few years ago, ISO acquired the company Scarus Software GmbH with the DQ tool DataGovernanceSuite, which was discovered through our search and was previously evaluated by Kokem&#x000FC;ller and Haupt (<xref ref-type="bibr" rid="B52">2012</xref>). The Scarus Data Quality (SDQ) Server constitutes the core DQ component by ISO, which has a separate memory but no DB. SDQ interoperates with SAP transparently, by offering functions like data profiling, duplicate detection, and address validation, which are directly executed within SAP. In contrast to its competing product SAP Information Steward, which aims at large enterprises, the tool by ISO is optimized for small to medium-sized companies. Reference customers of this size preferred the tool by ISO Professional Services due to its adjusted functional scope and cheaper pricing.</p>
</sec>
<sec>
<title>5.1.15.3. dspCompose by BackOffice Associates GmbH</title>
<p>The German company BackOffice Associates GmbH offers a DQ suite prefixed with &#x0201C;dsp&#x0201D; (data stewardship platform), which is dedicated to master data management. Their primary DQ products are dspMonitor (for data profiling, monitoring, and DQ checks), which is a competing product to SAP Information Steward, and dspCompose (for data cleansing and DQ workflow management), which acts as add-on for dspMonitor or SAP Information Steward. Further DQ related products are dspMigrate, an end-to-end data migration tool, dspConduct, a SAP MDE tool, and dspArchive for data achiving in SAP environments. Although BackOffice Associates offer their DQ products to customers without SAP, they developed a strong SAP focus in recent years. According to our customer contact, they leverage the greatest potential in offering dspCompose in combination with SAP.</p>
</sec>
</sec>
</sec>
<sec>
<title>5.2. Comparison of Data Profiling, DQ Measurement, and Monitoring Capabilities</title>
<p>In this Section, we investigate the DQ tools with regard to our catalog of requirements from <bold>Table 3</bold>. For each requirement, three ratings are possible: (&#x02713;) the requirement is fulfilled, (&#x02212;) the requirement is not fulfilled, or (<italic>p</italic>) the requirement is partially fulfilled. The coverage of each requirement is described in textual form with a focus on the justification of partial fulfillments.</p>
<sec>
<title>5.2.1. Data Profiling Capabilities</title>
<p><xref ref-type="table" rid="T5">Table 5</xref> shows the fulfillment of data profiling capabilities for each tool. We excluded Apache Griffin and MobyDQ from this table, because both tools do not offer any data profiling functionality. It can be summarized that basic single-column data profiling like cardinalities (DP 1&#x02013;5) are covered by most tools, but more sophisticated functionalities, like dependency discovery and multi-column profiling, are offered only in single cases.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Data profiling capabilities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"/>
<th valign="top" align="left"><bold>Aggregate Profiler</bold></th>
<th valign="top" align="left"><bold>Ataccama ONE</bold></th>
<th valign="top" align="left"><bold>DataCleaner</bold></th>
<th valign="top" align="left"><bold>Datamartist</bold></th>
<th valign="top" align="left"><bold>Experian Pandora</bold></th>
<th valign="top" align="left"><bold>Informatica DQ</bold></th>
<th valign="top" align="left"><bold>InfoZoom &#x00026; IZDQ</bold></th>
<th valign="top" align="left"><bold>OpenRefine &#x00026; MetricDoc</bold></th>
<th valign="top" align="left"><bold>Oracle EDQ</bold></th>
<th valign="top" align="left"><bold>SAS Data Quality</bold></th>
<th valign="top" align="left"><bold>Talend Open Studio</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">Number of rows</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">Number of nulls</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left">Percentage of nulls</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left">Number of distinct values</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left">Percentage of distinct values</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">6</td>
<td valign="top" align="left">Frequency histograms</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="left">Minimum and maximum values</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="left">Constancy</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="left">Quartiles</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="left">Distribution of first digit</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">11</td>
<td valign="top" align="left">Basic types</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">12</td>
<td valign="top" align="left">DBMS-specific data type</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">13</td>
<td valign="top" align="left">Value length</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr>
<td valign="top" align="left">14</td>
<td valign="top" align="left">Number of digits</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">15</td>
<td valign="top" align="left">Number of decimals</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">16</td>
<td valign="top" align="left">Histogram of value patterns</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">17</td>
<td valign="top" align="left">Generic semantic data type</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">18</td>
<td valign="top" align="left">Semantic domain</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">19</td>
<td valign="top" align="left">UCCs (key discovery)</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">20</td>
<td valign="top" align="left">Relaxed UCCs</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">21</td>
<td valign="top" align="left">INDs (foreign key discovery)</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">22</td>
<td valign="top" align="left">Relaxed INDs</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">23</td>
<td valign="top" align="left">FDs</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr>
<td valign="top" align="left">24</td>
<td valign="top" align="left">Relaxed FDs</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">25</td>
<td valign="top" align="left">Correlation analysis</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr>
<td valign="top" align="left">26</td>
<td valign="top" align="left">Association rule mining</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">27</td>
<td valign="top" align="left">Cluster analysis</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">28</td>
<td valign="top" align="left">Outlier detection</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">29</td>
<td valign="top" align="left">Exact duplicate detection</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">30</td>
<td valign="top" align="left">Relaxed duplicate detection</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>5.2.1.1. Single Column&#x02014;Cardinalities</title>
<p>While simple counts of values (i.e., cardinalities), like the number of rows, <monospace>null</monospace> values, or distinct values are covered by all DQ tools that support data profiling in general, the major distinction is an out-of-the-box availability of percentage values. The percentage of <monospace>null</monospace> values (DP-3) or distinct values (DP-5) is not supported by all investigated tools. The test case results also reveal different precision for the calculation of the percentages. For example, the percentage of <monospace>null</monospace> values in column <monospace>Supplier</monospace>.<monospace>Fax</monospace> was 55 % with Datamartist, 55.2 % with Oracle EDQ and SAS DataFlux, and 55.17 % in all other tools. The test case for DP-5 yielded 23 % with Datamartist, 23.07 % with Informatica, and 23.08 % with the other tools.</p>
</sec>
<sec>
<title>5.2.1.2. Single Column&#x02014;Value Distributions</title>
<p>Value distributions can be described as cardinalities of value groups (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>). While histograms to visualize value distributions are available in most tools in the form of equi-width histograms (which &#x0201C;span value ranges of same length&#x0201D; Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>), we did not find any tool that supports equi-depth or equi-height histograms (where each bucket represents &#x0201C;the same number of value occurrences.&#x0201D; Thus, we rated all tools that support histograms with &#x0201C;partially&#x0201D; for DP-6. Ataccama allows frequency analysis but no visualization with histograms, and Aggregate Profiler visualizes the distributions only in form of a pie chart. The majority of tools also support minimum and maximum values (DP-7), as well as constancy (DP-8), which is defined as &#x0201C;the ratio of the frequency of the most frequent value (possibly a predefined default value) and the overall number of values&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>). &#x0201C;Benford&#x00027;s law&#x0201D; (DP-10), which is particularly interesting in the area of fraud detection, was only available in Talend OS.</p>
<p>Quantiles are a statistical measure to divide a value distribution into equidistant percentage points (Sheskin, <xref ref-type="bibr" rid="B85">2003</xref>). The most common type of quantiles, which we observed in our study, are &#x0201C;quartiles&#x0201D; (DP-9), where the value distribution is divided by three points into four blocks. The division points are a multiple of 25 %, denoted as lower quartile or Q1 (25 %), median or Q2 (50 %), and upper quartile or Q3 (75 %), respectively. Other examples for quantiles are &#x0201C;percentiles,&#x0201D; which divide the distribution into 100 blocks (i.e., each block comprises a proportion of 1 %), or &#x0201C;deciles,&#x0201D; which divide the distribution into 10 blocks of each 10 % value distribution (Sheskin, <xref ref-type="bibr" rid="B85">2003</xref>). While only three tools explicitly support quartiles, we discovered the availability of other types of quantiles too (in our survey rated as <italic>p</italic>). <xref ref-type="table" rid="T6">Table 6</xref> shows the results for the DP-9 test case where quartiles or other types of quantiles are calculated for the column <monospace>OrderItem.UnitPrice</monospace>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Data profiling&#x02014;test case quartiles.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Q1 (25 %)</bold></th>
<th valign="top" align="left"><bold>Q2 (50 %)</bold></th>
<th valign="top" align="left"><bold>Q3 (75 %)</bold></th>
<th valign="top" align="left"><bold>Type of Quantile</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Aggregate Profiler</td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">18.4</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">Quartiles (4 blocks)</td>
</tr>
<tr>
<td valign="top" align="left">DataCleaner</td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">18.4</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">Quartiles (4 blocks)</td>
</tr>
<tr>
<td valign="top" align="left">Talend Open Studio</td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">18.4</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">Quartiles (4 blocks)</td>
</tr>
<tr>
<td valign="top" align="left">SAS Data Quality</td>
<td valign="top" align="left">12.9375</td>
<td valign="top" align="left">19.475</td>
<td valign="top" align="left">33.4375</td>
<td valign="top" align="left">Demi-deciles (20 blocks)</td>
</tr>
<tr>
<td valign="top" align="left">InfoZoom &#x00026; IZDQ</td>
<td valign="top" align="left">12 (25.48 %)</td>
<td valign="top" align="left">18.4 (50.63 %)</td>
<td valign="top" align="left">32 (75.36 %)</td>
<td valign="top" align="left">Inverse function (2.155 blocks)</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Ataccama ONE</td>
<td valign="top" align="left" colspan="3">0 %: 2, 10 %: 7.45, 20 %: 10, 30 %: 13.25, 40 %: 16, 50 %: 18.4, 60 %: 21.5, 70 %: 30, 80 %: 35.1, 90 %: 46, 100 %: 263.5</td>
<td valign="top" align="left">Deciles (10 blocks)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Ataccama ONE supports deciles, which are displayed extra in the last row since they cannot be directly mapped to quartiles. Although SAS Data Quality provides 20 blocks, that is, demi-deciles, the functionality is described in the SAS UI as &#x0201C;percentiles,&#x0201D; which would refer to the 100-partitions quantiles. In <xref ref-type="table" rid="T6">Table 6</xref>, we picked only the values for the quartile blocks out of the 20 blocks in total. In InfoZoom, the inverse function to quantiles is chosen: instead of merging values into blocks, the percentage value of the distribution is display for each value [denoted as &#x0201C;Cumulative Distribution Function&#x0201D; (Dasu and Johnson, <xref ref-type="bibr" rid="B20">2003</xref>)], leading to a total of 2.155 blocks, where each block contains exactly one value. <xref ref-type="table" rid="T6">Table 6</xref> displays the percentage value of the distribution that refers to Q1, Q2, and Q3, respectively. It can be summarized that the determination of quantiles is interpreted differently in the single DQ tools with respect to the notation (&#x0201C;Q1&#x0201D; vs. &#x0201C;lower quartile&#x0201D; vs. 25 %) as well as the type of quantile.</p>
</sec>
<sec>
<title>5.2.1.3. Single Column&#x02014;Patterns, Data Types, and Domains</title>
<p>In this category, the support of the different requirements varies widely and there is definitive potential for improvement with respect to out-of-the-box pattern and domain discovery. Even the discovery of basic types (DP-11) is not always supported. For example, DataCleaner recognizes the difference between string, boolean, and number and uses this information for further internal processing, but does not explicitly display it per attribute. While the test cases for the DBMS-specific data types (DP-12) yielded uniform results (&#x0201C;varchar&#x0201D; for <monospace>ProductName</monospace>, &#x0201C;decimal&#x0201D; for <monospace>UnitPrice</monospace>, and &#x0201C;bit&#x0201D; for <monospace>isDiscontinued</monospace>), the variety in terminology and classification for the basic types is outlined in <xref ref-type="table" rid="T7">Table 7</xref>. In SAS, we had problems to access a table containing the &#x0201C;decimal&#x0201D; data type and thus, converted <monospace>Product.UnitPrice</monospace> to &#x0201C;long.&#x0201D; &#x0201C;Alphanumeric&#x0201D; in Experian Pandora is abbreviated with &#x0201C;Alphanum.&#x0201D;</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Data profiling&#x02014;test cases basic types.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>A-ONE</bold></th>
<th valign="top" align="left"><bold>DM</bold></th>
<th valign="top" align="left"><bold>EP</bold></th>
<th valign="top" align="left"><bold>IDQ</bold></th>
<th valign="top" align="left"><bold>IZDQ</bold></th>
<th valign="top" align="left"><bold>OR</bold></th>
<th valign="top" align="left"><bold>O-EDQ</bold></th>
<th valign="top" align="left"><bold>SAS</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><monospace>ProductName</monospace></td>
<td valign="top" align="left">String</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">Alpha- numeric</td>
<td valign="top" align="left">String(32)</td>
<td valign="top" align="left">String</td>
<td valign="top" align="left">String</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">String</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>UnitPrice</monospace></td>
<td valign="top" align="left">String</td>
<td valign="top" align="left">Number</td>
<td valign="top" align="left">Decimal</td>
<td valign="top" align="left">Decimal(5,2)</td>
<td valign="top" align="left">&#x00023;&#x00023;&#x00023;&#x00023;.&#x00023;&#x00023;</td>
<td valign="top" align="left">String/ numeric</td>
<td valign="top" align="left">Numeric</td>
<td valign="top" align="left">Long</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>isDiscontinued</monospace></td>
<td valign="top" align="left">Integer</td>
<td valign="top" align="left">Number</td>
<td valign="top" align="left">Integer</td>
<td valign="top" align="left">Integer(1)</td>
<td valign="top" align="left">&#x00023;&#x00023;&#x00023;&#x00023;</td>
<td valign="top" align="left">Numeric</td>
<td valign="top" align="left">Numeric</td>
<td valign="top" align="left">Bit</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For the measurement of the value length (DP-13), the minimum (min.) and maximum (max.) values are usually provided, but not always an average (avg.) value length. The median (med.) value length is only provided by Ataccama ONE. We rated this requirement as fulfilled if at least the minimum, maximum, and average value length were provided, considering the median as optional. <xref ref-type="table" rid="T8">Table 8</xref> shows the exact results delivered by the single tools, which justifies the fulfillment ratings and indicates differences in the accuracy of the average values. InfoZoom provides only the maximum value length, while SAS and Talend OS restrict this feature to string values.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Data profiling&#x02014;test case value length.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>AP</bold></th>
<th valign="top" align="left"><bold>A-ONE</bold></th>
<th valign="top" align="left"><bold>DC</bold></th>
<th valign="top" align="left"><bold>EP</bold></th>
<th valign="top" align="left"><bold>IDQ</bold></th>
<th valign="top" align="left"><bold>IZDQ</bold></th>
<th valign="top" align="left"><bold>OR</bold></th>
<th valign="top" align="left"><bold>O-EDQ</bold></th>
<th valign="top" align="left"><bold>SAS</bold></th>
<th valign="top" align="left"><bold>TOS</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><monospace>ProductName</monospace> (min.)</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">4</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>ProductName</monospace> (max.)</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">32</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>ProductName</monospace> (avg.)</td>
<td valign="top" align="left">16.269</td>
<td valign="top" align="left">16.27</td>
<td valign="top" align="left">16.269</td>
<td valign="top" align="left">16.32</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">16.269</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">16.32</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>ProductName</monospace> (med.)</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">15</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For the number of digits and decimals, the DQ tools usually use the values documented by the DBMS, e.g., 12 digits and 2 decimals for attribute <monospace>UnitPrice</monospace> in table <monospace>Product</monospace>, compared to maximum 5 digits and 2 decimals in the real data. Value patterns and their visualization as a histogram (DP-16) is supported by most DQ tools. SAS supports pie charts only.</p>
<p>Generic semantic data types (DP-17), such as code, indicators, date/time, quantity, or identifier are also denoted as &#x0201C;data class&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>) and are defined by generic patterns. A semantic domain (DP-18), &#x0201C;such as a credit card, first name, city, [or] phenotype&#x0201D; (Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>), is more concrete than a generic semantic data type and usually associated with a specific application context. The DQ tools that fulfill these requirements offer a number of patterns, which are associated with the respective generic data type or semantic domain. By applying these patterns to the data values, it could be verified to which extent an attribute contains values that are of a specific type. Thus, the two requirements DP-17 and DP-18 are usually not distinguished within the DQ tools we evaluated. The number of available patterns varies between approximately 10&#x02013;50 patterns (Pandora, DataCleaner, SAS), 50&#x02013;100 patterns (Talend), and 100&#x02013;300 (Informatica, Oracle). While most tools display the matching patterns per attribute (e.g., <monospace>Product.UnitPrice</monospace> conforms to 98.72 % to the domain &#x0201C;Geocode_Longitude&#x0201D; using Informatica DQ), SAS displays the matching attribute per pattern (e.g., &#x0201C;Country&#x0201D; matches to 100 % the attribute <monospace>Customer.Country</monospace>). Talend OS is the only tool that displays the matching rows instead of the percentage of matching rows per attribute. For Ataccama ONE, we rated DP-17 and DP-18 as partially fulfilled, since specific attributes are classified (e.g., <monospace>Customer.FirstName</monospace> as &#x0201C;first name&#x0201D;), but those terms are part of the Ataccama business glossary, which we were unable to access during our evaluation and, therefore, had no further information about its origin.</p>
</sec>
<sec>
<title>5.2.1.4. Dependencies</title>
<p>The dependency section has the lowest coverage of the data profiling category and is best supported by Experian Pandora and Informatica DQ (in Developer edition only). Although we introduce each concept briefly, we refer to Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>) for details about dependency discovery and their implementation. In the following, <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow></mml:math></inline-formula> denotes a relational schema (defining a set of attributes) with <italic>r</italic> being an instance of <inline-formula><mml:math id="M15"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow></mml:math></inline-formula> (defining a set of records). Sets of attributes are denoted by &#x003B1; and &#x003B2;.</p>
<p>A unique column combination (UCC) is an attribute set <inline-formula><mml:math id="M16"><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x02286;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow></mml:math></inline-formula> whose projection contains no duplicate entries in <italic>r</italic> (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>). In other words, a UCC is a (possibly composite) candidate key that functionally determines <inline-formula><mml:math id="M17"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow></mml:math></inline-formula>. While Experian Pandora allows the detection of single column keys (thus <italic>p</italic>), Informatica DQ offers full UCC detection. Both tools allow the user to set a threshold for relaxed UCC detection (DP-20) and to identify violating records <italic>via</italic> drill-down. With Informatica DQ, we discovered five UCCs in table <monospace>Order</monospace> of our test DB, using a threshold of 98 %: <monospace>Id</monospace> (100 %), <monospace>OrderNumber</monospace> (100 %), <monospace>OrderDate</monospace> &#x0002B; <monospace>TotalAmount</monospace> (100 %), <monospace>CustomerId</monospace> &#x0002B; <monospace>TotalAmount</monospace> (99.88 %), and <monospace>CustomerId</monospace> &#x0002B; <monospace>OrderDate</monospace> (99.16 %). With Experian Pandora, only the two single column keys <monospace>Id</monospace> (100 %) and <monospace>OrderNumber</monospace> (100 %) were detected. SAS Data Quality indicates 100 % unique attributes as primary key candidates.</p>
<p>An inclusion dependency (IND) over the relational schemata <italic>R</italic><sub><italic>i</italic></sub> and <italic>R</italic><sub><italic>j</italic></sub> states that all values in attribute set &#x003B1; also occur in &#x003B2;, that is <italic>R</italic><sub><italic>i</italic></sub>[&#x003B1;]&#x02286;<italic>R</italic><sub><italic>i</italic></sub>[&#x003B2;] (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>). The detection of INDs (DP-21 and DP-22), also referred to as foreign key discovery, is not widely supported. The best automation for this requirement delivers Experian Pandora, where initially the primary keys (UCCs) and foreign key relations are inferred, and based on this information, INDs are displayed graphically as a Venn diagram. In addition, it is possible to drill down to records that violate those INDs in a spread sheet. Informatica DQ and SAS Data Quality support IND discovery only partially, since it is required that the user selects the respective primary key (UCCs) and assigns it to possible foreign key candidates that are then tested for compliance. DataCleaner can only be used to check if two tables can possibly be joined, without information on the respective columns or the join quality (i.e., violating rows).</p>
<p>A functional dependency (FD) &#x003B1; &#x02192; &#x003B2; asserts that all pairs of records with the same attribute values in &#x003B1; must also have the same attribute values in &#x003B2;. Thus, the &#x003B1;-values functionally determine the &#x003B2;-values (Codd, <xref ref-type="bibr" rid="B18">1970</xref>). Again, we used table <monospace>Order</monospace> to verify exact (DP-23) and relaxed (DP-24) FD detection. With Experian Pandora and Informatica DQ we found in total eight exact FDs: {<monospace>Id</monospace>} &#x02192; {<monospace>OrderDate</monospace>, <monospace>OrderNumber</monospace>, <monospace>CustomerId</monospace>, <monospace>TotalAmount</monospace>} and {<monospace>OrderNumber</monospace>} &#x02192; {<monospace>Id</monospace>, <monospace>OrderDate</monospace>, <monospace>CustomerId</monospace>, <monospace>TotalAmount</monospace>} and two more FDs when relaxing the threshold to 93 %: {<monospace>TotalAmount</monospace>} &#x02192; {<monospace>CustomerId</monospace>, <monospace>OrderDate</monospace>}. Talend OS fulfills FD discovery only partially, because it requires user interaction to specify the attribute sets &#x003B1; and &#x003B2;, given that the number and type of columns are equal. Although specific FDs can be tested with this functionality (e.g., to which extent <monospace>TotalAmount</monospace> determines <monospace>CustomerId</monospace>), we do not perceive it as true automated FD discovery, e.g., when performing the test case and specifying all attributes of <monospace>Order</monospace> as &#x003B1; and &#x003B2; respectively, the result are five FDs, where each attribute is discovered to functionally determine itself. This case should be ideally excluded during the detection. All three tools printed the identified FDs in table-format, one row for each attribute pair along with the match percentage, but with slightly differing terminology. Thus &#x003B1; (the left side) is denoted as &#x0201C;A column set,&#x0201D; &#x0201C;identity column&#x0201D; or &#x0201C;determinant column,&#x0201D; and &#x003B2; (the right side) is denoted as &#x0201C;B column set,&#x0201D; &#x0201C;identified columns&#x0201D; or &#x0201C;dependent column.&#x0201D; For relaxed FDs, Experian displayed the violating rows with a count (50 in this case), Informatica listed the respective rows, and Talend did not provide violating rows at all.</p>
</sec>
<sec>
<title>5.2.1.5. Advanced Multi-Column Profiling</title>
<p>Apart from duplicate detection, which is a widely supported feature, advanced multi-column features are rarely supported satisfactorily. No single tool offers association rule mining (DP-26) as mentioned by Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>). Note that we specifically tested the DQ tools described in Section 5.1, and did not consider related tools that are often installed together. For example, SAS Enterprise Guide, which was shipped with our DQ installation, is dedicated to data analysis and therefore provides a rich function palette that overlaps with the multi-column profiling section, e.g., a selection of correlation coefficients, hierarchical and k-means clustering. Since the aim of this survey is to investigate DQ tools, we did not consider such related tools.</p>
<p>Correlations (DP-25) are a statistical measure between 1.0 and -1.0 to indicate the relationship between two numerical attributes (Sheskin, <xref ref-type="bibr" rid="B85">2003</xref>; Abedjan et al., <xref ref-type="bibr" rid="B1">2015</xref>). The most commonly used coefficients are Pearson correlation coefficient, or the rank-based Spearman&#x00027;s or Kendall&#x00027;s tau correlation coefficients (Sheskin, <xref ref-type="bibr" rid="B85">2003</xref>). In our survey, only Aggregate Profiler is able to compute Pearson correlations. However, our test case for DP-25 (Pearson correlation between <monospace>OrderItem.UnitPrice</monospace> and <monospace>OrderItem.Quantity</monospace>) yielded -0.045350608 with Aggregate Profiler, which did not conform to our cross-check using SAS Enterprise Guide (0.00737) and the Python package numpy (0.00736647). Talend distinguishes between &#x0201C;numerical,&#x0201D; &#x0201C;time,&#x0201D; and &#x0201C;nominal&#x0201D; correlation analysis and displays the respective correlations in bubble charts. We rated this as partial fulfillment <italic>p</italic>, since no correlation coefficient is calculated and the calculation is restricted to single columns with specific data types, thus, it is not possible to calculate the correlation between two interval data types.</p>
<p>During our investigation, we found that the concepts of clustering (DP-27), outlier detection (DP-28), and duplicate detection (DP-29 and DP-30) are not always clearly distinguishable in practice. Also Abedjan et al. (<xref ref-type="bibr" rid="B1">2015</xref>) state that clustering can be either used to detect outliers in a single column, or to detect similar or duplicate records within a table. Thus, we describe the three concepts briefly along with the condition we applied to verify (partial) fulfillment of the respective requirement.</p>
<p>Clustering (DP-27) is a type of unsupervised machine learning, where data items (e.g., records or attribute values) are classified into groups (clusters) (Jain et al., <xref ref-type="bibr" rid="B48">2000</xref>). A comprehensive overview on existing clustering algorithms is provided by Jain et al. (<xref ref-type="bibr" rid="B48">2000</xref>). In some DQ tools, clustering is only available in the frame of duplicate detection. For example, in OpenRefine clustering is used to detect duplicate string values (cf. Stephens, <xref ref-type="bibr" rid="B86">2018</xref>), in Informatica DQ the grouped duplicates are referred to as &#x0201C;clusters,&#x0201D; SAS Data Quality requires a &#x0201C;clustering&#x0201D; component to group records based on their match codes (SAS, <xref ref-type="bibr" rid="B79">2019</xref>); and Oracle EDQ uses clustering as preprocessing step of the matching component to increase runtime efficiency by preventing unnecessary comparisons between records (Oracle, <xref ref-type="bibr" rid="B66">2018</xref>). To completely fulfill requirement DP-27, we presumed the availability of one of the common clustering algorithms (like <italic>k</italic>-means or hierarchical clustering) as an independent function. Datamartist supports <italic>k</italic>-means clustering and allows to select the number of clusters k from 5 predefined values (5, 10, 25, 50, 100) and to restrict the observed value range. Aggregate Profiler supports <italic>k</italic>-means clustering without any modification possibility (e.g., choose <italic>k</italic>), as well as a second type of clustering for numeric values, where the number of clusters can be defined. No further information about this clustering algorithm is provided. No tool offers hierarchical clustering or other partitional clustering algorithms except <italic>k</italic>-means, for example, graph theoretic approaches or expectation maximization (Jain et al., <xref ref-type="bibr" rid="B48">2000</xref>).</p>
<p>Outlier detection deals with data points that are considered abnormalities, deviants, or discordant when compared to the remaining data (Aggarwal, <xref ref-type="bibr" rid="B3">2017</xref>). A comprehensive overview on different algorithms to detect outliers is provided by Aggarwal (<xref ref-type="bibr" rid="B3">2017</xref>). Our investigation showed that outlier detection is implemented in the tools very differently, and compared to the current state of research, only simple methods are used. We have not found a tool that supports multivariate outlier detection or one of the more sophisticated approaches like z-score, linear regression models, or probabilistic models as mentioned by Aggarwal (<xref ref-type="bibr" rid="B3">2017</xref>). In the following, we describe the implementation of outlier detection and the result that our test case yielded to &#x0201C;find very high values&#x0201D; in <monospace>Order.TotalAmount</monospace>:</p>
<list list-type="bullet">
<list-item><p>Aggregate Profiler, Ataccama ONE, Datamartist, and InfoZoom provide outlier detection for numerical values only visually, either in a quantile plot (Ataccama), in a bar chart (Datamartist) or in form of a box plot. Aggregate Profiler and Ataccama ONE do not allow drill-down to the actual outlying values and in InfoZoom the visualization of the single values in the plot are not readable. In all three tools, it is not possible to modify the plot settings or to get details about the used settings. The bar chart in Datamartist is based on k-means clustering with the same modification options as described in the previous paragraph. When using the standard settings (100 bars), one outlier is detected for our test case of finding &#x0201C;very high values&#x0201D; in column <monospace>Product.UnitPrices</monospace>: 17250.0. This extreme value is detected by all tools correctly, although other methods yield more outlying values.</p></list-item>
<list-item><p>Experian Pandora offers a number of different types of outlier checks, where some require one of the two parameters that can be specified by a user: &#x0201C;Rarity Threshold&#x0201D; (default: 1000) and &#x0201C;Standard Deviation Tolerance&#x0201D; (default: 3.3) (Experian, <xref ref-type="bibr" rid="B28">2018</xref>). The rarity threshold is used to detect rare values, which occur less frequently than one time in &#x0003C; threshold&#x0003E; is used for the checks &#x0201C;rare values,&#x0201D; &#x0201C;is a key,&#x0201D; and &#x0201C;unusually missing values&#x0201D; (Experian, <xref ref-type="bibr" rid="B28">2018</xref>). The standard deviation tolerance specifies the number of standard deviations that is tolerated for a value to be apart from the norm. It is used for low/high amounts, short/long values, rare/frequent values, and rare formats (Experian, <xref ref-type="bibr" rid="B28">2018</xref>). By using the standard settings, we found 18 outlying values for our test case (17250.0, 16321.9, 15810.0, 12281.2, 11493.2, 11490.7, 11380.0, 11283.2, 10835.24, 10741.6, 10588.5, 10495.6, 10191.7, 10164.8, 8902.5, 8891.0, 8623.45, 8267.4).</p></list-item>
<list-item><p>Informatica DQ distinguishes between &#x0201C;pattern outliers,&#x0201D; which refer to unusual patterns in the data, and &#x0201C;value frequency outliers&#x0201D; (Informatica, <xref ref-type="bibr" rid="B43">2018</xref>), where values with unusual occurring frequency are displayed. With this functionality, it was not possible to perform our test case, because the characteristic of being an outlier depends on the frequency instead of the actual value.</p></list-item>
<list-item><p>SAS provides an &#x0201C;outliers&#x0201D; tab for columns of different data types, where a fixed number of five minimum and maximum values are outlined without any modification possibility. For our test case, the following maximum values have been detected: 17250.0, 16321.0, 15810.0, 12281.2, and 11493.2, which correspond to the five highest results detected with Pandora.</p></list-item>
</list>
<p>Duplicate detection &#x0201C;aims to identify records [...] that refer to the same real-world entity&#x0201D; (Elmagarmid et al., <xref ref-type="bibr" rid="B26">2006</xref>). It is a widely researched field, which is also referred to as record matching, record linkage, data merging, or redundancy detection (Elmagarmid et al., <xref ref-type="bibr" rid="B26">2006</xref>). In contrast to clustering and outlier detection, the understanding and implementation of duplicate detection is very similar across all tools we investigated. In principle, the user (1) selects the columns that should be considered for comparison, (2) optionally applies a transformation to those columns (e.g., pruning a string to the first three characters), and finally (3) selects an appropriate distance function and algorithm. The major difference in the implementations is the selection of distance functions for the attribute values. The following distances are supported:</p>
<list list-type="bullet">
<list-item><p>Aggregate Profiler: exact match, similar-any word (if any word is similar for this column), similar-all words (if all words are similar for this column), begin char match, and end char match (Arrah Technology, <xref ref-type="bibr" rid="B6">2019</xref>). No information about the used similarity function was provided.</p></list-item>
<list-item><p>DataCleaner: n-grams, first 5, last 5, sorted acronym, Metaphone, common integer, Fingerprint, near integer (for pre-selection phase); exact, is empty, normalized affine gap, and cosine similarity (for scoring phase). The two phases are explained in the following paragraph.</p></list-item>
<list-item><p>Experian Pandora: Edit distance, exact, exact (ignore cases), Jaro distance, Jaro-Winkler distance, regular expression, and Soundex.</p></list-item>
<list-item><p>Informatica DQ: Bigram, Edit, Hamming, reverse Hamming, and Jaro distance.</p></list-item>
<list-item><p>InfoZoom: Soundex and Cologne phonetics.</p></list-item>
<list-item><p>OpenRefine: Fingerprint, n-gram Fingerprint, Metaphone3, or Cologne phonetics (with key collision method); Levenshtein or PPM (with nearest neighbor method) cf. (Stephens, <xref ref-type="bibr" rid="B86">2018</xref>) for details.</p></list-item>
<list-item><p>Oracle EDQ: (transformations) absolute value, first/last n characters/words, lower case, Metaphone, normalize whitespace, round, Soundex.</p></list-item>
<list-item><p>Talend OS: exact, exact (ignore case), Soundex, Soundex FR, Levenshtein, Metaphone, Double Metaphone, Jaro, Fingerprint key, Jaro-Winkler, q-grams, Hamming, and custom.</p></list-item>
</list>
<p>SAS Data Quality does not offer string distances, but matches based on match codes (SAS, <xref ref-type="bibr" rid="B79">2019</xref>), which are generated based on an input variable, a &#x0201C;definition&#x0201D; (type of transformation for the input variable) and a &#x0201C;sensitivy&#x0201D; (threshold), where records with the same match codes are then grouped together into the same cluster. The list of match definitions depends on the used Quality Knowledge Base (QBK). Talend OS offers two different algorithms to define the record merge strategy: simple VSR Matcher (default) or T-Swoosh. We refer to the documentation (Talend, <xref ref-type="bibr" rid="B88">2017</xref>) for details.</p>
<p>DataCleaner implements a ML-based approach that distinguishes between two modes: untrained detection (considered experimental) and a training mode plus duplicate detection using the trained ML model (Quadient, <xref ref-type="bibr" rid="B75">2008</xref>). The training mode is divided into three phases: (1) pre-selection, (2) scoring using a random forest classifier and the distance functions mentioned above, and (3) the outcome, which highlights duplicate pairs with a probability between 0 and 1.</p>
<p>Despite the fact that duplicate detection is typically attributed toward data cleansing (data integration or data matching) and not considered to be part of data profiling in the implementations, most DQ tools allow this functionality to be used also for detection purposes. We rated Datamartist and Aggregate Profiler as supporting this requirement partially since the function is dedicated to direct cleansing (deletion or replacement of records) and because of the very limited configuration options compared to all other tools. Datamartist does not support DP-30 at all.</p>
</sec>
</sec>
<sec>
<title>5.2.2. Data Quality Measurement Capabilities</title>
<p><xref ref-type="table" rid="T9">Table 9</xref> summarizes the fulfillment of the DQM category, where the first part is dedicated to DQ dimensions, and the second one to business rules.</p>
<table-wrap position="float" id="T9">
<label>Table 9</label>
<caption><p>Data quality measurement capabilities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"/>
<th valign="top" align="left"><bold>Aggregate Profiler</bold></th>
<th valign="top" align="left"><bold>Apache Griffin</bold></th>
<th valign="top" align="left"><bold>Ataccama ONE</bold></th>
<th valign="top" align="left"><bold>DataCleaner</bold></th>
<th valign="top" align="left"><bold>Datamartist</bold></th>
<th valign="top" align="left"><bold>Experian Pandora</bold></th>
<th valign="top" align="left"><bold>Informatica DQ</bold></th>
<th valign="top" align="left"><bold>InfoZoom &#x00026; IZDQ</bold></th>
<th valign="top" align="left"><bold>MobyDQ</bold></th>
<th valign="top" align="left"><bold>OpenRefine &#x00026; MetricDoc</bold></th>
<th valign="top" align="left"><bold>Oracle EDQ</bold></th>
<th valign="top" align="left"><bold>SAS Data Quality</bold></th>
<th valign="top" align="left"><bold>Talend Open Studio</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">31</td>
<td valign="top" align="left">Accuracy metric</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">32</td>
<td valign="top" align="left">Completeness metric</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">33</td>
<td valign="top" align="left">Consistency metric</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">34</td>
<td valign="top" align="left">Timeliness metric</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">35</td>
<td valign="top" align="left">Other DQ metrics</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left"><italic>p</italic></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">36</td>
<td valign="top" align="left">Creation of business rules</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">37</td>
<td valign="top" align="left">General-applicable rules</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">38</td>
<td valign="top" align="left">Application of business rules</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>5.2.2.1. Accuracy</title>
<p>An accuracy metric (on table-level) is only provided by Apache Griffin, where the user needs to select a source and a target table and accuracy is calculated according to <inline-formula><mml:math id="M18"><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>r</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>*</mml:mo><mml:mn>100</mml:mn><mml:mi>%</mml:mi></mml:math></inline-formula>, where |<italic>r</italic>| is the total number of records in the source table, and |<italic>r</italic><sub><italic>a</italic></sub>| the number of (accurate) records in the target table that can be directly matched to a record in the source table (Apache Foundation, <xref ref-type="bibr" rid="B4">2019</xref>). This metric corresponds to the accuracy metric proposed by Redman (<xref ref-type="bibr" rid="B77">2005</xref>), which is outlined in Equation (2).</p>
</sec>
<sec>
<title>5.2.2.2. Completeness</title>
<p>The metric for the completeness on attribute-level (<inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula>) introduced in Equation (8) is closely related to DP-3, the percentage of <monospace>null</monospace> values, which yields the missingness <inline-formula><mml:math id="M20"><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, where |<italic>v</italic><sub><italic>n</italic></sub>| is the number of <monospace>null</monospace> values in one column. Ataccama, DataCleaner, Datamartist, Experian, and InfoZoom provide the completeness calculation on attribute-level according to Equation (8), without the possibility for aggregation on higher levels. Informatica DQ allows an aggregation on table-level (as the arithmetic mean of all attribute-level completeness values), but not higher. Note that this aggregated metric differs from the table-level completeness proposed by (Hinrichs, <xref ref-type="bibr" rid="B40">2002</xref>) and described in Equation (7), which calculates the mean of all completeness values at the record-level. Despite the fact that MetricDoc offers a metric that is denoted &#x0201C;completeness&#x0201D; in the GUI and is also described in their scientific documentation (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>), they calculate the missingness <italic>M</italic><sub><italic>att</italic></sub> on attribute-level and thus did not fulfill requirement DQM-32. MobyDQ computes the completeness between two data sources (source and target) according to <inline-formula><mml:math id="M21"><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:math></inline-formula>, where <italic>C</italic><sub><italic>t</italic></sub> is the completeness measure of the target source and <italic>C</italic><sub><italic>s</italic></sub> the measure from the source. If <italic>C</italic><sub><italic>s</italic></sub> is considered to be the reference dataset, this metric corresponds to the completeness calculation proposed by Batini and Scannapieco (<xref ref-type="bibr" rid="B11">2016</xref>), which is discussed in Section 2.2.2.</p>
</sec>
<sec>
<title>5.2.2.3. Consistency</title>
<p>The consistency dimension is mentioned in the Informatica DQ methodology (Informatica, <xref ref-type="bibr" rid="B42">2010</xref>), and SAS implements no single metric, but a set of rules that are grouped to this dimension, e.g., checks if an attribute contains numbers, non-numbers, is alphabetic, or is all lower case. We did not rate this as fulfilled because no aggregate metric was provided to calculate &#x0201C;consistency&#x0201D; and these rules are supplied by most DQ tools with generally applicable business rules (DQM-37). However, we would like to point out that the understanding of Informatica and SAS corresponds to the consistency metrics proposed in research (cf. Section 2.2.3), since both approaches are rule-based. In contrast to a predefined metric, Informatica and SAS assume that the user creates metric manually.</p>
</sec>
<sec>
<title>5.2.2.4. Timeliness</title>
<p>We did not find an implementation for the timeliness dimension as discussed in Section 2.2.4. However, with respect to other time-related dimensions, MetricDoc offers two different <italic>time interval metrics</italic>, where one checks if the interval between two timestamps &#x0201C;is smaller than, larger than, or equal to a given duration value&#x0201D; (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>), and the second one performs outlier detection on interval length. MobyDQ offers metrics for <italic>freshness</italic> and <italic>latency</italic>, but refers to those dimensions as DQ &#x0201C;indicators&#x0201D; (Rolland, <xref ref-type="bibr" rid="B78">2019</xref>). Freshness is implemented as <italic>ts</italic><sub><italic>cur</italic></sub> &#x02212; <italic>ts</italic><sub><italic>t</italic></sub>, where <italic>ts</italic><sub><italic>cur</italic></sub> is the current timestamp and <italic>ts</italic><sub><italic>t</italic></sub> the last updated timestamp from the target request, and latency as <italic>ts</italic><sub><italic>s</italic></sub> &#x02212; <italic>ts</italic><sub><italic>t</italic></sub>, where <italic>ts</italic><sub><italic>s</italic></sub> is the last updated timestamp from the source request (Rolland, <xref ref-type="bibr" rid="B78">2019</xref>). These indicators are not specifically dedicated to DQ dimensions and do not fulfill the requirement for DQ metrics to be normalized between [0,1] by Heinrich et al. (<xref ref-type="bibr" rid="B38">2018</xref>).</p>
</sec>
<sec>
<title>5.2.2.5. Other DQ Metrics</title>
<p>With respect to other non-time-related DQ metrics (DQM-35), the <italic>uniqueness</italic> dimension is most often implemented according to <inline-formula><mml:math id="M22"><mml:msub><mml:mrow><mml:mi>U</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula>, where |<italic>v</italic><sub><italic>u</italic></sub>| refers to the number of unique values within a column. DataCleaner, Datamartist, Experian, SAS Data Quality, and Talend OS implement uniqueness on attribute-level only, which corresponds to the requirement DP-5. Informatica DQ allows an aggregation on table-level but not higher. MetricDoc implements a dimension referred to as &#x0201C;uniqueness,&#x0201D; but actually calculate the <italic>redundancy</italic> <inline-formula><mml:math id="M23"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>r</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula> on table-level, where |<italic>r</italic><sub><italic>black</italic></sub>| is the number of records with at least one duplicate entry in the table. The user needs to select more than one attribute within a table in order to calculate the metric and it cannot be aggregated on higher levels. In addition, MetricDoc offers metrics for the DQ dimensions <italic>validity</italic> and <italic>plausibility</italic>. Validity is calculated on attribute-level as <inline-formula><mml:math id="M24"><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula>, where |<italic>v</italic><sub><italic>i</italic></sub>| is the number of attribute values that do not comply to the column data type. Plausibility is also calculated on attribute-level as <inline-formula><mml:math id="M25"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula>, where |<italic>v</italic><sub><italic>j</italic></sub>| is the number of attribute values that are outliers according to a nonrubost or robust statistical measure (mean with standard deviation, or median with interquartile range estimator, respectively) (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>). MobyDQ also provides a <italic>validity</italic> indicator, which connects to one single target data source and compares the values with defined thresholds (Rolland, <xref ref-type="bibr" rid="B78">2019</xref>). SAS does not provide predefined metrics, but uses DQ dimensions as abstraction layer to group business rules.</p>
<p>A number of additional DQ dimensions are mentioned in documentations or on websites of DQ tool vendors without being implemented as metric. In order to provide a structured overview, these additionally mentioned DQ dimensions are summarized together with all &#x0201C;other&#x0201D; DQ dimensions and their implementations (explained in this paragraph) in the following list:</p>
<list list-type="bullet">
<list-item><p><italic>Conformance</italic>: mentioned by Informatica Loshin (<xref ref-type="bibr" rid="B58">2006</xref>).</p></list-item>
<list-item><p><italic>Conformity</italic>: mentioned by Informatica (<xref ref-type="bibr" rid="B42">2010</xref>).</p></list-item>
<list-item><p><italic>Correctness</italic>: mentioned in DataCleaner documentation (Quadient, <xref ref-type="bibr" rid="B75">2008</xref>).</p></list-item>
<list-item><p><italic>Currency</italic>: mentioned by Loshin (<xref ref-type="bibr" rid="B58">2006</xref>).</p></list-item>
<list-item><p><italic>Duplicates</italic>: mentioned by Informatica (Informatica, <xref ref-type="bibr" rid="B42">2010</xref>).</p></list-item>
<list-item><p><italic>Duplication</italic>: mentioned in DataCleaner documentation (Quadient, <xref ref-type="bibr" rid="B75">2008</xref>).</p></list-item>
<list-item><p><italic>Freshness</italic>: implemented by MobyDQ as <italic>ts</italic><sub><italic>cur</italic></sub> &#x02212; <italic>ts</italic><sub><italic>t</italic></sub> (Rolland, <xref ref-type="bibr" rid="B78">2019</xref>).</p></list-item>
<list-item><p><italic>Integrity</italic>: mentioned in Informatica (<xref ref-type="bibr" rid="B42">2010</xref>), Talend (<xref ref-type="bibr" rid="B88">2017</xref>), and SAS (<xref ref-type="bibr" rid="B79">2019</xref>).</p></list-item>
<list-item><p><italic>Latency</italic>: implemented by MobyDQ as <italic>ts</italic><sub><italic>s</italic></sub> &#x02212; <italic>ts</italic><sub><italic>t</italic></sub> (Rolland, <xref ref-type="bibr" rid="B78">2019</xref>).</p></list-item>
<list-item><p><italic>Plausibility</italic>: implemented by MetricDoc for OpenRefine as <inline-formula><mml:math id="M26"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula> (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>).</p></list-item>
<list-item><p><italic>Referential Integrity</italic>: mentioned by Informatica (Loshin, <xref ref-type="bibr" rid="B58">2006</xref>).</p></list-item>
<list-item><p><italic>Structure</italic>: mentioned by SAS (<xref ref-type="bibr" rid="B79">2019</xref>).</p></list-item>
<list-item><p><italic>Uniformedness</italic>: mentioned in DataCleaner documentation (Quadient, <xref ref-type="bibr" rid="B75">2008</xref>).</p></list-item>
<list-item><p><italic>Uniqueness</italic>: implemented by DataCleaner, Datamartist, Experian, SAS Data Quality, and Talend OS on attribute-level as <inline-formula><mml:math id="M27"><mml:msub><mml:mrow><mml:mi>U</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula> and by Informatica also on table-level. Implemented by MetricDoc for OpenRefine as <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>r</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula>. Also mentioned in Loshin (<xref ref-type="bibr" rid="B58">2006</xref>); Datamartist (<xref ref-type="bibr" rid="B21">2017</xref>); SAS (<xref ref-type="bibr" rid="B79">2019</xref>); Experian (<xref ref-type="bibr" rid="B29">2020</xref>).</p></list-item>
<list-item><p><italic>Validity</italic>: implemented by MetricDoc for OpenRefine as <inline-formula><mml:math id="M29"><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>v</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula> (Bors et al., <xref ref-type="bibr" rid="B12">2018</xref>). Also mentioned in SAS (<xref ref-type="bibr" rid="B79">2019</xref>) and Experian (<xref ref-type="bibr" rid="B29">2020</xref>).</p></list-item>
</list>
</sec>
<sec>
<title>5.2.2.6. Business Rules</title>
<p>While the creation and application of business rules (DQM-36 and DQM-38) is supported by most DQ tools, few tools also offer predefined generally applicable business rules. A widely supported example are rules for address validation (e.g., zip codes, cities, states) that tackle the prevalent problem of failed mail deliveries due to incorrect addresses, also described by Apel et al. (<xref ref-type="bibr" rid="B5">2015</xref>). Despite the good performance of DataCleaner in terms of data profiling and CDQM, it does not support business rules at all. We rated DP-37 (the availability of generally applicable rules) as only partly fulfilled for InfoZoom, because the provided rules have been created for a given demo DB schema and would need to be modified to apply to other schemas (e.g., with other column names).</p>
</sec>
</sec>
<sec>
<title>5.2.3. Data Quality Monitoring Capabilities</title>
<p>The results of the CDQM evaluation are shown in <xref ref-type="table" rid="T10">Table 10</xref>. We want to point out that for two DQ tools (Ataccama ONE and Talend OS), more advanced versions are available that support CDQM according to their vendors, but we did not investigate it.</p>
<table-wrap position="float" id="T10">
<label>Table 10</label>
<caption><p>Data quality monitoring capabilities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"/>
<th valign="top" align="left"><bold>Aggregate Profiler</bold></th>
<th valign="top" align="left"><bold>Apache Griffin</bold></th>
<th valign="top" align="left"><bold>Ataccama ONE</bold></th>
<th valign="top" align="left"><bold>DataCleaner</bold></th>
<th valign="top" align="left"><bold>Datamartist</bold></th>
<th valign="top" align="left"><bold>Experian Pandora</bold></th>
<th valign="top" align="left"><bold>Informatica DQ</bold></th>
<th valign="top" align="left"><bold>InfoZoom &#x00026; IZDQ</bold></th>
<th valign="top" align="left"><bold>MobyDQ</bold></th>
<th valign="top" align="left"><bold>OpenRefine &#x00026; MetricDoc</bold></th>
<th valign="top" align="left"><bold>Oracle EDQ</bold></th>
<th valign="top" align="left"><bold>SAS Data Quality</bold></th>
<th valign="top" align="left"><bold>Talend Open Studio</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">39</td>
<td valign="top" align="left">Task scheduling</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">40</td>
<td valign="top" align="left">Storage of results</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">41</td>
<td valign="top" align="left">Retrieval of results</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
</tr>
<tr>
<td valign="top" align="left">42</td>
<td valign="top" align="left">Comparison</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
<tr>
<td valign="top" align="left">43</td>
<td valign="top" align="left">Visualization over time</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left">&#x02212;</td>
<td valign="top" align="left"><italic>p</italic></td>
<td valign="top" align="left">&#x02212;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The storage (CDQM-40) of DP or DQM results is possible in all tools. The majority of DQ tools (except MobyDQ and OpenRefine) also support data export <italic>via</italic> a GUI. Datamartist allows to export only very basic data profiles. The most comprehensive enterprise solutions for CDQM-40 and 41 is provided by Informatica DQ and SAS Data Quality, which enable the export of full DP procedures. During import, all settings and required data sources are reloaded from the time of the analysis.</p>
<p>Task scheduling (CDQM-39) is also widely supported. Aggregate Profiler fulfills this requirement only partially, since it is only possible to schedule business rules, but no other form of tasks, e.g., data profiling tasks. With Datamartist, InfoZoom, and SAS Data Quality, task scheduling is cumbersome for business users, since the command line is required to write batch files. With Datamartist, the tool needs to be closed to execute the batch file.</p>
<p>To visualize the continuously performed DQ checks (be it DP tasks, user-defined rules, or DQ metrics), Informatica DQ relies on so called &#x0201C;scorecards,&#x0201D; which can be customized to display the respective information. Apache Griffin, Experian Pandora, and SAS Data Quality also allow alerts to be defined, when specific errors occur or when a defined rule is violated. MobyDQ does not offer any visualization (which is considered future work), but relies on external libraries in its implementation at Ubisoft. SAS fulfills both requirements CDQM-42 and 43 only partially, since its &#x0201C;dashboards&#x0201D; contain solely the number or percentage of triggers per date, source or user, but no specific values (e.g., 80 % completeness) could be plotted. The most comprehensive solution for CDQM in general-purpose DQ tools provide Informatica DQ and DataCleaner by Human Inference. With respect to the open-source tools, only Apache Griffin provides comprehensive CDQM support, and the commercial version of MobyDQ, which is deployed at Ubisoft.</p>
</sec>
</sec>
</sec>
<sec id="s6">
<title>6. Survey Discussion and Lessons Learned</title>
<p>The results of our survey on DQ measurement and monitoring tools revealed interesting characteristics of DQ tools and allow to draw conclusions about the future direction of automated and continuous data quality measurement. While the following paragraphs provide a general overview on the marked of DQ tools, which was a side-result of this survey, each sub-question and the overall research question of this survey are discussed separately per subsection.</p>
<p>One of the greatest challenges we faced during the conduct of this survey was the constant change and development of the DQ tools, especially the open-source tools. Nevertheless, it is of great value to reflect on the current state of the market for two reasons: (1) to create a uniform vision for the future of DQ research, and (2) to identify the potential for functional enhancement across the tools.</p>
<p>The fact that we found 667 tools attributed to &#x0201C;data quality&#x0201D; in our systematic search indicates the growing awareness of the topic. However, approximately half (50.82 %) of the DQ tools that we found were domain specific, which means they were either dedicated to specific types of data or built to measure the DQ of a proprietary tool. This amount underlines the &#x0201C;fitness for use&#x0201D; principle of DQ, which states that the quality of data is dictated by the user and type of usage. 40 % of the DQ tools were dedicated to a specific data management task, for example, data cleansing, data integration, or data visualization, which reflects the complexity of the topic &#x0201C;data quality.&#x0201D; Although those tasks are often not clearly distinguished in practice, we required explicit DQ measurement, that is, making statements about the DQ without modifying the observed data.</p>
<p>Our selection of DQ tools provides a good digest of the market, since we included eight commercial and closed-source tools as well as five free and open-source tools, from which four (except Talend) are not mentioned by Gartner (cf. Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>). The vendors of four tools have been named &#x0201C;leader&#x0201D; in the Magic Quadrant of Data Quality Tools 2019 (Informatica, SAS, Talend, Oracle) and two of them are among the four vendors currently controlling the market [which are SAP, Informatica, Experian, and Syncsort (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>); however, no trial for SAP nor Syncsort was granted].</p>
<p>Overall and according to our requirements, we experienced Informatica DQ as the most mature DQ tool. The best support for data profiling is provided by Experian Pandora, which allows to profile across an entire DB and even across multiple connected data sources. All other tools allow data profiling only for selected columns or within specific tables. Despite being classified as leader by Gartner, we perceived Oracle EDQ, Talend OS, and SAS Data Quality as having less support for data profiling and/or DQ monitoring. Although Quadient (with DataCleaner) was removed from the Gartner study in 2019 due to their focus on customer data, our evaluation yielded a good support in data profiling and a strong support in DQ monitoring. However, when comparing the two general-purpose and freely available DQ tools Talend OS and Aggregate Profiler, the former one convinced in terms of intuitive user interface and a good overall performance. Aggregate Profiler on the other hand, has a richer support for advanced multi-column profiling and data cleansing, but it is not always clear which algorithms are used to perform data modifications and the documentation is not up-to-date.</p>
<p>Three open-source tools (Apache Griffin, MobyDQ, and OpenRefine) were installed from GitHub and thus required technical knowledge for the setup. While OpenRefine can not keep up with comparable tools like Talend OS or Aggregate Profiler in terms of data profiling, MobyDQ and Apache Griffin have clearly a different focus on CDQM. IBM ISDQ demonstrated, that also commercial tools can be very arduous and time intensive to install due to the increasing complexity of the single modules and dependencies between them.</p>
<sec>
<title>6.1. Data Profiling Capabilities in Current DQ Tools</title>
<p>In order to answer the first sub-research-question &#x0201C;<italic>Which data profiling capabilities are supported by current DQ tools?</italic>&#x0201D; we compiled 30 requirements that are mainly based on the classification on data profiling by Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>).</p>
<p>In summary, 11 (all except Apache Griffin and MobyDQ) of the 13 tools examined supported data profiling at least partially. The details on the data profiling capabilities per DQ tools are discussed in Section 5.2.1. Our evaluation revealed that especially single-column data profiling like cardinalities (DP 1&#x02014;5) were supported by all 11 tools. However, considering the state-of-the-art in research, there is potential for functional enhancement with respect to multi-column profiling (DP 25&#x02013;30) and dependency discovery (DP 19&#x02013;24). For example, dependency discovery is only supported by 2 tools in a comprehensive way. While in the group of multi-column profiling, exact and approximate duplicate detection is a very common feature (supported at least partially by 10 tools in total), correlation analysis is only supported by one tool (Aggregation Profiler) completely, and a second tool (Talend Open Studio) partially. Association-rule mining is not supported by any tool at all and there is also no full support for clustering by any tool observed. According to our customer contacts and reference customers, those functionalities are not considered to be part of data profiling and are usually implemented in analytics tools (e.g., SAS Enterprise Guide). This might be a reason why most DQ tools in our evaluation lack a wide range of features in this category: customers and vendors simply do not consider it as part of data profiling and data quality. This observation can be explained by the unclear distinction between the terms &#x0201C;data profiling&#x0201D; and &#x0201C;data mining.&#x0201D; Abedjan et al. (<xref ref-type="bibr" rid="B2">2019</xref>) distinguish the two topics by the object of analysis (focus on <italic>columns</italic> in data profiling vs. <italic>rows</italic> in data mining) and by the goal of the task (gathering technical metadata by data profiling vs. gathering new insights by data mining) (Abedjan et al., <xref ref-type="bibr" rid="B2">2019</xref>). While this distinction is still fuzzy, we go one step further and claim that there is also no clear distinction between data mining and data analytics with respect to the used techniques [e.g., regression analysis is discussed in both topics (Dasu and Johnson, <xref ref-type="bibr" rid="B20">2003</xref>)].</p>
<p>In recent years, numerous research initiatives concerning data profiling have been carried out that also use ML-based methods. Current general-purpose DQ tools do not take full advantage of these features. Although, several vendors claim to implement ML-based methods, we found no or only limited documentation of concrete algorithms (cf. Quadient, <xref ref-type="bibr" rid="B75">2008</xref>). Note that in the case of DataCleaner for duplicate detection, we received more detailed documentation upon request. We think that especially concerning the hype for artificial intelligence and the enhancement of detecting DQ errors with ML methods, it is necessary to focus on the desirable core characteristics for DQ and data mining (Dasu and Johnson, <xref ref-type="bibr" rid="B20">2003</xref>): the methods should be widely applicable, easy to use, interpret, store and deploy, and should have short response times. A counterexample are neural networks, which are increasingly applied in recent research initiatives, but need to be handled with care for DQ measurement, because they are black-box and hard to interpret. For measuring the quality of data (to ensure reliable and trustworthy data analysis), easy and clearly interpretable statistics and algorithms are required to prevent a user from deriving wrong conclusions from the results.</p>
<p>Apart from functional enhancements, we want to point out the desire for more automation and in data profiling. Current DQ tools allow users to select data profiling features or to define rules, which are then applied to single attributes or tables. This does not meet today&#x00027;s requirements to master big data problems, where typically, multiple information systems needs to be monitored at the same time (Stonebraker and Ilyas, <xref ref-type="bibr" rid="B87">2018</xref>). To ease the high initialization effort for large information system infrastructures, more automated, initial and still meaningful out-of-the-box profiling would be required.</p>
</sec>
<sec>
<title>6.2. Data Quality Measurement Capabilities</title>
<p>The first sub-research-question &#x0201C;<italic>Which data quality dimensions and metrics can be measured with current DQ tools?</italic>&#x0201D; was inspired by the number of DQ dimensions and metrics proposed by researchers (cf. Piro, <xref ref-type="bibr" rid="B71">2014</xref>; Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>; Heinrich et al., <xref ref-type="bibr" rid="B38">2018</xref> and the detailed outline in Section 2.2). In our survey, did not find a tool that implements a wider range of DQ metrics for the most important DQ dimensions as proposed in research papers and we also did not find another survey that investigates the existence of DQ metrics in tools. Identified DQ metric implementations have several drawbacks: some are only applicable on attribute-level (e.g., no aggregation possibility), some require a gold standard that might not exist, and some have implementation errors.</p>
<sec>
<title>6.2.1. DQ Metrics for Accuracy and the Problem of Gold Standards</title>
<p>The two open-source tools that implement metrics for the DQ dimensions accuracy (Apache Griffin) and completeness between two tables (MobyDQ) relied on a reference data set (i.e., gold standard) provided by the user. Apache Griffin based their metric on the definition by DAMA UK, who state that accuracy is &#x0201C;the degree to which data correctly describes the &#x02018;real world&#x00027; object or event being described&#x0201D; (Askham et al., <xref ref-type="bibr" rid="B7">2013</xref>), which needs to be selected for the calculation. MobyDQ specifically aims at automating DQ checks in data pipelines, that is, computing the difference between a source and a target data source, where the gold standard is clearly defined. However, in scenarios where the quality of a single data source should be assessed, such metrics are not suitable since a reference or gold standard is often not available (Ehrlinger et al., <xref ref-type="bibr" rid="B25">2018</xref>). This fact is also reflected by the restricted prevalence of such gold-standard-depending DQ metrics in commercial and general-purpose DQ tools.</p>
</sec>
<sec>
<title>6.2.2. DQ Metrics for Completeness and Uniqueness</title>
<p>The other investigated tools mainly implement two very basic metrics: completeness (indicating the missing data problem) and uniqueness (indicating duplicate data values or records). It is noteworthy that while completeness is one of the most-widely used DQ dimensions (cf. Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>; Myers, <xref ref-type="bibr" rid="B64">2017</xref>; Heinrich et al., <xref ref-type="bibr" rid="B38">2018</xref>), the aspect of uniqueness is often neglected in DQ research (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B24">2019</xref>). For example, (Piro, <xref ref-type="bibr" rid="B71">2014</xref>) perceives duplicate detection as a <italic>symptom</italic> of data quality, but not as DQ dimension. Neither (Myers, <xref ref-type="bibr" rid="B64">2017</xref>) in his &#x0201C;List of Conformed Dimensions of Data Quality,&#x0201D; nor the ISO/IEC 25024:2015 standard on DQ (<xref ref-type="bibr" rid="B46">ISO/IEC 25024:2015(E</xref>), <xref ref-type="bibr" rid="B46">2015</xref>) refer to a DQ dimension that describes the aspect of uniqueness or non-redundancy (Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B24">2019</xref>). Despite this difference, both DQ dimensions have a common characteristic: they can be calculated without necessarily requiring a gold standard. Nevertheless, these implementations lack two aspects: (1) the aggregation of DQ dimensions and (2) schema-level DQ dimensions that are clearly part of the DQ topic (Batini and Scannapieco, <xref ref-type="bibr" rid="B11">2016</xref>). The aggregation of DQ dimensions from value-level to attribute-, record-, table-, DB- or cross-data-source-level as presented by Hinrichs (<xref ref-type="bibr" rid="B40">2002</xref>); Piro (<xref ref-type="bibr" rid="B71">2014</xref>) was not provided by any tool prefabricated. Informatica DQ is the only tool that allows to aggregate column-level metrics on table-level, but not higher. We did not declare a manual implementation in tools with strong rule support as availability of such aggregation functions.</p>
</sec>
<sec>
<title>6.2.3. DQ Measurement Methodologies</title>
<p>Despite the lack of prefabricated DQ metrics, most tools refer to a set of DQ dimensions in their user guide or defined methodology, for example, Informatica and SAS rely on whitepapers influenced by David Loshin (cf. Informatica, <xref ref-type="bibr" rid="B42">2010</xref>; SAS, <xref ref-type="bibr" rid="B79">2019</xref>), or Talend promotes the existence of such metrics on their website<xref ref-type="fn" rid="fn0016"><sup>16</sup></xref>. In Section 5.2.2, we showed that the list of referenced DQ dimensions and metrics by the DQ vendors is very non-uniformly. Further inquiry on the metrics yielded two different responses by our customer contacts: while some explicitly stated that they do not offer generally applicable DQ metrics, others could not answer the question of how specific metrics are implemented.</p>
<p>In the case of Talend, we asked our customer contact and the Talend Community<xref ref-type="fn" rid="fn0017"><sup>17</sup></xref>, where the metrics promoted on the website can be found. Unfortunately, we got no satisfying answer, only references to the data profiling perspective in TOS and its documentation. This experience underlines the statement by Sebastian-Coleman (<xref ref-type="bibr" rid="B82">2013</xref>) that &#x0201C;people can often not say how to measure completeness or accuracy,&#x0201D; which also leads to different interpretations and implementations.</p>
<p>Other vendors justified the absence of generally applicable DQ metrics with two reasons: because such metrics are not feasible in practice, and because customers do not request it. Several DQ strategies also indicate the fact that DQ metrics should be created by the user and adjusted to the data (cf. Informatica, <xref ref-type="bibr" rid="B42">2010</xref>; Apache Foundation, <xref ref-type="bibr" rid="B4">2019</xref>; SAS, <xref ref-type="bibr" rid="B79">2019</xref>). This understanding follows the &#x0201C;fitness for use&#x0201D; principle, which highlights the subjectivity of DQ. Also Piro (<xref ref-type="bibr" rid="B71">2014</xref>) states that objectively measurable DQ dimensions previously require a manual configuration by a user. An example for this is Apache Griffin, who state that &#x0201C;Data scientists/analyst <italic>define</italic> their DQ requirements such as accuracy, completeness, timeliness, and profiling&#x0201D; (Apache Foundation, <xref ref-type="bibr" rid="B4">2019</xref>). Sebastian-Coleman (<xref ref-type="bibr" rid="B82">2013</xref>) points out that it is important to understand the DQ dimensions, but these do not immediately lend themselves to enabling specific measurements. The main foundation into DQ measurement, including the set of DQ dimensions and metrics have been originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT<xref ref-type="fn" rid="fn0018"><sup>18</sup></xref> in the 1980s. Dasu and Johnson (<xref ref-type="bibr" rid="B20">2003</xref>) state that DQ dimensions, as originally proposed by the TDQM, are not practically implementable and it is often not clear what they mean. The results of our survey underlines this statement with a scientific foundation, because each DQ tool implements the dimensions differently, and partially far away from the complex metrics proposed in research (e.g., no aggregation, often no gold standard). Apart from completeness and uniqueness on attribute-level, no DQ dimension finds wide-spread agreement in the implementation and definition in practice. This is especially noteworthy for the frequently mentioned accuracy dimension, which however, requires a reference data set that is often not available in practice.</p>
</sec>
<sec>
<title>6.2.4. The Meaning of DQ Dimensions and Metrics for DQ Measurement</title>
<p>We conclude that there is a strong need to question the current use of DQ dimensions and metrics. Research efforts to measure DQ dimensions directly with a single, generally-applicable DQ metric have little practical relevance and can hardly be found in DQ tools. In practice, DQ dimensions are used to group domain-specific DQ rules (sometimes referred to as metrics) on a higher level. Since research and practitioners failed to create a common understanding of DQ dimensions and their measurement for decades, a complementary and more practice-oriented approach should be developed. Several DQ tools show that DQ measurement is possible without referring to the dimensions at all. Since our focus is the automation of DQ measurement, a practical approach would be required without the need for DQ dimensions, but a focus on the core aspects (like missing data and duplicate detection), which can actually be measured automatically.</p>
</sec>
</sec>
<sec>
<title>6.3. Data Quality Monitoring</title>
<p>The third sub-research question addresses &#x0201C;<italic>whether DQ tools enable automated monitoring of data quality over time</italic>.&#x0201D; In contrast to Pushkarev et al. (<xref ref-type="bibr" rid="B74">2010</xref>), who did not find any tool that supports DQ monitoring, we identified the existence of this feature, as shown in <xref ref-type="table" rid="T10">Table 10</xref>.</p>
<p>In general-purpose DQ tools (e.g., DataCleaner, Informatica EDQ, InfoZoom &#x00026; IZDQ), DQ monitoring is considered a premium feature, which is liable to costs and only provided in professional versions. This is also the reason, why DQ monitoring has not been studied so far in related work that focused on open-source DQ tools (cf. Pushkarev et al., <xref ref-type="bibr" rid="B74">2010</xref>). An exceptions to this observation is the dedicated open-source DQ monitoring tool Apache Griffin, which supports the automation of DQ metrics, but lacks predefined functions and data profiling capabilities. The remaining open question with respect to DQ monitoring is which aspects of the data should actually be measured (discussed in Section 6.2).</p>
</sec>
</sec>
<sec id="s7">
<title>7. Conclusion and Outlook</title>
<p>In this survey, we conducted a systematic search in which we identified 667 software tools dedicated to the topic &#x0201C;data quality.&#x0201D; With six predefined exclusion criteria, we extracted 17 tools for deeper investigation. We evaluated 13 of the 17 tools with regard to our catalog of 43 requirements divided into the three categories (1) data profiling, (2) DQ measurement, and (3) continuous DQ monitoring. Although the market of DQ tools is continuously changing, this survey gives a comprehensive overview on state-of-the-art of DQ tools and how DQ measurement is currently perceived in practice by companies in contrast to DQ research.</p>
<p>So far, there are only a few surveys on DQ tools in general, and in particular no survey that investigated the existence of generic DQ metrics. There is also no survey that identified the existence of DQ monitoring capabilities in tools. We attempted to close this gap with our survey and provide the results regarding the available DQ metrics and DQ monitoring capabilities for the tools analyzed.</p>
<p>While we identified the need for more <italic>automation</italic> in data profiling and DQ measurement (with respect to initialization as well as continuous DQ monitoring), at the same time, a clear <italic>declaration and explanation</italic> of the performed calculations and algorithms is essential. In several tools (e.g., AggregateProfiler, InfoZoom), plots were generated or outliers were detected without a clear declaration of the used threshold or distance function. In alignment with the requirement for interpretability of data profiling results, we highlight the need for clear declaration of the parameters used.</p>
<p>In our ongoing and future work, we will introduce a practical DQ methodology that regards at directly measurable aspects of DQ in contrast to abstract dimensions with no common understanding. We also think that it is worth investigating the potential for automated out-of-the-box data profiling along with a clear declaration of the used parameters, which might be modified after the initial run. Part of our ongoing research is to exploit time-series analytics for further investigation of DQ monitoring results in order to predict trends and sudden changes in the DQ (as suggested in Ehrlinger and W&#x000F6;&#x000DF;, <xref ref-type="bibr" rid="B23">2017</xref>). Since a deep investigation of single DP requirements was out of scope for this survey, it would also be worth to further investigate specific implementations and their proper functionality, for example, which aspects yield floating point differences. Last but not least, further investigation of the 339 excluded domain-specific DQ tools with regard to their domains and their scope would be interesting.</p>
<p>The top vendors of DQ tools worldwide have between 7,200 (Experian), 5,000 (Informatica) and 2,700 (SAS) customers for their DQ product line (Chien and Jain, <xref ref-type="bibr" rid="B15">2019</xref>). Compared to the hype for AI and ML, these low numbers show high catch-up demand for DQ tool applications in general.</p>
</sec>
<sec sec-type="data-availability" id="s8">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>LE designed and conducted the survey. LE and WW wrote this article. Both authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry for Digital and Economic Affairs, and the State of Upper Austria in the frame of the COMET center SCCH.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>The authors would like to thank all contact persons who provided us with trial licenses and support, in particular, Thomas Bodenm&#x000FC;ller-Dodek and Dagmar Hillmeister-M&#x000FC;ller from Informatica, David Zydron from Experian, Alexis Rolland from Ubisoft, Marc Kliffen from Human Inference, Ingo Lenzen from InfoZoom, Loredana Locci from SAS, and Rudolf Plank from solvistas GmbH. We specifically thank Elisa Rusz for her support with the systematic search and the DQ tool evaluation, as well as Alexander Gindlhumer for his support with Apache Griffin.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abedjan</surname> <given-names>Z.</given-names></name> <name><surname>Golab</surname> <given-names>L.</given-names></name> <name><surname>Naumann</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>Profiling relational data: a survey</article-title>. <source>VLDB J.</source> <volume>24</volume>, <fpage>557</fpage>&#x02013;<lpage>581</lpage>. <pub-id pub-id-type="doi">10.1007/s00778-015-0389-y</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Abedjan</surname> <given-names>Z.</given-names></name> <name><surname>Golab</surname> <given-names>L.</given-names></name> <name><surname>Naumann</surname> <given-names>F.</given-names></name> <name><surname>Papenbrock</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Data profiling</article-title>, in <source>Synthesis Lectures on Data Management</source>, <volume>vol. 10</volume> (<publisher-loc>San Rafael, CA</publisher-loc>: <publisher-name>Morgan &#x00026; Claypool Publishers</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>154</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aggarwal</surname> <given-names>C. C.</given-names></name></person-group> (<year>2017</year>). <source>Outlier Analysis</source>. <edition>2nd Edn.</edition> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>.</citation>
</ref>
<ref id="B4">
<citation citation-type="web"><person-group person-group-type="author"><collab>Apache Foundation</collab></person-group> (<year>2019</year>). <source>Apache Griffin User Guide</source>. Technical report, <publisher-name>Apache Foundation</publisher-name>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md">https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md</ext-link> (January 2022).</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Apel</surname> <given-names>D.</given-names></name> <name><surname>Behme</surname> <given-names>W.</given-names></name> <name><surname>Eberlein</surname> <given-names>R.</given-names></name> <name><surname>Merighi</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <source>Datenqualit&#x000E4;t erfolgreich steuern: Praxisl&#x000F6;sungen f&#x000FC;r Business-Intelligence-Projekte [Successfully Governing Data Quality: Practical Solutions for Business-Intelligence Projects]</source>. Edition TDWI. <publisher-loc>Heidelberg</publisher-loc>: <publisher-name>dpunkt.verlag GmbH</publisher-name>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Arrah Technology</collab></person-group> (<year>2019</year>). <source>Aggregate Profile User Guide Version 6.1.8</source>. Technical Report.</citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Askham</surname> <given-names>N.</given-names></name> <name><surname>Cook</surname> <given-names>D.</given-names></name> <name><surname>Doyle</surname> <given-names>M.</given-names></name> <name><surname>Fereday</surname> <given-names>H.</given-names></name> <name><surname>Gibson</surname> <given-names>M.</given-names></name> <name><surname>Landbeck</surname> <given-names>U.</given-names></name> <etal/></person-group>. (<year>2013</year>). <source>The six primary dimensions for data quality assessment</source>. Technical Report, <publisher-loc>DAMA United Kingdom</publisher-loc>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ballou</surname> <given-names>D. P.</given-names></name> <name><surname>Pazer</surname> <given-names>H. L.</given-names></name></person-group> (<year>1985</year>). <article-title>Modeling data and process quality in multi-input, multi-output information systems</article-title>. <source>Manag. Sci.</source> <volume>31</volume>, <fpage>150</fpage>&#x02013;<lpage>162</lpage>. <pub-id pub-id-type="doi">10.1287/mnsc.31.2.150</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barateiro</surname> <given-names>J.</given-names></name> <name><surname>Galhardas</surname> <given-names>H.</given-names></name></person-group> (<year>2005</year>). <article-title>A survey of data quality tools</article-title>. <source>Datenbank-Spektrum</source> <volume>14</volume>, <fpage>15</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1145/3190578</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Batini</surname> <given-names>C.</given-names></name> <name><surname>Cappiello</surname> <given-names>C.</given-names></name> <name><surname>Francalanci</surname> <given-names>C.</given-names></name> <name><surname>Maurino</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Methodologies for data quality assessment and improvement</article-title>. <source>ACM Comput. Surveys (CSUR)</source> <volume>41</volume>, <fpage>16:1</fpage>&#x02013;<lpage>16:52</lpage>. <pub-id pub-id-type="doi">10.1145/1541880.1541883</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Batini</surname> <given-names>C.</given-names></name> <name><surname>Scannapieco</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <source>Data and Information Quality: Concepts, Methodologies and Techniques</source>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bors</surname> <given-names>C.</given-names></name> <name><surname>Gschwandtner</surname> <given-names>T.</given-names></name> <name><surname>Kriglstein</surname> <given-names>S.</given-names></name> <name><surname>Miksch</surname> <given-names>S.</given-names></name> <name><surname>Pohl</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Visual interactive creation, customization, and analysis of data quality metrics</article-title>. <source>J. Data Inf. Qual.</source> <volume>10</volume>, <fpage>3:1</fpage>&#x02013;<lpage>3:26</lpage>. <pub-id pub-id-type="doi">10.145/3190578</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bronselaer</surname> <given-names>A.</given-names></name> <name><surname>De Mol</surname> <given-names>R.</given-names></name> <name><surname>De Tr&#x000E9;</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>A measure-theoretic foundation for data quality</article-title>. <source>IEEE Trans. Fuzzy Syst.</source> <volume>26</volume>, <fpage>627</fpage>&#x02013;<lpage>639</lpage>. <pub-id pub-id-type="doi">10.1109/TFUZZ.2017.2686807</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Song</surname> <given-names>M.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name> <name><surname>Haihong</surname> <given-names>E.</given-names></name></person-group> (<year>2012</year>). <article-title>Survey on data quality</article-title>, in <source>2012 World Congress on Information and Communication Technologies (WICT)</source> (<publisher-loc>Trivandrum</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1009</fpage>&#x02013;<lpage>1013</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chien</surname> <given-names>M.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <source>Magic Quadrant for Data Quality Tools</source>. Technical Report, <publisher-name>Gartner, Inc.</publisher-name></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chrisman</surname> <given-names>N. R.</given-names></name></person-group> (<year>1983</year>). <article-title>The role of quality information in the long-term functioning of a geographic information system</article-title>. <source>Cartographica Int. J. Geograph. Inf. Geovisual.</source> <volume>21</volume>, <fpage>79</fpage>&#x02013;<lpage>88</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>C.</given-names></name> <name><surname>Rass</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>An overview of data quality frameworks</article-title>. <source>IEEE Access</source>, <volume>7</volume>:<fpage>24634</fpage>&#x02013;<lpage>24648</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2899751</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Codd</surname> <given-names>E. F.</given-names></name></person-group> (<year>1970</year>). <article-title>A relational model of data for large shared data banks</article-title>. <source>Commun. ACM</source> <volume>13</volume>, <fpage>377</fpage>&#x02013;<lpage>387</lpage>. <pub-id pub-id-type="pmid">9617087</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>W.</given-names></name> <name><surname>Wardlaw</surname> <given-names>I.</given-names></name> <name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Mehdi</surname> <given-names>K.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Long</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Data profiling technology of data governance regarding big data: review and rethinking</article-title>, in <source>Information Technology: New Generations</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>439</fpage>&#x02013;<lpage>450</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dasu</surname> <given-names>T.</given-names></name> <name><surname>Johnson</surname> <given-names>T.</given-names></name></person-group> (<year>2003</year>). <source>Exploratory Data Mining and Data Cleaning</source>, <volume>Vol. 479</volume>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons</publisher-name>.</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><collab>Datamartist</collab></person-group> (<year>2017</year>). <source>Automating Data Profiling (Pro Only)</source>. Technical Report, <publisher-name>Datamartist</publisher-name>.</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ehrlinger</surname> <given-names>L.</given-names></name> <name><surname>Haunschmid</surname> <given-names>V.</given-names></name> <name><surname>Palazzini</surname> <given-names>D.</given-names></name> <name><surname>Lettner</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>A DaQL to monitor the quality of machine data</article-title>, in <source>Proceedings of the International Conference on Database and Expert Systems Applications (DEXA), volume 11706 of Lecture Notes in Computer Science</source>. (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>227</fpage>&#x02013;<lpage>237</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ehrlinger</surname> <given-names>L.</given-names></name> <name><surname>W&#x000F6;&#x000DF;</surname> <given-names>W.</given-names></name></person-group> (<year>2017</year>). <article-title>Automated data quality monitoring</article-title>, in <source>Proceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017)</source>, ed <person-group person-group-type="editor"><name><surname>Talburt</surname> <given-names>J. R.</given-names></name></person-group> (<publisher-loc>Little Rock, AR</publisher-loc>), <fpage>15.1</fpage>&#x02013;<lpage>15.9</lpage>. <pub-id pub-id-type="pmid">31438180</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ehrlinger</surname> <given-names>L.</given-names></name> <name><surname>W&#x000F6;&#x000DF;</surname> <given-names>W.</given-names></name></person-group> (<year>2019</year>). <article-title>A Novel Data Quality Metric for Minimality</article-title>. In <person-group person-group-type="editor"><name><surname>Hacid</surname> <given-names>H.</given-names></name> <name><surname>Sheng</surname> <given-names>Q. Z.</given-names></name> <name><surname>Yoshida</surname> <given-names>T.</given-names></name> <name><surname>Sarkheyli</surname> <given-names>A.</given-names></name> <name><surname>Zhou</surname> <given-names>R.</given-names></name></person-group> editors, <source>Data Quality and Trust in Big Data</source>, pages <fpage>1</fpage>&#x02013;<lpage>15</lpage>, <publisher-loc>Cham</publisher-loc>. <publisher-name>Springer International Publishing</publisher-name>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ehrlinger</surname> <given-names>L.</given-names></name> <name><surname>Werth</surname> <given-names>B.</given-names></name> <name><surname>W&#x000F6;&#x000DF;</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>Automated continuous data quality measurement with quaIIe</article-title>. <source>Int. J. Adv. Softw.</source> <volume>11</volume>, <fpage>400</fpage>&#x02013;<lpage>417</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elmagarmid</surname> <given-names>A. K.</given-names></name> <name><surname>Ipeirotis</surname> <given-names>P. G.</given-names></name> <name><surname>Verykios</surname> <given-names>V. S.</given-names></name></person-group> (<year>2006</year>). <article-title>Duplicate record detection: a survey</article-title>. <source>IEEE Trans. Knowl. Data Eng.</source> <volume>19</volume>, <fpage>1</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2007.250581</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>English</surname> <given-names>L. P.</given-names></name></person-group> (<year>1999</year>). <source>Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons, Inc</publisher-name>.</citation>
</ref>
<ref id="B28">
<citation citation-type="web"><person-group person-group-type="author"><collab>Experian</collab></person-group> (<year>2018</year>). <source>User Manual Version 5.9. Technical Report, Experian</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.edq.com/globalassets/documentation/pandora/pandora_manual_590.pdf">https://www.edq.com/globalassets/documentation/pandora/pandora_manual_590.pdf</ext-link> (January 2022).</citation>
</ref>
<ref id="B29">
<citation citation-type="web"><person-group person-group-type="author"><collab>Experian</collab></person-group> (<year>2020</year>). <source>What is a Data Quality Dimension?</source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.experian.co.uk/business/glossary/data-quality-dimensions">https://www.experian.co.uk/business/glossary/data-quality-dimensions</ext-link> (January 2022).</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fisher</surname> <given-names>C. W.</given-names></name> <name><surname>Lauria</surname> <given-names>E. J. M.</given-names></name> <name><surname>Matheus</surname> <given-names>C. C.</given-names></name></person-group> (<year>2009</year>). <article-title>An accuracy metric: percentages, randomness, and probabilities</article-title>. <source>J. Data Inf. Qual.</source> <volume>1</volume>, <fpage>16:1</fpage>&#x02013;<lpage>16:21</lpage>. <pub-id pub-id-type="doi">10.1145/1659225.1659229</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Friedman</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <source>Magic Quadrant for Data Quality Tools</source>. Technical Report, <publisher-name>Gartner, Inc.</publisher-name></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>J.</given-names></name> <name><surname>Xie</surname> <given-names>C.</given-names></name> <name><surname>Tao</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Big data validation and quality assurance &#x02013; issuses, challenges, and needs</article-title>, in <source>Proceedings of the 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)</source> (<publisher-loc>Oxford</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>433</fpage>&#x02013;<lpage>441</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ge</surname> <given-names>M.</given-names></name> <name><surname>Helfert</surname> <given-names>M.</given-names></name></person-group> (<year>2007</year>). <article-title>A review of information quality research</article-title>, in <source>Proceedings of the 12th International Conference on Information Quality (ICIQ)</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT</publisher-name>), <fpage>76</fpage>&#x02013;<lpage>91</lpage>. <pub-id pub-id-type="pmid">28766742</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goasdou&#x000E9;</surname> <given-names>V.</given-names></name> <name><surname>Nugier</surname> <given-names>S.</given-names></name> <name><surname>Duquennoy</surname> <given-names>D.</given-names></name> <name><surname>Laboisse</surname> <given-names>B.</given-names></name></person-group> (<year>2007</year>). <article-title>An evaluation framework for data quality tools</article-title>, in <source>Proceedings of the 12th International Conference on Information Quality (ICIQ)</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT</publisher-name>), <fpage>280</fpage>&#x02013;<lpage>294</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Haegemans</surname> <given-names>T.</given-names></name> <name><surname>Snoeck</surname> <given-names>M.</given-names></name> <name><surname>Lemahieu</surname> <given-names>W.</given-names></name></person-group> (<year>2016</year>). <article-title>Towards a precise definition of data accuracy and a justification for its measure</article-title>, in <source>Proceedings of the International Conference on Information Quality (ICIQ 2016)</source> (<publisher-loc>Ciudad Real</publisher-loc>: <publisher-name>Alarcos Research Group (UCLM)</publisher-name>), <fpage>16.1</fpage>&#x02013;<lpage>16.13</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Heinrich</surname> <given-names>B.</given-names></name> <name><surname>Kaiser</surname> <given-names>M.</given-names></name> <name><surname>Klier</surname> <given-names>M.</given-names></name></person-group> (<year>2007</year>). <article-title>How to measure data quality? a metric-based approach</article-title>, in <source>Proceedings of the 28th International Conference on Information Systems (ICIS)</source>, eds <person-group person-group-type="editor"><name><surname>Rivard</surname> <given-names>S.</given-names></name> <name><surname>Webstere</surname> <given-names>J.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Association for Information Systems 2007</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1145/3148238</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Heinrich</surname> <given-names>B.</given-names></name> <name><surname>Klier</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>A novel data quality metric for timeliness considering supplemental data</article-title>, in <source>Proceedings of the 17th European Conference on Information Systems</source> (<publisher-loc>Verona</publisher-loc>: <publisher-name>Universit&#x000E0; di Verona, Facolt&#x000E0; di Economia, Departimento de Economia Aziendale</publisher-name>), <fpage>2701</fpage>&#x02013;<lpage>2713</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heinrich</surname> <given-names>B.</given-names></name> <name><surname>Hristova</surname> <given-names>D.</given-names></name> <name><surname>Klier</surname> <given-names>M.</given-names></name> <name><surname>Schiller</surname> <given-names>A.</given-names></name> <name><surname>Szubartowicz</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Requirements for data quality metrics</article-title>. <source>J. Data Inf. Qual.</source> <volume>9</volume>, <fpage>12:1</fpage>&#x02013;<lpage>12:32</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hildebrand</surname> <given-names>K.</given-names></name> <name><surname>Gebauer</surname> <given-names>M.</given-names></name> <name><surname>Hinrichs</surname> <given-names>H.</given-names></name> <name><surname>Mielke</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <source>Daten- und Informationsqualit&#x000E4;t [Data and Information Quality]</source>, <volume>vol. 3</volume>. <publisher-loc>Wiesbaden</publisher-loc>: <publisher-name>Springer Vieweg</publisher-name>.</citation>
</ref>
<ref id="B40">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Hinrichs</surname> <given-names>H.</given-names></name></person-group> (<year>2002</year>). <source>Datenqualit&#x000E4;tsmanagement in Data Warehouse-Systemen [Data Quality Management in Data Warehouse Systems]</source>. Ph.D. thesis, <publisher-name>Universit&#x000E4;t Oldenburg</publisher-name>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><collab>IEEE</collab></person-group> (<year>1998</year>). <source>Standard for a Software Quality Metrics Methodology</source>. Technical Report 1061-1998, <publisher-name>Institute of Electrical and Electronics Engineers</publisher-name>.</citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><collab>Informatica</collab></person-group> (<year>2010</year>). <source>The Informatica Data Quality Methodology</source>. Technical Report, <publisher-name>Informatica</publisher-name>.</citation>
</ref>
<ref id="B43">
<citation citation-type="web"><person-group person-group-type="author"><collab>Informatica</collab></person-group> (<year>2018</year>). <source>Profile Guide &#x02013; 10.2 HotFix 1</source>. Technical Report, <publisher-name>Informatica</publisher-name>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://docs.informatica.com/content/dam/source/GUID-2/GUID-2257EE21-27D6-4053-B1DE-E656DA0A15C8/11/en/IN_101_ProfileGuide_en.pdf">https://docs.informatica.com/content/dam/source/GUID-2/GUID-2257EE21-27D6-4053-B1DE-E656DA0A15C8/11/en/IN_101_ProfileGuide_en.pdf</ext-link> (January 2022).</citation>
</ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><collab>ISO 8000-8:2015(E)</collab></person-group> (<year>2015</year>). <source>Data Quality &#x02013; Part 8: Information and Data Quality Concepts and Measuring</source>. <publisher-name>Standard, International Organization for Standardization</publisher-name>, <publisher-loc>Geneva</publisher-loc>.</citation>
</ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><collab>ISO/IEC 25012:2008</collab></person-group> (<year>2008</year>). <source>Systems and Software Engineering ? Systems and Software Quality Requirements and Evaluation (SQuaRE) ? Measurement of Data Quality</source>. <publisher-name>Standard, International Organization for Standardization</publisher-name>, <publisher-loc>Geneva</publisher-loc>.</citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><collab>ISO/IEC 25024:2015(E)</collab></person-group> (<year>2015</year>). <source>Systems and Software Engineering ? Systems and Software Quality Requirements and Evaluation (SQuaRE) ? Measurement of Data Quality</source>. <publisher-name>Standard, International Organization for Standardization</publisher-name>, <publisher-loc>Geneva, Switzerland</publisher-loc>.</citation>
</ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><collab>ISO/IEC 25040:2011</collab></person-group> (<year>2011</year>). <source>Systems and Software Engineering ? Systems and Software Quality Requirements and Evaluation (SQuaRE) ? Measurement of Data Quality</source>. <publisher-name>Standard, International Organization for Standardization</publisher-name>, <publisher-loc>Geneva</publisher-loc>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jain</surname> <given-names>A. K.</given-names></name> <name><surname>Murty</surname> <given-names>M. N.</given-names></name> <name><surname>Flynn</surname> <given-names>P. J.</given-names></name></person-group> (<year>2000</year>). <article-title>Data Clustering: a review</article-title>. <source>ACM Comput. Surveys (CSUR)</source> <volume>31</volume>, <fpage>264</fpage>&#x02013;<lpage>323</lpage>. <pub-id pub-id-type="doi">10.1145/331499.331504</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Judah</surname> <given-names>S.</given-names></name> <name><surname>Selvage</surname> <given-names>M. Y.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Magic Quadrant for Data Quality Tools</source>. Technical Report, <publisher-name>Gartner, Inc.</publisher-name></citation>
</ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kitchenham</surname> <given-names>B.</given-names></name></person-group> (<year>2004</year>). <source>Procedures for Performing Systematic Reviews</source>. Technical Report, <publisher-name>Keele University TR/SE-0401 and NICTA 0400011T.1</publisher-name>.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kitchenham</surname> <given-names>B.</given-names></name> <name><surname>Brereton</surname> <given-names>O. P.</given-names></name> <name><surname>Budgen</surname> <given-names>D.</given-names></name> <name><surname>Turner</surname> <given-names>M.</given-names></name> <name><surname>Bailey</surname> <given-names>J.</given-names></name> <name><surname>Linkman</surname> <given-names>S.</given-names></name></person-group> (<year>2009</year>). <article-title>Systematic literature reviews in software engineering &#x02013; a systematic literature review</article-title>. <source>Inf. Softw. Technol.</source> <volume>51</volume>, <fpage>7</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1016/j.infsof.2008.09.009</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kokem&#x000FC;ller</surname> <given-names>J.</given-names></name> <name><surname>Haupt</surname> <given-names>F.</given-names></name></person-group> (<year>2012</year>). <source>Datenqualit&#x000E4;tswerkzeuge 2012 &#x02013; Werkzeuge zur Bewertung und Erh&#x000F6;hung von Datenqualit&#x000E4;t [Data Quality Tools 2012 - Tools for the Assessment and Improvement of Data Quality]</source>. Technical Report, <publisher-name>Fraunhofer IAO</publisher-name>.</citation>
</ref>
<ref id="B53">
<citation citation-type="web"><person-group person-group-type="author"><collab>KPMG International</collab></person-group> (<year>2016</year>). <source>Now or Never: 2016 Global CEO Outlook</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://home.kpmg/content/dam/kpmg/pdf/2016/06/2016-global-ceo-outlook.pdf">https://home.kpmg/content/dam/kpmg/pdf/2016/06/2016-global-ceo-outlook.pdf</ext-link> (January 2022).</citation>
</ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kusumasari</surname> <given-names>T. F.</given-names></name></person-group> (<year>2016</year>). <article-title>Data profiling for data quality improvement with OpenRefine</article-title>, in <source>2016 International Conference on Information Technology Systems and Innovation (ICITSI)</source> (<publisher-loc>Bal</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="pmid">9383671</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Laranjeiro</surname> <given-names>N.</given-names></name> <name><surname>Soydemir</surname> <given-names>S. N.</given-names></name> <name><surname>Bernardino</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>A survey on data quality: classifying poor data</article-title>, in <source>Proceedings of the 21st Pacific Rim International Symposium on Dependable Computing (PRDC)</source> (<publisher-loc>Zhangjiajie</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>179</fpage>&#x02013;<lpage>188</lpage>.</citation>
</ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>Y. W.</given-names></name> <name><surname>Pipino</surname> <given-names>L. L.</given-names></name> <name><surname>Funk</surname> <given-names>J. D.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>2009</year>). <source>Journey to Data Quality</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>The MIT Press</publisher-name>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>Y. W.</given-names></name> <name><surname>Strong</surname> <given-names>D. M.</given-names></name> <name><surname>Kahn</surname> <given-names>B. K.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>2002</year>). <article-title>AIMQ: a methodology for information quality assessment</article-title>. <source>Inf. Manag.</source> <volume>40</volume>, <fpage>133</fpage>&#x02013;<lpage>146</lpage>. <pub-id pub-id-type="doi">10.1016/S0378-7206(02)00043-5</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Loshin</surname> <given-names>D.</given-names></name></person-group> (<year>2006</year>). <source>Monitoring Data Quality Performance Using Data Quality Metrics</source>. Technical Report, <publisher-name>Informatica</publisher-name>. <pub-id pub-id-type="pmid">26062714</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Loshin</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <source>The Practitioner&#x00027;s Guide to Data Quality Improvement</source>. <edition>1st Edn.</edition> <publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc</publisher-name>. <pub-id pub-id-type="pmid">32520766</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maletic</surname> <given-names>J. I.</given-names></name> <name><surname>Marcus</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Data cleansing: a prelude to knowledge discovery</article-title>, in <source>Data Mining and Knowledge Discovery Handbook</source>, ed <person-group person-group-type="editor"><name><surname>Maimon</surname> <given-names>O.</given-names></name></person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>19</fpage>&#x02013;<lpage>32</lpage>.</citation>
</ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maydanchik</surname> <given-names>A.</given-names></name></person-group> (<year>2007</year>). <source>Data Quality Assessment</source>. <publisher-loc>Bradley Beach, NJ</publisher-loc>: <publisher-name>Technics Publications, LLC</publisher-name>.</citation>
</ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McKean</surname> <given-names>E.</given-names></name></person-group> (<year>2005</year>). <source>The New Oxford American Dictionary</source>, <volume>Vol. 2</volume>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press New York</publisher-name>. <pub-id pub-id-type="pmid">23304423</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Moore</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <source>How to Create a Business Case for Data Quality Improvement</source>. Available Online at: <ext-link ext-link-type="uri" xlink:href="https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement">https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement</ext-link> (January 2022).</citation>
</ref>
<ref id="B64">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Myers</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <source>About the Dimensions of Data Quality</source>. Available Online at: <ext-link ext-link-type="uri" xlink:href="http://dimensionsofdataquality.com/about_dims">http://dimensionsofdataquality.com/about_dims</ext-link> (January 2022).</citation>
</ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Naumann</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>Data profiling revisited</article-title>. <source>ACM SIGMOD Rec.</source> <volume>42</volume>, <fpage>40</fpage>&#x02013;<lpage>49</lpage>. <pub-id pub-id-type="doi">10.1145/2590989.2590995</pub-id></citation>
</ref>
<ref id="B66">
<citation citation-type="web"><person-group person-group-type="author"><collab>Oracle</collab></person-group> (<year>2018</year>). <source>em Enterprise Data Quality Help version 9.0. Technical Report, Oracle</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/index.htm">https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/index.htm</ext-link> (January 2022).</citation>
</ref>
<ref id="B67">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Otto</surname> <given-names>B.</given-names></name> <name><surname>&#x000D6;sterle</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). <source>Corporate Data Quality: Prerequisite for Successful Business Models</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Gabler</publisher-name>.</citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pateli</surname> <given-names>A. G.</given-names></name> <name><surname>Giaglis</surname> <given-names>G. M.</given-names></name></person-group> (<year>2004</year>). <article-title>A research framework for analysing ebusiness models</article-title>. <source>Eur. J. Inf. Syst.</source> <volume>13</volume>, <fpage>302</fpage>&#x02013;<lpage>314</lpage>. <pub-id pub-id-type="doi">10.1057/palgrave.ejis.3000513</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pawluk</surname> <given-names>P.</given-names></name></person-group> (<year>2010</year>). <article-title>Trusted data in IBM&#x00027;s MDM: accuracy dimension</article-title>, in <source>Proceedings of the 2010 International Multiconference on Computer Science and Information Technology (IMCSIT)</source> (<publisher-loc>Wisla</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>577</fpage>&#x02013;<lpage>584</lpage>.</citation>
</ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pipino</surname> <given-names>L. L.</given-names></name> <name><surname>Lee</surname> <given-names>Y. W.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>2002</year>). <article-title>Data quality assessment</article-title>. <source>Commun. ACM</source> <volume>45</volume>, <fpage>211</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1145/505248.506010</pub-id></citation>
</ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="editor"><name><surname>Piro</surname> <given-names>A.</given-names></name></person-group> editor (<year>2014</year>). <source>Informationsqualit&#x000E4;t bewerten &#x02013; Grundlagen, Methoden, Praxisbeispiele [Assessing Information Quality &#x02013; Foundations, Methods, and Practical Examples]</source>. <edition>1st Edn.</edition> <publisher-loc>D&#x000FC;sseldorf</publisher-loc>: <publisher-name>Symposion Publishing GmbH</publisher-name>.</citation>
</ref>
<ref id="B72">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Prasad</surname> <given-names>K. H.</given-names></name> <name><surname>Faruquie</surname> <given-names>T. A.</given-names></name> <name><surname>Joshi</surname> <given-names>S.</given-names></name> <name><surname>Chaturvedi</surname> <given-names>S.</given-names></name> <name><surname>Subramaniam</surname> <given-names>L. V.</given-names></name> <name><surname>Mohania</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>Data cleansing techniques for large enterprise datasets</article-title>, in <source>2011 Annual SRII Global Conference</source> (<publisher-loc>San Jose, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>135</fpage>&#x02013;<lpage>144</lpage>.</citation>
</ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pulla</surname> <given-names>V. S. V.</given-names></name> <name><surname>Varol</surname> <given-names>C.</given-names></name> <name><surname>Al</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Open Source Data Quality Tools: Revisited</article-title>. In <person-group person-group-type="editor"><name><surname>Latifi</surname> <given-names>S.</given-names></name></person-group>, editor, <source>Information Technology: New Generations: 13th International Conference on Information Technology</source>, pages <fpage>893</fpage>&#x02013;<lpage>902</lpage>, <publisher-loc>Cham, Switzerland</publisher-loc>. <publisher-name>Springer International Publishing</publisher-name>. <pub-id pub-id-type="pmid">26767969</pub-id></citation></ref>
<ref id="B74">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pushkarev</surname> <given-names>V.</given-names></name> <name><surname>Neumann</surname> <given-names>H.</given-names></name> <name><surname>Varol</surname> <given-names>C.</given-names></name> <name><surname>Talburt</surname> <given-names>J. R.</given-names></name></person-group> (<year>2010</year>). <article-title>An overview of open source data quality tools</article-title>, in <source>Proceedings of the 2010 International Conference on Information &#x00026; Knowledge Engineering, IKE 2010, July 12-15, 2010</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>CSREA Press</publisher-name>), <fpage>370</fpage>&#x02013;<lpage>376</lpage>.</citation>
</ref>
<ref id="B75">
<citation citation-type="web"><person-group person-group-type="author"><collab>Quadient</collab></person-group> (<year>2008</year>). <source>DataCleaner Reference Documentation 5.2. Technical Report</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://datacleaner.org/docs/5.2/html">https://datacleaner.org/docs/5.2/html</ext-link> (January 2022).</citation>
</ref>
<ref id="B76">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Redman</surname> <given-names>T. C.</given-names></name></person-group> (<year>1997</year>). <source>Data Quality for the Information Age</source>. <edition>1st Edn.</edition> <publisher-loc>Norwood, MA</publisher-loc>: <publisher-name>Artech House, Inc.</publisher-name></citation>
</ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Redman</surname> <given-names>T. C.</given-names></name></person-group> (<year>2005</year>). <article-title>Measuring data accuracy: a framework and review</article-title>, in <source>Information Quality</source>, Ch. 2 (<publisher-loc>Armonk, NY</publisher-loc>: <publisher-name>M.E. Sharpe</publisher-name>), <fpage>21</fpage>&#x02013;<lpage>36</lpage>.</citation>
</ref>
<ref id="B78">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Rolland</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <source>mobyDQ. Technical Report, The Data Tourists</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://ubisoft.github.io/mobydq">https://ubisoft.github.io/mobydq</ext-link> (January 2022).</citation>
</ref>
<ref id="B79">
<citation citation-type="web"><person-group person-group-type="author"><collab>SAS</collab></person-group> (<year>2019</year>). <source>DataFlux Data Management Studio 2.7: User Guide</source>. Technical Report, <publisher-name>SAS</publisher-name>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html">http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html</ext-link> (January 2022).</citation>
</ref>
<ref id="B80">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scannapieco</surname> <given-names>M.</given-names></name> <name><surname>Catarci</surname> <given-names>T.</given-names></name></person-group> (<year>2002</year>). <article-title>Data quality under the computer science perspective</article-title>. <source>Arch. Comput.</source> <volume>2</volume>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>.</citation>
</ref>
<ref id="B81">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sch&#x000E4;ffer</surname> <given-names>T.</given-names></name> <name><surname>Beckmann</surname> <given-names>H.</given-names></name></person-group> (<year>2014</year>). <source>Trendstudie Stammdatenqualit&#x000E4;t 2013: Erhebung der aktuellen Situation zur Stammdatenqualit&#x000E4;t in Unternehmen und daraus abgeleitete Trends [Trend Study Master Data Quality 2013: Inquiry of the Current Situation of Master Data Quality in Companies and Derived Trends]</source>. Technical Report, <publisher-name>Hochschule Heilbronn</publisher-name>.</citation>
</ref>
<ref id="B82">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sebastian-Coleman</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <source>Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework</source>. <publisher-loc>Waltham, MA</publisher-loc>: <publisher-name>Elsevier</publisher-name>.</citation>
</ref>
<ref id="B83">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Selvage</surname> <given-names>M. Y.</given-names></name> <name><surname>Judah</surname> <given-names>S.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <source>Magic Quadrant for Data Quality Tools</source>. Technical Report, <publisher-name>Gartner, Inc.</publisher-name></citation>
</ref>
<ref id="B84">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sessions</surname> <given-names>V.</given-names></name> <name><surname>Valtorta</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <article-title>The effects of data quality on machine learning algorithms</article-title>, in <source>Proceedings of the 11th International Conference on Information Quality (ICIQ 2006)</source>, <volume>Vol. 6</volume> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT</publisher-name>), <fpage>485</fpage>&#x02013;<lpage>498</lpage>.</citation>
</ref>
<ref id="B85">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sheskin</surname> <given-names>D. J.</given-names></name></person-group> (<year>2003</year>). <source>Handbook of Parametric and Nonparametric Statistical Procedures</source>. <edition>3rd Edn.</edition> <publisher-loc>Boca Raton, FL</publisher-loc>: <publisher-name>CRC Press</publisher-name>.</citation>
</ref>
<ref id="B86">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Stephens</surname> <given-names>O.</given-names></name></person-group> (<year>2018</year>). <source>Methods and Theory behind the Clustering Functionality in OpenRefine</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth">https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth</ext-link> (January 2022).</citation>
</ref>
<ref id="B87">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stonebraker</surname> <given-names>M.</given-names></name> <name><surname>Ilyas</surname> <given-names>I. F.</given-names></name></person-group> (<year>2018</year>). <article-title>Data integration: the current status and the way forward</article-title>. <source>Bull. IEEE Comput. Soc. Techn. Committee Data Eng.</source> <volume>41</volume>, <fpage>3</fpage>&#x02013;<lpage>9</lpage>.</citation>
</ref>
<ref id="B88">
<citation citation-type="web"><person-group person-group-type="author"><collab>Talend</collab></person-group> (<year>2017</year>). <source>Talend Open Studio for Data Quality &#x02013; User Guide 7.0.1M2</source>. Technical Report, <publisher-name>Talend</publisher-name>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://download-mirror1.talend.com/top/user-guide-download/V552/TalendOpenStudio_DQ_UG_5.5.2_EN.pdf">http://download-mirror1.talend.com/top/user-guide-download/V552/TalendOpenStudio_DQ_UG_5.5.2_EN.pdf</ext-link> (January 2022).</citation>
</ref>
<ref id="B89">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tsiflidou</surname> <given-names>E.</given-names></name> <name><surname>Manouselis</surname> <given-names>N.</given-names></name></person-group> (<year>2013</year>). <article-title>Tools and techniques for assessing metadata quality</article-title>, in <source>Research Conference on Metadata and Semantic Research</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>99</fpage>&#x02013;<lpage>110</lpage>.</citation>
</ref>
<ref id="B90">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wand</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>1996</year>). <article-title>Anchoring data quality dimensions in ontological foundations</article-title>. <source>Commun. ACM</source> <volume>39</volume>, <fpage>86</fpage>&#x02013;<lpage>95</lpage>.</citation>
</ref>
<ref id="B91">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>1998</year>). <article-title>A product perspective on total data quality management</article-title>. <source>Commun. ACM</source> <volume>41</volume>, <fpage>58</fpage>&#x02013;<lpage>65</lpage>.</citation>
</ref>
<ref id="B92">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>R. Y.</given-names></name> <name><surname>Strong</surname> <given-names>D. M.</given-names></name></person-group> (<year>1996</year>). <article-title>Beyond accuracy: what data quality means to data consumers</article-title>. <source>J. Manag. Inf. Syst.</source> <volume>12</volume>, <fpage>5</fpage>&#x02013;<lpage>33</lpage>.</citation>
</ref>
<ref id="B93">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Woodall</surname> <given-names>P.</given-names></name> <name><surname>Oberhofer</surname> <given-names>M.</given-names></name> <name><surname>Borek</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>A classification of data quality assessment and improvement methods</article-title>. <source>Int. J. Inf. Qual.</source> <volume>3</volume>, <fpage>298</fpage>&#x02013;<lpage>321</lpage>. <pub-id pub-id-type="doi">10.1504/IJIQ.2014.068656</pub-id><pub-id pub-id-type="pmid">22617804</pub-id></citation></ref>
<ref id="B94">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>H.</given-names></name> <name><surname>Madnick</surname> <given-names>S.</given-names></name> <name><surname>Lee</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>R. Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Data and information quality research: its evolution and future</article-title>, in <source>Computing Handbook: Information Systems and Information Technology</source> (<publisher-loc>London</publisher-loc>: <publisher-name>Chapman and Hall/CRC</publisher-name>), <fpage>16.1</fpage>&#x02013;<lpage>16.20</lpage>.</citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/CanburakTumer/SQL-Utils">https://github.com/CanburakTumer/SQL-Utils</ext-link> (January, 2022).</p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/rogelj/DescribeCol">https://github.com/rogelj/DescribeCol</ext-link> (January, 2022).</p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="http://www.dofactory.com/sql/sample-database">http://www.dofactory.com/sql/sample-database</ext-link> (January, 2022).</p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="https://sourceforge.net/projects/dataquality">https://sourceforge.net/projects/dataquality</ext-link> (January, 2022).</p></fn>
<fn id="fn0005"><p><sup>5</sup><ext-link ext-link-type="uri" xlink:href="https://griffin.apache.org">https://griffin.apache.org</ext-link> (January, 2022).</p></fn>
<fn id="fn0006"><p><sup>6</sup><ext-link ext-link-type="uri" xlink:href="https://one.ataccama.com">https://one.ataccama.com</ext-link> (January, 2022).</p></fn>
<fn id="fn0007"><p><sup>7</sup><ext-link ext-link-type="uri" xlink:href="http://www.datamartist.com">http://www.datamartist.com</ext-link> (January, 2022).</p></fn>
<fn id="fn0008"><p><sup>8</sup><ext-link ext-link-type="uri" xlink:href="https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/cont_iisinfsrv_install.html">https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.productization.iisinfsv.install.doc/topics/cont_iisinfsrv_install.html</ext-link> (January, 2022).</p></fn>
<fn id="fn0009"><p><sup>9</sup><ext-link ext-link-type="uri" xlink:href="http://www.humanit.de">http://www.humanit.de</ext-link> (January, 2022).</p></fn>
<fn id="fn0010"><p><sup>10</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/mobydq/mobydq">https://github.com/mobydq/mobydq</ext-link> (January, 2022).</p></fn>
<fn id="fn0011"><p><sup>11</sup><ext-link ext-link-type="uri" xlink:href="http://openrefine.org">http://openrefine.org</ext-link> (January, 2022).</p></fn>
<fn id="fn0012"><p><sup>12</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/OpenRefine/OpenRefine">https://github.com/OpenRefine/OpenRefine</ext-link> (January, 2022).</p></fn>
<fn id="fn0013"><p><sup>13</sup><ext-link ext-link-type="uri" xlink:href="http://www.oracle.com/technetwork/middleware/oedq/downloads/edq-vm-download-2424092.html">http://www.oracle.com/technetwork/middleware/oedq/downloads/edq-vm-download-2424092.html</ext-link> (January, 2022).</p></fn>
<fn id="fn0014"><p><sup>14</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/Talend/tdq-studio-se">https://github.com/Talend/tdq-studio-se</ext-link> (January, 2022).</p></fn>
<fn id="fn0015"><p><sup>15</sup><ext-link ext-link-type="uri" xlink:href="http://www.sas.com">http://www.sas.com</ext-link> (January, 2022).</p></fn>
<fn id="fn0016"><p><sup>16</sup><ext-link ext-link-type="uri" xlink:href="https://info.talend.com/vrstosdq_150602.html">https://info.talend.com/vrstosdq_150602.html</ext-link> (January, 2022).</p></fn>
<fn id="fn0017"><p><sup>17</sup><ext-link ext-link-type="uri" xlink:href="https://community.talend.com">https://community.talend.com</ext-link> (January, 2022).</p></fn>
<fn id="fn0018"><p><sup>18</sup><ext-link ext-link-type="uri" xlink:href="http://web.mit.edu/tdqm">http://web.mit.edu/tdqm</ext-link> (January, 2022).</p></fn>
</fn-group>
</back>
</article>