<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="systematic-review">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2025.1504725</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Systematic Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Plagiarism types and detection methods: a systematic survey of algorithms in text analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Amirzhanov</surname> <given-names>Altynbek</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Turan</surname> <given-names>Cemil</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2857438/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Makhmutova</surname> <given-names>Alfira</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2995860/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Computer Science, SDU University</institution>, <addr-line>Kaskelen</addr-line>, <country>Kazakhstan</country></aff>
<aff id="aff2"><sup>2</sup><institution>General Education, New Uzbekistan University</institution>, <addr-line>Tashkent</addr-line>, <country>Uzbekistan</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Kamel Barkaoui, Conservatoire National des Arts et M&#x000E9;tiers (CNAM), France</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Rocco Zaccagnino, University of Salerno, Italy</p>
<p>Gerardo Sierra, National Autonomous University of Mexico, Mexico</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Cemil Turan <email>cemil.turan&#x00040;sdu.edu.kz</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>03</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>7</volume>
<elocation-id>1504725</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>02</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Amirzhanov, Turan and Makhmutova.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Amirzhanov, Turan and Makhmutova</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Plagiarism in academic and creative writing continues to be a significant challenge, driven by the exponential growth of digital content. This paper presents a systematic survey of various types of plagiarism and the detection algorithms employed in text analysis. We categorize plagiarism into distinct types, including verbatim, paraphrasing, translation, and idea-based plagiarism, discussing the nuances that make detection complex. This survey critically evaluates existing literature, contrasting traditional methods like string-matching with advanced machine learning, natural language processing, and deep learning approaches. We highlight notable works focusing on cross-language plagiarism detection, source code plagiarism, and intrinsic detection techniques, identifying their contributions and limitations. Additionally, this paper explores emerging challenges such as detecting cross-language plagiarism and AI-generated content. By synthesizing the current landscape and emphasizing recent advancements, we aim to guide future research directions and enhance the robustness of plagiarism detection systems across various domains.</p></abstract>
<kwd-group>
<kwd>plagiarism detection</kwd>
<kwd>text analysis</kwd>
<kwd>natural language processing</kwd>
<kwd>plagiarism types</kwd>
<kwd>machine learning</kwd>
<kwd>AI-generated content</kwd>
</kwd-group>
<counts>
<fig-count count="14"/>
<table-count count="11"/>
<equation-count count="4"/>
<ref-count count="44"/>
<page-count count="20"/>
<word-count count="8963"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Theoretical Computer Science</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Plagiarism, often defined as the uncredited replication or close imitation of someone else&#x00027;s work, remains a persistent threat to academic integrity across various disciplines. The <italic>Office of Research Integrity (ORI)</italic> and foundational studies (Roig, <xref ref-type="bibr" rid="B32">2006</xref>) define plagiarism as the act of using another person&#x00027;s intellectual output without proper acknowledgment, which directly undermines the principles of originality and academic honesty. As digital content continues to expand, the challenge of detecting and preventing plagiarism has become increasingly complex (Gandhi et al., <xref ref-type="bibr" rid="B18">2024</xref>).</p>
<p>Early detection methods, such as <italic>string-matching algorithms</italic>, were effective for identifying verbatim plagiarism. Tools like <italic>Turnitin</italic> and <italic>CopyCatch</italic> employ <italic>Rabin-Karp</italic> and <italic>Knuth-Morris-Pratt string-matching</italic> techniques to efficiently compare text segments and detect direct text overlap. These approaches, widely adopted in educational institutions and publishing platforms, provide high accuracy in detecting exact text matches. However, plagiarism has evolved beyond simple copy-pasting to include <italic>paraphrasing, translation, idea-based plagiarism, and AI-generated content</italic>, making traditional methods increasingly inadequate.</p>
<p>In response, advancements in <italic>machine learning (ML) and natural language processing (NLP)</italic> have significantly enhanced plagiarism detection by incorporating <italic>semantic similarity models, deep learning architectures, and citation-based techniques</italic>. Emerging challenges, such as plagiarism in programming code and <italic>cross-lingual plagiarism</italic>, further complicate detection efforts. For instance, in programming plagiarism, even minor syntax changes (e.g., variable name alterations or logic restructuring) can obscure copied code. Specialized tools like <italic>Measure of Software Similarity (MOSS)</italic> and <italic>Program Dependence Graphs (PDG)</italic> exemplify approaches tailored to detect such obfuscation. Meanwhile, AI-generated content detection introduces a new frontier, requiring models capable of identifying machine-generated text with high accuracy.</p>
<p>This paper presents a <italic>systematic survey</italic> of plagiarism types and detection algorithms, integrating findings from previous research and highlighting recent advancements in AI-based detection techniques. By categorizing plagiarism into <italic>verbatim, paraphrased, translation-based, conceptual plagiarism, and programming code plagiarism</italic>, this study provides a <italic>comprehensive overview</italic> of the current landscape of plagiarism detection. Additionally, we examine emerging challenges such as <italic>cross-lingual plagiarism</italic> and <italic>AI-generated content detection</italic>, providing insights into future research directions.</p>
</sec>
<sec id="s2">
<title>2 Research objectives and questions</title>
<p>Plagiarism detection remains a complex challenge due to the increasing sophistication of textual obfuscation techniques. Traditional approaches, including <italic>string-matching and syntactic analysis</italic>, struggle with advanced forms of plagiarism, necessitating the development of more robust AI-driven solutions. To provide a <italic>structured and comprehensive</italic> analysis of plagiarism detection methodologies, this study is guided by the following objectives and research questions:</p>
<sec>
<title>2.1 Research objectives</title>
<p>This paper aims to:</p>
<list list-type="bullet">
<list-item><p><bold>Categorize and analyze</bold> the different types of plagiarism, highlighting their detection complexities.</p></list-item>
<list-item><p><bold>Critically evaluate</bold> the methodologies and algorithms currently used in plagiarism detection, comparing traditional approaches with <italic>ML, NLP, and deep learning techniques</italic>.</p></list-item>
<list-item><p><bold>Identify emerging challenges</bold>, including <italic>AI-generated plagiarism and cross-lingual detection</italic>, and propose <italic>future research directions</italic> to enhance detection systems.</p></list-item>
</list>
</sec>
<sec>
<title>2.2 Research questions</title>
<p>This study seeks to answer the following key questions:</p>
<list list-type="order">
<list-item><p><bold>What are the distinct types of plagiarism, and how do they differ in terms of detection complexity?</bold></p></list-item>
<list-item><p><bold>What are the strengths and limitations of existing plagiarism detection methods?</bold></p></list-item>
<list-item><p><bold>How do advanced ML, NLP, and deep learning techniques enhance plagiarism detection?</bold></p></list-item>
<list-item><p><bold>What are the emerging trends and challenges in detecting AI-generated content and cross-language plagiarism?</bold></p></list-item>
</list>
<p>By addressing these questions, this study aims to provide <italic>a comprehensive overview</italic> of current detection methodologies while offering <italic>insights into future advancements</italic> in plagiarism detection research.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Paper selection methodology</title>
<p>To ensure <italic>methodological rigor</italic> and <italic>clarity</italic>, this study follows the <bold>PICOS framework</bold>, which defines the <italic>Population, Intervention, Comparison, Outcomes, and Study Design</italic> of this systematic review.</p>
<list list-type="bullet">
<list-item><p><bold>Population (P):</bold> Academic, educational, and creative communities affected by plagiarism challenges, including researchers, educators, journal editors, and plagiarism detection system developers.</p></list-item>
<list-item><p><bold>Intervention (I):</bold> Various plagiarism detection techniques, including:</p>
<list list-type="simple">
<list-item><p>&#x02013; Traditional methods (e.g., string-matching, syntactic similarity).</p></list-item>
<list-item><p>&#x02013; Semantic similarity models (e.g., word embeddings, deep learning).</p></list-item>
<list-item><p>&#x02013; Machine learning and NLP-based methods (e.g., transformers, BERT-based models).</p></list-item>
<list-item><p>&#x02013; Citation-based approaches and structural analysis.</p></list-item>
</list>
</list-item>
<list-item><p><bold>Comparison (C):</bold> A critical evaluation of:</p>
<list list-type="simple">
<list-item><p>&#x02013; Rule-based and string-matching approaches vs. AI-driven methods.</p></list-item>
<list-item><p>&#x02013; Traditional textual similarity techniques vs. deep learning architectures.</p></list-item>
<list-item><p>&#x02013; Monolingual plagiarism detection vs. cross-lingual plagiarism detection methods.</p></list-item>
</list>
</list-item>
<list-item><p><bold>Outcomes (O):</bold> Identification of the most effective strategies for plagiarism detection, insights into emerging challenges such as AI-generated content, semantic plagiarism, and cross-lingual text transformation, and evaluation of the role of deep learning, NLP, and citation-based methods in plagiarism detection.</p></list-item>
<list-item><p><bold>Study Design (S):</bold> A <italic>systematic survey</italic> of <italic>peer-reviewed studies</italic> from <italic>high-quality journals (2014&#x02013;2024)</italic>, focusing on both theoretical advancements and real-world applications of plagiarism detection.</p></list-item>
</list>
<p>By following the PICOS framework, this study provides a <italic>structured and transparent</italic> review, ensuring reproducibility and guiding future research in plagiarism detection. To ensure a systematic and comprehensive review of the literature, we additionally followed the PRISMA guidelines for identifying, screening, and including relevant papers. Overall the methodology comprised the following steps.</p>
<sec>
<title>3.1 Database selection and search strategy</title>
<p>We selected <italic>Scopus</italic> as the primary database for retrieving papers due to its extensive coverage of high-quality peer-reviewed journals. The following search query was used to identify relevant studies:</p>
<list list-type="simple">
<list-item><p>&#x0201C;<italic>text AND plagiarism AND detection&#x0201D;</italic></p></list-item>
</list>
<p>This search query was designed to target research specifically focused on textual plagiarism detection techniques.</p>
<p>To refine the search results and focus on relevant studies, we applied the following filters:</p>
<list list-type="bullet">
<list-item><p><bold>Year range</bold>: 2014&#x02013;2024, ensuring the inclusion of recent advancements.</p></list-item>
<list-item><p><bold>Subject area</bold>: computer science, aligning with technological developments in plagiarism detection.</p></list-item>
<list-item><p><bold>Document type</bold>: only full-length peer-reviewed journal articles.</p></list-item>
<list-item><p><bold>Source type</bold>: journals, prioritizing high-quality research.</p></list-item>
<list-item><p><bold>Language</bold>: only English-language papers were considered for consistency.</p></list-item>
</list>
<p>After applying the filters, the search yielded a total of 104 papers. To further enhance the quality of the review, we restricted our selection to papers published in <bold>Q1 quartile journals</bold>. Q1 journals are recognized as leading in their fields and ensure high-impact research that meets rigorous peer-review standards. This selection criterion aligns with our objective of synthesizing advanced and reliable methodologies in plagiarism detection. This reduced the number of papers to 47. Following the initial query and filtering, we carefully reviewed the abstracts of the selected papers to assess their relevance to the scope of this review. More six papers were excluded based on the following criteria:</p>
<list list-type="bullet">
<list-item><p>Focus on plagiarism detection in non-textual domains, such as images or audio.</p></list-item>
<list-item><p>Lack of empirical validation or practical application of proposed methods.</p></list-item>
<list-item><p>Redundancy with other included studies, offering no additional insights.</p></list-item>
<list-item><p>Methodological limitations, such as insufficient sample sizes or incomplete datasets.</p></list-item>
<list-item><p>Language mismatch (abstracts or full text not available in English).</p></list-item>
<list-item><p>Inaccessibility or incomplete publication details.</p></list-item>
</list>
<p>As a result, 41 papers were included in the final dataset for this review, providing a solid foundation for analyzing plagiarism detection techniques. The methodology followed is summarized in <xref ref-type="fig" rid="F1">Figure 1</xref>, which outlines the step-by-step paper selection process.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Paper selection methodology: flow of search, filters, and selection process.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0001.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Quantitative analysis of reviewed literature on plagiarism detection</title>
<p>To provide quantitative and statistical insights of the literature, we present a statistical overview of the papers included in our systematic review. This analysis provides insights into the publication trends, disciplinary focus, and methodological evolution of plagiarism detection research.</p>
<sec>
<title>4.1 Publication trends in reviewed papers (2014&#x02013;2024)</title>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> shows a temporal analysis of the papers reviewed in this study as fluctuating research activity in plagiarism detection. Key observations include:</p>
<list list-type="bullet">
<list-item><p>A notable peak in 2020, reflecting an increased focus on AI-driven detection techniques and the rising concern over AI-generated plagiarism.</p></list-item>
<list-item><p>Stable publication activity between 2015 and 2017, indicating sustained interest in refining plagiarism detection methodologies.</p></list-item>
<list-item><p>A gradual decline in recent years, potentially due to:</p>
<list list-type="simple">
<list-item><p>&#x02013; The maturity of existing plagiarism detection techniques.</p></list-item>
<list-item><p>&#x02013; A shift toward integrating plagiarism detection within broader NLP and AI applications.</p></list-item>
</list>
</list-item>
</list>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Publication trends in reviewed papers (2014&#x02013;2024).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0002.tif"/>
</fig>
<p>These patterns indicate waves of research focus, often aligned with advancements in machine learning (ML), deep learning (DL), and natural language processing (NLP)-based approaches.</p>
</sec>
<sec>
<title>4.2 Disciplinary distribution of reviewed research</title>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> categorizes the reviewed papers by subject area, highlighting the disciplinary focus within plagiarism detection research:</p>
<list list-type="bullet">
<list-item><p><bold>Computer science (41.4%)</bold> remains the dominant field, reflecting its central role in developing text-matching, NLP, and AI-based plagiarism detection algorithms.</p></list-item>
<list-item><p><bold>Engineering (19.2%)</bold> accounts for a significant share, likely due to the development of software tools and algorithmic optimizations.</p></list-item>
<list-item><p><bold>Social sciences (16.2%) and decision sciences (13.1%)</bold> emphasize the increasing interdisciplinary interest in plagiarism detection, particularly in academic integrity, ethics, and policy frameworks.</p></list-item>
<list-item><p>Smaller contributions from <bold>Mathematics (2.0%)</bold>, <bold>business management (4.0%)</bold>, and <bold>Neuroscience (1.0%)</bold> highlight the adoption of plagiarism detection methods beyond technical disciplines.</p></list-item>
</list>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Disciplinary distribution of reviewed research papers.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0003.tif"/>
</fig>
<p>This disciplinary spread reinforces that while plagiarism detection is primarily a computational challenge, there is growing cross-disciplinary engagement, particularly in areas like education, publishing ethics, and AI-driven academic misconduct detection.</p>
</sec>
<sec>
<title>4.3 Evolution of detection methodologies in reviewed papers</title>
<p>Our literature analysis reveals distinct shifts in research focus across different periods:</p>
<list list-type="bullet">
<list-item><p><bold>Pre-2018</bold>: emphasis on traditional string-matching, n-gram, and citation-based approaches, widely used in early detection tools.</p></list-item>
<list-item><p><bold>Post-2018</bold>: a significant shift toward AI-powered detection, driven by:</p>
<list list-type="simple">
<list-item><p>&#x02013; The rise of deep learning (CNNs, LSTMs, transformers like BERT/GPT).</p></list-item>
<list-item><p>&#x02013; A growing need for cross-language plagiarism detection.</p></list-item>
<list-item><p>&#x02013; Concerns over AI-generated content and its detection.</p></list-item>
</list>
</list-item>
</list>
<p>This transition from surface-level text similarity to deeper semantic analysis highlights the increasing complexity of modern plagiarism cases, requiring more sophisticated detection models.</p>
</sec>
<sec>
<title>4.4 Aligning research trends with emerging needs</title>
<p>The reviewed literature reflects the evolving challenges in plagiarism detection:</p>
<list list-type="bullet">
<list-item><p>The rise of cross-language detection techniques aligns with the globalization of academic publishing.</p></list-item>
<list-item><p>The peak in 2020 corresponds with increased awareness of AI-generated text (e.g., GPT-3, BARD), emphasizing the importance of machine-learning-based plagiarism detection.</p></list-item>
<list-item><p>The shift to deep learning methods suggests a growing need for adaptive, context-aware plagiarism detection systems.</p></list-item>
</list>
<p>Overall, the quantitative insights from our reviewed papers provide a broader context for our systematic survey, demonstrating the evolution of research priorities in plagiarism detection. The statistical trends validate the transition from traditional similarity-based approaches to AI-driven, semantic plagiarism detection, highlighting the need for scalable and adaptive detection methodologies.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Background and types of plagiarism</title>
<p>Plagiarism is a widespread issue that undermines academic integrity, intellectual honesty, and innovation. With the rapid growth of digital content and access to online information, plagiarism has become increasingly sophisticated, requiring equally advanced methods for detection.</p>
<sec>
<title>5.1 Types of plagiarism</title>
<p>Plagiarism manifests in various sophisticated forms, each posing unique challenges to detection and prevention in academic and research contexts. Understanding these types is crucial for developing effective detection strategies and maintaining academic integrity. <xref ref-type="supplementary-material" rid="SM1">Supplementary Image 1</xref> the plagiarism types in a simple diagram which has brief information below:</p>
<list list-type="bullet">
<list-item><p><bold>Verbatim plagiarism:</bold> direct copying of text without changes or attribution.</p></list-item>
<list-item><p><bold>Paraphrased plagiarism:</bold> rewriting the original text while retaining the core meaning.</p></list-item>
<list-item><p><bold>Idea-based plagiarism:</bold> appropriating someone else&#x00027;s ideas or arguments without acknowledgment.</p></list-item>
<list-item><p><bold>Translation plagiarism:</bold> translating content from one language to another without citation.</p></list-item>
<list-item><p><bold>Code plagiarism:</bold> reusing source code or program logic with minimal alterations.</p></list-item>
<list-item><p><bold>AI-generated content:</bold> using AI tools like GPT or BARD to generate content without proper disclosure.</p></list-item>
</list>
<p>Plagiarism can also occur in more subtle and advanced forms:</p>
<list list-type="bullet">
<list-item><p><bold>Obfuscated plagiarism:</bold> modifying text structure or replacing key terms while retaining the original meaning (Alzahrani et al., <xref ref-type="bibr" rid="B6">2015</xref>; Gharavi et al., <xref ref-type="bibr" rid="B19">2019</xref>).</p></list-item>
<list-item><p><bold>Cross-language plagiarism:</bold> translating content across languages without credit, making detection more complex (Alzahrani and Aljuaid, <xref ref-type="bibr" rid="B5">2022</xref>; Franco-Salvador et al., <xref ref-type="bibr" rid="B16">2016a</xref>).</p></list-item>
<list-item><p><bold>Multilingual and language-independent plagiarism:</bold> extending plagiarism detection across different languages and linguistic structures (Gharavi et al., <xref ref-type="bibr" rid="B19">2019</xref>).</p></list-item>
<list-item><p><bold>Duplicate and redundant publications:</bold> republishing existing work with minor modifications to increase publication count (Benos et al., <xref ref-type="bibr" rid="B8">2005</xref>; Errami et al., <xref ref-type="bibr" rid="B15">2008</xref>; Lariviere and Gingras, <xref ref-type="bibr" rid="B25">2010</xref>).</p></list-item>
</list>
</sec>
<sec>
<title>5.2 Prevalence and impact</title>
<p>Plagiarism is a significant problem across educational institutions and professional settings. Studies suggest:</p>
<list list-type="bullet">
<list-item><p>A 2023 survey found that up to 58% of university students admitted to engaging in some form of plagiarism during their academic careers.</p></list-item>
<list-item><p>An estimated 1.5% of all published papers involve duplicate content or unethical reuse (Errami et al., <xref ref-type="bibr" rid="B15">2008</xref>).</p></list-item>
<list-item><p>In software development, code plagiarism accounts for nearly 20% of all academic misconduct cases reported by universities (Liu et al., <xref ref-type="bibr" rid="B26">2015</xref>).</p></list-item>
</list>
<p>The consequences include devaluation of academic credentials, intellectual theft, and reputational damage to institutions.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Plagiarism detection methods</title>
<p>Plagiarism detection is a crucial task in academic, professional, and digital environments, safeguarding the integrity of intellectual property. Various methods have been developed to identify plagiarism types, ranging from verbatim copying to complex paraphrasing and idea plagiarism. They have evolved to address challenges, from traditional <italic>string matching techniques</italic> to modern <italic>ML and NLP-based</italic> approaches. The reviewed papers propose a wide range of detection methods, which we have categorized into six primary approaches:</p>
<list list-type="bullet">
<list-item><p><bold>Textual similarity-based</bold>,</p></list-item>
<list-item><p><bold>Semantic similarity-based</bold>,</p></list-item>
<list-item><p><bold>Cross-language detection</bold>,</p></list-item>
<list-item><p><bold>Machine learning and deep learning models</bold>,</p></list-item>
<list-item><p><bold>Citation and structural-based approaches</bold>,</p></list-item>
<list-item><p><bold>Code-based detection</bold>.</p></list-item>
</list>
<p>While conventional methods excel in detecting verbatim plagiarism, they often struggle with paraphrased and conceptual plagiarism. AI-driven techniques, such as <italic>deep learning and citation-based approaches</italic>, are promising but require high computational resources. By understanding these trends, this paper highlights the importance of adopting diverse detection methods tailored to different plagiarism forms and the evolving landscape of digital content creation.</p>
<sec>
<title>6.1 Textual similarity-based approaches</title>
<p>Textual similarity-based methods focus on detecting overlaps in surface-level textual features. These methods include:</p>
<sec>
<title>6.1.1 Shingle/substring matching</title>
<p>Shingle-based approaches compare overlapping subsequences of text (e.g., n-grams, q-grams) to detect similarities. Several reviewed works, such as those by Chekhovich and Khazov (<xref ref-type="bibr" rid="B9">2022</xref>), Turrado Garc&#x000ED;a et al. (<xref ref-type="bibr" rid="B40">2018</xref>), Al-Thwaib et al. (<xref ref-type="bibr" rid="B3">2020</xref>), employ these methods. Vel&#x000E1;squez et al. (<xref ref-type="bibr" rid="B44">2016</xref>) and Malandrino et al. (<xref ref-type="bibr" rid="B27">2022</xref>) also use n-gram analysis to measure document similarity. A widely used formula is Jaccard similarity:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Jaccard Similarity</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mi>A</mml:mi><mml:mo>&#x02229;</mml:mo><mml:mi>B</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>A</mml:mi><mml:mo>&#x0222A;</mml:mo><mml:mi>B</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>A</italic> and <italic>B</italic> represent sets of n-grams from the compared texts. As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, the division of texts into bigrams for comparison using Jaccard similarity is illustrated. <xref ref-type="table" rid="T1">Table 1</xref> offers a summary of the key studies utilizing shingle/substring matching methods, including details on datasets and accuracy metrics.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Shingle/substring matching: example of dividing text into bigrams and comparing using Jaccard similarity.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0004.tif"/>
</fig>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Shingle/substring matching.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Chekhovich and Khazov (<xref ref-type="bibr" rid="B9">2022</xref>)</td>
<td valign="top" align="left">Duplicate publication, text recycling</td>
<td valign="top" align="left">Russian scientific publications</td>
<td valign="top" align="left">Shingle index, Antiplagiat</td>
<td valign="top" align="left">eLIBRARY.RU</td>
<td valign="top" align="left">F1, threshold 0.66</td>
<td valign="top" align="left">Detecting duplicate scientific publications in Russian journals</td>
<td valign="top" align="left">Likely <italic>O</italic>(<italic>n</italic>log<italic>n</italic>) due to shingle index structures</td>
<td valign="top" align="left">Efficient for large-scale document comparison; widely used in Russian scientific domain</td>
<td valign="top" align="left">Shingle methods can fail with highly paraphrased text</td>
</tr> <tr>
<td valign="top" align="left">Turrado Garc&#x000ED;a et al. (<xref ref-type="bibr" rid="B40">2018</xref>)</td>
<td valign="top" align="left">Misspelled names, deduplication</td>
<td valign="top" align="left">Names datasets</td>
<td valign="top" align="left">LSH, Damerau-Levenshtein, Jaccard</td>
<td valign="top" align="left">Synthetic dataset</td>
<td valign="top" align="left">Pairwise comparisons</td>
<td valign="top" align="left">Name deduplication, detecting misspellings in databases</td>
<td valign="top" align="left">LSH reduces complexity to sublinear time; Damerau-Levenshtein is <italic>O</italic>(<italic>nm</italic>)</td>
<td valign="top" align="left">Effective for detecting name misspellings and deduplication</td>
<td valign="top" align="left">Fails in cases where names are completely altered or context is missing</td>
</tr> <tr>
<td valign="top" align="left">Al-Thwaib et al. (<xref ref-type="bibr" rid="B3">2020</xref>)</td>
<td valign="top" align="left">Verbatim, paraphrasing</td>
<td valign="top" align="left">Academic dissertations</td>
<td valign="top" align="left">N-grams, NLP</td>
<td valign="top" align="left">JUPlag corpus (2,312 dissertations)</td>
<td valign="top" align="left">No accuracy provided</td>
<td valign="top" align="left">Detecting verbatim and paraphrased plagiarism in academic writing</td>
<td valign="top" align="left">N-gram comparison typically runs in <italic>O</italic>(<italic>n</italic>) but scales with document size</td>
<td valign="top" align="left">Handles verbatim and paraphrased plagiarism well with NLP integration</td>
<td valign="top" align="left">N-gram approaches struggle with deeply obfuscated plagiarism</td>
</tr> <tr>
<td valign="top" align="left">Vel&#x000E1;squez et al. (<xref ref-type="bibr" rid="B44">2016</xref>)</td>
<td valign="top" align="left">External and intrinsic plagiarism</td>
<td valign="top" align="left">Spanish academic documents</td>
<td valign="top" align="left">Information fusion, n-grams, writing style</td>
<td valign="top" align="left">Spanish corpus, PAN-PC 2010, 2011</td>
<td valign="top" align="left">Precision 85.59%, Recall 55.6%, F1 48.24%</td>
<td valign="top" align="left">Plagiarism detection in Spanish academic documents</td>
<td valign="top" align="left">Information fusion and n-grams run in polynomial time, estimated <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">Combines multiple features for better accuracy in Spanish documents</td>
<td valign="top" align="left">Struggles with short text plagiarism detection and recall rate is low</td>
</tr>
<tr>
<td valign="top" align="left">Malandrino et al. (<xref ref-type="bibr" rid="B27">2022</xref>)</td>
<td valign="top" align="left">Music plagiarism detection</td>
<td valign="top" align="left">Famous legal cases (MusicXML)</td>
<td valign="top" align="left">Meta-heuristic, clustering</td>
<td valign="top" align="left">George Washington &#x00026; Columbia Law dataset</td>
<td valign="top" align="left">Spectral clustering 97% accuracy</td>
<td valign="top" align="left">Detecting music plagiarism in legal cases</td>
<td valign="top" align="left">Meta-heuristic clustering runs in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>) for typical cases</td>
<td valign="top" align="left">High accuracy for music plagiarism detection; adaptive clustering approach</td>
<td valign="top" align="left">Method is domain-specific and may not generalize well to textual plagiarism</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.1.2 Syntax-based approaches</title>
<p>Syntax-based methods analyze grammatical structures to detect plagiarism, even when sentences are restructured. Methods proposed by Manzoor et al. (<xref ref-type="bibr" rid="B28">2023</xref>), Vani and Gupta (<xref ref-type="bibr" rid="B41">2017a</xref>) involve parsing text into syntactic components using POS tagging and algorithms like the Longest Common Subsequence (LCS):</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">if&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">otherwise</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>X</italic> and <italic>Y</italic> are sequences (e.g., sentences or phrases) from two documents. <xref ref-type="fig" rid="F5">Figure 5</xref> illustrates how sentences are parsed into POS tags and analyzed using the LCS algorithm to detect structural similarities. Key studies on syntax-based plagiarism detection methods are summarized in <xref ref-type="table" rid="T2">Table 2</xref>, highlighting datasets and accuracy results.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Syntax-based approaches: example of POS tagging and LCS matching.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0005.tif"/>
</fig>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Syntax-based approaches.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Manzoor et al. (<xref ref-type="bibr" rid="B28">2023</xref>)</td>
<td valign="top" align="left">Intrinsic plagiarism detection</td>
<td valign="top" align="left">Literary, academic texts</td>
<td valign="top" align="left">Lexical, syntactic, semantic analysis, ML</td>
<td valign="top" align="left">PAN, Corpus of English Novels, Wikipedia</td>
<td valign="top" align="left">F1-score, Precision, Recall</td>
<td valign="top" align="left">Academic and literary intrinsic plagiarism detection</td>
<td valign="top" align="left">Resource requirements discussed; computational complexity not explicitly provided</td>
<td valign="top" align="left">Diverse methods including ML and deep learning improve detection robustness</td>
<td valign="top" align="left">Lack of a reference collection limits robustness and benchmarking</td>
</tr>
<tr>
<td valign="top" align="left">Vani and Gupta (<xref ref-type="bibr" rid="B42">2017b</xref>)</td>
<td valign="top" align="left">Text plagiarism</td>
<td valign="top" align="left">Academic short answers</td>
<td valign="top" align="left">POS tagging, chunking, feature selection</td>
<td valign="top" align="left">PAN12-14, PSA</td>
<td valign="top" align="left">Accuracy 97.89%, F1 0.979</td>
<td valign="top" align="left">Detection in academic short answers</td>
<td valign="top" align="left">POS tagging and chunking are typically <italic>O</italic>(<italic>n</italic>); feature selection adds extra overhead</td>
<td valign="top" align="left">High accuracy in academic short-answer plagiarism detection</td>
<td valign="top" align="left">Performance depends on feature engineering; domain-specific</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>6.2 Semantic similarity-based approaches</title>
<p>Semantic similarity methods detect plagiarism by analyzing the meaning behind words, going beyond surface-level text comparison. These methods are crucial for detecting paraphrased content and are divided into concept-based approaches and word embedding models.</p>
<sec>
<title>6.2.1 Concept-based approaches</title>
<p>Concept-based methods use semantic role labeling (SRL), named entity recognition (NER), and linguistic knowledge to detect idea plagiarism (Taufiq et al., <xref ref-type="bibr" rid="B39">2023</xref>; Vani and Gupta, <xref ref-type="bibr" rid="B42">2017b</xref>; Abdi et al., <xref ref-type="bibr" rid="B1">2015</xref>). These approaches combine semantic and syntactic similarity to detect deeper textual relationships. A common metric used is Wu-Palmer Similarity:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Wu-Palmer Similarity</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">depth</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">LCS</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">depth</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">depth</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>w</italic><sub>1</sub> and <italic>w</italic><sub>2</sub> are two words being compared, and LCS refers to their least common subsumer in a semantic hierarchy. In <xref ref-type="fig" rid="F6">Figure 6</xref>, the comparison of two concepts through Wu-Palmer similarity, with the identification of the least common subsumer, is demonstrated. In <xref ref-type="table" rid="T3">Table 3</xref>, the major studies on concept-based plagiarism detection approaches are summarized, with attention to the datasets and reported accuracy metrics.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Concept-based approaches: Wu-Palmer similarity between two concepts in a semantic hierarchy.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0006.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Concept-based approaches.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Manzoor et al. (<xref ref-type="bibr" rid="B28">2023</xref>)</td>
<td valign="top" align="left">Intrinsic plagiarism detection</td>
<td valign="top" align="left">Literary, academic texts</td>
<td valign="top" align="left">Lexical, syntactic, semantic analysis, ML</td>
<td valign="top" align="left">PAN, Corpus of English Novels, Wikipedia</td>
<td valign="top" align="left">F1-score, Precision, Recall</td>
<td valign="top" align="left">Academic and literary intrinsic plagiarism detection</td>
<td valign="top" align="left">Resource requirements discussed; computational complexity not explicitly provided</td>
<td valign="top" align="left">Diverse methods including ML and deep learning improve detection robustness</td>
<td valign="top" align="left">Lack of a reference collection limits robustness and benchmarking</td>
</tr>
<tr>
<td valign="top" align="left">Vani and Gupta (<xref ref-type="bibr" rid="B42">2017b</xref>)</td>
<td valign="top" align="left">Text plagiarism</td>
<td valign="top" align="left">Academic short answers</td>
<td valign="top" align="left">POS tagging, chunking, feature selection</td>
<td valign="top" align="left">PAN12-14, PSA</td>
<td valign="top" align="left">Accuracy 97.89%, F1 0.979</td>
<td valign="top" align="left">Detection in academic short answers</td>
<td valign="top" align="left">POS tagging and chunking are typically <italic>O</italic>(<italic>n</italic>); feature selection adds extra overhead</td>
<td valign="top" align="left">High accuracy in academic short-answer plagiarism detection</td>
<td valign="top" align="left">Performance depends on feature engineering; domain-specific</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.2.2 Word embedding models</title>
<p>Word embedding models, such as Word2Vec, BERT, and GPT-based transformers, allow for nuanced detection by capturing contextual meanings (Mehak et al., <xref ref-type="bibr" rid="B29">2023</xref>; Alzahrani et al., <xref ref-type="bibr" rid="B6">2015</xref>; Darwish et al., <xref ref-type="bibr" rid="B10">2023</xref>). Cosine similarity is often used to measure similarity in an embedding space:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Cosine Similarity</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt><mml:mo>&#x000D7;</mml:mo><mml:msqrt><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>A</italic><sub><italic>i</italic></sub> and <italic>B</italic><sub><italic>i</italic></sub> are the word embedding vectors from two different texts. <xref ref-type="fig" rid="F7">Figure 7</xref> demonstrates how cosine similarity is calculated for word embedding vectors to capture semantic relationships. A summary of key studies utilizing word embedding models for plagiarism detection, along with their datasets and accuracy metrics, is provided in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Word embedding models: cosine similarity between word vectors in a high-dimensional space.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0007.tif"/>
</fig>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Word embedding models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Mehak et al. (<xref ref-type="bibr" rid="B29">2023</xref>)</td>
<td valign="top" align="left">Text Reuse (Phrasal)</td>
<td valign="top" align="left">Urdu language content</td>
<td valign="top" align="left">Sentence Transformer, N-gram, embeddings</td>
<td valign="top" align="left">UTRD-Phr-23</td>
<td valign="top" align="left">F1-score &#x0007E;0.63</td>
<td valign="top" align="left">Detecting text reuse in Urdu content</td>
<td valign="top" align="left">Sentence Transformer runs in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>), N-grams in <italic>O</italic>(<italic>n</italic>)</td>
<td valign="top" align="left">Adapts well to Urdu language-specific text reuse detection</td>
<td valign="top" align="left">Lower F1-score suggests room for improvement in embedding effectiveness</td>
</tr> <tr>
<td valign="top" align="left">Alzahrani et al. (<xref ref-type="bibr" rid="B6">2015</xref>)</td>
<td valign="top" align="left">Obfuscated plagiarism</td>
<td valign="top" align="left">Academic and web texts</td>
<td valign="top" align="left">Fuzzy semantic similarity, WordNet</td>
<td valign="top" align="left">PAN-PC-09, PAN-PC-10, Microsoft Paraphrase</td>
<td valign="top" align="left">Precision 0.9178, Recall 0.6933</td>
<td valign="top" align="left">Detecting obfuscated plagiarism across different text sources</td>
<td valign="top" align="left">Fuzzy semantic similarity and WordNet traversal run in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Highly effective for uncovering obfuscated plagiarism</td>
<td valign="top" align="left">WordNet-based approaches depend on lexicon availability and coverage</td>
</tr> <tr>
<td valign="top" align="left">Darwish et al. (<xref ref-type="bibr" rid="B10">2023</xref>)</td>
<td valign="top" align="left">Semantic plagiarism</td>
<td valign="top" align="left">Summary obfuscation</td>
<td valign="top" align="left">Quantum genetic algorithm, WordNet</td>
<td valign="top" align="left">PAN13-14 dataset</td>
<td valign="top" align="left">F-score improved 10%</td>
<td valign="top" align="left">Handling summary obfuscation in academic plagiarism detection</td>
<td valign="top" align="left">Quantum genetic algorithm has high computational overhead (<italic>O</italic>(<italic>n</italic><sup>2</sup>))</td>
<td valign="top" align="left">Shows improvement in handling summary obfuscation cases</td>
<td valign="top" align="left">Quantum methods may require specialized resources for scalability</td>
</tr> <tr>
<td valign="top" align="left">Sahi and Gupta (<xref ref-type="bibr" rid="B35">2017</xref>)</td>
<td valign="top" align="left">Verbatim, paraphrasing</td>
<td valign="top" align="left">Academic papers</td>
<td valign="top" align="left">Semantic-syntactic analysis</td>
<td valign="top" align="left">PAN-PC-11</td>
<td valign="top" align="left">F1 0.837, Plagdet 0.836</td>
<td valign="top" align="left">Detecting verbatim and paraphrased plagiarism in academic texts</td>
<td valign="top" align="left">Semantic-syntactic analysis runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>) for pairwise comparisons</td>
<td valign="top" align="left">Good balance of semantic and syntactic analysis for plagiarism detection</td>
<td valign="top" align="left">Computationally expensive for large datasets</td>
</tr> <tr>
<td valign="top" align="left">Alvi et al. (<xref ref-type="bibr" rid="B4">2021</xref>)</td>
<td valign="top" align="left">Paraphrase plagiarism</td>
<td valign="top" align="left">Academic short answers</td>
<td valign="top" align="left">Context matching, embeddings, Smith-Waterman</td>
<td valign="top" align="left">Corpus of plagiarized short answers</td>
<td valign="top" align="left">F1 0.905, 0.802</td>
<td valign="top" align="left">Paraphrased plagiarism detection in short academic answers</td>
<td valign="top" align="left">Smith-Waterman algorithm runs in <italic>O</italic>(<italic>nm</italic>), embeddings in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Performs well on paraphrased plagiarism detection</td>
<td valign="top" align="left">Accuracy depends on the quality of paraphrase embeddings</td>
</tr>
<tr>
<td valign="top" align="left">Gharavi et al. (<xref ref-type="bibr" rid="B19">2019</xref>)</td>
<td valign="top" align="left">Obfuscation types</td>
<td valign="top" align="left">Multilingual plagiarism</td>
<td valign="top" align="left">Embedding-based, cosine, Jaccard</td>
<td valign="top" align="left">PAN-PC-2013, PersianPlagDet2016, custom Arabic</td>
<td valign="top" align="left">Plagdet &#x0007E;79.9%</td>
<td valign="top" align="left">Multilingual plagiarism detection with embedding-based methods</td>
<td valign="top" align="left">Embedding-based similarity runs in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>), Jaccard in <italic>O</italic>(<italic>n</italic>)</td>
<td valign="top" align="left">Handles multilingual plagiarism effectively</td>
<td valign="top" align="left">Scalability remains a challenge with larger datasets</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>6.3 Cross-language and multilingual approaches</title>
<p>Cross-language plagiarism detection methods address the challenge of identifying plagiarism in translated texts. These include:</p>
<list list-type="bullet">
<list-item><p><bold>Multilingual embedding models</bold>: these models represent words from different languages into the same space so they can be compared. This approach is seen in the works by Glava&#x00161; et al. (<xref ref-type="bibr" rid="B21">2018</xref>), Alzahrani and Aljuaid (<xref ref-type="bibr" rid="B5">2022</xref>), and Roostaee et al. (<xref ref-type="bibr" rid="B34">2020</xref>), where models like BERT and other cross-lingual types which help to detect plagiarism between languages like Spanish and English. As shown in <xref ref-type="fig" rid="F8">Figure 8</xref>, words from different languages, including Spanish and English, are placed in the same space for comparison using cosine similarity. In <xref ref-type="table" rid="T5">Table 5</xref>, key studies using multilingual embedding models for cross-lingual plagiarism detection are summarized, with information on methods, datasets, and accuracy.</p></list-item>
<list-item><p><bold>Cross-language detection</bold>: methods by Ehsan and Shakery (<xref ref-type="bibr" rid="B11">2016</xref>) and Ehsan et al. (<xref ref-type="bibr" rid="B12">2018</xref>) use translation-based approaches and dynamic text alignment for detecting plagiarism between different languages, such as German-English and Spanish-English pairs. The procedure for cross-language plagiarism detection, using cosine similarity between Spanish and English texts, is depicted in <xref ref-type="fig" rid="F9">Figure 9</xref>. In <xref ref-type="table" rid="T6">Table 6</xref>, you will find an overview of the methods, datasets, and performance metrics from various studies on cross-language plagiarism detection.</p></list-item>
<list-item><p><bold>Knowledge graphs and embedding models</bold>: knowledge graphs and embedding models combine the power of structured semantic networks (knowledge graphs) with the flexibility of embedding models to detect plagiarism across languages. These approaches are particularly useful in cross-lingual plagiarism detection where the challenge is to compare texts in different languages. By using knowledge graphs, which model relationships between concepts, and embedding models that represent words or concepts in vector space, these methods can handle cases of paraphrasing or translation-based plagiarism. The knowledge graphs provide a structural representation of concepts and their relationships, while embeddings map those concepts into a continuous vector space, allowing comparison across languages. Franco-Salvador et al. (<xref ref-type="bibr" rid="B16">2016a</xref>,<xref ref-type="bibr" rid="B17">b</xref>) have pioneered this hybrid approach with their methods like KBSim (Knowledge-Based Similarity) and XCNN for detecting plagiarism between Spanish-English and German-English texts. A detailed overview of key studies using knowledge graphs and embedding models for cross-lingual plagiarism detection can be found in <xref ref-type="table" rid="T7">Table 7</xref>, where methods, datasets, and accuracy metrics are discussed.</p></list-item>
</list>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Multilingual embedding models: mapping words from different languages into a shared vector space for similarity comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0008.tif"/>
</fig>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Multilingual embedding models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Glava&#x00161; et al. (<xref ref-type="bibr" rid="B21">2018</xref>)</td>
<td valign="top" align="left">Cross-lingual plagiarism detection</td>
<td valign="top" align="left">Spanish-English academic papers</td>
<td valign="top" align="left">Cross-lingual word embeddings</td>
<td valign="top" align="left">PAN-PC-11</td>
<td valign="top" align="left">R&#x00040;1 = 89.5%, R&#x00040;10 = 94%</td>
<td valign="top" align="left">Detection of cross-lingual plagiarism in academic texts</td>
<td valign="top" align="left">Cross-lingual word embeddings run in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">High recall for cross-lingual plagiarism detection</td>
<td valign="top" align="left">Limited to language pairs seen during training</td>
</tr> <tr>
<td valign="top" align="left">Alzahrani and Aljuaid (<xref ref-type="bibr" rid="B5">2022</xref>)</td>
<td valign="top" align="left">Cross-lingual plagiarism detection</td>
<td valign="top" align="left">Arabic-English academic texts</td>
<td valign="top" align="left">Deep learning, ML</td>
<td valign="top" align="left">Custom corpus</td>
<td valign="top" align="left">Accuracy &#x0007E;97%</td>
<td valign="top" align="left">Identifying cross-lingual plagiarism in Arabic-English texts</td>
<td valign="top" align="left">Deep learning models run in <italic>O</italic>(<italic>n</italic><sup>2</sup>) for training, <italic>O</italic>(<italic>n</italic>) for inference</td>
<td valign="top" align="left">Achieves high accuracy for Arabic-English cross-lingual plagiarism detection</td>
<td valign="top" align="left">Requires large annotated cross-lingual datasets for effective performance</td>
</tr>
<tr>
<td valign="top" align="left">Roostaee et al. (<xref ref-type="bibr" rid="B34">2020</xref>)</td>
<td valign="top" align="left">Cross-lingual plagiarism detection</td>
<td valign="top" align="left">Multilingual academic texts</td>
<td valign="top" align="left">Vector space models, embeddings</td>
<td valign="top" align="left">PAN-PC-11, PAN-PC-12, SemEval</td>
<td valign="top" align="left">Plagdet 0.720, 0.769</td>
<td valign="top" align="left">Multilingual plagiarism detection across academic datasets</td>
<td valign="top" align="left">Vector space models and embeddings operate in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Effective across multiple languages and academic text domains</td>
<td valign="top" align="left">Plagdet scores indicate potential room for improvement in precision</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Cross-language detection: translating text between languages and detecting plagiarism using similarity methods.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0009.tif"/>
</fig>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Cross-language detection.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Ehsan and Shakery (<xref ref-type="bibr" rid="B11">2016</xref>)</td>
<td valign="top" align="left">Cross-lingual plagiarism detection</td>
<td valign="top" align="left">Cross-lingual documents</td>
<td valign="top" align="left">Proximity-based retrieval, topic segmentation</td>
<td valign="top" align="left">PAN-PC-12</td>
<td valign="top" align="left">F2 score 0.6703</td>
<td valign="top" align="left">Detecting cross-lingual plagiarism in textual data</td>
<td valign="top" align="left">Proximity-based retrieval runs in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>); topic segmentation depends on document length</td>
<td valign="top" align="left">Effective for detecting cross-lingual similarity using topic segmentation</td>
<td valign="top" align="left">Performance depends on language-specific topic segmentation accuracy</td>
</tr>
<tr>
<td valign="top" align="left">Ehsan et al. (<xref ref-type="bibr" rid="B12">2018</xref>)</td>
<td valign="top" align="left">Cross-lingual plagiarism detection</td>
<td valign="top" align="left">Academic papers</td>
<td valign="top" align="left">Dictionary-based, dynamic alignment</td>
<td valign="top" align="left">PAN-PC-12</td>
<td valign="top" align="left">Plagdet 0.863</td>
<td valign="top" align="left">Identifying cross-lingual plagiarism in academic literature</td>
<td valign="top" align="left">Dictionary-based methods operate in <italic>O</italic>(<italic>n</italic>), dynamic alignment runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">High accuracy in academic cross-lingual plagiarism detection</td>
<td valign="top" align="left">Relies on the availability of quality bilingual dictionaries</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Knowledge graphs and embedding models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Franco-Salvador et al. (<xref ref-type="bibr" rid="B16">2016a</xref>)</td>
<td valign="top" align="left">Cross-language plagiarism detection</td>
<td valign="top" align="left">Cross-lingual academic texts</td>
<td valign="top" align="left">Hybrid models, knowledge graphs (KBSim, XCNN)</td>
<td valign="top" align="left">PAN-PC-11 (Spanish-English, German-English)</td>
<td valign="top" align="left">Plagdet &#x0007E;0.64</td>
<td valign="top" align="left">Detecting cross-lingual plagiarism in academic texts</td>
<td valign="top" align="left">Knowledge graph similarity is <italic>O</italic>(<italic>n</italic><sup>2</sup>); hybrid models improve efficiency</td>
<td valign="top" align="left">Integrates deep learning with structured knowledge for better detection</td>
<td valign="top" align="left">Performance depends on the completeness of the knowledge graph</td>
</tr>
<tr>
<td valign="top" align="left">Franco-Salvador et al. (<xref ref-type="bibr" rid="B17">2016b</xref>)</td>
<td valign="top" align="left">Cross-language plagiarism (paraphrasing)</td>
<td valign="top" align="left">Cross-lingual academic texts</td>
<td valign="top" align="left">Cross-language Knowledge Graph Analysis</td>
<td valign="top" align="left">PAN-PC-10, PAN-PC-11</td>
<td valign="top" align="left">Plagdet &#x0007E;0.663</td>
<td valign="top" align="left">Detecting paraphrased plagiarism across languages</td>
<td valign="top" align="left">Knowledge graph analysis operates in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>) for retrieval, <italic>O</italic>(<italic>n</italic><sup>2</sup>) for entity linking</td>
<td valign="top" align="left">Effective for capturing semantic relationships across languages</td>
<td valign="top" align="left">Requires extensive multilingual knowledge bases for high accuracy</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.4 Machine learning and deep learning approaches</title>
<p><bold>Traditional machine learning models</bold> such as SVMs and Random Forest classifiers have been widely used for text classification (Hussain and Suryani, <xref ref-type="bibr" rid="B23">2015</xref>; Polydouri et al., <xref ref-type="bibr" rid="B31">2018</xref>; El-Rashidy et al., <xref ref-type="bibr" rid="B13">2022</xref>). As depicted in <xref ref-type="fig" rid="F10">Figure 10</xref>, the feature extraction process and SVM classification workflow lead to the final plagiarism detection outcome. <xref ref-type="table" rid="T8">Table 8</xref> outlines key studies using traditional machine learning models for plagiarism detection, along with details on their methods and performance metrics.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Traditional machine learning: feature extraction and classification using an SVM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0010.tif"/>
</fig>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Traditional machine learning approaches.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Hussain and Suryani (<xref ref-type="bibr" rid="B23">2015</xref>)</td>
<td valign="top" align="left">Intelligent plagiarism</td>
<td valign="top" align="left">Academic papers</td>
<td valign="top" align="left">&#x003C7;-Sim, SVM</td>
<td valign="top" align="left">Custom dataset</td>
<td valign="top" align="left">PI 48.23%</td>
<td valign="top" align="left">Detecting text similarity in academic settings</td>
<td valign="top" align="left">SVM runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>) for training, <italic>O</italic>(<italic>n</italic>) for inference</td>
<td valign="top" align="left">Efficient for detecting text similarity in academic settings</td>
<td valign="top" align="left">Performance is limited by feature extraction quality</td>
</tr> <tr>
<td valign="top" align="left">Polydouri et al. (<xref ref-type="bibr" rid="B31">2018</xref>)</td>
<td valign="top" align="left">Intrinsic plagiarism</td>
<td valign="top" align="left">Academic papers</td>
<td valign="top" align="left">Supervised ML, stylometric features</td>
<td valign="top" align="left">PAN 2009, 2011, 2016</td>
<td valign="top" align="left">F1-score 0.43 (Random Forest)</td>
<td valign="top" align="left">Intrinsic plagiarism detection based on writing style</td>
<td valign="top" align="left">Random Forest has <italic>O</italic>(<italic>n</italic>log<italic>n</italic>) training complexity</td>
<td valign="top" align="left">Can detect intrinsic plagiarism using writing style analysis</td>
<td valign="top" align="left">Accuracy depends on sufficient stylistic variation in text</td>
</tr>
<tr>
<td valign="top" align="left">El-Rashidy et al. (<xref ref-type="bibr" rid="B13">2022</xref>)</td>
<td valign="top" align="left">Lexical, syntactic, semantic plagiarism</td>
<td valign="top" align="left">Academic texts</td>
<td valign="top" align="left">SVM, Chi-square feature selection</td>
<td valign="top" align="left">PAN 2012, 2013, 2014</td>
<td valign="top" align="left">F1 89.34%, 92.95%</td>
<td valign="top" align="left">Plagiarism detection using supervised ML techniques</td>
<td valign="top" align="left">Chi-square selection is <italic>O</italic>(<italic>n</italic><sup>2</sup>), SVM training is <italic>O</italic>(<italic>n</italic><sup>2</sup>), inference is <italic>O</italic>(<italic>n</italic>)</td>
<td valign="top" align="left">High accuracy across multiple datasets; effective for multiple plagiarism types</td>
<td valign="top" align="left">Computational overhead can be high for large-scale data</td>
</tr></tbody>
</table>
</table-wrap>
<p><bold>Deep learning models</bold>, including LSTM, CNN, and Transformers, offer powerful tools for detecting more subtle forms of plagiarism, such as paraphrasing. Works by Shahmohammadi et al. (<xref ref-type="bibr" rid="B36">2020</xref>), Hayawi et al. (<xref ref-type="bibr" rid="B22">2023</xref>), Suman et al. (<xref ref-type="bibr" rid="B38">2021</xref>), Agarwal et al. (<xref ref-type="bibr" rid="B2">2018</xref>), Romanov et al. (<xref ref-type="bibr" rid="B33">2021</xref>), El-Rashidy et al. (<xref ref-type="bibr" rid="B14">2024</xref>), Shakeel et al. (<xref ref-type="bibr" rid="B37">2020</xref>), and Iqbal et al. (<xref ref-type="bibr" rid="B24">2024</xref>) apply these advanced neural networks to plagiarism detection tasks. In <xref ref-type="fig" rid="F11">Figure 11</xref>, an LSTM network is depicted, demonstrating its ability to process input sequences like sentences for plagiarism detection. A concise summary of key studies employing deep learning techniques for plagiarism detection, including methods, datasets, and performance metrics, is provided in <xref ref-type="table" rid="T9">Table 9</xref>.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Deep learning: LSTM network processing an input sequence and predicting plagiarism.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0011.tif"/>
</fig>
<table-wrap position="float" id="T9">
<label>Table 9</label>
<caption><p>Deep learning approaches.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Shahmohammadi et al. (<xref ref-type="bibr" rid="B36">2020</xref>)</td>
<td valign="top" align="left">Paraphrase detection</td>
<td valign="top" align="left">Paraphrase in question pairs (NLP)</td>
<td valign="top" align="left">Bi-LSTM, handcrafted features</td>
<td valign="top" align="left">MSRP, Quora</td>
<td valign="top" align="left">Accuracy 79.2%, F1 85.4%</td>
<td valign="top" align="left">Detecting paraphrase similarity in NLP tasks</td>
<td valign="top" align="left">Bi-LSTM runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>); handcrafted feature extraction adds processing overhead</td>
<td valign="top" align="left">Effective in handling paraphrased text with deep learning</td>
<td valign="top" align="left">Handcrafted features require extensive domain expertise</td>
</tr> <tr>
<td valign="top" align="left">Hayawi et al. (<xref ref-type="bibr" rid="B22">2023</xref>)</td>
<td valign="top" align="left">AI-generated text</td>
<td valign="top" align="left">Human/AI-generated essays, code</td>
<td valign="top" align="left">Random Forest, SVM, LSTM</td>
<td valign="top" align="left">GPT, BARD texts</td>
<td valign="top" align="left">Accuracy 95.74%</td>
<td valign="top" align="left">Detecting AI-generated text in academic writing and coding</td>
<td valign="top" align="left">LSTM runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>); SVM and RF scale with dataset size</td>
<td valign="top" align="left">High accuracy in distinguishing AI-generated text</td>
<td valign="top" align="left">Performance varies with emerging AI text generators</td>
</tr> <tr>
<td valign="top" align="left">Suman et al. (<xref ref-type="bibr" rid="B38">2021</xref>)</td>
<td valign="top" align="left">Author profiling</td>
<td valign="top" align="left">Twitter data (text and images)</td>
<td valign="top" align="left">BERT, EfficientNet</td>
<td valign="top" align="left">PAN-2018</td>
<td valign="top" align="left">Accuracy 89.53%</td>
<td valign="top" align="left">Social media analytics and multimodal author profiling</td>
<td valign="top" align="left">No detailed discussion on computational efficiency, but BERT typically runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>), EfficientNet in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Use of BERT and EfficientNet enables effective multimodal profiling</td>
<td valign="top" align="left">Does not address limitations in handling diverse user behaviors</td>
</tr> <tr>
<td valign="top" align="left">Agarwal et al. (<xref ref-type="bibr" rid="B2">2018</xref>)</td>
<td valign="top" align="left">Paraphrase detection</td>
<td valign="top" align="left">User-generated texts</td>
<td valign="top" align="left">CNN &#x0002B; RNN</td>
<td valign="top" align="left">Microsoft Paraphrase Corpus</td>
<td valign="top" align="left">F1-score 84.5%</td>
<td valign="top" align="left">Identifying paraphrased content in online discussions</td>
<td valign="top" align="left">CNN in <italic>O</italic>(<italic>n</italic>), RNN in <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">Effective combination of CNN and RNN for paraphrase detection</td>
<td valign="top" align="left">Training requires significant computational power</td>
</tr> <tr>
<td valign="top" align="left">Romanov et al. (<xref ref-type="bibr" rid="B33">2021</xref>)</td>
<td valign="top" align="left">Authorship identification</td>
<td valign="top" align="left">Russian literary texts</td>
<td valign="top" align="left">SVM, LSTM, CNN, Transformer</td>
<td valign="top" align="left">Moshkov library</td>
<td valign="top" align="left">96% (SVM), 94% (CNN), 87% (LSTM), 93% (Transformer)</td>
<td valign="top" align="left">Forensic linguistics and author verification</td>
<td valign="top" align="left">Transformer operates in <italic>O</italic>(<italic>n</italic><sup>2</sup>), CNN in <italic>O</italic>(<italic>n</italic>), LSTM in <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">High accuracy in authorship identification</td>
<td valign="top" align="left">Transformer-based models require extensive training data</td>
</tr> <tr>
<td valign="top" align="left">El-Rashidy et al. (<xref ref-type="bibr" rid="B14">2024</xref>)</td>
<td valign="top" align="left">Lexical, syntactic, semantic plagiarism</td>
<td valign="top" align="left">Academic texts</td>
<td valign="top" align="left">LSTM, DenseNet</td>
<td valign="top" align="left">PAN 2013, PAN 2014</td>
<td valign="top" align="left">Plagdet 89.81%, 93.92%</td>
<td valign="top" align="left">Detecting different types of plagiarism in academic texts</td>
<td valign="top" align="left">LSTM runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>), DenseNet operates in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Effective for detecting multiple types of plagiarism</td>
<td valign="top" align="left">Computationally intensive for large datasets</td>
</tr> <tr>
<td valign="top" align="left">Shakeel et al. (<xref ref-type="bibr" rid="B37">2020</xref>)</td>
<td valign="top" align="left">Paraphrase detection</td>
<td valign="top" align="left">Short text paraphrase</td>
<td valign="top" align="left">CNN, LSTM, data augmentation</td>
<td valign="top" align="left">Quora, MSRP, SemEval</td>
<td valign="top" align="left">F1-score 75.4%, 84.8%</td>
<td valign="top" align="left">Improving paraphrase detection using data augmentation</td>
<td valign="top" align="left">CNN in <italic>O</italic>(<italic>n</italic>), LSTM in <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">Improved accuracy with data augmentation techniques</td>
<td valign="top" align="left">Requires a large amount of augmented training data</td>
</tr>
<tr>
<td valign="top" align="left">Iqbal et al. (<xref ref-type="bibr" rid="B24">2024</xref>)</td>
<td valign="top" align="left">Paraphrase detection</td>
<td valign="top" align="left">Urdu texts</td>
<td valign="top" align="left">DNN (D-TRAPPD, WENGO)</td>
<td valign="top" align="left">SUSPC, UPPC, USTRC</td>
<td valign="top" align="left">F1-score 96.80%, 87.85%</td>
<td valign="top" align="left">Detecting Urdu text reuse and plagiarism</td>
<td valign="top" align="left">DNN runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>) complexity</td>
<td valign="top" align="left">High accuracy in Urdu paraphrase detection</td>
<td valign="top" align="left">Deep models require large labeled datasets</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.5 Structural and citation-based approaches</title>
<p>Structural and citation-based approaches focus on how documents are organized or how citations are reused. These methods are particularly effective in academic and research-based plagiarism detection. Gipp et al. (<xref ref-type="bibr" rid="B20">2014</xref>), Pertile et al. (<xref ref-type="bibr" rid="B30">2015</xref>), and Vani and Gupta (<xref ref-type="bibr" rid="B43">2018</xref>) employ citation pattern analysis to track citation reuse and bibliographic coupling, while structural methods look at document organization to detect anomalies. As shown in <xref ref-type="fig" rid="F12">Figure 12</xref>, bibliographic coupling highlights shared references between two documents, helping to visualize the overlap in citation patterns. The key studies that utilize citation-based approaches for plagiarism detection, along with their corresponding datasets and accuracy metrics, are summarized in <xref ref-type="table" rid="T10">Table 10</xref>.</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p>Citation pattern analysis: bibliographic coupling between two documents based on shared references.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0012.tif"/>
</fig>
<table-wrap position="float" id="T10">
<label>Table 10</label>
<caption><p>Citation-based approaches.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Gipp et al. (<xref ref-type="bibr" rid="B20">2014</xref>)</td>
<td valign="top" align="left">Citation-based plagiarism</td>
<td valign="top" align="left">Scientific papers</td>
<td valign="top" align="left">Citation pattern analysis (CbPD), Greedy Tiling</td>
<td valign="top" align="left">PMC OAS corpus (PubMed)</td>
<td valign="top" align="left">Fleiss&#x00027;s kappa 0.65</td>
<td valign="top" align="left">Detecting disguised plagiarism using citation patterns</td>
<td valign="top" align="left">Greedy Tiling runs in <italic>O</italic>(<italic>n</italic><sup>2</sup>); citation pattern analysis in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>)</td>
<td valign="top" align="left">Effective for detecting disguised plagiarism using citation patterns</td>
<td valign="top" align="left">Requires high-quality citation data for accuracy</td>
</tr> <tr>
<td valign="top" align="left">Pertile et al. (<xref ref-type="bibr" rid="B30">2015</xref>)</td>
<td valign="top" align="left">Verbatim, paraphrased, citation plagiarism</td>
<td valign="top" align="left">Scientific publications</td>
<td valign="top" align="left">Content-based, citation-based analysis</td>
<td valign="top" align="left">ACL, PubMed</td>
<td valign="top" align="left">Precision 0.76, 0.61</td>
<td valign="top" align="left">Identifying different forms of scientific text plagiarism</td>
<td valign="top" align="left">Citation-based analysis runs in <italic>O</italic>(<italic>n</italic>log<italic>n</italic>); content-based varies by method</td>
<td valign="top" align="left">Strong results for verbatim and paraphrased plagiarism</td>
<td valign="top" align="left">Citation structure may not always reflect textual similarity</td>
</tr>
<tr>
<td valign="top" align="left">Vani and Gupta (<xref ref-type="bibr" rid="B43">2018</xref>)</td>
<td valign="top" align="left">Paraphrasing, structural plagiarism</td>
<td valign="top" align="left">Academic papers</td>
<td valign="top" align="left">POS tagging, WordNet similarity</td>
<td valign="top" align="left">PAN, PSA</td>
<td valign="top" align="left">No accuracy provided</td>
<td valign="top" align="left">Detecting structural plagiarism in academic writing</td>
<td valign="top" align="left">POS tagging operates in <italic>O</italic>(<italic>n</italic>); WordNet similarity in <italic>O</italic>(<italic>n</italic><sup>2</sup>)</td>
<td valign="top" align="left">Useful for detecting structural plagiarism</td>
<td valign="top" align="left">Lacks accuracy benchmarks; WordNet dependency limits scalability</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>6.6 Code-based plagiarism detection</title>
<p>Code-based plagiarism detection is designed to handle the unique challenges of source code plagiarism, where syntactic and structural changes can mask copied code. Liu et al. (<xref ref-type="bibr" rid="B26">2015</xref>) and Bartoszuk and Gagolewski (<xref ref-type="bibr" rid="B7">2021</xref>) use methods like Program Dependence Graph (PDG) and q-grams to detect code similarities, even when the code is restructured or altered in non-obvious ways. <xref ref-type="fig" rid="F13">Figure 13</xref> shows how q-grams are extracted from tokenized code sequences and compared to assess similarity. In <xref ref-type="fig" rid="F14">Figure 14</xref>, the comparison of Program Dependence Graphs is demonstrated, with structural elements like operations and conditions matched between two programs. A summary of studies focused on code-based plagiarism detection, including the methods used and their performance metrics, is provided in <xref ref-type="table" rid="T11">Table 11</xref>.</p>
<fig id="F13" position="float">
<label>Figure 13</label>
<caption><p>Token-based approaches: tokenizing source code into q-grams and comparing them for plagiarism detection.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0013.tif"/>
</fig>
<fig id="F14" position="float">
<label>Figure 14</label>
<caption><p>Program dependence graph-based approaches: matching PDGs of two programs to detect plagiarism.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1504725-g0014.tif"/>
</fig>
<table-wrap position="float" id="T11">
<label>Table 11</label>
<caption><p>Code-based plagiarism detection.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Plagiarism type</bold></th>
<th valign="top" align="left"><bold>Scope of study</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Dataset used</bold></th>
<th valign="top" align="left"><bold>Accuracy</bold></th>
<th valign="top" align="left"><bold>Applications or use cases</bold></th>
<th valign="top" align="left"><bold>Computational complexity</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Weaknesses</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Liu et al. (<xref ref-type="bibr" rid="B26">2015</xref>)</td>
<td valign="top" align="left">Code plagiarism</td>
<td valign="top" align="left">Programming assignments</td>
<td valign="top" align="left">Improved LCS, code standardization</td>
<td valign="top" align="left">Xiangtan University dataset</td>
<td valign="top" align="left">False alarm 0%, omission 5%</td>
<td valign="top" align="left">Detecting exact and near-duplicate code plagiarism</td>
<td valign="top" align="left">LCS operates in <italic>O</italic>(<italic>nm</italic>) complexity</td>
<td valign="top" align="left">High precision in detecting exact and near-duplicate code</td>
<td valign="top" align="left">May struggle with highly obfuscated code variations</td>
</tr>
<tr>
<td valign="top" align="left">Bartoszuk and Gagolewski (<xref ref-type="bibr" rid="B7">2021</xref>)</td>
<td valign="top" align="left">Source code plagiarism detection</td>
<td valign="top" align="left">Source code similarity, clone detection</td>
<td valign="top" align="left">PDG, Levenshtein, q-grams</td>
<td valign="top" align="left">Simulated R functions</td>
<td valign="top" align="left">F1-score 0.967</td>
<td valign="top" align="left">Detecting similarity in code clones and programming assignments</td>
<td valign="top" align="left">PDG operates in <italic>O</italic>(<italic>n</italic><sup>2</sup>), Levenshtein in <italic>O</italic>(<italic>nm</italic>), q-grams in <italic>O</italic>(<italic>n</italic>)</td>
<td valign="top" align="left">Robust for detecting code similarity across different structures</td>
<td valign="top" align="left">Computational overhead increases for large-scale codebases</td>
</tr></tbody>
</table>
</table-wrap>
<p>While this section has outlined various plagiarism detection techniques and their operational mechanisms, the next section critically evaluates these approaches, highlighting their trade-offs in computational efficiency, detection accuracy, and real-world application.</p>
</sec>
</sec>
<sec id="s7">
<title>7 Critical assessment of detection methods</title>
<p>Building on the previous section&#x00027;s discussion of plagiarism detection techniques, this section critically evaluates their effectiveness, computational efficiency, and scalability. Each method has distinct advantages and limitations, necessitating a comparative analysis to determine their applicability in different scenarios.</p>
<sec>
<title>7.1 Trade-offs between computational efficiency and detection accuracy</title>
<p>Different detection methods present trade-offs between computational efficiency and detection accuracy. Traditional methods such as n-grams and string matching are computationally efficient but struggle with detecting paraphrased plagiarism. More advanced deep learning methods, while highly effective in semantic analysis, require substantial computational resources and training data.</p>
<sec>
<title>7.1.1 Comparison of detection methods</title>
<p><bold>Computationally efficient but less accurate methods</bold></p>
<list list-type="bullet">
<list-item><p>Textual and lexical approaches: traditional methods like string matching and n-gram models excel in computational efficiency due to their simplicity and low resource requirements. For example, Liu et al. (<xref ref-type="bibr" rid="B26">2015</xref>) demonstrated a linear complexity algorithm for source code detection with a 0% false alarm rate. However, these methods struggle with paraphrased or obfuscated plagiarism, limiting their robustness.</p></list-item>
<list-item><p>Citation-based approaches: techniques analyzing citation patterns (e.g., Vel&#x000E1;squez et al., <xref ref-type="bibr" rid="B44">2016</xref>) are computationally efficient but lack the ability to detect nuanced text-level transformations.</p></list-item>
<list-item><p>Lexical and string matching techniques: fast but weak against paraphrased content.</p></list-item>
</list>
<p><bold>Methods prioritizing accuracy over efficiency</bold></p>
<list list-type="bullet">
<list-item><p>Deep learning models: advanced methods, such as LSTM-based approaches (El-Rashidy et al., <xref ref-type="bibr" rid="B14">2024</xref>), achieve superior performance with PlagDet scores surpassing competitors. However, their high computational costs and long training times make them resource-intensive.</p></list-item>
<list-item><p>Knowledge graph-based detection: Franco-Salvador et al. (<xref ref-type="bibr" rid="B17">2016b</xref>) introduced knowledge graph approaches for cross-language plagiarism, achieving high accuracy but at the cost of significant computational overhead.</p></list-item>
<list-item><p>Syntax-based methods: effective for detecting restructured sentences but computationally expensive.</p></list-item>
<list-item><p>Word embeddings and semantic models: capture deeper meaning but require large-scale training.</p></list-item>
</list>
</sec>
</sec>
<sec>
<title>7.2 Scalability challenges and real-world applications</title>
<p>While deep learning methods offer state-of-the-art accuracy, their practical deployment in large-scale systems presents challenges. Institutions and publishers handling vast repositories of documents need hybrid approaches combining efficiency and semantic robustness. Cloud-based parallel processing and selective document screening strategies are potential solutions to balance computational cost with detection performance.</p>
<sec>
<title>7.2.1 Scalability challenges</title>
<p>Resource-intensive methods face scalability issues, particularly in real-world applications involving large datasets or real-time detection requirements. For instance, Hussain and Suryani (<xref ref-type="bibr" rid="B23">2015</xref>) reported exponential increases in training times as dataset sizes grew from 1,000 to 10,000 documents. Similarly, cross-language detection methods relying on extensive semantic analysis often require substantial memory and processing power.</p>
<p><bold>Practical solutions</bold></p>
<list list-type="order">
<list-item><p>Selective processing: pre-screening techniques, such as text standardization, can reduce the computational load by narrowing the dataset requiring detailed analysis.</p></list-item>
<list-item><p>Distributed computing: leveraging cloud-based systems or parallel processing can improve the scalability of advanced methods.</p></list-item>
<list-item><p>Hybrid techniques: combining traditional methods with advanced semantic approaches provides a balance between efficiency and accuracy. For example, sahi2017novel) integrated syntactic and semantic analysis, achieving scalability and robust detection.</p></list-item>
</list>
<p><bold>Practical implications</bold></p>
<list list-type="bullet">
<list-item><p>Real-time applications: systems for educational and publishing environments must prioritize lightweight methods or pre-processing to ensure timely results.</p></list-item>
<list-item><p>Large-scale databases: distributed computing and hybrid approaches are essential for managing millions of documents effectively.</p></list-item>
</list>
<p>By critically assessing the trade-offs between computational efficiency and detection accuracy, this section underscores the need for adaptive and scalable plagiarism detection methods. Future research should focus on hybrid approaches and optimization techniques to achieve a balance suited to diverse real-world applications.</p>
</sec>
</sec>
</sec>
<sec id="s8">
<title>8 Insights, challenges, and future directions</title>
<p>The landscape of plagiarism detection is rapidly evolving due to the increasing complexity of academic writing and the diverse forms of content reuse. Several key insights and challenges have emerged from recent research, providing a foundation for improving plagiarism detection systems.</p>
<sec>
<title>8.1 Actionable insights</title>
<p>Recent advancements in plagiarism detection have identified key areas for improvement:</p>
<list list-type="order">
<list-item><p><bold>Enhanced linguistic models</bold>: incorporating advanced linguistic features such as <italic>semantic role labeling (SRL)</italic> and <italic>dependency parsing</italic> can improve the detection of paraphrased and idea-based plagiarism. Research by Shakeel et al. (<xref ref-type="bibr" rid="B37">2020</xref>) demonstrates that fine-tuning large language models (LLMs) like <italic>BERT</italic> and <italic>GPT</italic> for plagiarism detection captures subtle textual variations more effectively.</p></list-item>
<list-item><p><bold>Cross-lingual plagiarism detection</bold>: multilingual embedding models and <italic>cross-language knowledge graphs</italic> should be further developed to tackle translated plagiarism. Studies such as Franco-Salvador et al. (<xref ref-type="bibr" rid="B16">2016a</xref>) show that <italic>word sense disambiguation</italic> enhances semantic equivalence detection across languages, improving multilingual plagiarism detection.</p></list-item>
<list-item><p><bold>Real-time detection systems</bold>: integrating plagiarism detection tools within writing software can help prevent misconduct at the source. Efficient algorithms are needed for real-time feedback without compromising accuracy (Pertile et al., <xref ref-type="bibr" rid="B30">2015</xref>).</p></list-item>
<list-item><p><bold>AI-generated content fingerprinting</bold>: the rise of AI-generated content from models like <italic>GPT-4</italic> and <italic>BARD</italic> necessitates the development of classifiers tailored to detect AI-generated text. Hayawi et al. (<xref ref-type="bibr" rid="B22">2023</xref>) discuss how AI writing models exhibit unique linguistic &#x0201C;fingerprints&#x0201D; that classifiers can leverage for detection.</p></list-item>
</list>
</sec>
<sec>
<title>8.2 Challenges</title>
<p>Despite these advancements, plagiarism detection systems face several challenges:</p>
<list list-type="order">
<list-item><p><bold>Linguistic and discursive variability</bold>: variations in <italic>writing style, tone, and cultural expression</italic> make it difficult to detect plagiarism, especially in multilingual contexts. Detecting <italic>idea-based plagiarism</italic> requires models that understand discourse-level semantics (Hussain and Suryani, <xref ref-type="bibr" rid="B23">2015</xref>).</p></list-item>
<list-item><p><bold>Scalability</bold>: many state-of-the-art detection models require significant computational resources, limiting accessibility for smaller institutions. Developing <italic>scalable models</italic> that maintain high accuracy remains an open research problem.</p></list-item>
<list-item><p><bold>Adaptability to emerging techniques</bold>: plagiarists increasingly use advanced obfuscation methods such as <italic>automated paraphrasing tools</italic> and <italic>neural translation models</italic>. Detection systems must incorporate <italic>adaptive learning mechanisms</italic> to evolve with these threats (El-Rashidy et al., <xref ref-type="bibr" rid="B14">2024</xref>).</p></list-item>
</list>
</sec>
<sec>
<title>8.3 Recommendations</title>
<p>To address these challenges, future research should focus on:</p>
<list list-type="order">
<list-item><p><bold>Integration of discourse analysis</bold>: plagiarism detection systems should incorporate <italic>discourse analysis techniques</italic> to capture nuanced semantic relationships between sentences and paragraphs. This is particularly useful for detecting <italic>idea-based and cross-lingual plagiarism</italic>.</p></list-item>
<list-item><p><bold>Publicly available benchmarks</bold>: establishing <italic>multilingual datasets</italic> for benchmarking will facilitate consistent evaluation and comparison of plagiarism detection methods. Collaborative initiatives can ensure diverse linguistic and cultural coverage.</p></list-item>
<list-item><p><bold>Interdisciplinary collaboration</bold>: researchers in <italic>computational linguistics, education, and ethics</italic> should work together to develop holistic solutions that address both the <italic>technical and ethical dimensions</italic> of plagiarism detection.</p></list-item>
<list-item><p><bold>Education and awareness</bold>: while technological advancements play a crucial role, promoting <italic>academic integrity through education</italic> is equally essential. Institutions should prioritize awareness campaigns alongside detection tool deployment.</p></list-item>
</list>
</sec>
<sec>
<title>8.4 Future research directions</title>
<p>Looking ahead, several opportunities exist to address these challenges:</p>
<list list-type="bullet">
<list-item><p>Developing <bold>hybrid models</bold> that combine linguistic analysis with deep learning techniques for greater robustness.</p></list-item>
<list-item><p>Exploring <bold>multimodal approaches</bold> that integrate text, images, and other data types for comprehensive content analysis (Agarwal et al., <xref ref-type="bibr" rid="B2">2018</xref>).</p></list-item>
<list-item><p>Designing <bold>adaptive algorithms</bold> capable of detecting <italic>emerging plagiarism techniques</italic>, including AI-generated content.</p></list-item>
<list-item><p>Investigating <bold>user-centric detection tools</bold> that integrate seamlessly into existing workflows, providing <italic>non-intrusive yet effective plagiarism prevention mechanisms</italic>.</p></list-item>
</list>
<p>By addressing these challenges and pursuing these recommendations, future research can significantly enhance the <italic>effectiveness, fairness, and scalability</italic> of plagiarism detection systems. This will help maintain <bold>academic integrity</bold> in an increasingly complex digital landscape.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s9">
<title>9 Conclusion</title>
<p>In this survey, we systematically analyzed various types of plagiarism and the corresponding detection methods, ranging from traditional string-matching techniques to advanced AI-driven approaches. While lexical and shingle-based methods remain effective for detecting verbatim plagiarism, they struggle with more complex cases such as paraphrased, cross-lingual, and AI-generated plagiarism. Recent advancements in deep learning, particularly semantic similarity models and multilingual embeddings, have significantly improved detection accuracy. However, the computational cost and scalability of these approaches remain key challenges.</p>
<p>To enhance plagiarism detection systems, future research should focus on refining cross-language detection using knowledge graphs and multilingual embeddings. The rise of AI-generated content necessitates new techniques, such as linguistic fingerprinting, to differentiate between human and machine-generated text. Additionally, balancing detection accuracy with computational efficiency is crucial for integrating these systems into real-time applications. Hybrid models that combine traditional rule-based methods with AI-driven approaches could offer a scalable solution. Furthermore, developing large-scale, standardized datasets will facilitate better benchmarking and model generalizability, ultimately ensuring more robust and fair plagiarism detection frameworks.</p>
<p>By addressing these challenges, plagiarism detection systems can evolve to meet the growing complexities of academic and digital content integrity. Future efforts should prioritize integrating detection tools into educational and publishing platforms, enabling real-time feedback mechanisms that help prevent plagiarism at its source. With continued advancements in AI and linguistic analysis, the field is well-positioned to develop more sophisticated, adaptable, and ethical solutions to combat plagiarism in an increasingly digital world.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s10">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec sec-type="author-contributions" id="s11">
<title>Author contributions</title>
<p>AA: Investigation, Methodology, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. CT: Conceptualization, Methodology, Supervision, Writing &#x02013; review &#x00026; editing. AM: Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s12">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan within the framework of project AP23487777.</p>
</sec>
<ack><p>The authors would like to express their sincere gratitude to Dr. Shirali Kadyrov for his invaluable assistance and guidance in the writing of this paper. This work is part of the PhD research of the first author conducted at SDU University.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s13">
<title>Generative AI statement</title>
<p>The author(s) declare that Gen AI was used in the creation of this manuscript. We used generative AI tools to enhance the language and assist with proofreading in this paper.</p>
</sec>
<sec sec-type="disclaimer" id="s14">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="supplementary-material" id="s15">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fcomp.2025.1504725/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1504725/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Image_1.jpeg" id="SM1" mimetype="image/jpeg" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abdi</surname> <given-names>A.</given-names></name> <name><surname>Idris</surname> <given-names>N.</given-names></name> <name><surname>Alguliyev</surname> <given-names>R. M.</given-names></name> <name><surname>Aliguliyev</surname> <given-names>R. M.</given-names></name></person-group> (<year>2015</year>). <article-title>Pdlk: plagiarism detection using linguistic knowledge</article-title>. <source>Expert Syst. Appl</source>. <volume>42</volume>, <fpage>8936</fpage>&#x02013;<lpage>8946</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2015.07.048</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Agarwal</surname> <given-names>B.</given-names></name> <name><surname>Ramampiaro</surname> <given-names>H.</given-names></name> <name><surname>Langseth</surname> <given-names>H.</given-names></name> <name><surname>Ruocco</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>A deep network model for paraphrase detection in short text messages</article-title>. <source>Inf. Process. Manage</source>. <volume>54</volume>, <fpage>922</fpage>&#x02013;<lpage>937</lpage>. <pub-id pub-id-type="doi">10.1016/j.ipm.2018.06.005</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Thwaib</surname> <given-names>E.</given-names></name> <name><surname>Hammo</surname> <given-names>B. H.</given-names></name> <name><surname>Yagi</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>An academic Arabic corpus for plagiarism detection: design, construction and experimentation</article-title>. <source>Int. J. Educ. Technol. High. Educ</source>. <volume>17</volume>, <fpage>1</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1186/s41239-019-0174-x</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alvi</surname> <given-names>F.</given-names></name> <name><surname>Stevenson</surname> <given-names>M.</given-names></name> <name><surname>Clough</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Paraphrase type identification for plagiarism detection using contexts and word embeddings</article-title>. <source>Int. J. Educ. Technol. High. Educ.</source> <volume>18</volume>:<fpage>42</fpage>. <pub-id pub-id-type="doi">10.1186/s41239-021-00277-8</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alzahrani</surname> <given-names>S.</given-names></name> <name><surname>Aljuaid</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases</article-title>. <source>J. King Saud Univ.-Comput. Inform. Sci</source>. <volume>34</volume>, <fpage>1110</fpage>&#x02013;<lpage>1123</lpage>. <pub-id pub-id-type="doi">10.1016/j.jksuci.2020.04.009</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alzahrani</surname> <given-names>S. M.</given-names></name> <name><surname>Salim</surname> <given-names>N.</given-names></name> <name><surname>Palade</surname> <given-names>V.</given-names></name></person-group> (<year>2015</year>). <article-title>Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model</article-title>. <source>J. King Saud Univ.-Comput. Inform. Sci</source>. <volume>27</volume>, <fpage>248</fpage>&#x02013;<lpage>268</lpage>. <pub-id pub-id-type="doi">10.1016/j.jksuci.2014.12.001</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bartoszuk</surname> <given-names>M.</given-names></name> <name><surname>Gagolewski</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>T-norms or t-conorms? How to aggregate similarity degrees for plagiarism detection</article-title>. <source>Knowl.-Based Syst</source>. <volume>231</volume>:<fpage>107427</fpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2021.107427</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benos</surname> <given-names>D. J.</given-names></name> <name><surname>Fabres</surname> <given-names>J.</given-names></name> <name><surname>Farmer</surname> <given-names>J.</given-names></name> <name><surname>Gutierrez</surname> <given-names>J. P.</given-names></name> <name><surname>Hennessy</surname> <given-names>K.</given-names></name> <name><surname>Kosek</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2005</year>). <article-title>Ethics and scientific publication</article-title>. <source>Adv. Physiol. Educ</source>. <volume>29</volume>, <fpage>59</fpage>&#x02013;<lpage>74</lpage>. <pub-id pub-id-type="doi">10.1152/advan.00056.2004</pub-id><pub-id pub-id-type="pmid">15905149</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chekhovich</surname> <given-names>Y. V.</given-names></name> <name><surname>Khazov</surname> <given-names>A. V.</given-names></name></person-group> (<year>2022</year>). <article-title>Analysis of duplicated publications in Russian journals</article-title>. <source>J. Informetr</source>. <volume>16</volume>:<fpage>101246</fpage>. <pub-id pub-id-type="doi">10.1016/j.joi.2021.101246</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Darwish</surname> <given-names>S. M.</given-names></name> <name><surname>Mhaimeed</surname> <given-names>I. A.</given-names></name> <name><surname>Elzoghabi</surname> <given-names>A. A.</given-names></name></person-group> (<year>2023</year>). <article-title>A quantum genetic algorithm for building a semantic textual similarity estimation framework for plagiarism detection applications</article-title>. <source>Entropy</source> <volume>25</volume>:<fpage>1271</fpage>. <pub-id pub-id-type="doi">10.3390/e25091271</pub-id><pub-id pub-id-type="pmid">37761570</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ehsan</surname> <given-names>N.</given-names></name> <name><surname>Shakery</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information</article-title>. <source>Inf. Process. Manage</source>. <volume>52</volume>, <fpage>1004</fpage>&#x02013;<lpage>1017</lpage>. <pub-id pub-id-type="doi">10.1016/j.ipm.2016.04.006</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ehsan</surname> <given-names>N.</given-names></name> <name><surname>Shakery</surname> <given-names>A.</given-names></name> <name><surname>Tompa</surname> <given-names>F. W.</given-names></name></person-group> (<year>2018</year>). <article-title>Cross-lingual text alignment for fine-grained plagiarism detection</article-title>. <source>J. Inform. Sci</source>. <volume>45</volume>, <fpage>443</fpage>&#x02013;<lpage>459</lpage>. <pub-id pub-id-type="doi">10.1177/0165551518787696</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>El-Rashidy</surname> <given-names>M. A.</given-names></name> <name><surname>Mohamed</surname> <given-names>R. G.</given-names></name> <name><surname>El-Fishawy</surname> <given-names>N. A.</given-names></name> <name><surname>Shouman</surname> <given-names>M. A.</given-names></name></person-group> (<year>2022</year>). <article-title>Reliable plagiarism detection system based on deep learning approaches</article-title>. <source>Neural Comput. Appl</source>. <volume>34</volume>, <fpage>18837</fpage>&#x02013;<lpage>18858</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-022-07486-w</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>El-Rashidy</surname> <given-names>M. A.</given-names></name> <name><surname>Mohamed</surname> <given-names>R. G.</given-names></name> <name><surname>El-Fishawy</surname> <given-names>N. A.</given-names></name> <name><surname>Shouman</surname> <given-names>M. A.</given-names></name></person-group> (<year>2024</year>). <article-title>An effective text plagiarism detection system based on feature selection and SVM techniques</article-title>. <source>Multimed. Tools Appl</source>. <volume>83</volume>, <fpage>2609</fpage>&#x02013;<lpage>2646</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-023-15703-4</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Errami</surname> <given-names>M.</given-names></name> <name><surname>Hicks</surname> <given-names>J. M.</given-names></name> <name><surname>Fisher</surname> <given-names>W.</given-names></name> <name><surname>Trusty</surname> <given-names>D.</given-names></name> <name><surname>Wren</surname> <given-names>J. D.</given-names></name> <name><surname>Long</surname> <given-names>T. C.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>D&#x000E9;j&#x000E0; vu&#x02014;a study of duplicate citations in medline</article-title>. <source>Bioinformatics</source> <volume>24</volume>, <fpage>243</fpage>&#x02013;<lpage>249</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btm574</pub-id><pub-id pub-id-type="pmid">18056062</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Gupta</surname> <given-names>P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>Banchs</surname> <given-names>R. E.</given-names></name></person-group> (<year>2016a</year>). <article-title>Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language</article-title>. <source>Knowl.-Based Syst</source>. <volume>111</volume>, <fpage>87</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2016.08.004</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>Montes-y G&#x000F3;mez</surname> <given-names>M.</given-names></name></person-group> (<year>2016b</year>). <article-title>A systematic study of knowledge graph analysis for cross-language plagiarism detection</article-title>. <source>Inf. Process. Manage</source>. <volume>52</volume>, <fpage>550</fpage>&#x02013;<lpage>570</lpage>. <pub-id pub-id-type="doi">10.1016/j.ipm.2015.12.004</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gandhi</surname> <given-names>N.</given-names></name> <name><surname>Gopalan</surname> <given-names>K.</given-names></name> <name><surname>Prasad</surname> <given-names>P.</given-names></name></person-group> (<year>2024</year>). <article-title>A support vector machine based approach for plagiarism detection in python code submissions in undergraduate settings</article-title>. <source>Front. Comput. Sci</source>. <volume>6</volume>:<fpage>1393723</fpage>. <pub-id pub-id-type="doi">10.3389/fcomp.2024.1393723</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gharavi</surname> <given-names>E.</given-names></name> <name><surname>Veisi</surname> <given-names>H.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2019</year>). <article-title>Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase</article-title>. <source>Neural Comput. Appl</source>. <volume>32</volume>, <fpage>10593</fpage>&#x02013;<lpage>10607</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-019-04594-y</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gipp</surname> <given-names>B.</given-names></name> <name><surname>Meuschke</surname> <given-names>N.</given-names></name> <name><surname>Breitinger</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>Citation-based plagiarism detection: Practicability on a large-scale scientific corpus</article-title>. <source>J. Assoc. Inform. Sci. Technol</source>. <volume>65</volume>, <fpage>1527</fpage>&#x02013;<lpage>1540</lpage>. <pub-id pub-id-type="doi">10.1002/asi.23228</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Glava&#x00161;</surname> <given-names>G.</given-names></name> <name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Ponzetto</surname> <given-names>S. P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2018</year>). <article-title>A resource-light method for cross-lingual semantic textual similarity</article-title>. <source>arXiv</source> [Preprint]. arXiv:1801.06436. <pub-id pub-id-type="doi">10.48550/arXiv.1801.06436</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hayawi</surname> <given-names>K.</given-names></name> <name><surname>Shahriar</surname> <given-names>S.</given-names></name> <name><surname>Mathew</surname> <given-names>S. S.</given-names></name></person-group> (<year>2023</year>). <article-title>The imitation game: detecting human and AI-generated texts in the era of chatgpt and bard</article-title>. <source>arXiv</source> [Preprint]. arXiv:2307.12166. <pub-id pub-id-type="doi">10.48550/arXiv.2307.12166</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hussain</surname> <given-names>S. F.</given-names></name> <name><surname>Suryani</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>On retrieving intelligently plagiarized documents using semantic similarity</article-title>. <source>Eng. Appl. Artif. Intell</source>. <volume>45</volume>, <fpage>246</fpage>&#x02013;<lpage>258</lpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2015.07.011</pub-id><pub-id pub-id-type="pmid">35951673</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iqbal</surname> <given-names>H. R.</given-names></name> <name><surname>Maqsood</surname> <given-names>R.</given-names></name> <name><surname>Raza</surname> <given-names>A. A.</given-names></name> <name><surname>Hassan</surname> <given-names>S.-U.</given-names></name></person-group> (<year>2024</year>). <article-title>Urdu paraphrase detection: a novel dnn-based implementation using a semi-automatically generated corpus</article-title>. <source>Nat. Lang. Eng</source>. <volume>30</volume>, <fpage>354</fpage>&#x02013;<lpage>384</lpage>. <pub-id pub-id-type="doi">10.1017/S1351324923000189</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lariviere</surname> <given-names>V.</given-names></name> <name><surname>Gingras</surname> <given-names>Y.</given-names></name></person-group> (<year>2010</year>). <article-title>On the prevalence and scientific impact of duplicate publications in different scientific fields (1980-2007)</article-title>. <source>J. Document</source>. <volume>66</volume>, <fpage>179</fpage>&#x02013;<lpage>190</lpage>. <pub-id pub-id-type="doi">10.1108/00220411011023607</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Ouyang</surname> <given-names>B.</given-names></name></person-group> (<year>2015</year>). <article-title>Plagiarism detection algorithm for source code in computer science education</article-title>. <source>Int. J. Dist. Educ. Technol</source>. <volume>13</volume>, <fpage>29</fpage>&#x02013;<lpage>39</lpage>. <pub-id pub-id-type="doi">10.4018/IJDET.2015100102</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Malandrino</surname> <given-names>D.</given-names></name> <name><surname>De Prisco</surname> <given-names>R.</given-names></name> <name><surname>Ianulardo</surname> <given-names>M.</given-names></name> <name><surname>Zaccagnino</surname> <given-names>R.</given-names></name></person-group> (<year>2022</year>). <article-title>An adaptive meta-heuristic for music plagiarism detection based on text similarity and clustering</article-title>. <source>Data Min. Knowl. Discov</source>. <volume>36</volume>, <fpage>1301</fpage>&#x02013;<lpage>1334</lpage>. <pub-id pub-id-type="doi">10.1007/s10618-022-00835-2</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manzoor</surname> <given-names>M. F.</given-names></name> <name><surname>Farooq</surname> <given-names>M. S.</given-names></name> <name><surname>Haseeb</surname> <given-names>M.</given-names></name> <name><surname>Farooq</surname> <given-names>U.</given-names></name> <name><surname>Khalid</surname> <given-names>S.</given-names></name> <name><surname>Abid</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Exploring the landscape of intrinsic plagiarism detection: benchmarks, techniques, evolution, and challenges</article-title>. <source>IEEE Access</source> <volume>11</volume>, <fpage>14706</fpage>&#x02013;<lpage>14729</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2023.3338855</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mehak</surname> <given-names>G.</given-names></name> <name><surname>Muneer</surname> <given-names>I.</given-names></name> <name><surname>Nawab</surname> <given-names>R. M. A.</given-names></name></person-group> (<year>2023</year>). <article-title>Urdu text reuse detection at phrasal level using sentence transformer-based approach</article-title>. <source>Expert Syst. Appl</source>. <volume>234</volume>:<fpage>121063</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2023.121063</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pertile</surname> <given-names>S. L.</given-names></name> <name><surname>Moreira</surname> <given-names>V. P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>Comparing and combining content- and citation-based approaches for plagiarism detection</article-title>. <source>J. Assoc. Inform. Sci. Technol</source>. <volume>66</volume>, <fpage>1976</fpage>&#x02013;<lpage>1991</lpage>. <pub-id pub-id-type="doi">10.1002/asi.23593</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polydouri</surname> <given-names>A.</given-names></name> <name><surname>Vathi</surname> <given-names>E.</given-names></name> <name><surname>Siolas</surname> <given-names>G.</given-names></name> <name><surname>Stafylopatis</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection</article-title>. <source>Evol. Syst</source>. <volume>11</volume>, <fpage>503</fpage>&#x02013;<lpage>515</lpage>. <pub-id pub-id-type="doi">10.1007/s12530-018-9232-1</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Roig</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <source>Avoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://ori.hhs.gov/sites/default/files/plagiarism.pdf">https://ori.hhs.gov/sites/default/files/plagiarism.pdf</ext-link></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Romanov</surname> <given-names>A.</given-names></name> <name><surname>Kurtukova</surname> <given-names>A.</given-names></name> <name><surname>Shelupanov</surname> <given-names>A.</given-names></name> <name><surname>Fedotova</surname> <given-names>A.</given-names></name> <name><surname>Goncharov</surname> <given-names>V.</given-names></name></person-group> (<year>2021</year>). <article-title>Authorship identification of a russian-language text using support vector machine and deep neural networks</article-title>. <source>Future Internet</source> <volume>13</volume>:<fpage>3</fpage>. <pub-id pub-id-type="doi">10.3390/fi13010003</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roostaee</surname> <given-names>M.</given-names></name> <name><surname>Fakhrahmad</surname> <given-names>S. M.</given-names></name> <name><surname>Sadreddini</surname> <given-names>M. H.</given-names></name></person-group> (<year>2020</year>). <article-title>Cross-language text alignment: a proposed two-level matching scheme for plagiarism detection</article-title>. <source>Expert Syst. Appl</source>. <volume>160</volume>:<fpage>113718</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2020.113718</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sahi</surname> <given-names>M.</given-names></name> <name><surname>Gupta</surname> <given-names>V.</given-names></name></person-group> (<year>2017</year>). <article-title>A novel technique for detecting plagiarism in documents exploiting information sources</article-title>. <source>Cogn. Comput</source>. <volume>9</volume>, <fpage>852</fpage>&#x02013;<lpage>867</lpage>. <pub-id pub-id-type="doi">10.1007/s12559-017-9502-4</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shahmohammadi</surname> <given-names>H.</given-names></name> <name><surname>Dezfoulian</surname> <given-names>M.</given-names></name> <name><surname>Mansoorizadeh</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Paraphrase detection using LSTM networks and handcrafted features</article-title>. <source>Multimed. Tools Appl</source>. <volume>80</volume>, <fpage>24137</fpage>&#x02013;<lpage>24155</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-020-09996-y</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shakeel</surname> <given-names>M. H.</given-names></name> <name><surname>Karim</surname> <given-names>A.</given-names></name> <name><surname>Khan</surname> <given-names>I.</given-names></name></person-group> (<year>2020</year>). <article-title>A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts</article-title>. <source>Inf. Process. Manage</source>. <volume>57</volume>:<fpage>102204</fpage>. <pub-id pub-id-type="doi">10.1016/j.ipm.2020.102204</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suman</surname> <given-names>C.</given-names></name> <name><surname>Naman</surname> <given-names>A.</given-names></name> <name><surname>Saha</surname> <given-names>S.</given-names></name> <name><surname>Bhattacharyya</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>A multimodal author profiling system for tweets</article-title>. <source>IEEE Trans. Comput. Soc. Syst</source>. <volume>8</volume>, <fpage>1407</fpage>&#x02013;<lpage>1416</lpage>. <pub-id pub-id-type="doi">10.1109/TCSS.2021.3082942</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Taufiq</surname> <given-names>U.</given-names></name> <name><surname>Pulungan</surname> <given-names>R.</given-names></name> <name><surname>Suyanto</surname> <given-names>Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Named entity recognition and dependency parsing for better concept extraction in summary obfuscation detection</article-title>. <source>Expert Syst. Appl</source>. <volume>217</volume>:<fpage>119579</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2023.119579</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turrado Garc&#x000ED;a</surname> <given-names>F.</given-names></name> <name><surname>Garc&#x000ED;a Villalba</surname> <given-names>L. J.</given-names></name> <name><surname>Sandoval Orozco</surname> <given-names>A. L.</given-names></name> <name><surname>Aranda Ruiz</surname> <given-names>F. D.</given-names></name> <name><surname>Aguirre Ju&#x000E1;rez</surname> <given-names>A.</given-names></name> <name><surname>Kim</surname> <given-names>T.-H.</given-names></name></person-group> (<year>2018</year>). <article-title>Locating similar names through locality sensitive hashing and graph theory</article-title>. <source>Multimed. Tools Appl</source>. <volume>78</volume>, <fpage>19965</fpage>&#x02013;<lpage>19985</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-018-6375-9</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vani</surname> <given-names>K.</given-names></name> <name><surname>Gupta</surname> <given-names>D.</given-names></name></person-group> (<year>2017a</year>). <article-title>Detection of idea plagiarism using syntax-semantic concept extractions with genetic algorithm</article-title>. <source>Expert Syst. Appl</source>. <volume>71</volume>, <fpage>106</fpage>&#x02013;<lpage>122</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2016.12.022</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vani</surname> <given-names>K.</given-names></name> <name><surname>Gupta</surname> <given-names>D.</given-names></name></person-group> (<year>2017b</year>). <article-title>Text plagiarism classification using syntax based linguistic features</article-title>. <source>Expert Syst. Appl</source>. <volume>94</volume>, <fpage>29</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2017.07.006</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vani</surname> <given-names>K.</given-names></name> <name><surname>Gupta</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection</article-title>. <source>J. Assoc. Inf. Sci. Technol</source>. <volume>69</volume>, <fpage>1186</fpage>&#x02013;<lpage>1203</lpage>. <pub-id pub-id-type="doi">10.1002/asi.24027</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vel&#x000E1;squez</surname> <given-names>J. D.</given-names></name> <name><surname>Covacevich</surname> <given-names>Y.</given-names></name> <name><surname>Molina</surname> <given-names>F.</given-names></name> <name><surname>Marrese-Taylor</surname> <given-names>E.</given-names></name> <name><surname>Rodr&#x000ED;guez</surname> <given-names>C.</given-names></name> <name><surname>Bravo-Marquez</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources</article-title>. <source>Inf. Fusion</source> <volume>27</volume>, <fpage>64</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2015.05.006</pub-id></citation>
</ref>
</ref-list>
</back>
</article>