<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="review-article" dtd-version="1.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title-group>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2025.1626641</article-id>
<article-version article-version-type="Version of Record" vocab="NISO-RP-8-2008"/>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Mini Review</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>From shades to vibrance: a comprehensive review of modern image colorization techniques</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Geenath</surname> <given-names>Oshen</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Formal analysis" vocab-term-identifier="https://credit.niso.org/contributor-roles/formal-analysis/">Formal analysis</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="investigation" vocab-term-identifier="https://credit.niso.org/contributor-roles/investigation/">Investigation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="validation" vocab-term-identifier="https://credit.niso.org/contributor-roles/validation/">Validation</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; original draft" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing &#x2013; original draft</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &#x00026; editing</role>
<uri xlink:href="https://loop.frontiersin.org/people/3035698"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Priyadarshana</surname> <given-names>Y. H. P. P.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/2705123"/>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="supervision" vocab-term-identifier="https://credit.niso.org/contributor-roles/supervision/">Supervision</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &#x00026; editing</role>
</contrib>
</contrib-group>
<aff id="aff1"><label>1</label><institution>School of Computing, Robert Gordon University</institution>, <city>Aberdeen</city>, <country country="gb">United Kingdom</country></aff>
<aff id="aff2"><label>2</label><institution>Kyoto University of Advanced Science (KUAS)</institution>, <city>Kyoto</city>, <country country="jp">Japan</country></aff>
<author-notes>
<corresp id="c001"><label>&#x0002A;</label>Correspondence: Oshen Geenath, <email xlink:href="mailto:o.arul-jeganathan@rgu.ac.uk">o.arul-jeganathan@rgu.ac.uk</email></corresp>
</author-notes>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2025-09-18">
<day>18</day>
<month>09</month>
<year>2025</year>
</pub-date>
<pub-date publication-format="electronic" date-type="collection">
<year>2025</year>
</pub-date>
<volume>7</volume>
<elocation-id>1626641</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>05</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>08</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Geenath and Priyadarshana.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Geenath and Priyadarshana</copyright-holder>
<license>
<ali:license_ref start_date="2025-09-18">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract>
<p>Image colorization has become a significant task in computer vision, addressing the challenge of transforming grayscale images into realistic, vibrant color outputs. Recent advancements leverage deep learning techniques, ranging from generative adversarial networks (GANs) to diffusion models, and integrate semantic understanding, multi-scale features, and user-guided controls. This review explores state-of-the-art methodologies, highlighting innovative components such as semantic class distribution learning, bidirectional temporal fusion, and instance-aware frameworks. Evaluation metrics, including PSNR, FID, and task-specific measures, ensure a comprehensive assessment of performance. Despite remarkable progress, challenges like multimodal uncertainty, computational cost, and generalization remain. This paper provides a thorough analysis of existing approaches, offering insights into their contributions, limitations, and future directions in automated image colorization.</p></abstract>
<kwd-group>
<kwd>image colorization</kwd>
<kwd>real-time colorization</kwd>
<kwd>black-and-white colorization</kwd>
<kwd>user-guided colorization</kwd>
<kwd>interactive image colorization</kwd>
</kwd-group>
<funding-group>
<funding-statement>The author(s) declare that no financial support was received for the research and/or publication of this article.</funding-statement>
</funding-group>
<counts>
<fig-count count="3"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="45"/>
<page-count count="11"/>
<word-count count="7714"/>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computer Vision</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<label>1</label>
<title>Introduction</title>
<p>Image colorization, a significant task in computer vision, involves converting grayscale images into realistic and semantically consistent color outputs (<xref ref-type="bibr" rid="B36">Welsh et al., 2002</xref>). This technology has broad applications in historical photo restoration, film and content enhancement, digital art, and interactive media creation (<xref ref-type="bibr" rid="B7">Cheng et al., 2015</xref>). Despite considerable advancements, colorization remains inherently ambiguous&#x02014;grayscale images may have multiple plausible colorizations depending on object semantics, scene context, and user intent (<xref ref-type="bibr" rid="B41">Zhang et al., 2016</xref>). Producing visually convincing results requires models to reason over both local textures and global semantic cues while maintaining computational efficiency and adaptability.</p>
<p>This review provides a comprehensive analysis of recent developments in deep learning-based image colorization. A systematic selection of research papers was conducted across major academic sources including Google Scholar, IEEE Xplore, ACM Digital Library, arXiv, and SpringerLink. Using search terms such as &#x0201C;image colorization,&#x0201D; &#x0201C;automatic colorization,&#x0201D; &#x0201C;semantic colorization,&#x0201D; &#x0201C;user-guided colorization,&#x0201D; and &#x0201C;text-to-image colorization,&#x0201D; we identified 46 relevant publications. After excluding unrelated works on sketch-based colorization, underwater image enhancement, and extremely low-resolution inputs (<xref ref-type="bibr" rid="B28">Pramanick et al., 2024</xref>; <xref ref-type="bibr" rid="B30">Sangkloy et al., 2017</xref>; <xref ref-type="bibr" rid="B24">Liu et al., 2024</xref>; <xref ref-type="bibr" rid="B20">Lee et al., 2020</xref>; <xref ref-type="bibr" rid="B15">Isola et al., 2017</xref>; <xref ref-type="bibr" rid="B9">Fei et al., 2023</xref>; <xref ref-type="bibr" rid="B10">Gao et al., 2023</xref>; <xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref>; <xref ref-type="bibr" rid="B18">Kumar et al., 2021</xref>; <xref ref-type="bibr" rid="B31">Shafiq et al., 2025</xref>; <xref ref-type="bibr" rid="B19">Larsson et al., 2016</xref>; <xref ref-type="bibr" rid="B21">Li et al., 2023</xref>), 21 influential papers published between 2015 and 2025 were selected for in-depth review.</p>
<p>This paper categorizes and evaluates state-of-the-art methodologies across seven core areas: classification-based models, adversarial networks, diffusion models, transformer and dual-decoder architectures, exemplar-based and temporal colorization, multimodal and text-guided systems, and semantic fusion-based frameworks. Each method is discussed in terms of its architectural design, innovation, quantitative performance, and limitations. In addition, a summary of benchmark datasets and widely used evaluation metrics&#x02014;including PSNR, SSIM, LPIPS, FID, and CLIP Score&#x02014;is provided.</p>
<p>The remainder of this paper is structured as follows: Section II reviews existing methods grouped by model type and design strategy. Section III outlines the key challenges facing colorization models, including color imbalance, semantic ambiguity, and computational cost. Section IV discusses evaluation metrics used to assess fidelity, diversity, and perceptual quality. Section V highlights emerging trends and future research directions, including interactive frameworks, hybrid modeling, and lightweight architectures. Section VI concludes with a summary of progress and recommendations for future research directions.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Existing approaches</title>
<p>Recent advances in image colorization have led to a diverse array of deep learning-based models that vary significantly in architectural design, learning objectives, and user controllability. This section categorizes and reviews state-of-the-art techniques into key methodological families, each offering distinct advantages and trade-offs. We organize these approaches into discretized classification models, adversarial networks, diffusion-based frameworks, transformer and dual-decoder architectures, exemplar and temporal methods, text-guided and multimodal systems, and semantic fusion models.</p>
<p>To enhance accessibility, we also provide a visual summary of the distribution of surveyed models by architecture type in <xref ref-type="fig" rid="F1">Figure 1</xref>, highlighting how the field has evolved in terms of complexity, controllability, and realism over the past decade. This high-level overview contextualizes the detailed analysis in the subsequent subsections.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption><p>Distribution of surveyed image colorization models by architecture type (2015&#x02013;2025).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1626641-g0001.tif">
<alt-text content-type="machine-generated">Pie chart showing percentages of different approaches. GAN-based: 8%, Diffusion-based: 28%, Transformer-based: 12%, Classification-based: 12%, Text-guided: 20%, Others: 20%. Each segment has a distinct color and is labeled.</alt-text>
</graphic>
</fig>
<p>As illustrated in <xref ref-type="fig" rid="F1">Figures 1</xref>, <xref ref-type="fig" rid="F2">2</xref>, GAN-based and classification-based models have historically dominated the field, while diffusion and text-guided methods have gained significant traction in recent years due to their controllability and realism. Transformer-based and multimodal approaches are also emerging, reflecting a growing emphasis on semantic alignment and user interactivity. In the following subsections, we explore each category in detail, analyzing architectural innovations, quantitative results, use cases, and limitations. A comparative summary of key image colorization models, their reported metrics, strengths, and limitations is presented in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption><p>Timeline of surveyed model categories from 2015 to 2025.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1626641-g0002.tif">
<alt-text content-type="machine-generated">Bar chart displaying the number of academic papers by publication period from 2015 to 2025, categorized by GAN-based, Diffusion-based, Transformer-based, Text-guided, and Others. The count generally increases over time, with the highest in the 2024-2025 period.</alt-text>
</graphic>
</fig>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Overview of key image colorization models with reported metrics, strengths, and limitations.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Paper</bold></th>
<th valign="top" align="left"><bold>Model type</bold></th>
<th valign="top" align="left"><bold>Dataset(s)</bold></th>
<th valign="top" align="left"><bold>Metrics reported</bold></th>
<th valign="top" align="left"><bold>Strengths</bold></th>
<th valign="top" align="left"><bold>Limitations</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">L-CAD <xref ref-type="bibr" rid="B39">Weng et al., 2023</xref></td>
<td valign="top" align="left">Diffusion</td>
<td valign="top" align="left">COCO-Stuff, ImageNet</td>
<td valign="top" align="left">PSNR, SSIM</td>
<td valign="top" align="left">Handles fine text prompts</td>
<td valign="top" align="left">Prompt-sensitive, slow</td>
</tr>
<tr>
<td valign="top" align="left">SS-CycleGAN <xref ref-type="bibr" rid="B21">Li et al., 2023</xref></td>
<td valign="top" align="left">GAN &#x0002B; attention</td>
<td valign="top" align="left">COCO</td>
<td valign="top" align="left">PSNR, SSIM</td>
<td valign="top" align="left">Spatial consistency</td>
<td valign="top" align="left">No FID/LPIPS, heavy</td>
</tr>
<tr>
<td valign="top" align="left">DDColor <xref ref-type="bibr" rid="B16">Kang et al., 2023</xref></td>
<td valign="top" align="left">Transformer &#x0002B; CNN</td>
<td valign="top" align="left">ImageNet</td>
<td valign="top" align="left">FID, PSNR</td>
<td valign="top" align="left">Semantic color separation</td>
<td valign="top" align="left">Fails on translucent regions</td>
</tr>
<tr>
<td valign="top" align="left">L-Colns <xref ref-type="bibr" rid="B6">Chang et al., 2023</xref></td>
<td valign="top" align="left">Transformer-based, text-guided</td>
<td valign="top" align="left">Extended COCO-stuff</td>
<td valign="top" align="left">PSNR, SSIM, LPIPS</td>
<td valign="top" align="left">Instance-aware without external priors</td>
<td valign="top" align="left">Struggles with small-object grounding in long captions</td>
</tr>
<tr>
<td valign="top" align="left">L-CoDer <xref ref-type="bibr" rid="B5">Chang et al., 2022</xref></td>
<td valign="top" align="left">Transformer-based, text-guided</td>
<td valign="top" align="left">Extended COCO-stuff</td>
<td valign="top" align="left">PSNR, SSIM, LPIPS</td>
<td valign="top" align="left">Handles color-object mismatch with decoupling</td>
<td valign="top" align="left">High GPU/memory cost for high-resolution</td>
</tr>
<tr>
<td valign="top" align="left">L-CoDe <xref ref-type="bibr" rid="B38">Weng et al., 2022b</xref></td>
<td valign="top" align="left">GAN-based, text-guided</td>
<td valign="top" align="left">COCO-stuff</td>
<td valign="top" align="left">PSNR, SSIM, LPIPS</td>
<td valign="top" align="left">High subjective realism</td>
<td valign="top" align="left">Color bleeding on fine boundaries</td>
</tr>
<tr>
<td valign="top" align="left">CT2 <xref ref-type="bibr" rid="B37">Weng et al., 2022a</xref></td>
<td valign="top" align="left">Transformer-based, classification</td>
<td valign="top" align="left">ImageNet</td>
<td valign="top" align="left">PSNR, SSIM, LPIPS</td>
<td valign="top" align="left">Color tokens enable semantic consistency</td>
<td valign="top" align="left">Sensitive to biased training data</td>
</tr>
<tr>
<td valign="top" align="left">ParaColorizer <xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref></td>
<td valign="top" align="left">Dual GANs</td>
<td valign="top" align="left">Oxford flowers</td>
<td valign="top" align="left">FID, SSIM</td>
<td valign="top" align="left">Fast inference</td>
<td valign="top" align="left">Needs more training data</td>
</tr>
<tr>
<td valign="top" align="left">GAN Colorization <xref ref-type="bibr" rid="B27">Nazeri et al., 2018</xref></td>
<td valign="top" align="left">Conditional GAN</td>
<td valign="top" align="left">Various</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Structured training, vivid colors</td>
<td valign="top" align="left">Texture miscoloring</td>
</tr>
<tr>
<td valign="top" align="left">User-Guided <xref ref-type="bibr" rid="B43">Zhang et al., 2017</xref></td>
<td valign="top" align="left">CNN &#x0002B; Hints</td>
<td valign="top" align="left">COCO</td>
<td valign="top" align="left">User study</td>
<td valign="top" align="left">Interactive and intuitive</td>
<td valign="top" align="left">Over-optimistic coloring</td>
</tr>
<tr>
<td valign="top" align="left">TextIR <xref ref-type="bibr" rid="B3">Bai et al., 2025</xref></td>
<td valign="top" align="left">GAN &#x0002B; CLIP</td>
<td valign="top" align="left">CelebA, COCO</td>
<td valign="top" align="left">FID, SSIM, CLIP</td>
<td valign="top" align="left">Text-based edits</td>
<td valign="top" align="left">CLIP mismatch possible</td>
</tr>
<tr>
<td valign="top" align="left">Let There Be Color <xref ref-type="bibr" rid="B36">Welsh et al., 2002</xref></td>
<td valign="top" align="left">CNN</td>
<td valign="top" align="left">Classic scenes</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Simple, no user input needed</td>
<td valign="top" align="left">Fails on unseen domains</td>
</tr>
<tr>
<td valign="top" align="left">Palette <xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref></td>
<td valign="top" align="left">Diffusion</td>
<td valign="top" align="left">ImageNet</td>
<td valign="top" align="left">FID</td>
<td valign="top" align="left">General purpose, realistic</td>
<td valign="top" align="left">Slower than GANs</td>
</tr>
<tr>
<td valign="top" align="left">BiSTNet <xref ref-type="bibr" rid="B40">Yang et al., 2024</xref></td>
<td valign="top" align="left">Video colorization (fusion)</td>
<td valign="top" align="left">DAVIS, Videvo</td>
<td valign="top" align="left">PSNR, CDC</td>
<td valign="top" align="left">Video accuracy, temporal logic</td>
<td valign="top" align="left">Heavy modules (SAM, RAFT)</td>
</tr>
<tr>
<td valign="top" align="left">Deep Colorization <xref ref-type="bibr" rid="B7">Cheng et al., 2015</xref></td>
<td valign="top" align="left">DNN &#x0002B; semantic features</td>
<td valign="top" align="left">SUN</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Few artifacts</td>
<td valign="top" align="left">Needs large training set</td>
</tr>
<tr>
<td valign="top" align="left">Instance-Aware <xref ref-type="bibr" rid="B32">Su et al., 2020</xref></td>
<td valign="top" align="left">GAN &#x0002B; segmentation</td>
<td valign="top" align="left">Custom</td>
<td valign="top" align="left">FID</td>
<td valign="top" align="left">Good for multiple objects</td>
<td valign="top" align="left">Detection accuracy critical</td>
</tr>
<tr>
<td valign="top" align="left">ChromaGAN <xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref></td>
<td valign="top" align="left">GAN &#x0002B; semantic estimation</td>
<td valign="top" align="left">ImageNet</td>
<td valign="top" align="left">PSNR</td>
<td valign="top" align="left">Vivid color, semantic realism</td>
<td valign="top" align="left">Needs labeled classes</td>
</tr></tbody>
</table>
</table-wrap>
<sec>
<label>2.1</label>
<title>Discretized classification models</title>
<p>Regression-based colorization often results in desaturated or averaged outputs, particularly in regions with multiple plausible colors. To address this, classification-based models predict a probability distribution over discretized color classes, enhancing color diversity and rare color representation while reducing mode collapse.</p>
<p>Deep Colorization (<xref ref-type="bibr" rid="B7">Cheng et al., 2015</xref>) adopts a fully connected neural network to classify each pixel using low-, mid-, and high-level features (grayscale patches, DAISY descriptors, and semantic segmentation). While it delivers strong PSNR (up to 33 dB) and avoids CNN overhead, the lack of spatial feature reuse limits its scalability to high-resolution or texture-rich images.</p>
<p><xref ref-type="bibr" rid="B33">Tassin et al. (2025)</xref> introduce Crayon (<xref ref-type="bibr" rid="B33">Tassin et al., 2025</xref>), a U-Net-based model that addresses colorization from a compression perspective using a discretized color grid. Instead of predicting full color, it reconstructs chrominance from sparse color patches retained at fixed intervals (e.g., every <italic>n</italic><sup><italic>th</italic></sup> pixel). This structured sampling aligns with discretized classification principles, learning color mappings from partial ground-truth. Crayon performs competitively in PSNR and CSIM across varying grid sizes (<italic>n</italic> &#x0003D; 6&#x02013;100), with optimal trade-offs at <italic>n</italic> &#x0003D; 15&#x02013;20. While lightweight and compression-efficient, its performance degrades at extreme sparsity levels due to color loss and grid artifacts.</p>
<p>In summary, classification-based models offer a structured way to encode color diversity and handle multimodal color spaces. They are effective for vibrant and data-driven colorization but face challenges in scalability and generalization to complex scenes due to discretization and post-processing dependencies.</p>
</sec>
<sec>
<label>2.2</label>
<title>Adversarial colorization networks</title>
<p>Generative Adversarial Networks (GANs; <xref ref-type="bibr" rid="B9">Fei et al., 2023</xref>) have become a cornerstone of modern image colorization, capable of producing vivid and realistic outputs by learning from natural color distributions. Unlike regression-based models, GANs use a discriminator to guide the generator toward perceptually convincing results. Recent approaches enhance this setup with semantic priors, instance awareness, and spatial refinement to boost realism and structure.</p>
<p>ChromaGAN (<xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref>) introduces a dual-branch generator: one predicts chrominance channels, the other estimates semantic class distributions, supervised by KL divergence against VGG-16 predictions. This improves contextual accuracy and color diversity. However, its reliance on fixed-size inputs (due to VGG-16 constraints) and pretrained semantic classifiers limits its adaptability across different domains, resolutions, and tasks where pretrained priors may not align with target data distributions.</p>
<p>ParaColorizer (<xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>) tackles foreground-background confusion using two parallel GANs for foreground (self-attention ResUNet) and background, fused via a DenseFuse network. This enhances object separation and color clarity, achieving top FID and colorfulness scores on COCO and ImageNet. Its trade-off is increased complexity and inference time, along with dependency on instance segmentation.</p>
<p>SS-CycleGAN extends CycleGAN (<xref ref-type="bibr" rid="B21">Li et al., 2023</xref>) with Multi-Scale Cascaded Dilated Convolution (MCDC) and a self-attention patch discriminator, improving semantic focus and edge fidelity. It boosts PSNR and SSIM, but the model was not evaluated on perceptual realism metrics such as FID or LPIPS, limiting direct comparability with diffusion or multimodal models. Furthermore, it lacks user guidance features, reducing controllability in interactive settings.</p>
<p>L-CoDe (Language-based Colorization with Decoupled Conditions; <xref ref-type="bibr" rid="B38">Weng et al., 2022b</xref>) integrates adversarial learning with semantic decoupling by separating caption tokens into object (noun) and color (adjective) vectors, addressing color-object mismatch and coupling. A novel Attention Transfer Module (ATM) maps object references in the image to corresponding color tokens, while a Soft-gated Injection Module (SIM) ensures that only mentioned regions receive injected color guidance. The model is trained with both perceptual and binary cross-entropy losses, achieving strong performance in PSNR, SSIM, and LPIPS on the COCO-Stuff dataset. Although not evaluated on FID, L-CoDe&#x00027;s user studies demonstrate strong subjective realism and controllability, positioning it as a semantically guided adversarial model that bridges linguistic cues and visual fidelity.</p>
<p>Instance-Aware GANs (<xref ref-type="bibr" rid="B32">Su et al., 2020</xref>) colorize detected objects individually and merge them with global features via a fusion module, reducing color mixing between objects and backgrounds. While effective in dense scenes, the approach is highly dependent on segmentation accuracy and incurs considerable computational cost due to per-instance forward passes.</p>
<p>In summary, adversarial colorization networks push the boundary of realism through semantic fusion and structural refinement. Their key limitations include training complexity, runtime cost, and sensitivity to external dependencies such as detection quality and pretrained priors.</p>
</sec>
<sec>
<label>2.3</label>
<title>Diffusion-based colorization models</title>
<p>Diffusion-based models have emerged as a powerful solution for high-fidelity colorization by iteratively denoising noisy samples conditioned on grayscale input or auxiliary signals. Compared to GANs, they offer more stable training and generate diverse, semantically coherent outputs, though they remain computationally expensive and slower to infer due to their iterative nature.</p>
<p>Palette (<xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref>) is a general-purpose diffusion model trained on multiple image-to-image tasks, including colorization. It uses a U-Net with global self-attention and requires no task-specific tuning. Achieving FID = 15.78 and a 47.8% human fooling rate on ImageNet, Palette outperforms earlier GAN-based models like ColTran. However, its universal design slightly compromises colorization-specific precision, and its multi-step generation makes it unsuitable for real-time use.</p>
<p>L-CAD (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>) offers text-conditioned colorization using Stable Diffusion, integrating LIC, CEC, and ISS modules for structure preservation, semantic alignment, and object-aware control. It performs well on COCO-Stuff and ImageNet (PSNR = 26.3, SSIM = 0.911) and supports prompts of varying detail. However, its effectiveness relies on the clarity and precision of user prompts, making it vulnerable to ambiguous or sparse descriptions.</p>
<p>In summary, diffusion models offer high-quality, controllable colorization across modalities but face challenges in efficiency, making them ideal for offline or batch processing rather than real-time tasks. Future work must focus on faster sampling strategies and task-specific tuning to unlock their full potential in practical settings.</p>
</sec>
<sec>
<label>2.4</label>
<title>Transformer and dual-decoder architectures</title>
<p>Transformer-based and dual-decoder models have recently advanced colorization by decoupling spatial detail from semantic reasoning. This architectural split allows networks to simultaneously handle texture reconstruction and context-aware color prediction, improving accuracy in complex scenes. However, these designs often come with high training costs and memory demands.</p>
<p>DDColor (<xref ref-type="bibr" rid="B16">Kang et al., 2023</xref>) exemplifies this trend with a ConvNeXt backbone and two decoders: a pixel decoder for spatial fidelity and a transformer-based color decoder for semantic-aware color queries. Their fusion via attention mechanisms enables high-resolution, vibrant outputs. The architectural overview of DDColor is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. DDColor achieves strong performance (FID = 3.92) on ImageNet, COCO-Stuff, and ADE20K, aided by a colorfulness loss. However, its dual-path design increases latency and memory usage, limiting real-time usability.</p>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption><p>DDColor architecture: a grayscale image is encoded via a ViT encoder. Optional user hints are processed by a separate encoder and fused via an adaptive mask. A transformer-based diffusion decoder generates the final colorized image.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-07-1626641-g0003.tif">
<alt-text content-type="machine-generated">Flowchart illustrating image colorization. A grayscale input image undergoes a ViT-based encoder process. Simultaneously, a scribble or hint input is processed by a hint encoder. Both outputs feed into an adaptive mask generator. This connects to a transformer-based diffusion decoder, resulting in a colorized output image.</alt-text>
</graphic>
</fig>
<p>CT2 (<xref ref-type="bibr" rid="B37">Weng et al., 2022a</xref>) further expands transformer-based colorization by introducing color tokens and treating colorization as a classification problem in quantized color space. The model features a ViT-based encoder and a transformer-based decoder, enhanced by two novel modules: (1) a luminance-selecting module that dynamically restricts valid color candidates based on luminance levels, and (2) a color attention mechanism that injects color tokens into grayscale image features. These innovations address common issues like semantic color errors and undersaturation, leading to visually rich, plausible outputs without relying on external priors. CT2 achieves state-of-the-art performance across multiple benchmarks, including ImageNet, with superior FID (5.51), PSNR (23.50), SSIM (0.92), and colorfulness metrics. Despite its strengths, the model depends on accurate empirical color distributions and may underperform on highly biased or limited datasets.</p>
<p>Overall, these architectures demonstrate that semantic disentanglement improves interpretability and realism in colorization. Their main limitations are computational efficiency and generalization, which remain key areas for further refinement.</p>
</sec>
<sec>
<label>2.5</label>
<title>Exemplar and temporal colorization</title>
<p>Video colorization poses challenges like temporal consistency, color propagation, and scene coherence, which static models do not face. To overcome these, exemplar-based and temporal models use reference frames, semantic priors, and feature-level alignment to maintain consistency across sequences.</p>
<p>L-CoDer (Language-Based Colorization with Color-Object Decoupling Transformer; <xref ref-type="bibr" rid="B5">Chang et al., 2022</xref>) introduces a language-guided approach that unifies grayscale images and textual captions in a shared token-based representation using transformers. Unlike temporal or exemplar-based models, L-CoDer targets the modality alignment problem by decoupling the caption into noun (object) and adjective (color) tokens and processing them alongside image patches. The model employs a decoupling transformer with bidirectional attention, enabling each modality to refine the other from coarse to fine. A learned Object-Color Correspondence Matrix (OCCM) ensures correct color-object associations, addressing issues such as color-object mismatch and coupling. L-CoDer achieves state-of-the-art performance on the COCO-Stuff dataset across PSNR, SSIM, and LPIPS metrics. However, the model&#x00027;s transformer backbone leads to high memory demands, posing challenges for scaling to high-resolution or real-time applications. Nonetheless, it represents a strong advancement in semantically controllable colorization.</p>
<p>BiSTNet (<xref ref-type="bibr" rid="B40">Yang et al., 2024</xref>) colorizes entire video sequences using only two reference frames, employing a Bidirectional Temporal Fusion Block (BTFB) to blend forward and backward predictions based on temporal distance. It further refines output using a Mixed Expert Block (MEB)&#x02014;which combines segmentation and edge features&#x02014;and a Multi-Scale Refinement Block (MSRB). It achieved top scores in the NTIRE 2023 Video Colorization Challenge, with strong PSNR and CDC metrics. However, its dependency on external modules (e.g., SAM, RAFT) and heavy computation limits real-time use, and its success hinges on high-quality references.</p>
<p>DeepExemplar (from ParaColorizer; <xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>) uses a dual-GAN strategy to colorize foreground and background separately, extending it to videos through semantic alignment and temporal fusion. It preserves color consistency in repeating or structured elements and uses instance-aware segmentation for identity tracking. Despite improved visual coherence, the model remains computationally intensive, and performance degrades when semantic matching fails in dynamic or complex scenes.</p>
<p>In summary, these models represent a shift toward sequence-aware colorization, offering robust performance through semantic fusion and temporal logic. Their major limitations lie in runtime overhead, reference dependency, and scalability, especially in real-time or unconstrained settings.</p>
</sec>
<sec>
<label>2.6</label>
<title>Text-guided and multimodal colorization</title>
<p>Modern colorization models increasingly support multimodal interaction, enabling users to guide outputs via text prompts, strokes, or exemplars. These systems incorporate semantic understanding and visual alignment, offering both global scene-level control and localized refinement. This shift toward human-in-the-loop generation enhances creativity and personalization, but also introduces challenges in precision and usability.</p>
<p>TextIR (<xref ref-type="bibr" rid="B3">Bai et al., 2025</xref>) uses CLIP-based embeddings and a StyleConv generator to enable prompt-driven colorization, inpainting, and super-resolution. A feature fusion module blends semantic cues with structural details for fine-grained edits (e.g., &#x0201C;a red umbrella and green boots&#x0201D;). It outperforms prior models like L-CoDe and L-CoIns on FID, SSIM, and CLIP Score. However, its sensitivity to prompt quality can cause mismatches in complex or ambiguous scenarios, and its alignment is less precise than pixel-based control.</p>
<p>L-CAD (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>) builds on Stable Diffusion, introducing modules such as LIC, CEC, and ISS to align text inputs with grayscale structure and instance features. It performs well on COCO-Stuff and ImageNet, maintaining high PSNR and SSIM, and handles both general and detailed prompts. Still, its performance heavily depends on prompt clarity, especially in multi-object scenes where ambiguity can degrade results.</p>
<p>Language-based Image Colorization (<xref ref-type="bibr" rid="B22">Li et al., 2025</xref>), a distilled diffusion model for text-guided colorization that achieves 14 &#x000D7; faster inference and high CLIP alignment. They benchmark from-scratch and pre-trained models, proposing a hue-invariant FID (hFID) metric for fairer evaluation. While efficient and generalizable, Color-Turbo lacks fine-grained control and may produce hue inconsistencies in complex prompts. Their curated dataset standardizes evaluation across language-based colorization models.</p>
<p>Controllable Image Colorization with Instance-aware Texts and Masks (<xref ref-type="bibr" rid="B1">An et al., 2025</xref>) extends text-based control with segmentation masks for instance-aware colorization. It combines a transformer-guided diffusion model with a novel GPT-generated dataset (GPT-Color) to enable fine-grained, object-level control. This achieves strong performance in user studies and CLIP alignment, but its reliance on accurate instance masks and multi-modal inputs increases system complexity and limits scalability for casual users.</p>
<p>L-CoIns (<xref ref-type="bibr" rid="B6">Chang et al., 2023</xref>) introduces a framework that leverages both language and instance-level cues for object-aware colorization. The model uses CLIP-based embeddings to encode text prompts and aligns them with grayscale image regions through object detection and instance segmentation. By doing so, it enables controllable, region-specific colorization (e.g., &#x0201C;make the apple green and the car red&#x0201D;). Experimental results show that L-CoIns achieves better semantic alignment and diversity compared to earlier text-based methods. However, the models effectiveness depends on accurate instance segmentation and is less responsive in cases where object boundaries are unclear or ambiguous.</p>
<p>In summary, text-guided and multimodal models offer flexible, user-controllable colorization, blending visual reasoning with language and manual input. Their limitations stem from prompt sensitivity, runtime demands, and precision trade-offs, but they represent a crucial step toward interactive and expressive colorization.</p>
</sec>
<sec>
<label>2.7</label>
<title>Semantic fusion and context-aware models</title>
<p>Semantic fusion models combine global scene understanding with local spatial features to guide colorization more effectively, especially in cluttered or ambiguous scenes. Through classification, segmentation, or feature alignment, these models bridge low-level texture and high-level context, resulting in more coherent and object-aware outputs.</p>
<p>Iizuka et al. introduced a dual-branch network with a scene classification head and a mid-level feature extractor, fused to guide per-pixel color prediction. It achieves strong perceptual realism, but lacks multimodal control and produces less diverse colors in ambiguous scenes.</p>
<p>ChromaGAN (<xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref>) operates within a GAN framework, combining color prediction with semantic class distribution estimation, regularized via KL divergence against VGG-16 outputs. This enhances realism and alignment, though the reliance on pretrained classification priors limits adaptability to unseen domains or tasks.</p>
<p>Instance-Aware Colorization (<xref ref-type="bibr" rid="B32">Su et al., 2020</xref>) improves fusion by separating object-level and global features, using Mask R-CNN (<xref ref-type="bibr" rid="B12">He et al., 2017</xref>) for instance detection and a fusion module to merge them. This approach excels in multi-object scenes but depends heavily on detection accuracy and incurs high computational cost when many instances are present.</p>
<p>BiSTNet (<xref ref-type="bibr" rid="B40">Yang et al., 2024</xref>), while designed for video, incorporates semantic fusion through a Mixed Expert Block (MEB) that combines segmentation and edge cues to guide color blending across frames. It achieves top performance (CDC, PSNR) but suffers from high latency due to reliance on external modules like RAFT and SAM.</p>
<p>In summary, semantic fusion models boost colorization accuracy by aligning structural and contextual information. Their key challenges lie in external dependencies and complexity, suggesting a need for more lightweight, integrated solutions for broader applicability.</p>
</sec>
<sec>
<label>2.8</label>
<title>Benchmark datasets used in colorization research</title>
<p>A wide range of datasets have been employed in colorization research to evaluate model performance across domains such as natural scenes, objects, faces, and videos. These datasets vary in scale, diversity, annotation detail, and complexity, enabling benchmarking on both qualitative and quantitative metrics like PSNR, SSIM, LPIPS, FID, and perceptual user studies.</p>
<p>ImageNet (ILSVRC2012 / val5k; <xref ref-type="bibr" rid="B8">Deng et al., 2009</xref>) is a large-scale dataset containing over 1.2 million labeled images across 1,000 categories. It is widely used for both training and evaluation in automatic colorization due to its semantic richness and variety of scenes. The val5k subset is a common benchmark for computing FID, PSNR, and SSIM, particularly in general-purpose and diffusion-based colorization models.</p>
<p>COCO-Stuff and COCO-2017 (<xref ref-type="bibr" rid="B23">Lin et al., 2014</xref>) datasets provide densely annotated scenes with instance-level and semantic segmentation, making them suitable for testing models like L-CAD (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>)</p>
<p>Places205 and Places365 (<xref ref-type="bibr" rid="B45">Zhou et al., 2017</xref>) are scene-centric datasets with millions of labeled images across a wide range of indoor and outdoor settings. These datasets are used to support global semantic understanding, especially in models such as Iizuka et al., which incorporate scene classification into the colorization pipeline for improved contextual coherence.</p>
<p>CelebA and CelebA-HQ (<xref ref-type="bibr" rid="B44">Zhang et al., 2020</xref>) are high-quality facial datasets with attribute annotations and aligned facial landmarks, often used for portrait colorization and identity preservation. These datasets serve as testbeds for frameworks like TextIR (<xref ref-type="bibr" rid="B3">Bai et al., 2025</xref>) and BiSTNet (<xref ref-type="bibr" rid="B40">Yang et al., 2024</xref>) that require localized control or fine-grained detail in human subjects.</p>
<p>DAVIS and Videvo are video datasets commonly used to benchmark temporal colorization models such as BiSTNet (<xref ref-type="bibr" rid="B40">Yang et al., 2024</xref>). Their annotated sequences and high visual fidelity make them ideal for evaluating flicker reduction, temporal consistency, and long-range coherence in video-based colorization tasks.</p>
<p>Oxford 102 Flower is frequently used in models like SS-CycleGAN (<xref ref-type="bibr" rid="B21">Li et al., 2023</xref>) and ParaColorizer (<xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>) to test colorization in fine-grained textures and natural object structures. The dataset&#x00027;s high intra-class variance and boundary complexity help assess a models ability to retain detail.</p>
<p>Lastly, the SUN dataset, though smaller, is historically significant for early deep learning colorization models like Deep Colorization (<xref ref-type="bibr" rid="B7">Cheng et al., 2015</xref>) providing a diverse but manageable benchmark for scene understanding and category-driven colorization tasks.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Challenges</title>
<p>Despite significant progress in deep learning-based image colorization, several persistent challenges hinder model robustness, generalization, and deployment efficiency. These limitations often stem from architectural design decisions, training constraints, and dataset biases. Understanding the technical causes behind these issues is essential for improving current models and designing future systems.</p>
<p>One major challenge is feature imbalance in the color distribution, where dominant tones&#x02014;such as grays, browns, and skin-like hues&#x02014;are overrepresented in training data. This skews model predictions toward frequent colors, resulting in desaturated or uniform outputs, particularly in underrepresented regions. Classification-based models attempt to mitigate this using class reweighting and color frequency adjustment, assigning higher loss weights to rare colors. While effective, these techniques rely heavily on empirical tuning of hyperparameters, such as balancing weights and color bin definitions, which can limit generalization across datasets with different statistical properties.</p>
<p>Generative Adversarial Networks (GANs) present another well-known challenge: mode collapse, where the generator learns to produce a narrow set of colorizations regardless of the input diversity. This often arises from imbalanced adversarial training, where the discriminator becomes too strong and overfits to a small set of outputs, preventing the generator from exploring diverse mappings. Architectural solutions such as Wasserstein loss, spectral normalization, gradient penalties, and mini-batch discrimination have been proposed to stabilize training and encourage output diversity (<xref ref-type="bibr" rid="B2">Arjovsky et al., 2017</xref>; <xref ref-type="bibr" rid="B11">Goodfellow et al., 2020</xref>). However, these strategies often come with high computational overhead and are sensitive to training dynamics and architecture-specific constraints, making them difficult to generalize across domains or models without extensive tuning.</p>
<p>Semantic and spatial inconsistencies pose a significant problem, particularly in cluttered scenes with overlapping objects or ambiguous visual cues. For example, Conditional CycleGAN employs cycle-consistency loss to enforce structure preservation, but its deterministic one-to-one mapping cannot account for multi-modal color possibilities, such as a shirt that could plausibly be red or blue. As a result, these models often default to the most statistically probable color, reducing realism. Models like SS-CycleGAN improve upon this with Multi-Scale Cascaded Dilated Convolutions (MCDC) and self-attention, which expand receptive fields and allow the model to align features across spatial hierarchies (<xref ref-type="bibr" rid="B21">Li et al., 2023</xref>). Still, without a probabilistic mechanism, these models remain brittle in scenes with semantic ambiguity. In contrast, VAEs and diffusion models incorporate stochastic sampling and latent-variable conditioning, making them better suited for uncertainty modeling and diverse color prediction&#x02014;but often at the expense of inference speed and simplicity.</p>
<p>Another widespread issue is structural distortion, including edge noise, color bleeding, and boundary mismatch. These problems are especially evident in models without strong instance-awareness or edge supervision. Recent models like CtrlColor integrate SAM-based segmentation and edge-aware loss functions to preserve object boundaries. While these methods improve sharpness and local consistency, they often rely on external modules (e.g., SAM or RAFT) and high-resolution computation, which increase runtime complexity and limit real-time applicability on low-power devices.</p>
<p>In summary, image colorization remains a multi-dimensional optimization problem. Models must balance color diversity, semantic fidelity, spatial structure, and computational efficiency. Each class of architecture addresses some of these goals but introduces new trade-offs. The path forward lies in hybrid designs that combine deterministic structure preservation with probabilistic color reasoning, along with lightweight, end-to-end architectures that minimize external dependencies while supporting interactive and real-time applications. To better understand how recent colorization models perform in real-world settings, we provide a comparative analysis in <xref ref-type="table" rid="T2">Table 2</xref>. This table summarizes the practical usability of key models based on real-time capability, inference time, and hardware requirements. Such comparisons are essential for selecting appropriate models for deployment on edge devices, real-time systems, or cloud platforms.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Inference performance and real-time capability of key image colorization models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="left"><bold>Control mode</bold></th>
<th valign="top" align="left"><bold>Inference time</bold></th>
<th valign="top" align="left"><bold>Hardware used</bold></th>
<th valign="top" align="left"><bold>Real-time</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">DDColor <xref ref-type="bibr" rid="B16">Kang et al., 2023</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">4 Tesla V100</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">ParaColorizer <xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">&#x0007E;0.24 ms</td>
<td valign="top" align="left">2 Tesla V100</td>
<td valign="top" align="left">Yes</td>
</tr>
<tr>
<td valign="top" align="left">TextIR <xref ref-type="bibr" rid="B3">Bai et al., 2025</xref></td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">2 Tesla V100</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">L-Colns <xref ref-type="bibr" rid="B6">Chang et al., 2023</xref></td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">8 RTX 3090</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">L-CoDer <xref ref-type="bibr" rid="B5">Chang et al., 2022</xref></td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">4 NVIDIA TITAN TRX</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">L-CoDe <xref ref-type="bibr" rid="B38">Weng et al., 2022b</xref></td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">2 GTX 1080Ti</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">CT2 <xref ref-type="bibr" rid="B37">Weng et al., 2022a</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">8 RTX 3090</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">Palette <xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">&#x0007E;0.8 s</td>
<td valign="top" align="left">TPU v3</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">SS-CycleGAN <xref ref-type="bibr" rid="B21">Li et al., 2023</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Tesla T4</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">L-CAD <xref ref-type="bibr" rid="B39">Weng et al., 2023</xref></td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">2 RTX 3090Ti</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">Instance-Aware GAN <xref ref-type="bibr" rid="B32">Su et al., 2020</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">&#x0007E;0.187 s</td>
<td valign="top" align="left">RTX 2080Ti</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">ChromaGAN <xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">&#x0007E;4.4 ms</td>
<td valign="top" align="left">Quadro P6000</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">GAN Colorization <xref ref-type="bibr" rid="B27">Nazeri et al., 2018</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">User-Guided <xref ref-type="bibr" rid="B43">Zhang et al., 2017</xref></td>
<td valign="top" align="left">User Hint</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Yes</td>
</tr>
<tr>
<td valign="top" align="left">Let There Be Color <xref ref-type="bibr" rid="B36">Welsh et al., 2002</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">CPU</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">Deep Colorization <xref ref-type="bibr" rid="B7">Cheng et al., 2015</xref></td>
<td valign="top" align="left">None</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">Tesla K40</td>
<td valign="top" align="left">No</td>
</tr>
<tr>
<td valign="top" align="left">BiSTNet <xref ref-type="bibr" rid="B40">Yang et al., 2024</xref></td>
<td valign="top" align="left">Reference frames</td>
<td valign="top" align="left">N/A</td>
<td valign="top" align="left">4 RTX A6000</td>
<td valign="top" align="left">No</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec id="s4">
<label>4</label>
<title>Evaluation metrics</title>
<p>Evaluation metrics are essential for assessing the performance and quality of image colorization methods. They provide quantitative and qualitative insights into how well a model performs, ensuring comprehensive evaluation from multiple perspectives. The metrics used in image colorization are broadly categorized into pixel-wise accuracy, structural and perceptual similarity, generative quality, and task-specific measures.</p>
<sec>
<label>4.1</label>
<title>Pixel-wise accuracy</title>
<p>Pixel-level metrics, such as Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR), are commonly employed to evaluate the fidelity of generated images against the ground truth (<xref ref-type="bibr" rid="B14">Hore and Ziou, 2010</xref>). MSE measures the pixel-wise differences, ensuring accurate reconstruction at the pixel level. The formula for MSE is given as:</p>
<disp-formula id="E1"><mml:math id="M1"><mml:mrow><mml:mtext class="textrm" mathvariant="normal">MSE</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>where <italic>H</italic> and <italic>W</italic> are the height and width of the image, and <italic>X</italic><sub><italic>ij</italic></sub> and <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are the ground-truth and generated pixel values, respectively. PSNR, on the other hand, reflects the reconstruction quality and is computed as:</p>
<disp-formula id="E2"><mml:math id="M3"><mml:mrow><mml:mtext class="textrm" mathvariant="normal">PSNR</mml:mtext><mml:mo>=</mml:mo><mml:mn>20</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">log</mml:mo></mml:mrow><mml:mrow><mml:mn>10</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">MAX</mml:mtext></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mtext class="textrm" mathvariant="normal">MSE</mml:mtext></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where MAX is the maximum pixel value in the image. These metrics are extensively used in methods like L-CAD, and DDColor to evaluate the accuracy of chrominance predictions (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>; <xref ref-type="bibr" rid="B16">Kang et al., 2023</xref>). While effective, these metrics may not fully capture the perceptual quality of colorization outputs, especially in multimodal tasks.</p>
</sec>
<sec>
<label>4.2</label>
<title>Structural and perceptual similarity</title>
<p>Structural and perceptual similarity metrics are crucial for evaluating the consistency of structural and visual coherence between the generated and ground-truth images (<xref ref-type="bibr" rid="B35">Wang et al., 2004</xref>). The Structural Similarity Index (SSIM) measures luminance, contrast, and structural similarity using the following equation:</p>
<disp-formula id="E3"><mml:math id="M4"><mml:mrow><mml:mtext class="textrm" mathvariant="normal">SSIM</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where &#x003BC;<sub><italic>x</italic></sub>, &#x003BC;<sub><italic>y</italic></sub> are the means, <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> are the variances, and &#x003C3;<sub><italic>xy</italic></sub> is the covariance of the two images, with <italic>C</italic><sub>1</sub> and <italic>C</italic><sub>2</sub> being constants. Learned Perceptual Image Patch Similarity (LPIPS) evaluates perceptual similarity by comparing deep feature representations (<xref ref-type="bibr" rid="B42">Zhang et al., 2018</xref>), as follows:</p>
<disp-formula id="E4"><mml:math id="M6"><mml:mrow><mml:mtext class="textrm" mathvariant="normal">LPIPS</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mo stretchy="false">|</mml:mo><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msubsup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:math></disp-formula>
<p>where &#x003D5;<sub><italic>l</italic></sub>(<italic>x</italic>) and &#x003D5;<sub><italic>l</italic></sub>(<italic>y</italic>) are features from layer <italic>l</italic>, and <italic>H</italic><sub><italic>l</italic></sub>, <italic>W</italic><sub><italic>l</italic></sub> are the dimensions of the feature map. Metrics such as SSIM and LPIPS are widely used in methods like SS-CycleGAN, ParaColorizer, and BiSTNet to ensure structural coherence and perceptual quality (<xref ref-type="bibr" rid="B35">Wang et al., 2004</xref>; <xref ref-type="bibr" rid="B42">Zhang et al., 2018</xref>; <xref ref-type="bibr" rid="B21">Li et al., 2023</xref>; <xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>; <xref ref-type="bibr" rid="B40">Yang et al., 2024</xref>).</p>
</sec>
<sec>
<label>4.3</label>
<title>Generative quality</title>
<p>Generative quality metrics, such as Fr&#x000E9;chet Inception Distance (FID) and Inception Score (IS), measure the realism and diversity of generated images (<xref ref-type="bibr" rid="B13">Heusel et al., 2017</xref>; <xref ref-type="bibr" rid="B4">Barratt and Sharma, 2018</xref>). FID quantifies the similarity between the distributions of real and generated image features and is given by:</p>
<disp-formula id="E5"><mml:math id="M7"><mml:mrow><mml:mtext class="textrm" mathvariant="normal">FID</mml:mtext><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">Tr</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>2</mml:mn><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x003A3;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where &#x003BC;<sub><italic>r</italic></sub>, &#x003BC;<sub><italic>g</italic></sub> are the means and &#x003A3;<sub><italic>r</italic></sub>, &#x003A3;<sub><italic>g</italic></sub> are the covariances of real and generated image features. Inception Score (IS) evaluates diversity and quality by analyzing the entropy of predictions made by a pre-trained classification model. Generative metrics like FID are commonly employed in methods such as Palette, and DDColor to ensure that generated outputs are both realistic and diverse (<xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref>; <xref ref-type="bibr" rid="B16">Kang et al., 2023</xref>). Additionally, the Colorfulness Metric assesses the vibrancy and richness of colors in generated images, reflecting the vividness of the results.</p>
</sec>
<sec>
<label>4.4</label>
<title>Qualitative and perceptual evaluations</title>
<p>In addition to quantitative measures, qualitative evaluations and user studies play a vital role in assessing the perceptual realism of colorized images. Perceptual studies, as conducted in methods like ChromaGAN and Real-Time User-Guided Colorization, involve measuring fooling rates and user preferences (<xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref>; <xref ref-type="bibr" rid="B43">Zhang et al., 2017</xref>). These evaluations complement traditional metrics by capturing subjective qualities such as naturalness and believability, particularly in multimodal and visually ambiguous scenarios.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Emerging trends and future directions</title>
<sec>
<label>5.1</label>
<title>Emerging trends in image colorization</title>
<sec>
<label>5.1.1</label>
<title>Diffusion models as the new backbone</title>
<p>The success of models like Palette (<xref ref-type="bibr" rid="B29">Saharia et al., 2022</xref>) has led to a shift from GANs to diffusion models for high-fidelity, controllable colorization. Diffusion provides better color diversity and supports iterative refinements, making it more suitable for creative tasks. However, high inference latency remains a bottleneck, limiting real-time use.</p>
</sec>
<sec>
<label>5.1.2</label>
<title>Prompt-based and multimodal interaction</title>
<p>Prompt-guided models like L-CAD and TextIR (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>; <xref ref-type="bibr" rid="B3">Bai et al., 2025</xref>) illustrate how text can guide colorization flexibly, even at the region level. With increasing adoption of CLIP and similar models, the future may lean toward foundation model-guided colorization, allowing zero-shot or few-shot customization using natural language.</p>
</sec>
<sec>
<label>5.1.3</label>
<title>Real-time and lightweight inference</title>
<p>Models like ParaColorizer (<xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>) and User-Guided Colorization (<xref ref-type="bibr" rid="B43">Zhang et al., 2017</xref>) reflect an increasing demand for real-time colorization, particularly for mobile and AR/VR applications. Future systems will likely prioritize architectures that trade off minimal quality for fast, efficient deployment on edge devices.</p>
</sec>
<sec>
<label>5.1.4</label>
<title>Ethics, bias mitigation, and explainability</title>
<p>As image colorization systems move beyond artistic applications and into sensitive domains&#x02014;such as historical restoration, forensic analysis, and medical imaging&#x02014;the need for ethical safeguards has become increasingly urgent. A primary ethical challenge is the risk of color misinterpretation. When models hallucinate colors without ground truth references, they may inadvertently introduce misleading or historically inaccurate information. For example, assigning skin tones or fabric colors in archival photographs could distort cultural or racial identity, leading to unintentional misrepresentation of the past.</p>
<p>Another major concern is dataset bias. Popular datasets such as COCO or ImageNet often reflect implicit social and cultural biases, which can propagate into generated outputs. This may result in systematically skewed colorizations&#x02014;for instance, consistently rendering certain demographics with particular tones&#x02014;thereby reinforcing stereotypes or marginalizing underrepresented groups.</p>
<p>In high-stakes domains like journalism or forensics, hallucinated colorizations may be mistaken as factual, particularly when presented without proper disclaimers. In evidentiary settings, such misinterpretations could even carry legal implications. This underscores the importance of embedding uncertainty visualization, provenance tracking, and clear disclaimers to differentiate generated content from original data.</p>
<p>Explainability also remains limited. While some recent systems integrate user hints, segmentation cues, or attention mechanisms to guide outputs, most colorization pipelines remain opaque to end users. This lack of transparency hinders trust and accountability, especially in workflows where factual accuracy is paramount.</p>
<p>To mitigate these concerns, future research should emphasize transparency mechanisms such as attribution maps, error bounds, and dataset audits. Additionally, incorporating controllable generation frameworks with provenance logging can empower users to better understand and guide the colorization process&#x02014;promoting both ethical integrity and user trust.</p>
</sec>
</sec>
<sec>
<label>5.2</label>
<title>Future research directions</title>
<p>Emerging trends in image colorization highlight the shift toward hybrid transformer-convolutional architectures, structurally-aware learning, prompt-driven multimodal control, and real-time interactivity. While current models achieve photorealism and semantic richness, challenges remain in scalability, boundary preservation, and global reasoning.</p>
<p>A notable direction is the move from traditional CNN backbones (e.g., VGG-16 in ChromaGAN) to hybrid architectures combining ConvNeXt and Transformers (<xref ref-type="bibr" rid="B34">Vitoria et al., 2020</xref>; <xref ref-type="bibr" rid="B16">Kang et al., 2023</xref>; <xref ref-type="bibr" rid="B25">Liu et al., 2021</xref>). Models like DDColor show how ConvNeXt can preserve textures while transformers enhance contextual reasoning (<xref ref-type="bibr" rid="B26">Liu et al., 2022</xref>). A hybrid multi-scale architecture with structured attention fusion could improve local-global feature integration, addressing issues like over-smoothing and poor generalization.</p>
<p>Color bleeding remains a challenge in GAN-based models (<xref ref-type="bibr" rid="B21">Li et al., 2023</xref>; <xref ref-type="bibr" rid="B17">Kumar et al., 2024</xref>; <xref ref-type="bibr" rid="B27">Nazeri et al., 2018</xref>). Recent solutions propose edge-conditioned discriminators (e.g., using Canny or HED maps) and boundary-aware generators with edge-guided attention. A dual-weighted loss that balances perceptual smoothness with structural sharpness could further improve fidelity and boundary accuracy.</p>
<p>Multimodal frameworks such as L-CAD and TextIR reflect another key trend, enabling prompt-guided and user-controllable colorization via text or exemplars (<xref ref-type="bibr" rid="B39">Weng et al., 2023</xref>; <xref ref-type="bibr" rid="B3">Bai et al., 2025</xref>). These systems offer customization and interactivity, paving the way for integration into creative tools and restoration pipelines.</p>
<p>In summary, the future of colorization lies in developing interactive, scalable, and semantically aligned systems. Through architectural innovation and user-focused design, next-generation models will support applications in digital media, heritage restoration, and augmented creativity.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<label>6</label>
<title>Conclusion</title>
<p>Image colorization has advanced significantly, driven by deep learning, semantic understanding, and generative models. This review explored innovations such as semantic class distributions, multimodal fusion, and user-guided controls, addressing challenges like multimodal uncertainty and object-level consistency. Despite these advancements, limitations such as high computational costs, dataset dependencies, and performance on unseen scenarios remain. Future work should focus on lightweight models, enhanced generalization, and interactive frameworks to balance automation with creative flexibility. Transforming grayscale to vibrant color continues to be an exciting frontier in computer vision.</p>
</sec>
</body>
<back>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>OG: Formal analysis, Investigation, Validation, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. YP: Supervision, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s9">
<title>Generative AI statement</title>
<p>The author(s) declare that no Gen AI was used in the creation of this manuscript.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>An</surname> <given-names>Y.</given-names></name> <name><surname>Gui</surname> <given-names>L.</given-names></name> <name><surname>Hu</surname> <given-names>Q.</given-names></name> <name><surname>Cai</surname> <given-names>C.</given-names></name> <name><surname>Ye</surname> <given-names>T.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2025</year>). <article-title>Controllable image colorization with instance-aware texts and masks</article-title>. <source>arXiv preprint arXiv:2505.08705</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2505.08705</pub-id></mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Arjovsky</surname> <given-names>M.</given-names></name> <name><surname>Chintala</surname> <given-names>S.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Wasserstein generative adversarial networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning (ICML)</source> (<publisher-loc>Sydney, NSW</publisher-loc>: <publisher-name>MLR</publisher-name>), <fpage>214</fpage>&#x02013;<lpage>223</lpage>.</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Bai</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Xie</surname> <given-names>S.</given-names></name> <name><surname>Dong</surname> <given-names>C.</given-names></name> <name><surname>Yuan</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2025</year>). <article-title>&#x0201C;Textir: a simple framework for text-based editable image restoration,&#x0201D;</article-title> in <source>IEEE Transactions on Visualization and Computer Graphics</source> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>). doi: <pub-id pub-id-type="doi">10.1109/TVCG.2025.3550844</pub-id><pub-id pub-id-type="pmid">40072855</pub-id></mixed-citation></ref>
<ref id="B4">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Barratt</surname> <given-names>S.</given-names></name> <name><surname>Sharma</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>A note on the inception score</article-title>. <source>arXiv preprint arXiv:1801.01973</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1801.01973</pub-id></mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>Z.</given-names></name> <name><surname>Weng</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;L-coder: language-based colorization with color-object decoupling transformer,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>360</fpage>&#x02013;<lpage>375</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-19797-0_21</pub-id></mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>Z.</given-names></name> <name><surname>Weng</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;L-coins: language-based colorization with instance awareness,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>19221</fpage>&#x02013;<lpage>19230</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52729.2023.01842</pub-id></mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name> <name><surname>Sheng</surname> <given-names>B.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Deep colorization,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>415</fpage>&#x02013;<lpage>423</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICCV.2015.55</pub-id></mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Imagenet: a large-scale hierarchical image database,&#x0201D;</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>248</fpage>&#x02013;<lpage>255</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id><pub-id pub-id-type="pmid">26886976</pub-id></mixed-citation></ref>
<ref id="B9">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Fei</surname> <given-names>B.</given-names></name> <name><surname>Lyu</surname> <given-names>Z.</given-names></name> <name><surname>Pan</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Yang</surname> <given-names>W.</given-names></name> <name><surname>Luo</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>&#x0201C;Generative diffusion prior for unified image restoration and enhancement,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>9935</fpage>&#x02013;<lpage>9946</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52729.2023.00958</pub-id></mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>X.</given-names></name> <name><surname>Mou</surname> <given-names>J.</given-names></name> <name><surname>Banerjee</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Color-gray multi-image hybrid compression-encryption scheme based on bp neural network and knight tour</article-title>. <source>IEEE Trans. Cybern</source>. <volume>53</volume>, <fpage>5037</fpage>&#x02013;<lpage>5047</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TCYB.2023.3267785</pub-id><pub-id pub-id-type="pmid">37130254</pub-id></mixed-citation></ref>
<ref id="B11">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Pouget-Abadie</surname> <given-names>J.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Warde-Farley</surname> <given-names>D.</given-names></name> <name><surname>Ozair</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Generative adversarial networks</article-title>. <source>Commun. ACM</source> <volume>63</volume>, <fpage>139</fpage>&#x02013;<lpage>144</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3422622</pub-id></mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Gkioxari</surname> <given-names>G.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Mask R-CNN,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2961</fpage>&#x02013;<lpage>2969</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id></mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Heusel</surname> <given-names>M.</given-names></name> <name><surname>Ramsauer</surname> <given-names>H.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <name><surname>Nessler</surname> <given-names>B.</given-names></name> <name><surname>Hochreiter</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Gans trained by a two time-scale update rule converge to a local nash equilibrium</article-title>. <source>Adv. Neural. Inf. Process Syst</source>. <volume>30</volume>, <fpage>6626</fpage>&#x02013;<lpage>6637</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1706.08500</pub-id></mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Hore</surname> <given-names>A.</given-names></name> <name><surname>Ziou</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <article-title>&#x0201C;Image quality metrics: PSNR vs. SSIM,&#x0201D;</article-title> in <source>2010 20th International Conference on Pattern Recognition (ICPR)</source> (<publisher-loc>Istanbul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2366</fpage>&#x02013;<lpage>2369</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICPR.2010.579</pub-id></mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Zhu</surname> <given-names>J.-Y.</given-names></name> <name><surname>Zhou</surname> <given-names>T.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Image-to-image translation with conditional adversarial networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1125</fpage>&#x02013;<lpage>1134</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR.2017.632</pub-id></mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>X.</given-names></name> <name><surname>Yang</surname> <given-names>T.</given-names></name> <name><surname>Ouyang</surname> <given-names>W.</given-names></name> <name><surname>Ren</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Xie</surname> <given-names>X.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Ddcolor: towards photo-realistic image colorization via dual decoders,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Paris</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>328</fpage>&#x02013;<lpage>338</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICCV51070.2023.00037</pub-id></mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>H.</given-names></name> <name><surname>Banerjee</surname> <given-names>A.</given-names></name> <name><surname>Saurav</surname> <given-names>S.</given-names></name> <name><surname>Singh</surname> <given-names>S.</given-names></name></person-group> (<year>2024</year>). <article-title>Paracolorizer-realistic image colorization using parallel generative networks</article-title>. <source>Vis. Comput</source>. <volume>40</volume>, <fpage>4039</fpage>&#x02013;<lpage>4054</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s00371-023-03067-7</pub-id></mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>M.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Kalchbrenner</surname> <given-names>N.</given-names></name></person-group> (<year>2021</year>). <article-title>Colorization transformer</article-title>. <source>arXiv preprint arXiv:2102.04432</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2102.04432</pub-id></mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Larsson</surname> <given-names>G.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Shakhnarovich</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Learning representations for automatic colorization,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11&#x02013;14, 2016, Proceedings, Part IV 14</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>577</fpage>&#x02013;<lpage>593</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-46493-0_35</pub-id></mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>J.</given-names></name> <name><surname>Kim</surname> <given-names>E.</given-names></name> <name><surname>Lee</surname> <given-names>Y.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Chang</surname> <given-names>J.</given-names></name> <name><surname>Choo</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5801</fpage>&#x02013;<lpage>5810</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00584</pub-id></mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Lu</surname> <given-names>Y.</given-names></name> <name><surname>Pang</surname> <given-names>W.</given-names></name> <name><surname>Xu</surname> <given-names>H.</given-names></name></person-group> (<year>2023</year>). <article-title>Image colorization using cyclegan with semantic and spatial rationality</article-title>. <source>Multimed. Tools Appl</source>. <volume>82</volume>, <fpage>21641</fpage>&#x02013;<lpage>21655</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11042-023-14675-9</pub-id></mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>S.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name></person-group> (<year>2025</year>). <article-title>Language-based image colorization: a benchmark and beyond</article-title>. <source>arXiv preprint arXiv:2503.14974</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2503.14974</pub-id></mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Microsoft coco: common objects in context,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6&#x02013;12, 2014, Proceedings, Part V 13</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>740</fpage>&#x02013;<lpage>755</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id></mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Chan</surname> <given-names>K. C.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Loy</surname> <given-names>C. C.</given-names></name> <name><surname>Qiao</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Temporally consistent video colorization with deep feature propagation and self-regularization learning</article-title>. <source>Comput. Vis. Media</source> <volume>10</volume>, <fpage>375</fpage>&#x02013;<lpage>395</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s41095-023-0342-8</pub-id></mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Swin transformer: Hierarchical vision transformer using shifted windows,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>10012</fpage>&#x02013;<lpage>10022</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00986</pub-id></mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Mao</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>C.-Y.</given-names></name> <name><surname>Feichtenhofer</surname> <given-names>C.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Xie</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;A convnet for the 2020s,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>11976</fpage>&#x02013;<lpage>11986</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01167</pub-id></mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Nazeri</surname> <given-names>K.</given-names></name> <name><surname>Ng</surname> <given-names>E.</given-names></name> <name><surname>Ebrahimi</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Image colorization using generative adversarial networks,&#x0201D;</article-title> in <source>Articulated Motion and Deformable Objects: 10th International Conference, AMDO 2018, Palma de Mallorca, Spain, July 12-13, 2018, Proceedings 10</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>85</fpage>&#x02013;<lpage>94</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-94544-6_9</pub-id></mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Pramanick</surname> <given-names>A.</given-names></name> <name><surname>Sarma</surname> <given-names>S.</given-names></name> <name><surname>Sur</surname> <given-names>A.</given-names></name></person-group> (<year>2024</year>). <article-title>&#x0201C;X-caunet: Cross-color channel attention with underwater image-enhancing transformer,&#x0201D;</article-title> in <source>ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3550</fpage>&#x02013;<lpage>3554</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ICASSP48485.2024.10445832</pub-id></mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Saharia</surname> <given-names>C.</given-names></name> <name><surname>Chan</surname> <given-names>W.</given-names></name> <name><surname>Chang</surname> <given-names>H.</given-names></name> <name><surname>Lee</surname> <given-names>C.</given-names></name> <name><surname>Ho</surname> <given-names>J.</given-names></name> <name><surname>Salimans</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Palette: image-to-image diffusion models,&#x0201D;</article-title> in <source>ACM SIGGRAPH 2022 Conference Proceedings</source> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>10</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3528233.3530757</pub-id><pub-id pub-id-type="pmid">38400307</pub-id></mixed-citation></ref>
<ref id="B30">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Sangkloy</surname> <given-names>P.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Fang</surname> <given-names>C.</given-names></name> <name><surname>Yu</surname> <given-names>F.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Scribbler: controlling deep image synthesis with sketch and color,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5400</fpage>&#x02013;<lpage>5409</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR.2017.723</pub-id></mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Shafiq</surname> <given-names>H.</given-names></name> <name><surname>Nguyen</surname> <given-names>T.</given-names></name> <name><surname>Lee</surname> <given-names>B.</given-names></name></person-group> (<year>2025</year>). <article-title>Colorformer: a novel colorization method based on a transformer</article-title>. <source>Neurocomputing</source> <volume>649</volume>:<fpage>130743</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2025.130743</pub-id></mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Su</surname> <given-names>J.-W.</given-names></name> <name><surname>Chu</surname> <given-names>H.-K.</given-names></name> <name><surname>Huang</surname> <given-names>J.-B.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Instance-aware image colorization,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7968</fpage>&#x02013;<lpage>7977</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00799</pub-id><pub-id pub-id-type="pmid">38064958</pub-id></mixed-citation></ref>
<ref id="B33">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tassin</surname> <given-names>I.</given-names></name> <name><surname>Goebel</surname> <given-names>K.</given-names></name> <name><surname>Lasher</surname> <given-names>B.</given-names></name></person-group> (<year>2025</year>). <article-title>Convolutional deep colorization for image compression: a color grid based approach</article-title>. <source>arXiv preprint arXiv:2502.05402</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2502.05402</pub-id></mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Vitoria</surname> <given-names>P.</given-names></name> <name><surname>Raad</surname> <given-names>L.</given-names></name> <name><surname>Ballester</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Chromagan: adversarial picture colorization with semantic class distribution,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source> (<publisher-loc>Snowmass Village, CO</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2445</fpage>&#x02013;<lpage>2454</lpage>. doi: <pub-id pub-id-type="doi">10.1109/WACV45572.2020.9093389</pub-id></mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Bovik</surname> <given-names>A. C.</given-names></name> <name><surname>Sheikh</surname> <given-names>H. R.</given-names></name> <name><surname>Simoncelli</surname> <given-names>E. P.</given-names></name></person-group> (<year>2004</year>). <article-title>Image quality assessment: from error visibility to structural similarity</article-title>. <source>IEEE Trans. Image Process</source>. <volume>13</volume>, <fpage>600</fpage>&#x02013;<lpage>612</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TIP.2003.819861</pub-id><pub-id pub-id-type="pmid">15376593</pub-id></mixed-citation></ref>
<ref id="B36">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Welsh</surname> <given-names>T.</given-names></name> <name><surname>Ashikhmin</surname> <given-names>M.</given-names></name> <name><surname>Mueller</surname> <given-names>K.</given-names></name></person-group> (<year>2002</year>). <article-title>&#x0201C;Transferring color to greyscale images,&#x0201D;</article-title> in <source>Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques</source> (<publisher-loc>San Antonio, TX</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>277</fpage>&#x02013;<lpage>280</lpage>. doi: <pub-id pub-id-type="doi">10.1145/566570.566576</pub-id></mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Weng</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name></person-group> (<year>2022a</year>). <article-title>&#x0201C;Ct 2: colorization transformer via color tokens,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>16</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-20071-7_1</pub-id></mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Weng</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>H.</given-names></name> <name><surname>Chang</surname> <given-names>Z.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name></person-group> (<year>2022b</year>). <article-title>L-code: language-based colorization using color-object decoupled conditions</article-title>. <source>Proc. AAAI Conf. Artif. Intell</source>. <volume>36</volume>, <fpage>2677</fpage>&#x02013;<lpage>2684</lpage>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v36i3.20170</pub-id></mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Weng</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>L-cad: language-based colorization with any-level descriptions using diffusion priors</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>36</volume>, <fpage>77174</fpage>&#x02013;<lpage>77186</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2310.14191</pub-id></mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Pan</surname> <given-names>J.</given-names></name> <name><surname>Peng</surname> <given-names>Z.</given-names></name> <name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Tao</surname> <given-names>Z.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name></person-group> (<year>2024</year>). <article-title>Bistnet: semantic image prior guided bidirectional temporal feature fusion for deep exemplar-based video colorization</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>46</volume>, <fpage>5612</fpage>&#x02013;<lpage>5624</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2024.3370920</pub-id><pub-id pub-id-type="pmid">38416607</pub-id></mixed-citation></ref>
<ref id="B41">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Colorful image colorization,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>649</fpage>&#x02013;<lpage>666</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-46487-9_40</pub-id></mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name> <name><surname>Shechtman</surname> <given-names>E.</given-names></name> <name><surname>Wang</surname> <given-names>O.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;The unreasonable effectiveness of deep features as a perceptual metric,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>586</fpage>&#x02013;<lpage>595</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR.2018.00068</pub-id></mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Zhu</surname> <given-names>J.-Y.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Geng</surname> <given-names>X.</given-names></name> <name><surname>Lin</surname> <given-names>A. S.</given-names></name> <name><surname>Yu</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Real-time user-guided image colorization with learned deep priors</article-title>. <source>arXiv preprint arXiv:1705.02999</source>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1705.02999</pub-id></mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Yin</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Yin</surname> <given-names>G.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Shao</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Celeba-spoof: large-scale face anti-spoofing dataset with rich annotations,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XII 16</source> (<publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name>), <fpage>70</fpage>&#x02013;<lpage>85</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-030-58610-2_5</pub-id></mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>B.</given-names></name> <name><surname>Lapedriza</surname> <given-names>A.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Places: a 10 million image database for scene recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>40</volume>, <fpage>1452</fpage>&#x02013;<lpage>1464</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2723009</pub-id><pub-id pub-id-type="pmid">28692961</pub-id></mixed-citation></ref>
</ref-list>
<fn-group>
<fn fn-type="custom" custom-type="edited-by" id="fn0001">
<p>Edited by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2591698/overview">Hang Cheng</ext-link>, Fuzhou University, China</p>
</fn>
<fn fn-type="custom" custom-type="reviewed-by" id="fn0002">
<p>Reviewed by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1658315/overview">Sinem Aslan</ext-link>, University of Milan, Italy</p>
</fn>
</fn-group>
</back>
</article>