<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article article-type="research-article" dtd-version="1.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title-group>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1737238</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2025.1737238</article-id>
<article-version article-version-type="Version of Record" vocab="NISO-RP-8-2008"/>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>LG-H-PPO: offline hierarchical PPO for robot path planning on a latent graph</article-title>
<alt-title alt-title-type="left-running-head">Han</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frobt.2025.1737238">10.3389/frobt.2025.1737238</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Han</surname>
<given-names>Xiang</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/3262794"/>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; original draft" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing &#x2013; original draft</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &#x26; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/">Writing &#x2013; review and editing</role>
</contrib>
</contrib-group>
<aff id="aff">
<institution>China University of Petroleum (East China)</institution>, <city>Qingdao</city>, <country country="CN">China</country>
</aff>
<author-notes>
<corresp id="c001">
<label>&#x2a;</label>Correspondence: Xiang Han, <email xlink:href="mailto:2473495989@qq.com">2473495989@qq.com</email>
</corresp>
</author-notes>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-01-07">
<day>07</day>
<month>01</month>
<year>2026</year>
</pub-date>
<pub-date publication-format="electronic" date-type="collection">
<year>2025</year>
</pub-date>
<volume>12</volume>
<elocation-id>1737238</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>11</month>
<year>2025</year>
</date>
<date date-type="rev-recd">
<day>05</day>
<month>12</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>08</day>
<month>12</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2026 Han.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Han</copyright-holder>
<license>
<ali:license_ref start_date="2026-01-07">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract>
<p>The path planning capability of autonomous robots in complex environments is crucial for their widespread application in the real world. However, long-term decision-making and sparse reward signals pose significant challenges to traditional reinforcement learning (RL) algorithms. Offline hierarchical reinforcement learning offers an effective approach by decomposing tasks into two stages: high-level subgoal generation and low-level subgoal attainment. Advanced Offline HRL methods, such as Guider and HIQL, typically introduce latent spaces in high-level policies to represent subgoals, thereby handling high-dimensional states and enhancing generalization. However, these approaches require the high-level policy to search and generate sub-objectives within a continuous latent space. This remains a complex and sample-inefficient challenge for policy optimization algorithms&#x2014;particularly policy gradient-based PPO&#x2014;often leading to unstable training and slow convergence. To address this core limitation, this paper proposes a novel offline hierarchical PPO framework&#x2014;LG-H-PPO (Latent Graph-based Hierarchical PPO). The core innovation of LG-H-PPO lies in discretizing the continuous latent space into a structured &#x201c;latent graph.&#x201d; By transforming high-level planning from challenging &#x201c;continuous creation&#x201d; to simple &#x201c;discrete selection,&#x201d; LG-H-PPO substantially reduces the learning difficulty for the high-level policy. Preliminary experiments on standard D4RL offline navigation benchmarks demonstrate that LG-H-PPO achieves significant advantages over advanced baselines like Guider and HIQL in both convergence speed and final task success rates. The main contribution of this paper is introducing graph structures into latent variable HRL planning. This effectively simplifies the action space for high-level policies, enhancing the training efficiency and stability of offline HRL algorithms for long-sequence navigation tasks. It lays the foundation for future offline HRL research combining latent variable representations with explicit graph planning.</p>
</abstract>
<kwd-group>
<kwd>latent graph</kwd>
<kwd>offline hierarchical PPO</kwd>
<kwd>offline reinforcement learning</kwd>
<kwd>robot path planning</kwd>
<kwd>sparse reward</kwd>
</kwd-group>
<funding-group>
<funding-statement>The author(s) declared that financial support was not received for this work and/or its publication.</funding-statement>
</funding-group>
<counts>
<fig-count count="3"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="16"/>
<page-count count="00"/>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Robot Learning and Evolution</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<label>1</label>
<title>Introduction</title>
<p>With the rapid advancement of robotics, endowing robots with the ability to autonomously navigate in unknown or complex environments has become one of the core challenges in the fields of artificial intelligence and robotics (<xref ref-type="bibr" rid="B11">Martinez-Baselga et al., 2023</xref>). Whether for household service robots, warehouse logistics AGVs, or planetary rovers, efficient and safe path planning forms the foundation for accomplishing their tasks. However, real-world navigation tasks often involve long-horizon decision making&#x2014;where robots must execute a long sequence of actions to reach their destination&#x2014;while simultaneously facing sparse rewards&#x2014;clear positive feedback signals are only obtained when the robot ultimately reaches the goal or completes specific subtasks. These two characteristics pose significant challenges to traditional supervised learning and model-based planning methods. Reinforcement learning (RL), particularly deep reinforcement learning (DRL), is considered a powerful tool for addressing such problems due to its ability to learn optimal strategies through trial and error (<xref ref-type="bibr" rid="B3">Barto, 2021</xref>).</p>
<p>Standard online RL algorithms, such as Proximal Policy Optimization (PPO) (<xref ref-type="bibr" rid="B15">Schulman et al., 2017</xref>), have achieved success in many domains. However, their &#x201c;learn-while-exploring&#x201d; paradigm requires extensive interactions with the environment to gather sufficient effective experience. This is often costly, time-consuming, and even hazardous in real robotic systems (<xref ref-type="bibr" rid="B4">Chen et al., 2025</xref>).</p>
<p>To overcome the limitations of online RL, offline reinforcement learning (Offline RL) (<xref ref-type="bibr" rid="B10">Levine et al., 2020</xref>) emerged. Offline RL aims to learn policies using only pre-collected, fixed datasets, completely avoiding online interactions with the environment. This enables the utilization of large-scale, diverse historical data. However, Offline RL faces its own unique challenge: the out-of-distribution (OOD) action problem. Learned policies may select actions not present in the dataset, and their value estimates are often inaccurate, leading to a sharp decline in performance (<xref ref-type="bibr" rid="B9">Kumar et al., 2020</xref>). For long-temporal-horizon and sparse-reward problems, hierarchical reinforcement learning (HRL) offers an effective solution (<xref ref-type="bibr" rid="B8">Kulkarni et al., 2016</xref>). HRL decomposes complex tasks into multiple hierarchical sub-tasks. In a typical two-layer architecture, the high-level policy formulates a sequence of subgoals, while the low-level policy executes primitive actions to achieve the current subgoal. This decomposition not only reduces the temporal scale a single policy must handle but also facilitates credit assignment.</p>
<p>In recent years, Offline HRL&#x2014;the integration of Offline RL and HRL&#x2014;has emerged as a research hotspot, regarded as a promising direction for tackling complex robotic tasks. Guider (<xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>) and HIQL (<xref ref-type="bibr" rid="B12">Park et al., 2023</xref>) share the common contribution of successfully leveraging latent spaces to handle high-dimensional states (e.g., images) and promote subgoal generalization, while improving sample efficiency through offline learning frameworks. However, they also share a core limitation: high-level policies <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> must plan and make decisions within a continuous latent space that may be high-dimensional (though lower than the original state space). Concurrently, another research approach attempts to explicitly represent the state connectivity of the environment using graph structures, transforming high-level planning problems into graph search problems. For instance, some online HRL methods (<xref ref-type="bibr" rid="B5">Eysenbach et al., 2019</xref>) construct topological maps or empirical maps of the environment during exploration. Recently, GAS (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>) introduced graph structures into Offline HRL. GAS first learns a Temporal Distance Representation (TDR) space, then constructs a graph within this space based on TDR distances and a Time Efficiency (TE) metric. It employs graph search algorithms (e.g., Dijkstra) to directly select subgoal sequences, replacing explicit high-level policy learning. GAS has achieved significant success on tasks requiring extensive trajectory stitching (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>). While GAS demonstrates impressive capabilities in trajectory stitching through explicit graph search (e.g., Dijkstra), it relies heavily on the correctness of the constructed graph&#x2019;s connectivity and the precision of temporal distance estimation. Deterministic search algorithms can be brittle; if the graph contains noisy edges or incorrect connections, the planner may fail to find a feasible path. In contrast, LG-H-PPO learns a stochastic high-level policy on the graph rather than executing a deterministic search. This distinction is crucial:</p>
<p>Robustness: The value function learned by PPO allows the agent to identify and avoid edges that appear feasible in the graph structure but are unreliable for actual traversal, offering greater robustness against imperfect graph construction.</p>
<p>Generalization: A learned policy can better handle states that do not perfectly align with graph nodes, enabling smoother control through probabilistic selection, which is difficult for rigid graph search methods to achieve.</p>
<p>To this end, we propose the LG-H-PPO (Latent Graph-based Hierarchical PPO) framework. Our core idea is to transform the challenging continuous latent variable space from Guider (<xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>)/HIQL (<xref ref-type="bibr" rid="B12">Park et al., 2023</xref>) into a discrete, easily manageable latent variable graph, then enable the high-level PPO to plan on this graph. By simplifying the high-level policy&#x2019;s action space from a continuous latent variable space to node selection on a discrete latent variable graph, our preliminary experiments on D4RL benchmarks like Antmaze validate our expectations. LG-H-PPO demonstrates significant improvements over Guider (<xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>) and HIQL (<xref ref-type="bibr" rid="B12">Park et al., 2023</xref>) in both convergence speed and final success rate. The main contribution of this paper is the introduction of a new paradigm for offline HRL that combines latent variable representations with explicit graph structures. Theoretically, discretizing the continuous latent space significantly mitigates the complexity of the credit assignment problem in hierarchical policy gradients. In continuous latent space methods (like Guider), the high-level policy must learn a mapping from states to exact latent vectors, where slight deviations in the output can lead to vastly different low-level traversals, causing high variance in gradient estimation. By restricting the high-level policy&#x2019;s output to a finite set of graph nodes (transforming &#x2018;creation&#x2019; into &#x2018;selection&#x2019;), LG-H-PPO drastically reduces the variance of the policy gradient. This stabilization of the training process allows for more accurate value estimation and significantly improves sample efficiency. By discretizing the high-level action space, we effectively resolve the planning challenges faced by existing methods. This work lays the foundation for future exploration of more efficient and robust graph- and latent variable-based offline HRL algorithms.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>The proposed methods</title>
<sec id="s2-1">
<label>2.1</label>
<title>LG-H-PPO algorithm</title>
<p>In this section, we present our proposed LG-H-PPO (Latent Graph-based Hierarchical PPO) framework in detail. The core objective of this framework is to significantly reduce the complexity of long-term planning for high-level policies in offline hierarchical reinforcement learning (Offline HRL), particularly within PPO-based frameworks, by introducing a latent variable graph structure. The overall architecture of LG-H-PPO is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>, comprising three organically integrated stages: latent variable encoder pre-training, latent variable graph construction, and graph-based hierarchical PPO training.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Overall architecture style of LG-H-PPO.</p>
</caption>
<graphic xlink:href="frobt-12-1737238-g001.tif">
<alt-text content-type="machine-generated">Flowchart illustrating a three-stage process for VAE pre-training, latent graph construction, and LG-H-PPO training and execution. Stage 1 involves inputting an offline dataset into a VAE model to produce a trained encoder and prior. Stage 2 uses the trained encoder and offline data for encoding states, K-means clustering, and graph edge construction, resulting in a latent graph and K-means model. Stage 3 employs these results for high-level and low-level policy execution and environment interaction. Each stage is depicted with relevant inputs, processes, and outputs.</alt-text>
</graphic>
</fig>
<p>LG-H-PPO follows the fundamental paradigm of HRL by decomposing complex navigation tasks into two levels: high-level (subgoal selection) and low-level (subgoal arrival). The key innovation of LG-H-PPO lies in constructing a discrete latent variable graph <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. The task of the high-level policy <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is no longer to &#x201c;create&#x201d; a latent subgoal <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> in a high-dimensional continuous space but is simplified to &#x201c;select&#x201d; an adjacent node as the next subgoal on this pre-constructed graph <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The low-level policy <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> executes atomic actions, facilitating transitions between latent variable states represented by graph nodes. The entire framework is designed around training using a fixed offline dataset <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to enhance sample efficiency and accommodate scenarios where extensive online interactions are impractical. We apply K-Means clustering to the latent representations <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> of all states in the offline dataset <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to identify <inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> representative cluster centroids, which form the nodes <inline-formula id="inf11">
<mml:math id="m11">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> of our graph. Edges <inline-formula id="inf12">
<mml:math id="m12">
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are then established between nodes that are temporally adjacent in the dataset. This entire process is visualized in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>
<bold>(a)</bold> The raw Antmaze environment with latent states extracted from the offline dataset. <bold>(b)</bold> The constructed discrete latent graph <inline-formula id="inf13">
<mml:math id="m13">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where nodes (subgoals) are formed by clustering latent states, and edges represent learned reachability.</p>
</caption>
<graphic xlink:href="frobt-12-1737238-g002.tif">
<alt-text content-type="machine-generated">Maze illustrations labeled (a) and (b), with (a) showing two mazes featuring paths, obstacles, and goals marked by stars and arrows. (b) displays a network graph superimposed on mazes with paths and connections between nodes. A legend shows a red flag for the final goal and green circles for subgoals.</alt-text>
</graphic>
</fig>
</sec>
<sec id="s2-2">
<label>2.2</label>
<title>Training process</title>
<p>The first stage involves pretraining a latent variable encoder. The objective is to learn a high-quality, low-dimensional state representation <inline-formula id="inf14">
<mml:math id="m14">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> from offline data, which should capture key state information and reachability relationships between states. We aim to obtain an encoder <inline-formula id="inf15">
<mml:math id="m15">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. This model primarily consists of three key components: Encoder (Posterior Network) <inline-formula id="inf16">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>: Inputs the current state <inline-formula id="inf17">
<mml:math id="m17">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> (mapped to the target space, e.g., coordinates) and the future state <inline-formula id="inf18">
<mml:math id="m18">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> after <inline-formula id="inf19">
<mml:math id="m19">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> steps, outputting the posterior distribution parameters (mean <inline-formula id="inf20">
<mml:math id="m20">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">post</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and variance <inline-formula id="inf21">
<mml:math id="m21">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">post</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>) of the latent variable <inline-formula id="inf22">
<mml:math id="m22">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Decoder Network <inline-formula id="inf23">
<mml:math id="m23">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>: Inputs the current state <inline-formula id="inf24">
<mml:math id="m24">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and the latent variable <inline-formula id="inf25">
<mml:math id="m25">
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> sampled from the posterior distribution, attempting to reconstruct the state <inline-formula id="inf26">
<mml:math id="m26">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> after <inline-formula id="inf27">
<mml:math id="m27">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> steps. Prior Network <inline-formula id="inf28">
<mml:math id="m28">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>: Inputs only the current state <inline-formula id="inf29">
<mml:math id="m29">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and outputs the prior distribution parameters of the latent variable <inline-formula id="inf30">
<mml:math id="m30">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (mean <inline-formula id="inf31">
<mml:math id="m31">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">prior</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and variance <inline-formula id="inf32">
<mml:math id="m32">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">prior</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>). All three networks are implemented using Multi-Layer Perceptrons (MLPs) with ReLU or GELU activation functions (<xref ref-type="bibr" rid="B7">Hafner et al., 2020</xref>). The training objective maximizes the Evidence Lower Bound (ELBO), combining reconstruction loss and KL divergence regularization terms:<disp-formula id="equ1">
<mml:math id="m33">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">V AE</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b2;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Next comes the core innovation: we constructed a latent variable graph, discretizing the high-dimensional, continuous latent variable space <inline-formula id="inf33">
<mml:math id="m34">
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> learned in Phase 1 into a structured directed graph <inline-formula id="inf34">
<mml:math id="m35">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Nodes <inline-formula id="inf35">
<mml:math id="m36">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represent representative states (or regions) within the latent space, while edges <inline-formula id="inf36">
<mml:math id="m37">
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> denote the latent reachability between these states (based on observations in the offline dataset). We use the trained encoder <inline-formula id="inf37">
<mml:math id="m38">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to encode all states <inline-formula id="inf38">
<mml:math id="m39">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> in the offline dataset <inline-formula id="inf39">
<mml:math id="m40">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, yielding the corresponding latent variable set <inline-formula id="inf40">
<mml:math id="m41">
<mml:mrow>
<mml:mi>Z</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2282;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf41">
<mml:math id="m42">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the latent variable dimension. Select an appropriate number of nodes K (a key hyperparameter, e.g., K &#x3d; 100, 200, 500). Apply the K-Means algorithm to cluster the latent variable set <inline-formula id="inf42">
<mml:math id="m43">
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, aiming to minimize the sum of squared errors within clusters:<disp-formula id="equ2">
<mml:math id="m44">
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mtext>argmin</mml:mtext>
<mml:mtext>&#x2003;</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The K cluster centroids <inline-formula id="inf43">
<mml:math id="m45">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> are defined as the K nodes of the latent variable graph <inline-formula id="inf44">
<mml:math id="m46">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Each node <inline-formula id="inf45">
<mml:math id="m47">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents an abstract region within the latent space.</p>
<p>To reflect dynamic reachability between states, we utilize trajectory information from the dataset <inline-formula id="inf46">
<mml:math id="m48">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to construct graph edges. We traverse each trajectory segment <inline-formula id="inf47">
<mml:math id="m49">
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> in <inline-formula id="inf48">
<mml:math id="m50">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (where <inline-formula id="inf49">
<mml:math id="m51">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> can be a fixed number of steps, such as <inline-formula id="inf50">
<mml:math id="m52">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mspace width="0.3333em"/>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>), or the entire trajectory). Let <inline-formula id="inf51">
<mml:math id="m53">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf52">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> denote the initial and final latent variables of the segment, respectively. Use the K-Means model to find their nearest node indices: <inline-formula id="inf53">
<mml:math id="m55">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>arg min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf54">
<mml:math id="m56">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>arg min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. If <inline-formula id="inf55">
<mml:math id="m57">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, this indicates the presence of observations in the dataset moving from node <inline-formula id="inf56">
<mml:math id="m58">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>&#x2019;s region to node <inline-formula id="inf57">
<mml:math id="m59">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>&#x2019;s region. We then add a directed edge <inline-formula id="inf58">
<mml:math id="m60">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to the graph <inline-formula id="inf59">
<mml:math id="m61">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="equ3">
<mml:math id="m62">
<mml:mtable class="align" columnalign="left">
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="{" close="">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#x2203;</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>arg min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="right">
<mml:mspace width="0.7em"/>
<mml:mrow>
<mml:mfenced open="" close="}">
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>arg min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Next, train the hierarchical PPO policy <inline-formula id="inf60">
<mml:math id="m63">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> on the constructed latent variable graph <inline-formula id="inf61">
<mml:math id="m64">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to enable it to effectively utilize the graph structure for long-term planning and navigation. The task of the lower-level policy <inline-formula id="inf62">
<mml:math id="m65">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is to learn <inline-formula id="inf63">
<mml:math id="m66">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, enabling the agent to drive itself from the current state <inline-formula id="inf64">
<mml:math id="m67">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to the latent variable region represented by the target node <inline-formula id="inf65">
<mml:math id="m68">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. We adopt an offline Soft Actor-Critic (SAC) (<xref ref-type="bibr" rid="B6">Haarnoja et al., 2018</xref>) training paradigm based on Advantage-Weighted Action Cloning (AWAC/AWR) (<xref ref-type="bibr" rid="B13">Peng et al., 2019</xref>). At the high-level policy level (<inline-formula id="inf66">
<mml:math id="m69">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> - PPO), our task is to learn <inline-formula id="inf67">
<mml:math id="m70">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, which selects the optimal next neighbor node <inline-formula id="inf68">
<mml:math id="m71">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as a sub-goal on the graph <inline-formula id="inf69">
<mml:math id="m72">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> based on the current graph node <inline-formula id="inf70">
<mml:math id="m73">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and the final objective (encoded as <inline-formula id="inf71">
<mml:math id="m74">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>). The high-level state consists of the current encoded and mapped graph node <inline-formula id="inf72">
<mml:math id="m75">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (or its latent variable <inline-formula id="inf73">
<mml:math id="m76">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>) and the latent variable of the final goal <inline-formula id="inf74">
<mml:math id="m77">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. The discrete action space <inline-formula id="inf75">
<mml:math id="m78">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the set of all neighboring nodes reached via outgoing edges from a node <inline-formula id="inf76">
<mml:math id="m79">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> in graph <inline-formula id="inf77">
<mml:math id="m80">
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. During PPO training, the network architecture <inline-formula id="inf78">
<mml:math id="m81">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> comprises Actor and Critic networks, both implemented using MLP. Compute the K logits output by the actor. Simultaneously, a mask is constructed to set the logits of all non-neighboring nodes <inline-formula id="inf79">
<mml:math id="m82">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2209;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to <inline-formula id="inf80">
<mml:math id="m83">
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x221e;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Subsequently, a Softmax function is applied to the logits of the remaining neighboring nodes to obtain a probability distribution, from which the next node is sampled: <inline-formula id="inf81">
<mml:math id="m84">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x223c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. The high-level PPO relies on round data <inline-formula id="inf82">
<mml:math id="m85">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, which we collect through simulated interactions. Using the collected high-level rollout data, we compute the Generalized Advantage Estimate (GAE) (<xref ref-type="bibr" rid="B14">Schulman et al., 2015</xref>):<disp-formula id="equ4">
<mml:math id="m86">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b4;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:mtext>where&#x2009;</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b4;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>Specifically, to stabilize the high-level policy updates and prevent catastrophic policy collapse, we employ the standard clipped surrogate objective function of PPO. The optimization objective <inline-formula id="inf83">
<mml:math id="m87">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">CLIP</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> for the high-level policy network <inline-formula id="inf84">
<mml:math id="m88">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is defined as:<disp-formula id="equ5">
<mml:math id="m89">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">CLIP</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>min</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>clip</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>where <inline-formula id="inf85">
<mml:math id="m90">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">old</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">final</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula> denotes the probability ratio between the new and old policies, and <inline-formula id="inf86">
<mml:math id="m91">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the Generalized Advantage Estimate (GAE) computed via Equation 3. <inline-formula id="inf87">
<mml:math id="m92">
<mml:mrow>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the clipping hyperparameter (set to 0.2 in our experiments). This objective ensures that the updated policy does not deviate excessively from the behavior policy that generated the data, guaranteeing monotonic improvement during training. Subsequently, we update the Actor and Critic networks using multiple iterations (epochs) and mini-batches of data, minimizing the joint loss function of PPO. This ultimately yields the trained low-level policy <inline-formula id="inf88">
<mml:math id="m93">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and high-level PPO policy <inline-formula id="inf89">
<mml:math id="m94">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Experiment and result analysis</title>
<p>This section aims to evaluate the effectiveness of our proposed LG-H-PPO framework through a series of rigorous experiments. We compare LG-H-PPO with current offline hierarchical and non-hierarchical reinforcement learning algorithms on the challenging D4RL Antmaze navigation benchmark. The experimental design focuses on validating LG-H-PPO&#x2019;s performance advantages in addressing long-temporal-order, sparse-reward problems, particularly in convergence speed, final performance, and training stability. Furthermore, we conduct ablation studies and qualitative analyses to delve into the critical role of latent variable graph structures and the internal workings of the framework.</p>
<sec id="s3-1">
<label>3.1</label>
<title>Experimental design</title>
<p>We evaluate on the Antmaze navigation benchmark, focusing on antmaze-medium-diverse-v2 and antmaze-large-diverse-v2. These environments simulate navigation tasks for quadrupedal robots in medium and large mazes, characterized by high state space dimensions (29-dimensional), continuous action space (8-dimensional), limited field of view, sparse rewards (&#x2b;1 only upon reaching the goal), and long task durations (up to 1,000 steps). They serve as an ideal platform for testing long-term planning and offline learning capabilities. Training is based on the antmaze-medium-diverse-v2 and antmaze-large-diverse-v2 offline datasets. The diverse dataset contains a large number of suboptimal trajectories generated by medium-level policy exploration, offering broad coverage but few successful trajectories. This places high demands on the algorithm&#x2019;s trajectory stitching capabilities and its ability to learn optimal policies from suboptimal data (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>; <xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>). Each dataset contains approximately one million transition samples.</p>
<p>Baselines: To comprehensively evaluate LG-H-PPO&#x2019;s performance, we selected the following representative baselines for comparison:<list list-type="order">
<list-item>
<label>1.</label>
<p>Guider (<xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>): A state-of-the-art offline HRL algorithm based on VAE latent variables and continuous high-level actions. It serves as a crucial foundation and benchmark for our approach.</p>
</list-item>
<list-item>
<label>2.</label>
<p>HIQL (<xref ref-type="bibr" rid="B12">Park et al., 2023</xref>): A state-of-the-art offline HRL algorithm based on implicit Q-learning and latent variable states as high-level actions. It represents another value-learning-based technical approach to HRL.</p>
</list-item>
<list-item>
<label>3.</label>
<p>GAS (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>): The latest state-of-the-art offline HRL algorithm based on graph structures and graph search, which does not learn explicit high-level policies and excels at trajectory splicing.</p>
</list-item>
<list-item>
<label>4.</label>
<p>CQL &#x2b; HER (<xref ref-type="bibr" rid="B9">Kumar et al., 2020</xref>): A state-of-the-art non-hierarchical offline RL algorithm, combined with Hindsight Experience Replay (<xref ref-type="bibr" rid="B1">Andrychowicz et al., 2017</xref>) to handle sparse rewards, and used to demonstrate the advantages of hierarchical structures.</p>
</list-item>
<list-item>
<label>5.</label>
<p>H-PPO (Continuous): Our baseline implementation adopts the same PPO algorithmic framework as LG-H-PPO, but its high-level policy directly generates actions in the continuous latent variable <inline-formula id="inf90">
<mml:math id="m95">
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> space. This serves as the most direct ablation subject for LG-H-PPO, used to validate the improvements brought by discrete graph structures.</p>
</list-item>
</list>
</p>
<p>We use normalized scores as the primary performance metric. This score is linearly scaled based on the environment&#x2019;s raw rewards, where 0 corresponds to the performance of a random policy and 100 corresponds to the performance of an expert policy. We run each algorithm and environment with five different random seeds, reporting the final policy&#x2019;s average normalized score, standard deviation, maximum, and minimum over 100 evaluation rounds. Additionally, we plot the average normalized score curve (learning curve) during training to compare the convergence speed and training stability of different algorithms. We implement the LG-H-PPO framework using PyTorch. Clustering Implementation Details: For the latent graph construction, we utilize the KMeans module from the Scikit-learn library. To ensure high-quality cluster center initialization and faster convergence, we explicitly employ the &#x2018;k-means&#x2b;&#x2b;&#x2019; initialization strategy rather than random initialization. This is critical for generating representative graph nodes in the complex high-dimensional latent space. Discussion on Node Count K: The number of graph nodes <inline-formula id="inf91">
<mml:math id="m96">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a key hyperparameter balancing &#x2018;planning resolution&#x2019; and &#x2018;computational complexity&#x2019;. A smaller <inline-formula id="inf92">
<mml:math id="m97">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> may result in nodes representing overly large regions, losing local details, while an excessively large <inline-formula id="inf93">
<mml:math id="m98">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> significantly increases the action space for the high-level policy, complicating learning. In our experiments, <inline-formula id="inf94">
<mml:math id="m99">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>200</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> was empirically selected as the optimal value for the Antmaze tasks. Detailed hyperparameter settings are shown in <xref ref-type="table" rid="T1">Table 1</xref>. All experiments are conducted on a server equipped with an NVIDIA RTX 4090 GPU.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>LG-H-PPO hyperparameters.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Component</th>
<th align="left">Hyperparameter</th>
<th align="left">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">VAE</td>
<td align="left">Network (encoder)</td>
<td align="left">MLP (512, 512, 512)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (decoder)</td>
<td align="left">MLP (512, 512, 512)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (prior)</td>
<td align="left">MLP (512, 512, 512)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Latent dimension <inline-formula id="inf95">
<mml:math id="m100">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">16</td>
</tr>
<tr>
<td align="left"/>
<td align="left">KL weight <inline-formula id="inf96">
<mml:math id="m101">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.1</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Subgoal period <inline-formula id="inf97">
<mml:math id="m102">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">25</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Learning rate</td>
<td align="left">
<inline-formula id="inf98">
<mml:math id="m103">
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Batch size</td>
<td align="left">100</td>
</tr>
<tr>
<td align="left">Graph construction</td>
<td align="left">Number of nodes <inline-formula id="inf99">
<mml:math id="m104">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">200</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Clustering algorithm</td>
<td align="left">K-means (scikit-learn, k-means&#x2b;&#x2b;)</td>
</tr>
<tr>
<td align="left">Low-level <inline-formula id="inf100">
<mml:math id="m105">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Policy type</td>
<td align="left">SAC &#x2b; AWAC</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (actor)</td>
<td align="left">MLP (256, 256)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (critic)</td>
<td align="left">MLP (256, 256)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Learning rate</td>
<td align="left">
<inline-formula id="inf101">
<mml:math id="m106">
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left"/>
<td align="left">AWAC temperature <inline-formula id="inf102">
<mml:math id="m107">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">1.0</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Batch size</td>
<td align="left">512</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Discount factor <inline-formula id="inf103">
<mml:math id="m108">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.98</td>
</tr>
<tr>
<td align="left">High-level <inline-formula id="inf104">
<mml:math id="m109">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Policy type</td>
<td align="left">PPO</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (actor)</td>
<td align="left">MLP (256, 256, 256)</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Network (critic)</td>
<td align="left">MLP (256, 256, 256)</td>
</tr>
<tr>
<td align="left"/>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">Discount factor <inline-formula id="inf105">
<mml:math id="m110">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.99</td>
</tr>
<tr>
<td align="left"/>
<td align="left">GAE <inline-formula id="inf106">
<mml:math id="m111">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.95</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Clip <inline-formula id="inf107">
<mml:math id="m112">
<mml:mrow>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.2</td>
</tr>
<tr>
<td align="left"/>
<td align="left">VF coefficient <inline-formula id="inf108">
<mml:math id="m113">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.5</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Entropy coefficient <inline-formula id="inf109">
<mml:math id="m114">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left"/>
<td align="left">PPO epochs</td>
<td align="left">10</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Minibatch size</td>
<td align="left">64</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Learning rate</td>
<td align="left">
<inline-formula id="inf110">
<mml:math id="m115">
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Rollout length (high-level steps)</td>
<td align="left">2048</td>
</tr>
<tr>
<td align="left"/>
<td align="left">High-level decision frequency</td>
<td align="left">
<inline-formula id="inf111">
<mml:math id="m116">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>25</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3-2">
<label>3.2</label>
<title>Results and analysis</title>
<p>We summarize the final performance of LG-H-PPO and various baseline algorithms on the D4RL Antmaze task in <xref ref-type="table" rid="T2">Table 2</xref>. To present the results more comprehensively, we report the average normalized score, standard deviation, and maximum and minimum scores across 100 evaluation rounds for five random seeds.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Performance comparison on D4RL antmaze tasks.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Environment</th>
<th align="left">Metric</th>
<th align="center">LG-H-PPO (Ours)</th>
<th align="center">Guider (<xref ref-type="bibr" rid="B11">Martinez-Baselga et al., 2023</xref>)</th>
<th align="center">HIQL (<xref ref-type="bibr" rid="B3">Barto, 2021</xref>)</th>
<th align="center">GAS (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>)</th>
<th align="center">CQL &#x2b; HER (<xref ref-type="bibr" rid="B16">Shin and Kim, 2023</xref>)</th>
<th align="center">H-PPO (Cont.)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Antmaze-medium-diverse-v2</td>
<td align="left">Mean score</td>
<td align="center">90.5</td>
<td align="center">87.3</td>
<td align="center">89.9</td>
<td align="center">96.3</td>
<td align="center">28.3</td>
<td align="center">75.2</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Std. Dev</td>
<td align="center">
<inline-formula id="inf112">
<mml:math id="m117">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 3.1</td>
<td align="center">
<inline-formula id="inf113">
<mml:math id="m118">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 0.4</td>
<td align="center">
<inline-formula id="inf114">
<mml:math id="m119">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 4.5</td>
<td align="center">
<inline-formula id="inf115">
<mml:math id="m120">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 1.3</td>
<td align="center">
<inline-formula id="inf116">
<mml:math id="m121">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 5.3</td>
<td align="center">
<inline-formula id="inf117">
<mml:math id="m122">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 4.5</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Max</td>
<td align="center">94.8</td>
<td align="center">88.1</td>
<td align="center">95.0</td>
<td align="center">98.0</td>
<td align="center">35.5</td>
<td align="center">81.0</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Min</td>
<td align="center">85.2</td>
<td align="center">86.5</td>
<td align="center">85.0</td>
<td align="center">94.8</td>
<td align="center">19.8</td>
<td align="center">69.5</td>
</tr>
<tr>
<td align="left">Antmaze-large-diverse-v2</td>
<td align="left">Mean score</td>
<td align="center">85.6</td>
<td align="center">80.8</td>
<td align="center">88.2</td>
<td align="center">93.2</td>
<td align="center">11.3</td>
<td align="center">68.1</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Std. Dev</td>
<td align="center">
<inline-formula id="inf118">
<mml:math id="m123">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 4.2</td>
<td align="center">
<inline-formula id="inf119">
<mml:math id="m124">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 4.6</td>
<td align="center">
<inline-formula id="inf120">
<mml:math id="m125">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 3.0</td>
<td align="center">
<inline-formula id="inf121">
<mml:math id="m126">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 0.8</td>
<td align="center">
<inline-formula id="inf122">
<mml:math id="m127">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 8.2</td>
<td align="center">
<inline-formula id="inf123">
<mml:math id="m128">
<mml:mrow>
<mml:mo>&#xb1;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> 5.8</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Max</td>
<td align="center">91.5</td>
<td align="center">87.0</td>
<td align="center">92.0</td>
<td align="center">94.5</td>
<td align="center">22.0</td>
<td align="center">75.3</td>
</tr>
<tr>
<td align="left"/>
<td align="left">Min</td>
<td align="center">79.8</td>
<td align="center">75.1</td>
<td align="center">84.0</td>
<td align="center">92.1</td>
<td align="center">0.5</td>
<td align="center">60.2</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="table" rid="T2">Table 2</xref> clearly demonstrates the superiority of LG-H-PPO. In the antmaze-medium environment, the average scores of all hierarchical methods significantly outperform the non-hierarchical CQL &#x2b; HER, highlighting the inherent advantage of hierarchical structures in handling long-temporal sequence problems. LG-H-PPO achieves a high score of 90.5 on this task, matching the performance of top methods HIQL and GAS while significantly outperforming continuous latent space planning-based approaches like Guider and H-PPO (Cont.). This performance advantage is further amplified in the more challenging antmaze-large environment. Confronted with longer paths and sparser rewards, the non-hierarchical method CQL &#x2b; HER experiences a steep decline to 11.3. Guider and H-PPO (Cont.) also drop to 80.8 and 68.1 respectively, indicating significantly increased difficulty in long-term planning within continuous latent spaces. In contrast, LG-H-PPO, leveraging its planning capability on discrete latent variable graphs, maintained a high performance of 85.6. This substantially outperformed both Guider and H-PPO (Cont.), coming very close to HIQL&#x2019;s performance. This strongly demonstrates that latent variable graph structures are key to overcoming the bottlenecks of long-term offline HRL planning. By discretizing the action space of high-level PPO, we enable policy gradient methods to more effectively learn long-range dependencies and select optimal subgoal sequences. Concurrently, LG-H-PPO exhibits a relatively small standard deviation, and the gap between its maximum and minimum values indicates stable performance across different random seeds.</p>
<p>To provide a more intuitive understanding of LG-H-PPO&#x2019;s decision-making process, we visualize a planned trajectory in the Antmaze-Large environment in <xref ref-type="fig" rid="F3">Figure 3</xref>. The visualization highlights the two-level hierarchical structure.As shown in <xref ref-type="fig" rid="F3">Figure 3a</xref>, the high-level PPO policy <inline-formula id="inf124">
<mml:math id="m129">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, operating on its discrete action space (the graph nodes), selects an efficient sequence of latent subgoals (yellow stars) from the start state (S) to the final goal. This demonstrates its long-term planning capability. <xref ref-type="fig" rid="F3">Figure 3b</xref> shows the low-level policy <inline-formula id="inf125">
<mml:math id="m130">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> in action, executing primitive actions to successfully navigate and reach each of the discrete subgoals provided by the high-level.This qualitative result visually confirms that our framework effectively decomposes the complex, long-horizon task into a series of simpler, short-horizon navigation problems, validating the efficacy of our latent graph-based approach.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Visualization of LG-H-PPO&#x2019;s hierarchical planning and execution. <bold>(a)</bold> The high-level PPO policy selects a sequence of discrete graph nodes (yellow stars) as subgoals. <bold>(b)</bold> The low-level policy executes trajectories (red dotted line) to reach each sequential subgoal, successfully navigating from the initial state to the final goal.</p>
</caption>
<graphic xlink:href="frobt-12-1737238-g003.tif">
<alt-text content-type="machine-generated">Maze diagrams showing high-level and low-level policy paths. Part (a) depicts a maze with stars and flags marking subgoals and final goals. Green and blue lines represent high-level plans with nodes and edges. Part (b) illustrates progressive steps of a low-level policy, denoted by dotted red lines, achieving subgoals marked by stars. A legend identifies icons and line colors for states and trajectories.</alt-text>
</graphic>
</fig>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Conclusion and future work</title>
<p>The main contribution of this paper is the proposal and validation of a novel offline HRL paradigm (LG-H-PPO) that integrates latent variable representation learning with explicit graph structure planning. By discretizing the action space of high-level PPO, we effectively overcome the bottlenecks of existing offline HRL methods based on policy gradients, which suffer from low planning efficiency and poor stability in continuous latent spaces. This work lays a solid foundation for future exploration of more efficient and robust offline HRL algorithms that integrate the abstractive capabilities of latent variables with the advantages of explicit structured planning, particularly in robotic applications requiring long-term reasoning and suboptimal data utilization.</p>
<p>Future research directions hold great promise. First, exploring the learning of edge weights in the graph&#x2014;such as adopting time efficiency metrics from GAS (<xref ref-type="bibr" rid="B2">Baek et al., 2025</xref>) or directly learning edge reachability probabilities/transition costs&#x2014;and integrating this information into the decision-making process of high-level PPO or as reward shaping signals for low-level policies could enable smarter path selection. Second, online dynamic graph expansion mechanisms can be investigated, allowing agents to dynamically add or modify graph nodes and edges based on new experiences during (limited) online interactions or deployment. This enables the discovery of optimal paths potentially missing in offline data, endowing the algorithm with lifelong learning capabilities. Finally, extending the LG-H-PPO framework to navigation tasks based on high-dimensional observations (e.g., images) represents a significant direction. This requires investigating more robust visual encoders and exploring how to effectively construct and utilize graph structures within visual latent spaces.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s5">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="author-contributions" id="s6">
<title>Author contributions</title>
<p>XH: Writing &#x2013; original draft, Writing &#x2013; review and editing.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s9">
<title>Generative AI statement</title>
<p>The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI was used to assist in content summarization and minor text refinement within the Abstract section.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn fn-type="custom" custom-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2699136/overview">Jun Ma</ext-link>, Hong Kong University of Science and Technology, Hong Kong SAR, China</p>
</fn>
<fn fn-type="custom" custom-type="reviewed-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2843743/overview">Pengqin Wang</ext-link>, Hong Kong University of Science and Technology, Hong Kong SAR, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/3268439/overview">Mengwei Zhang</ext-link>, Tsinghua University, China</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrychowicz</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wolski</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ray</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Welinder</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Hindsight experience replay</article-title>. <source>Adv. Neural Information Processing Systems</source> <volume>30</volume>.</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baek</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Oh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2025</year>). <article-title>Graph-assisted stitching for offline hierarchical reinforcement learning</article-title>. <source>arXiv Preprint arXiv:2506.07744</source>.</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barto</surname>
<given-names>A. G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Reinforcement learning: an introduction. by richard&#x2019;s sutton</article-title>. <source>SIAM Rev.</source> <volume>6</volume> (<issue>2</issue>), <fpage>423</fpage>.</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Rakhlin</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2025</year>). <article-title>Outcome-based online reinforcement learning: Algorithms and fundamental limits</article-title>. <source>arXiv Preprint arXiv:2505.20268</source>.</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eysenbach</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R. R.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Search on the replay buffer: bridging planning and reinforcement learning</article-title>. <source>Adv. Neural Information Processing Systems</source> <volume>32</volume>.</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Haarnoja</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-loc>Stockholm, Sweden</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>1861</fpage>&#x2013;<lpage>1870</lpage>.</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafner</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lillicrap</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Norouzi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Mastering atari with discrete world models</article-title>. <source>arXiv Preprint arXiv:2010.02193</source>.</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kulkarni</surname>
<given-names>T. D.</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Saeedi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation</article-title>. <source>Adv. Neural Information Processing Systems</source> <volume>29</volume>.</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kumar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tucker</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Conservative q-learning for offline reinforcement learning</article-title>. <source>Adv. Neural Information Processing Systems</source> <volume>33</volume>, <fpage>1179</fpage>&#x2013;<lpage>1191</lpage>.</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tucker</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Offline reinforcement learning: tutorial, review, and perspectives on open problems</article-title>. <source>arXiv Preprint arXiv:2005.01643</source>.</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Martinez-Baselga</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Riazuelo</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Montano</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Improving robot navigation in crowded environments using intrinsic rewards</article-title>. <source>arXiv Preprint arXiv:2302.06554</source>, <fpage>9428</fpage>&#x2013;<lpage>9434</lpage>. <pub-id pub-id-type="doi">10.1109/icra48891.2023.10160876</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ghosh</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Eysenbach</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>Offline goal-conditioned rl with latent states as actions</article-title>,&#x201d; in <source>ICML workshop on new frontiers in learning, control, and dynamical systems</source>.</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X. B.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Advantage-weighted regression: simple and scalable off-policy reinforcement learning</article-title>. <source>arXiv Preprint arXiv:1910.00177</source>.</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schulman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Moritz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>High-dimensional continuous control using generalized advantage estimation</article-title>. <source>arXiv Preprint arXiv:1506.02438</source>.</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schulman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wolski</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Radford</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Klimov</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Proximal policy optimization algorithms</article-title>. <source>arXiv Preprint arXiv:1707.06347</source>.</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shin</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Guide to control: offline hierarchical reinforcement learning using subgoal generation for long-horizon and sparse-reward tasks</article-title>. <source>IJCAI</source>, <fpage>4217</fpage>&#x2013;<lpage>4225</lpage>. <pub-id pub-id-type="doi">10.24963/ijcai.2023/469</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</article>