Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

Siqi Mu, Shuo Wen, Yang Lu, , Ruihong Jiang, , and Bo Ai S. Mu and S. Wen are with the School of Sports Engineering, Beijing Sport University, Beijing 100084, China (e-mail: [email protected]).Y. Lu is with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]).R. Jiang is with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mail: [email protected]).B. Ai are with the School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]).Corresponding author: Yang Lu.
Abstract

Due to their inherent flexibility and autonomous operation, unmanned aerial vehicles (UAVs) have been widely used in Internet of Medical Things (IoMT) to provide real-time biomedical edge computing service for wireless body area network (WBAN) users. In this paper, considering the time-varying task criticality characteristics of diverse WBAN users and the dual mobility between WBAN users and UAV, we investigate the dynamic task offloading and UAV flight trajectory optimization problem to minimize the weighted average task completion time of all the WBAN users, under the constraint of UAV energy consumption. To tackle the problem, an embodied AI-enhanced IoMT edge computing framework is established. Specifically, we propose a novel hierarchical multi-scale Transformer-based user trajectory prediction model based on the users’ historical trajectory traces captured by the embodied AI agent (i.e., UAV). Afterwards, a prediction-enhanced deep reinforcement learning (DRL) algorithm that integrates predicted users’ mobility information is designed for intelligently optimizing UAV flight trajectory and task offloading decisions. Real-word movement traces and simulation results demonstrate the superiority of the proposed methods in comparison with the existing benchmarks.

I Introduction

I-A Background and Prior Works

Recent advancements in Internet of Medical Things (IoMT) and artificial intelligence (AI), have made a significant contribution to sustainable digital health. Combining traditional medical equipments with IoT, IoMT provides ubiquitous in-home healthcare, and greatly alleviates public medical burdens and saves healthcare resources[philip2021internet]. As the key components of IoMT, wireless body area networks (WBANs) deploy various low-power biosensors on numerous people. These heterogeneous biosensors sense various types of physiological data, including electrocardiogram (ECG), electroencephalogram (EEG), blood pressure (BP), body temperature etc., and transmit the data to an on-body sink node for further processing[movassaghi2014wireless, mu2025aoi]. WBANs have significantly facilitated pervasive health monitoring services and promoted real-time health assessment.

Despite these promising developments, the rapid population growth, especially the aged, and their medical tasks still overloads the healthcare infrastructure, and limits the development of IoMT[ning2020mobile]. Local sink nodes, such as mobile phones and laptops, cannot satisfy the latency requirements of the time-sensitive tasks for medical information analysis. The proliferation of mobile edge computing (MEC) is conceived as a promising paradigm for tackling such challenges[lu2025agentic]. By providing computation resources for the offloaded medical analysis tasks in proximity, MEC alleviates the burden on local devices and augments the capability of IoMT. The integration of IoMT and MEC has been envisioned as an effective approach for real-time healthcare service provision, especially during the COVID-19 pandemic[zhu2022iomt, rahman2021internet].

With the continuous increase in WBAN users, cellular infrastructure-based MEC, in which edge servers deployed at terrestrial base station, struggles to provide seamless connectivity and reliable computation[mao2024uav]. Benefitting from its high mobility, flexible deployment capabilities and strong scalability, UAV-enabled MEC has gained widespread attention and become a research hotspot. Particularly, computation offloading and UAV flight trajectory in a MEC system are jointly optimized to minimize the overall task delay of users[hu2018joint, guo2019joint], or minimize the energy consumption of UAV[sun2021joint]. By enabling UAV to simultaneously act as a relay and MEC server, [liu2023maximizing] and [hu2019uav] optimized the task offloading, bandwidth allocation, computation resource scheduling and UAV trajectory using successive convex approximation, such that maximizing the energy efficiency of users and UAV. Later in [bai2022delay], a UAV enabled edge-cloud computing system was considered to augment the computation capability of the UAV. A delay minimization problem for edge-cloud cooperative offloading was investigated in this paper. However, these earlier studies have been conducted on the static scenarios without consideration on user mobility, which is unrealistic since user locations may change dynamically over time in practice. In addition, user mobility directly impacts UAV path planning and edge computing performance.

To tackle this problem, several studies focused on mobile users, where the user mobility model follows the random waypoint mobility model[amer2020mobility], reference point group mobility model[wang2023joint, yan2023joint], or the Gauss-Markov mobility model[omoniwa2022optimizing, yang2021dynamic]. These models have been applied to derive the UAV coverage probability[amer2020mobility] or develop optimization algorithms on user association, power allocation, subchannel assignment, UAV positioning or trajectory design under various system objectives, such as maximizing system throughput[yan2023joint, wang2023joint] or energy efficiency[omoniwa2022optimizing, yang2021dynamic]. However, in these ideal mobility models, the direction of user’s movement tends to be uniformly distributed among left, right, forward and backward, which does not fully reflect the complexities and nuances of the real-world user movement [liu2019trajectory]. Additionally, although user mobility patterns were considered, the algorithm designs in these works were based on the assumption that precise user locations are accessed by UAV in real time. Such an assumption is difficult to fulfill in practice, especially in urban environments or areas with significant obstructions. Even worse, the user location information reported to UAV may be outdated due to the fast movement of users, leading to suboptimal task offloading strategy and UAV path planning[wu2024deep].

In view of these, a few researchers have made efforts to capture the time-varying uncertainty of user mobility with prediction models to improve the service quality of edge computing. [ma2020leveraging] proposed a LSTM-based mobility prediction model, based on which a predictive service placement algorithm was designed to balance the latency performance and handover cost. In [wu2023mobility], the authors developed a seq2seq user trajectory prediction model, alongside a deep reinforcement learning (DRL) algorithm for supporting offloading decisions and resource allocation in MEC, to minimize the average task latency of users. In spite of these innovative attempts, several challenges still remains. First, these existing studies on MEC with mobility prediction mainly focus on the communication connections between terrestrial base station and users, how to improve the edge computing performance in an air-ground system with dual mobility of UAV and ground users needs to be further investigated. Besides, inaccuracies in the existing trajectory prediction methods can result in suboptimal offloading decisions and UAV trajectory optimization, which will increase task completion time and UAV energy consumption. It is imperative to develop a more robust and predictive framework. Finally, few works on UAV-assisted MEC systems consider the time-varying criticality of computation tasks, which is a significant and indispensable feature for WBAN users. How to intelligently make offloading decisions based on the time-varying task characteristics is non-trivial.

I-B Contributions

Motivated by the challenges and inspired by the advanced AI techniques, we propose an embodied AI-enhanced UAV edge computing framework in this work. Embodied AI, which emphasizes the physical objects embedded with intelligent system actively interact with and learn from their physical surroundings, has been shown a promising solution for dealing with this highly complex and dynamic scenario[zhang2025embodied, zhang2024generative]. Specifically, the proposed framework is comprised of two core modules, i.e., a hierarchical Transformer enabled user mobility prediction module and a DRL enabled UAV trajectory optimization and task offloading module. The embodied AI system embedded within UAV enables accurate mobility prediction and real-time strategy adaptation to dynamic environments. Equipped with the designed AI algorithms, the UAV embodied AI agent predicts user mobility based on the perceived historical information, and autonomously optimizes flight trajectory and makes intelligent task offloading decisions. The contributions of this work are summarized as follows:

  • 1)

    Considering the time-varying task criticality characteristics of diverse WBAN users, we formulate a dynamic multi-stage task offloading and UAV flight trajectory optimization problem, aiming at minimizing the weighted average task completion time of all the WBAN users, subject to the total energy consumption of the UAV.

  • 2)

    To facilitate the optimization of flight trajectory and task offloading decisions for the UAV embodied AI agent, we propose a novel user trajectory prediction model based on a hierarchical multi-scale Transformer framework. Through the design of trajectory slice partitioning, embedding representation and the attention mechanism, the proposed model can capture the temporal dependencies of historical user trajectory on various time scales.

  • 3)

    The original dynamic optimization problem is transformed into a Markov decision process (MDP) problem. Based on the designed state, action and reward function, a prediction-enhanced DRL algorithm that integrates predicted users’ mobility information is developed for intelligent UAV trajectory optimization and task offloading.

  • 4)

    We evaluate the performance of the proposed mobility prediction model and DRL algorithm. Real-word traces and simulation results demonstrate that the proposed methods are superior in both effective mobility prediction and optimizations on task offloading and UAV flight trajectory compared with the existing benchmarks.

The organization of this paper is as follows. In Section II, the system model is introduced and the multi-stage optimization problem is formulated. Section III presents the hierarchical multi-scale Transformer framework for mobility prediction. Section IV provides the prediction-enhanced UAV trajectory optimization and task offloading algorithm. Performance evaluation results are shown in Section V. Finally, Section VI concludes our work and points out possible future work.

II System Model

Refer to caption
Figure 1: System model

We consider a UAV-enabled WBAN edge computing system as illustrated in Fig. 1, where a UAV equipped with an edge server has a mission to provides the proximate computation service for mobile WBAN users. The set of WBAN users is denoted as 𝒰={1,,U}\mathcal{U}=\{1,\ldots,U\} and each user has NN heterogeneous computation tasks from its equipped biosensors, indexed as n𝒩={1,,N}n\in\mathcal{N}=\{1,\ldots,N\}. The mission period of UAV is denoted as TmaxT^{\max}, which is slotted into TT time slots, represented as t𝒯={1,,T}t\in\mathcal{T}=\{1,\ldots,T\}, with time slot duration of τ\tau. Considering a 3-D Cartesian coordinate system, user u𝒰u\in\mathcal{U} has a zero altitude and its horizontal location at time slot tt is denoted as 𝒑u[t]=(xu[t],yu[t])\bm{p}_{u}[t]=(x_{u}[t],y_{u}[t]). We assume the UAV flies at a fixed altitude HH, and the initial horizontal locations of the UAV is preset as 𝒑v[1]=(xI,yI)\bm{p}_{v}[1]=(x_{I},y_{I}). At the beginning of each time slot, WBAN users generate tasks and the UAV moves to the next location based on the observed locations and predicted user mobility. After relocating a new location of the UAV, WBAN users offload a portion of tasks via wireless links, and the UAV assists the execution of these computation tasks. At time slot tt, the horizontal location of the UAV is denoted as 𝒑v[t]=(x[t],y[t])\bm{p}_{v}[t]=(x[t],y[t]). It is assumed that the UAV flies with a constant speed v[t]v[t] at time tt and the direction of flight is represented by σ[t]\sigma[t]. Then, the UAV flying from the previous hover location to the new location can be expressed as

𝒑v[t+1]=[x[t]+v[t]tflycosσ[t],y[t]+v[t]tflysinσ[t]],\displaystyle\bm{p}_{v}[t+1]=[x[t]+v[t]t^{\text{fly}}\cos\sigma[t],y[t]+v[t]t^{\text{fly}}\sin\sigma[t]], (1)

where tflyt^{\text{fly}} is the UAV flying time in each time slot.

II-A Computation Task Model

Due to the varying sensed physiological states by each biosensor, the tasks of WBAN users are dynamically changed across time slots. For WBAN user uu, its computation task nn at time slot tt is represented as a tuple Θu,n[t]=Iu,n[t],Du,n[t],Cu,n[t]\Theta_{u,n}[t]=\langle{I_{u,n}[t],D_{u,n}[t],C_{u,n}[t]\rangle}, where Iu,n[t]I_{u,n}[t] is the current criticality index of task nn, Du,n[t]D_{u,n}[t] is the data load of task nn, and Cu,n[t]C_{u,n}[t] the computation amount of task nn.

Specifically, criticality index Iu,n[t]I_{u,n}[t] is comprised of three parts, ϕu\phi_{u}, ρu,n\rho_{u,n} and αu,n[t]\alpha_{u,n}[t], where ϕu\phi_{u} and ρu,n\rho_{u,n} are introduced to model the criticality of WBAN user uu and its biosensor nn, and αu,n[t]\alpha_{u,n}[t] represents the importance of the sensing data of biosensor nn[askari2021energy]. The larger the values of ϕu\phi_{u} and ρu,n\rho_{u,n}, the higher the data criticality of the corresponding user and its biosensor. For example, the data for heart disease patients have a higher criticality than healthy users, and the value of ρu,n\rho_{u,n} for ECG biosensor used to monitor heart is greater than that of EMG biosensor used to monitor muscle activity. Besides, the data of the same biosensor can be divided into normal data and emergency abnormal data. The data importance of biosensor αu,n[t]\alpha_{u,n}[t] indicates that the urgency of the sensing typical data θu,n[t]\theta_{u,n}[t]. The predefined normal value range is [θ˘u,n,θ^u,n][\breve{\theta}_{u,n},\hat{\theta}_{u,n}]. Without loss of generality, αu,n[t]\alpha_{u,n}[t] is classified as two levels, i.e., low and high. If θu,n[t]\theta_{u,n}[t] is within the predefined normal value range [θ˘u,n,θ^u,n][\breve{\theta}_{u,n},\hat{\theta}_{u,n}], it indicates the low urgency. Otherwise, it represents an abnormal states with high urgency. Thus, αu,n[t]\alpha_{u,n}[t] can be expressed as

αu,n[t]={low,ifθu,n[t][θ˘u,n,θ^u,n],high,ifθu,n[t](,θ˘u,n)(θ^u,n,+).\displaystyle\alpha_{u,n}[t]=\left\{\begin{array}[]{l}\mbox{low},\mbox{if}\ \theta_{u,n}[t]\in[\breve{\theta}_{u,n},\hat{\theta}_{u,n}],\\[6.0pt] \mbox{high},\mbox{if}\ \theta_{u,n}[t]\in(-\infty,\breve{\theta}_{u,n})\cup(\hat{\theta}_{u,n},+\infty).\end{array}\right. (4)

By jointly considering the user categories, biosensor categories and data importance, criticality index Iu,n[t]I_{u,n}[t] of biosensor nn for WBAN user uu is defined as a function of the three factors, written as Iu,n[t]=(ϕu,ρu,n,αu,n[t])I_{u,n}[t]=\mathcal{F}(\phi_{u},\rho_{u,n},\alpha_{u,n}[t]).

Each task is atomically indivisible and can be processed locally or offloaded to the UAV for computing. Let zu,n[t]{0,1}z_{u,n}[t]\in\{0,1\} denote as the indicator of the task offloading decision for task Θu,n[t]\Theta_{u,n}[t]. zu,n[t]=0z_{u,n}[t]=0 signifies that task Θu,n[t]\Theta_{u,n}[t] is computed at local hub node (e.g. a mobile device), and zu,n[t]=1z_{u,n}[t]=1 otherwise. Multiple tasks dispatched to the local hub node can be executed in parallel. Considering the distinct criticality of these locally-processed tasks, the local computation capability allocated to a task is proportional to the criticality index of the task. Define VuV_{u} as the local computation capability of WBAN user uu, the computation resources allocated to task Θu,n[t]\Theta_{u,n}[t] is denoted as

fu,nloc[t]=Iu,n[t]Vun𝒩(1zu,n[t])Iu,n[t].\displaystyle f_{u,n}^{\text{loc}}[t]=\frac{I_{u,n}[t]V_{u}}{\sum\limits_{n\in\mathcal{N}}(1-z_{u,n}[t])I_{u,n}[t]}. (5)

Hence, the latency of local computing for task Θu,n[t]\Theta_{u,n}[t] is then represented as

Tu,nloc[t]=Cu,n[t]fu,nloc[t].\displaystyle T^{\text{loc}}_{u,n}[t]=\frac{C_{u,n}[t]}{f_{u,n}^{\text{loc}}[t]}. (6)

II-B Task Offloading Model

For task offloading, both the effect of line-of-sight (LoS) and non-line-of-sight (NLoS) on wireless channel are taken into account in this work. Specifically, the probabilistic LoS model is adopted to model the large-scale attenuation between the UAV and WBAN users[zeng2019energy]. The probability of geometrical LoS between the UAV and each WBAN user depends on the statistical parameters related to the environment and the elevation angle. At time slot tt, the LoS probability for user uu is denoted as

LoS(βu[t])=11+aexp(b(βu[t]a)),\displaystyle\mathbb{P}^{\text{LoS}}(\beta_{u}[t])=\frac{1}{1+a\exp(-b(\beta_{u}[t]-a))}, (7)

where aa and bb are environment-related parameters, and βu[t]\beta_{u}[t] is the elevation angle, represented as

βu[t]=180πarctan(H𝒑u[t]𝒑v[t]).\displaystyle\beta_{u}[t]=\frac{180}{\pi}\arctan\left(\frac{H}{\bm{p}_{u}[t]-\bm{p}_{v}[t]}\right). (8)

Then, the non-line-of-sight (NLoS) channel probability is represented as NLoS(βu[t])=1LoS(βu[t])\mathbb{P}^{\text{NLoS}}(\beta_{u}[t])=1-\mathbb{P}^{\text{LoS}}(\beta_{u}[t]). Therefore, the expected channel gain is

gu[t]=LoS(βu[t])g0duς[t]+1LoS(βu[t])κg0duς[t],\displaystyle g_{u}[t]=\frac{\mathbb{P}^{\text{LoS}}(\beta_{u}[t])g_{0}}{d_{u}^{\varsigma}[t]}+\frac{1-\mathbb{P}^{\text{LoS}}(\beta_{u}[t])\kappa g_{0}}{d_{u}^{\varsigma}[t]}, (9)

where duς[t]=H2+𝒑u[t]𝒑v[t]2d_{u}^{\varsigma}[t]=\sqrt{H^{2}+||\bm{p}_{u}[t]-\bm{p}_{v}[t]||^{2}} is the distance between WBAN user uu and the UAV at time slot tt, κ\kappa is the NLOS attenuation, g0g_{0} is the channel gain at the reference distance d0d_{0} and ς\varsigma is the path loss exponent.

To prevent the signal interference among WBAN users, the frequency bands are orthogonally allocated to users. The wireless bandwidth available for user uu is WuW_{u} Hz. When delivering the tasks to the UAV for edge execution at time slot tt, WBAN user uu further assigns its bandwidth and transmission power Pu[t]P_{u}[t] to its tasks according to the task criticality index. Let N0N_{0} be the noise power at the UAV, then the transmission rate for offloading task Θu,n[t]\Theta_{u,n}[t] can be obtained as

Ru,n[t]=Iu,n[t]Wun𝒩zu,n[t]Iu,n[t]log2(1+Iu,n[t]Pu[t]gu[t]n𝒩zu,n[t]Iu,n[t]N0)\displaystyle R_{u,n}[t]=\frac{I_{u,n}[t]W_{u}}{\sum\limits_{n\in\mathcal{N}}z_{u,n}[t]I_{u,n}[t]}\log_{2}\left(1+\frac{I_{u,n}[t]P_{u}[t]g_{u}[t]}{\sum\limits_{n\in\mathcal{N}}z_{u,n}[t]I_{u,n}[t]N_{0}}\right) (10)

The data transmission time for Θu,n[t]\Theta_{u,n}[t] is expressed as

Tu,ntrans[t]=Du,n[t]Ru,n[t].\displaystyle T^{\text{trans}}_{u,n}[t]=\frac{D_{u,n}[t]}{R_{u,n}[t]}. (11)

II-C UAV Energy Consumption Model

During flight, the energy consumption of the UAV is mainly comprised of the propulsion energy consumption and the computation energy consumption. According to the existing analytical model for helicopter dynamics[hu2019uav], its propulsion energy consumption at time slot tt can be denoted as

Efly[t]=(γ1v3[t]+γ2v[t])tfly.\displaystyle E^{\text{fly}}[t]=\left(\gamma_{1}v^{3}[t]+\frac{\gamma_{2}}{v[t]}\right)t^{\text{fly}}. (12)

where γ1\gamma_{1} and γ2\gamma_{2} are parameters related to the weight, wing area, wing span efficiency of the UAV and air density, etc.

To improve the computation energy efficiency for offloaded tasks, a dynamic voltage and frequency scaling (DVFS) technique is leveraged by the UAV. By adjusting the CPU frequency of the UAV during each time slot, its computation power can be adaptively controlled. Let FvF_{v} denote as the total CPU frequency of the UAV. We consider the offloaded tasks of all WBAN users are executed concurrently by the UAV, and the computation capability allocated to an offloaded task is determined by the proportion of its criticality index to the total criticality index of all the offloaded tasks. Thus, the CPU frequency allocated to task Θu,n[t]\Theta_{u,n}[t] is

fu,nuav[t]=Iu,n[t]Fvu𝒰n𝒩zu,n[t]Iu,n[t].\displaystyle f_{u,n}^{\text{uav}}[t]=\frac{I_{u,n}[t]F_{v}}{\sum\limits_{u\in\mathcal{U}}\sum\limits_{n\in\mathcal{N}}{z_{u,n}[t]I_{u,n}[t]}}. (13)

Then, the computation time of task Θu,n[t]\Theta_{u,n}[t] can be obtained as

Tu,ncomp[t]\displaystyle T^{\text{comp}}_{u,n}[t] =Cu,n[t]fu,nuav[t]=Cu,n[t]u𝒰n𝒩zu,n[t]Iu,n[t]Iu,n[t]Fv,\displaystyle=\frac{C_{u,n}[t]}{f_{u,n}^{\text{uav}}[t]}=\frac{C_{u,n}[t]\sum\limits_{u\in\mathcal{U}}\sum\limits_{n\in\mathcal{N}}{z_{u,n}[t]I_{u,n}[t]}}{I_{u,n}[t]F_{v}}, (14)

According to [xu2017online], the power consumption for computing task Θu,n[t]\Theta_{u,n}[t] is ηfu,n3[t]\eta f_{u,n}^{3}[t], where η\eta is the effective capacitance coefficient of the UAV, which depends on its processor chip architecture. Thus, the energy consumption for computing task Θu,n[t]\Theta_{u,n}[t] is represented as

Eu,ncomp[t]=η(fu,nuav[t])2zu,n[t]Cu,n[t],=ηzu,n[t]Cu,n[t]Iu,n2[t]Fv2(u𝒰n𝒩zu,n[t]Iu,n[t])2.\displaystyle E^{\text{comp}}_{u,n}[t]=\eta(f_{u,n}^{\text{uav}}[t])^{2}z_{u,n}[t]C_{u,n}[t],=\frac{\eta z_{u,n}[t]C_{u,n}[t]I_{u,n}^{2}[t]F_{v}^{2}}{\left(\sum\limits_{u\in\mathcal{U}}\sum\limits_{n\in\mathcal{N}}{z_{u,n}[t]I_{u,n}[t]}\right)^{2}}. (15)

II-D Problem Formulation

To comprehensively measure the completion time gain of tasks with different criticality in each time slot, we define a weighted task completion time for each task based on its criticality index, as follows:

Ψu,n[t]=Iu,n[t]u𝒰n𝒩Iu,n[t]Tu,ntotal[t],\displaystyle\Psi_{u,n}[t]=\frac{I_{u,n}[t]}{\sum\limits_{u\in\mathcal{U}}\sum\limits_{n\in\mathcal{N}}I_{u,n}[t]}T^{\text{total}}_{u,n}[t], (16)

where Tu,ntotal[t]T^{\text{total}}_{u,n}[t] is the overall completion latency of task Θu,n[t]\Theta_{u,n}[t], obtained as

Tu,ntotal[t]=(1zu,n[t])Tu,nloc[t]+zu,n[t](Tu,ntrans[t]+Tu,ncomp[t]).\displaystyle T^{\text{total}}_{u,n}[t]=(1-z_{u,n}[t])T^{\text{loc}}_{u,n}[t]+z_{u,n}[t](T^{\text{trans}}_{u,n}[t]+T^{\text{comp}}_{u,n}[t]). (17)

Note that it is mandatory that each task should be completed within a time slot duration. That is, constraint Tu,ntotal[t]τT^{\text{total}}_{u,n}[t]\leq\tau holds.

In this paper, we consider to minimize the weighted average task completion time of all the WBAN users during the UAV’s mission period, subject to the total energy consumption of the UAV. By jointly optimizing the UAV flying trajectory and the task offloading decisions, the problem is formulated as

maxv[t],σ[t],zu,n[t]\displaystyle\max_{v[t],\sigma[t],z_{u,n}[t]} 1Tt=1Tu=1Un=1NΨu,n[t]\displaystyle\frac{1}{T}\sum\limits_{t=1}^{T}\sum\limits_{u=1}^{U}\sum\limits_{n=1}^{N}\Psi_{u,n}[t] (18a)
s.t. t=1T(Efly[t]+u=1Un=1NEu,ncomp[t])Euav,\displaystyle\sum\limits_{t=1}^{T}\left(E^{\text{fly}}[t]+\sum\limits_{u=1}^{U}\sum\limits_{n=1}^{N}E^{\text{comp}}_{u,n}[t]\right)\leq E^{\text{uav}}, (18b)
0Tu,ntotal[t]τ,u𝒰,n𝒩,t𝒯,\displaystyle 0\leq T^{\text{total}}_{u,n}[t]\leq\tau,\forall u\in\mathcal{U},n\in\mathcal{N},t\in\mathcal{T}, (18c)
𝒑v[t+1]𝒑v[t]Vmaxτ,t𝒯,\displaystyle||\bm{p}_{v}[t+1]-\bm{p}_{v}[t]||\leq V^{\max}\tau,\forall t\in\mathcal{T}, (18d)
σ[t][0,2π],t𝒯,\displaystyle\sigma[t]\in[0,2\pi],\forall t\in\mathcal{T}, (18e)
zu,n[t]{0,1},u𝒰,n𝒩,t𝒯.\displaystyle z_{u,n}[t]\in\{0,1\},\forall u\in\mathcal{U},n\in\mathcal{N},t\in\mathcal{T}. (18f)

where constraint (18b) indicates that the total energy consumption of UAV for flying and task computation is limited to its battery energy. Constraint (18c) guarantees that task Θu,n[t]\Theta_{u,n}[t] is completed within a time slot duration. The flying speed of the UAV v[t][0,Vmax]v[t]\in[0,V^{\max}] is guaranteed by constraint (18d), where VmaxV^{\max} is the maximum flight speed. Constraint (18e) imposes limits on the UAV’s angle of movement, and constraint (18f) indicates the task offloading decision variables. Problem (18) is a multi-stage dynamic optimization problem. Its non-convex property and complex time-correlated constraint present significant challenges for problem solving. Traditional optimization algorithms often fall into the curse of dimensionality, and are hard to adapt to rapid changes in network states. To address these issues, we propose an embodied AI framework that integrates mobility prediction and DRL to solve it in the next section.

III Hierarchical Transformer Trajectory Prediction Model

Refer to caption
Figure 2: Hierarchical Transformer Trajectory Prediction Model

In this section, we propose a user trajectory prediction model based on a hierarchical multi-scale Transformer framework, to capture the temporal dependencies of user mobility on various time scales. Traditional Transformer model [vaswani2017attention] has been shown to effectively capture the long-range dependencies between words within a sentence in context of natural language processing. Its great sequence modeling ability facilitates to capture the contextual information of user mobility, which will help improve trajectory prediction performance. However, user trajectory typically exhibits multiple patterns with different human activities, characterized by significant variations and fluctuations across different temporal scales. Traditional transformer prediction model often analyze these patterns at a unified time scale, which can lead to inaccurate learning of mobility patterns. Hence, we develop a hierarchical transformer framework for learning multi-scale time series features of mobility patterns. In the following, a detailed overview of the hierarchical Transformer trajectory prediction model is provided.

The hierarchical Transformer trajectory prediction model is composed of four main modules: trajectory slice partitioning module, embedding representation module, encoder network module and output module. The overall structure of prediction model is illustrated in Fig. 2.

1) Trajectory Slice Partitioning: Suppose that the historical trajectory of user uu is represented as 𝒫u={𝒑u[1],,𝒑u[Th]}\mathcal{P}_{u}=\{\bm{p}_{u}[1],\ldots,\bm{p}_{u}[T_{h}]\}, with ThT_{h} as the length of the historical observation window. The whole hierarchical Transformer framework is divided into MM stages that produce different feature maps of the historical trajectory for each user. To this end, a temporal slice partitioning strategy is designed to vary the time scale of the user trajectory at different stages. Specifically, a window slicing operation is leveraged to aggregate successive neighborhood location data, with the window slice size denoting the time scale size of the input. Let the user mobility trajectory sequence at stage mm\in\mathbb{R} denote by 𝑺m=[sm,1,sm,2,,sm,n]\bm{S}_{m}=[s_{m,1},s_{m,2},\ldots,s_{m,n}], containing nn\in\mathbb{R} elements with dimensions of jmj_{m}\in\mathbb{R}. In particular, 𝑺1=𝒫u\bm{S}_{1}=\mathcal{P}_{u} with j1=2j_{1}=2 denotes the raw user trajectory sequence of user uu. The window slicing size at stage mm is wmw_{m}, which means every wmw_{m} location data are grouped into a new temporal trajectory slice. In this way, the trajectory sequence 𝑺m\bm{S}_{m} input to stage mm is partitioned into a set of fine-grained trajectory slices. The number of trajectory slices KmK_{m} and the size of each slice GmG_{m} are respectively denoted as:

Km=|𝑺m|wm,\displaystyle K_{m}=\frac{|\bm{S}_{m}|}{w_{m}}, (19)
Gm=wm×jm.\displaystyle G_{m}=w_{m}\times j_{m}. (20)

To some extent, it means that the length of the transformed trajectory sequence is KmK_{m}, and each trajectory data has feature dimension of GmG_{m}.

2) Embedding Representation: Through the embedding layer, these trajectory slices 𝑺mKm×Gm\bm{S}^{\prime}_{m}\in\mathbb{R}^{K_{m}\times G_{m}} are projected into a higher dimensional space dmodeld_{\text{model}}. The embedding hisroty feature 𝒁mdmodel\bm{Z}_{m}\in\mathbb{R}^{d_{\text{model}}} is denoted as

𝒁m=ReLU(𝑺m𝑾𝒆m+𝒃𝒆m),\displaystyle\bm{Z}_{m}=\mbox{ReLU}(\bm{S}^{\prime}_{m}\bm{W^{e}}_{m}+\bm{b^{e}}_{m}), (21)

where ReLU(.)\mbox{ReLU}(.)[goodfellow2016deep] denotes the activation function, 𝑾𝒆mGm×dmodel\bm{W^{e}}_{m}\in\mathbb{R}^{G_{m}\times d_{\text{model}}} is the embedding weight matrix and 𝒃𝒆mKm×dmodel\bm{b^{e}}_{m}\in\mathbb{R}^{K_{m}\times d_{\text{model}}} is the bias term. To make the model understand the trajectory sequence order, locational encoding is also adopted to encode the relative locations of each data point within 𝒵m\mathcal{Z}_{m}. Following [vaswani2017attention], sine and cosine functions of different frequencies are used to implement the locational encoding, shown as below:

{PEpos,2i=sin(pos100002i/dmodel),PEpos,2i+1=cos(pos100002i/dmodel),\displaystyle\left\{\begin{array}[]{l}\text{PE}_{\text{pos},2i}=\sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right),\\ \text{PE}_{\text{pos},2i+1}=\cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right),\end{array}\right. (24)

where pos{1,2,,Km}\text{pos}\in\{1,2,\ldots,K_{m}\} is the location and i{1,2,,dmodel/2}i\in\{1,2,\ldots,d_{\text{model}}/2\} is the dimension.

3) Encoder Network: The output 𝑶mKm×dmodel\bm{O}_{m}\in\mathbb{R}^{K_{m}\times d_{\text{model}}} of the locational encoding are then processed using an encoder network. Specifically, 𝑶m\bm{O}_{m} is firstly transformed to a query, a key and a value through different linear projections, as the inputs of the multi-head self-attention sublayer. The query can be regarded as the transformed matrix comprised of the feature vectors of each trajectory point in 𝑶m\bm{O}_{m}, which is compared to the feature vectors of every other trajectory point in the key matrix. The relevance between a query and a key is computed by dot product. Define the query, the key and the value of head ii as 𝑸m,i\bm{Q}_{m,i}, 𝑲m,i\bm{K}_{m,i} and 𝑽m,i\bm{V}_{m,i}, respectively. We have

𝑸m,i=𝑶m𝑾m,iQ,𝑲m,i=𝑶m𝑾m,iK,𝑽m,i=𝑶m𝑾m,iV,\displaystyle\bm{Q}_{m,i}=\bm{O}_{m}\bm{W}_{m,i}^{Q},\bm{K}_{m,i}=\bm{O}_{m}\bm{W}_{m,i}^{K},\bm{V}_{m,i}=\bm{O}_{m}\bm{W}_{m,i}^{V}, (25)

where 𝑾m,iQdmodel×di\bm{W}_{m,i}^{Q}\in\mathbb{R}^{d_{\text{model}}\times d_{i}}, 𝑾m,iKdmodel×di\bm{W}_{m,i}^{K}\in\mathbb{R}^{d_{\text{model}}\times d_{i}}, and 𝑾m,iVdmodel×di\bm{W}_{m,i}^{V}\in\mathbb{R}^{d_{\text{model}}\times d_{i}} are learnable projection parameters for head ii at stage mm, respectively. di=dmodel/hd_{i}=d_{\text{model}}/h is the dimension of the feature vector of head ii with hh denoting the number of heads at stage mm.

Then the output of the single head ii at stage mm is defined as:

𝑨m,i=softmax(𝑸m,i𝑲m,iTdi)𝑽m,i.\displaystyle\bm{A}_{m,i}=\mbox{softmax}\left(\frac{\bm{Q}_{m,i}\bm{K}_{m,i}^{T}}{\sqrt{d_{i}}}\right)\bm{V}_{m,i}. (26)

Distinct attention heads are computed in parallel. Their outputs are then concatenated and projected to the dimension of dmodeld_{\text{model}}. The final output 𝑭mKm×dmodel\bm{F}_{m}\in\mathbb{R}^{K_{m}\times d_{\text{model}}} of the multi-head attention at stage mm is as follows:

𝑭m=[𝑨m,1,,𝑨m,h]𝑾𝑶m,\displaystyle\bm{F}_{m}=[\bm{A}_{m,1},\ldots,\bm{A}_{m,h}]\bm{W^{O}}_{m}, (27)

where 𝑾𝑶mdmodel×dmodel\bm{W^{O}}_{m}\in\mathbb{R}^{d_{\text{model}}\times d_{\text{model}}} is the weight parameter matrix of the linear projection at stage mm.

The output of the self-attention sublayer is subsequently fed into a feed-forward neural network (FFN) sublayer, which consists of two linear transformations with a ReLU activation in between. Define 𝑾𝟏m\bm{W^{1}}_{m} , 𝑾𝟐m\bm{W^{2}}_{m}, 𝒃𝟏m\bm{b^{1}}_{m} and 𝒃𝟐m\bm{b^{2}}_{m} as learnable weights and bias of FFN at stage mm, the process is described as:

𝑭m=ReLU(𝑭m𝑾𝟏m+𝒃𝟏m)𝑾𝟐m+𝒃𝟐m.\displaystyle\bm{F^{\prime}}_{m}=\mbox{ReLU}(\bm{F}_{m}\bm{W^{1}}_{m}+\bm{b^{1}}_{m})\bm{W^{2}}_{m}+\bm{b^{2}}_{m}. (28)

Similar to the vanilla transformer encoder, these two sublayers are then enclosed within a residual connection to form an encoder layer that avoids the vanishing gradient problem. The overall encoder network is comprised of successive encoder layers, further enhancing the model’s ability to learn complex patterns and dependencies in the user trajectory. The output 𝑭m\bm{F^{\prime}}_{m} of the mm-th stage is then as the input of the trajectory sequence 𝑺m+1\bm{S}_{m+1} for (m+1)(m+1)-th stage.

4) Output Representation: After the process of MM stages, the output hidden representation of the final stage 𝑭MKM×dmodel\bm{F^{\prime}}_{M}\in\mathbb{R}^{K_{M}\times d_{\text{model}}} from encoder network is flatten, and then the predicted trajectories 𝒀uTp×2\bm{Y}_{u}\in\mathbb{R}^{T_{p}\times 2} of user uu is obtained through a linear projection, presented as

𝒀^u=[Flatten(𝑭M)𝑾𝒙+𝒃𝒙,Flatten(𝑭M)𝑾𝒚+𝒃𝒚],\displaystyle\bm{\hat{Y}}_{u}=[\text{Flatten}(\bm{F^{\prime}}_{M})\bm{W^{x}}+\bm{b^{x}},\text{Flatten}(\bm{F^{\prime}}_{M})\bm{W^{y}}+\bm{b^{y}}], (29)

where 𝑾𝒙,𝑾𝒚KMdmodel×Tp\bm{W^{x}},\bm{W^{y}}\in\mathbb{R}^{K_{M}d_{\text{model}}\times T_{p}} and 𝒃𝒙,𝒃𝒚KMdmodel×Tp\bm{b^{x}},\bm{b^{y}}\in\mathbb{R}^{K_{M}d_{\text{model}}\times T_{p}} denote the output weight matrix and bias vectors for 2-D coordinates of 𝒀^u\bm{\hat{Y}}_{u}, respectively. TpT_{p} is the predicted horizon.

To evaluate the accuracy of the predicted trajectories generated by the hierarchical multi-scale Transformer, the root-mean-squared error (RMSE) is employed as the evaluation metric. RMSE measures the average Euclidean distance between predicted and ground truth locations across the prediction horizon. It is defined as

=1Tpt=1Tp𝒀^u𝒀u.\displaystyle\mathcal{L}=\sqrt{\frac{1}{T_{p}}\sum\limits_{t=1}^{T_{p}}\|\bm{\hat{Y}}_{u}-\bm{Y}_{u}\|}. (30)

In training phase, the hierarchical Transformer trajectory prediction model is trained by minimizing the above RMSE.

IV Prediction-enhanced UAV Trajectory Optimization and Task Offloading Algorithm

In this section, the UAV agent makes decisions on its flight action and task offloading decisions at each time slot with the observed and predicted trajectory information of WBAN users. The Markov decision process (MDP) framework is firstly used to model problem (18) for the UAV edge computing network, and then a prediction-enhanced DRL algorithm for UAV trajectory optimization and task offloading is proposed.

IV-A MDP Elements Formulation

In this work, the key components of MDP are designed as follows.

1) State Space: The state space captures the environment’s information at each time slot tt. For our framework, the state encompasses the currently observable parameters and the predicted future user trajectory information. We define the state at time slot tt as

s[t]={𝑰[t],𝑷[t],𝒑v[t],Eremain[t]}.\displaystyle s[t]=\{\bm{I}[t],\bm{P}[t],\bm{p}_{v}[t],E^{\text{remain}}[t]\}. (31)

Here, 𝑰[t]={Iu,n[t]}U×N\bm{I}[t]=\{I_{u,n}[t]\}_{U\times N} is the set of task criticality index, 𝑷[t]={pu[t]}U×1\bm{P}[t]=\{p_{u}[t]\}_{U\times 1} is the set of current user locations, 𝒑v[t]\bm{p}_{v}[t] is the UAV location and Eremain[t]=Eremain[t1](Efly[t]+u=1Un=1NEu,ncomp[t])E^{\text{remain}}[t]=E^{\text{remain}}[t-1]-\left(E^{\text{fly}}[t]+\sum_{u=1}^{U}\sum_{n=1}^{N}E^{\text{comp}}_{u,n}[t]\right) is the remaining energy of the UAV at time slot tt. Particularly, at t=1t=1, Eremain[t]=EuavE^{\text{remain}}[t]=E^{\text{uav}}. Thus, the dimension of the state space is U(N+1)+3U(N+1)+3.

2) Action Space: The action space represents the decisions made by the embodied UAV agent given a state. The selection of actions is based on the agent¡¯s policy, which is gradually optimized throughout the learning process. Specifically, the actions include the flying speed of the UAV, the flying angle of the UAV, and the task offloading decisions. The action taken by the agent at time slot tt can be represented as

a[t]={v[t],σ[t],zu,n[t]}.\displaystyle a[t]=\{v[t],\sigma[t],z_{u,n}[t]\}. (32)

Note that v[t][0,vmax]v[t]\in[0,v^{\max}] and σ[t][0,2π]\sigma[t]\in[0,2\pi] should be satisfied. For the task offloading decisions, we round it to the nearest integer in the range [0,1][0,1]. That is, if zu,n[t][0,0.5)z_{u,n}[t]\in[0,0.5), we have zu,n[t]=0z_{u,n}[t]=0; if zu,n[t][0.5,1]z_{u,n}[t]\in[0.5,1], then zu,n[t]=1z_{u,n}[t]=1. The dimension of the action space is UN+2UN+2.

3) Reward: The reward r[t]r[t] evaluates the utility of the agent’s action a[t]a[t] at the given the state s[t]s[t]. In our algorithm, the reward function is designed to guide the UAV agent toward optimal actions by maximizing the task completion remaining time while considering the constraints on UAV energy consumption and task completion time. It is defined as

r(t)=n𝒩Iu,n[t](τTu,ntotal[t])×Ωuav×Ωtime,\displaystyle r(t)=\sum\limits_{n\in\mathcal{N}}I_{u,n}[t](\tau-T^{\text{total}}_{u,n}[t])\times\Omega^{\text{uav}}\times\Omega^{\text{time}}, (33)

where Ωuav\Omega^{\text{uav}} and Ωtime\Omega^{\text{time}} are binary penalty terms that ensure the fulfillment of constraints (18b) and (18c), respectively. These penalty variables are defined as follows:

Ωuav={1,ifEremain[t]0,0,ifEremain[t]<0.,Ωtime={1,ifTu,ntotal[t]τ,0,ifTu,ntotal[t]>τ.\displaystyle\Omega^{\text{uav}}=\left\{\begin{array}[]{l}1,\mbox{if}\ E^{\text{remain}}[t]\geq 0,\\[6.0pt] 0,\mbox{if}\ E^{\text{remain}}[t]<0.\end{array}\right.,\ \ \ \Omega^{\text{time}}=\left\{\begin{array}[]{l}1,\mbox{if}\ T^{\text{total}}_{u,n}[t]\leq\tau,\\[6.0pt] 0,\mbox{if}\ T^{\text{total}}_{u,n}[t]>\tau.\end{array}\right. (38)

The reward function is strictly positive only when all constraints in problem (18) are satisfied.

IV-B Algorithm Design

In this work, proximal policy optimization (PPO), a representative reinforcement learning algorithm that has shown stable performance when implemented in various environments[schulman2017proximal], is utilized to implement the prediction-enhanced UAV trajectory optimization and task offloading strategy. PPO is a policy gradient method within the actor-critic framework, where the actor network (parameterized by δA\delta_{A}) defines the flight and offloading policy, and the critic network (parameterized by δC\delta_{C}) estimates the value function. It is improved from the trust region policy optimization (TRPO) algorithm. TRPO constrains the distance between policies by Kullback-Leibler (KL) divergence, preventing an excessively large policy update in a single update. However, the computations using Taylor expansion approximations or conjugate gradients is overly complex. PPO, on the other hand, directly constrains the distance between the old policy πδAold\pi_{\delta_{A}^{\text{old}}} and new policy πδA\pi_{\delta_{A}} within the objective function through the clipping method. The clipped version of the objective function is as follows:

LCLIP(δA)=𝔼t[min(φt(δA)A^[t],clip(φt(δA),1ε,1+ε)A^[t])],\displaystyle\mathrm{L}^{\text{CLIP}}(\delta_{A})=\mathbb{E}_{t}\left[\min\left(\varphi_{t}(\delta_{A})\hat{A}[t],\text{clip}(\varphi_{t}(\delta_{A}),1-\varepsilon,1+\varepsilon)\hat{A}[t]\right)\right], (39)

where ε\varepsilon is a hyper-parameter that limits the update magnitude. φt(δA)=πδA(a[t]|s[t])/πδAold(a[t]|s[t])\varphi_{t}(\delta_{A})=\pi_{\delta_{A}}(a[t]|s[t])/\pi_{\delta_{A}^{\text{old}}}(a[t]|s[t]) represents the probability ratio between the new and old policies, and A^[t]\hat{A}[t] is the generalized advantage estimation at time tt, which is calculated as

A^[t]=l=0(γλ)l(r[t+l]+γV(s[t+l+1])V(s[t+l])),\displaystyle\hat{A}[t]=\sum_{l=0}^{\infty}(\gamma\lambda)^{l}(r[t+l]+\gamma V(s[t+l+1])-V(s[t+l])), (40)

where V(s[t])=l=0(γ)lr[t+l]V(s[t])=\sum_{l=0}^{\infty}(\gamma)^{l}r[t+l] is the cumulative discounted reward, which also represents the state-value function. Following that, the loss function is defined as

LC(δC)=12𝔼t[(VδC(s[t])Vtar(s[t]))2],\displaystyle\mathrm{L}^{\text{C}}(\delta_{C})=\frac{1}{2}\mathbb{E}_{t}\left[(V_{\delta_{C}}(s[t])-V_{\text{tar}}(s[t]))^{2}\right], (41)

where VδC(s[t])V_{\delta_{C}}(s[t]) is the value calculated by the value network with hyper-parameters set δC\delta_{C}, and VδC(st)V_{\delta_{C}}(s_{t}) is the target value, i.e.,

Vtar(s[t])=r[t]+γVδC(s[t+1]).\displaystyle V_{\text{tar}}(s[t])=r[t]+\gamma V_{\delta_{C}}(s[t+1]). (42)

Consequently, the actor and critic can be updated according to (39) and (41), respectively.

It is noted that the actions are typically continuous and bounded in this work. Conventional action sampling from Gaussian distribution by actor network will unavoidably introduce an estimation bias of policy gradient, since the boundary effects will be imposed by clipping the values of out-of-bound actions. To tackle this problem, we adopt Beta distribution instead of Gaussian distribution for the parameter learning of the actor network, which has the following form

f(x;α0,β0)=Γ(α0+β0)Γ(α0)Γ(β0)xα01(1x)β01,x[0,1],\displaystyle f(x;\alpha_{0},\beta_{0})=\frac{\Gamma(\alpha_{0}+\beta_{0})}{\Gamma(\alpha_{0})\Gamma(\beta_{0})}x^{\alpha_{0}-1}(1-x)^{\beta_{0}-1},x\in[0,1], (43)

where α0\alpha_{0} and β0\beta_{0} are the parameters of Beta distribution. Since (43) has a bounded domain, it is appropriate to sample bounded actions. For a clear representation of the interaction between the embodied UAV agent and environment in our proposed method, we provide the detailed workflow in Fig. 3.

Refer to caption
Figure 3: The workflow of the interaction between the embodied UAV agent and the environment

We then propose the UAV trajectory optimization and task offloading algorithm that integrates with user mobility information, shown as in Algorithm 1. In the training process of the algorithm, the UAV agent applies the trained hierarchical multi-scale Transformer model in Section III to predict the future mobility of WBAN users. Specifically, the UAV agent observes the current state s[t]s[t] from environment at time slot tt. Based on the historical location information 𝒫u[t]={𝒑u[tTh+1],,𝒑u[t]}\mathcal{P}_{u}[t]=\{\bm{p}_{u}[t-T_{h}+1],\ldots,\bm{p}_{u}[t]\}, the user mobility prediction algorithm is invoked to predict user location information {𝒑u[t+1],,𝒑u[t+Tp]}\{\bm{p}_{u}[t+1],\ldots,\bm{p}_{u}[t+T_{p}]\} for each user uu. This information is concatenated into the environment states. The integrated states s~[t]\tilde{s}[t] are used for the algorithm training. Then, actions are selected based on the actor network. The next state s[t+1]s[t+1] is also integrated with the predicted location information to generate s~[t+1]\tilde{s}[t+1]. Then, the experience samples are stored in the replay buffer with the rewards feedback from environment, and the actor and critic networks are updated using gradient descent to continuously optimize the UAV trajectory and task offloading policy.

1:Initialization: set the maximum episode EmaxE^{\max}, the maximum number of steps per episode TmaxT^{\max}, discount factor γ\gamma, hyper-parameter ε\varepsilon, learning rate, actor network δA\delta_{A}, critic network δC\delta_{C}, and create environment;
2:for e=1:Emaxe=1:E^{\max} do
3:  Reset the environment and obtain the initial state;
4:  for t=1:Tmaxt=1:T^{\max} do
5:   Observe the current state s[t]={𝑰[t],𝑷[t],𝒑v[t],Eremain[t]}s[t]=\left\{\bm{I}[t],\bm{P}[t],\bm{p}_{v}[t],E^{\text{remain}}[t]\right\};
6:   for Each WBAN user u𝒰u\in\mathcal{U} do
7:    Predict user location 𝒫u[t]={𝒑u[t+1],,𝒑u[t+Tp]}\mathcal{P}_{u}[t]=\{\bm{p}_{u}[t+1],\ldots,\bm{p}_{u}[t+T_{p}]\} based on the proposed mobility prediction model;
8:   end for
9:   Concatenate s[t]s[t] with {𝒫u[t]}U\{\mathcal{P}_{u}[t]\}_{U} to obtain integrated states s~[t]\tilde{s}[t];
10:   Select action a[t]a[t] using the actor network based on policy πδA\pi_{\delta_{A}} and get reward r(t)r(t);
11:   Based on s[t+1]s[t+1], predict user location 𝒫u[t+1]={𝒑u[t+2],,𝒑u[t+Tp+1]}\mathcal{P}_{u}[t+1]=\{\bm{p}_{u}[t+2],\ldots,\bm{p}_{u}[t+T_{p}+1]\} for each user and obtain integrated states s~[t+1]\tilde{s}[t+1];
12:   Record the transition tuple (s~[t],a[t],r[t],s~[t+1])(\tilde{s}[t],a[t],r[t],\tilde{s}[t+1]) into the experience replay buffer;
13:   Sample random mini-batch of transitions from the experience replay buffer;
14:   Update actor network δA\delta_{A} via gradient descent on LCLIP(δA)\mathrm{L}^{\text{CLIP}}(\delta_{A});
15:   Update critic network δC\delta_{C} by minimizing LC(δC)\mathrm{L}^{\text{C}}(\delta_{C}) ;
16:   Update the state of the environment.
17:  end for
18:end for
Algorithm 1 Prediction-enhanced UAV Trajectory Optimization and Task Offloading Algorithm

V Performance Evaluation

This section presents a comprehensive evaluation of the proposed hierarchical Transformer trajectory prediction model and the PETO algorithm. Specifically, we introduces the experimental dataset and simulation setting in subsection V-A. Then, we conduct testing and validation to assess the performance of the trajectory prediction model in subsection V-B, while subsection V-C presents the simulation results for PETO algorithm in terms of weighted task completion time compared to baseline methods.

V-A Experiment Setting

TABLE I: Environmental Parameter Settings
Description Value
Altitude of UAV 100m[tang2025deep]
Mission Period TmaxT^{\max} 100s [liu2023energy]
Data load Du,n[t]D_{u,n}[t] [1,2]MB
Computation amount Cu,n[t]C_{u,n}[t] [1,2] Gigacycles
Local computation capability VuV_{u} of user uu 1 Gigacycles/s[liu2023energy]
Edge computation capability FvF_{v} of UAV dmodeld_{\text{model}} 10 Gigacycles/s[liu2023energy]
Channel bandwidth WuW_{u} for user uu 1MHz
Parameters of LOS channel aa,bb 10, 0.6[zeng2019energy]
NLOS attenuation κ\kappa 0.2[yang2022online]
Path loss exponent ς\varsigma 2.3[yang2022online]
Channel gain at reference distance g0g_{0} 1.42e-4[wang2021deep]
Initial Energy EUAVE^{\text{UAV}} of UAV 500kJ[chen2025computation]
Effective capacitance coefficient η\eta 1e-27[liu2023energy]
Maximum flying speed VmaxV^{\max} 50m/s[chen2025computation]
Transmission power Pu[t]P_{u}[t] 100mW[ren2024resource]
Noise Power N0N_{0} -60dBm[hu2019uav]

To evaluate the performance of the proposed trajectory prediction model, a real-world public human trajectory dataset is used for training and testing. The trajectory dataset was collected in the GeoLife project by Microsoft Research Asia[zheng2010geolife]. The dataset contains precise latitude and longitude information on the consecutive locations of 182 users obtained from the GPS timestamp. The frame rate we adopt is 1s. For the trajectory data with intermittent missing values, we apply linear interpolation between adjacent observations. Each user trajectory is split into multiple segments with duration of Th+TpT_{h}+T_{p} seconds by window sliding, and use the data of last ThT_{h} seconds to predict the user’s trajectory in the next TpT_{p} seconds. For training, validation and testing, the data set is divided in a 7:2:1 ratio.

TABLE II: Parameters in the Proposed Methods
Description Value
Length of historical trajectory ThT_{h} 60s
Length of predicted trajectory TpT_{p} 10s
Number of stages MM 3
Trajectory slice size wmw_{m} 2
Hidden size of the output dmodeld_{\text{model}} 64
Number of encoders 6
Batch size 64
Learning rate 1e-3
Clipping parameter ε\varepsilon 0.2
Hidden size of the actor/critic network 128
Discount factor γ\gamma 0.98
Replay buffer size 1e6

For the performance evaluation on PETO algorithm, we then consider an environment containing 10 mobile WBAN users that generate tasks with different criticality. According to the 802.15.6 protocol[kwak2010overview], the services provided by the WBAN can be divided into four categories: non-medical services, low-priority medical services, general health services, and high-priority medical services. Each BAN user is equipped with five different physiological sensors to monitor data with different criticality, including background information, voice, ECG, body temperature, and movement, and generates corresponding data analysis tasks. Therefore, ϕu[0.25,0.5,0.75,1]\phi_{u}\in[0.25,0.5,0.75,1], and ρu,n[0.2,0.4,0.6,0.8,1]\rho_{u,n}\in[0.2,0.4,0.6,0.8,1] after normalization. The urgency of the perceived data is divided into normal data and emergency abnormal data, i.e., αu,n[0.5,1]\alpha_{u,n}\in[0.5,1]. Criticality index Iu,n[t]I_{u,n}[t] is defined as (ϕu+ρu,n+αu,n)/3(\phi_{u}+\rho_{u,n}+\alpha_{u,n})/3. The criticality of the data at different time slots follows the Markov property, and the corresponding state transition probability matrix is [0.7,0.3;0.3,0.7][0.7,0.3;0.3,0.7][yuan2018performance]. To facilitate the distance computation between UAV and users, user locations are converted from the World Geodetic System coordinates to the Cartesian coordinate system by Haversine formula. Each time slot is divided into 1s of movement time and 1s of offloading and computation time[hu2024drl]. Unless otherwise stated, the detailed environmental parameters are listed in TABLE I. For the implementation details of our hierarchical Transformer trajectory prediction model and the PETO algorithm, the default parameters are listed in TABLE II. Adam optimizer and standard normalization are employed for model training in both of the methods. To mitigate overfitting, we employ early stopping during training. Specifically, we monitor the validation loss and halt the training process if it does not decrease for 10 consecutive epochs, restoring the model parameters that achieved the best validation performance. In addition, all the experiments are implemented with PyTorch 1.13 [paszke2019pytorch] with Python 3.8 and trained on a T4 with 2560 CUDA cores.

V-B Evaluation of Trajectory Prediction Model

Refer to caption
Figure 4: Convergence behavior of mobility prediction models

To verify the effectiveness of the proposed trajectory prediction model, two prediction methods in the existing work are also implemented for comparative analysis. Specifically, LSTM-based user trajectory prediction[ma2020leveraging] serves as the baseline algorithm. In the model, the LSTM cell with size 128 is used. Following that, two fully connected layers with the activation functions ReLU are added. The vanilla Transformer is also utilized to predict the user trajectory [najjar2024pre], in which the number of encoders and the attention heads are the same to the proposed hierarchical Transformer trajectory prediction model. These algorithms are repeated for 10 times with different seeds. The mean value of RMSE calculated using (30) from 10 experiments are reported.

Fig. 4 shows the convergence behavior of the three trajectory prediction models, in which their training loss are plotted with epochs. As epochs increases, the training loss of the three trajectory prediction models gradually decreases and approximately converges to 0. It can be observed that the proposed hierarchical Transformer trajectory prediction model exhibits more rapid convergence compared to the other models, reaching a stable state approximately 20 epochs with early stopping triggered at epoch 42 to prevent overfitting.

Refer to caption
Figure 5: RMSE comparison for user trajectory prediction models

To verify the prediction performance of different user trajectory prediction models, Fig. 5 presents their RMSE results on predicted trajectories against the actual ones over the historical observations from 20s to 70s. The predicted horizon is set to 10s. As shown in the figure, a longer historical observation windows contributes to more accurate predictions, in spite of the higher complexity. The improvement gap become small with the increasing length of historical observations. For the proposed hierarchical Transformer trajectory prediction model, the historical observations of 60s are enough for a good balance between accuracy and complexity. Compared to the LSTM model, both the two Transformer-based models achieve lower RMSE, particularly when the historical observation is long. Notably, the proposed hierarchical Transformer trajectory prediction model reduces RMSE by an average of 67.86%, with the most pronounced drop of 80.42% at 60s. This suggests that the multi-head self-attention attention mechanism in vanilla Transformer model and the proposed model can better capture long-range temporal dependencies in the user trajectory, which provides better mobility prediction. Besides, it can be observed that the proposed trajectory prediction model outperforms the vanilla Transformer model with fixed-scale feature representation, showing a consistent RMSE reduction with an average of 46.82%. This demonstrates the hierarchical feature extraction from small-scale fine-grained temporal trajectory features to large-scale coarse-grained trajectory features is more effective for prediction.

In addition to quantitative metrics, the trajectory prediction results of different models are visualized in Fig. 6. In the figure, the prediction results of a representative user (ID 20) is illustrated. The X-axis represents the latitude value, and the y-axis represents the longitude value. Both the units of the X-axis and Y-axis are decimal degree of World Geodetic System coordinates. It is noted that the trajectory obtained by the proposed hierarchical Transformer prediction model is closely aligned with the actual trajectory, which exhibits its superior fitting accuracy. This result highlights the model’s ability to capture complex temporal dependencies of user movements.

Refer to caption
Figure 6: Trajectory prediction results of different models

V-C Evaluation of PETO Algorithm

In this subsection, we validate the performance of the proposed PETO algorithm. The convergence results and the optimized UAV trajectory are firstly provided. Then, we compare the PETO algorithm with the following benchmark and state-of-the-art methods:

  • 1)

    Random UAV trajectory and edge computing (RUEC): At each time slot, the UAV agent randomly chooses the flying speed and angle, and all the tasks are offloaded to the UAV by each WBAN user. Each task Θu,n[t]\Theta_{u,n}[t] is allocated computation resources according to their criticality index Iu,n[t]I_{u,n}[t].

  • 2)

    PPO-based UAV trajectory optimization and task offloading algorithm without prediction (PAWP) [wang2025joint]: At each time slot, the UAV agent observes the current system state, based on which the UAV agent optimizes trajectory and makes task offloading decisions by PPO reinforcement learning algorithm.

  • 3)

    DDPG-based UAV trajectory optimization and task offloading algorithm with prediction (DAWP) [wang2021computation]: At each time slot, the UAV agent utilizes the proposed trajectory model to predict the future user trajectory. According to the current state and the predicted partial state, the UAV agent optimizes trajectory and makes task offloading decisions by DDPG reinforcement learning algorithm.

Refer to caption
Figure 7: Convergence behavior of the proposed PETO algorithm
Refer to caption
Figure 8: UAV trajectory of the proposed PETO algorithm
Refer to caption
Figure 9: Weighted average task completion time versus number of WBAN users

Fig. 7 shows the convergence behavior of our PRTO algorithm, in which the weighted average task completion time and the UAV remaining energy in each episode are plotted with the evolution of episodes. In the early training stage, such as 0-1000 episodes, the PRTO algorithm still maintains a high exploration rate, causing the UAV agent to frequently try sub-optimal actions. As episode evolves, the weighted average task completion time declines and the UAV remaining energy grows as better trajectory and offloading strategies are learned from the environment dynamics, leading to more satisfactory performance. The results indicate that our PRTO method has stable convergence behavior and is reliable in this dynamic network environment. In addition, Fig. 8 characterizes both the users’ trajectory and the designed UAV trajectory by the proposed PETO algorithm. The initial position of the UAV is determined by the geometric center of the user locations. It can be observed that the UAV agent intelligently follows a learned optimal trajectory in response to the movements of WBAN users, thereby maintaining the quality of edge computing services.

In Fig. 9, the performance comparison on weighted average task completion time of different solutions are illustrated as the number of WBAN users varies. With the increasing number of WBAN users, more users compete the limited computation resources of UAV, and the computation resources allocated to each task decrease. As a result, the weighted average task completion time grows in all the four methods. Note that the increase in weighted average task completion time gradually become small for the PAWP, DAWP and the proposed PRTO methods, this is because the computational demands of tasks from more WBAN users (e.g., 20) exceed the afforded computational capability of the UAV, results in most of the tasks being processed locally. It is observed the proposed PRTO method consistently achieves superior performance compared to the other methods. Specifically, the proposed PRTO achieves 57.95% decrease in weighted average task completion time over RUEC on average. The performance gain comes from better flight trajectory and offloading decisions through the interactive learning of the embodied UAV agent. Without parallel computing at the local device and the UAV, too much tasks from WBAN users overloads the UAV in the RUEC method, leading to longer average task completion time. Compared to the PAWP method, the proposed PRTO method demonstrates approximately 21.76% gain in weighted average task completion time reduction. This improvement is attributed to the advanced prediction of user mobility by the proposed hierarchical Transformer framework, which facilitates the UAV to relocate a better position at each time slot to provide the edge computing service. Compared to DAWP, PRTO method also has better performance, which benefits from the use of generalized advantage estimation in PRTO provides more accurate advantage estimation and more effective action learning. These results demonstrate the scalability of our PRTO method, as it maintains superior performance gains even as the network size grows.

VI Conclusions

In this paper, we have proposed an embodied AI-enhanced IoMT edge computing framework, in which the UAV embodied AI agent serves as the edge server for mobile WBAN users with time-varying task criticality. By integrating the designed hierarchical Transformer model for user mobility prediction and DRL for UAV flight trajectory and task offloading decision-making, the proposed method can reduce task completion time and improve adaptability in dynamic IoMT environments. Real-word traces have demonstrated that our proposed user mobility prediction method consistently outperforms traditional methods in terms of convergence speed and the prediction accuracy. In addition, simulations results have shown the proposed DRL algorithm greatly improves the task completion time performance through prediction enhancement, validating the effectiveness of using embodied AI framework in IoMT edge computing scenario.