Encrypted Traffic Detection in Resource Constrained IoT Networks: A Diffusion Model and LLM Integrated Framework
Abstract
The proliferation of Internet-of-things (IoT) infrastructures and the widespread adoption of traffic encryption present significant challenges, particularly in environments characterized by dynamic traffic patterns, constrained computational capabilities, and strict latency constraints. In this paper, we propose DMLITE, a diffusion model and large language model (LLM) integrated traffic embedding framework for network traffic detection within resource-limited IoT environments. The DMLITE overcomes these challenges through a tri-phase architecture including traffic visual preprocessing, diffusion-based multi-level feature extraction, and LLM-guided feature optimization. Specifically, the framework utilizes self-supervised diffusion models to capture both fine-grained and abstract patterns in encrypted traffic through multi-level feature fusion and contrastive learning with representative sample selection, thus enabling rapid adaptation to new traffic patterns with minimal labeled data. Furthermore, DMLITE incorporates LLMs to dynamically adjust particle swarm optimization parameters for intelligent feature selection by implementing a dual objective function that minimizes both classification error and variance across data distributions. Comprehensive experimental validation on benchmark datasets confirms the effectiveness of DMLITE, achieving classification accuracies of 98.87%, 92.61%, and 99.83% on USTC-TFC, ISCX-VPN, and Edge-IIoTset datasets, respectively. This improves classification accuracy by an average of 3.7% and reduces training time by an average of 41.9% compared to the representative deep learning model.
Index Terms:
IoT network traffic classification, diffusion model, feature extraction, large language model, and feature selection.I Introduction
The rapid proliferation of Internet-of-things (IoT) technology is profoundly reshaping contemporary network ecosystems and digital infrastructure [TranDang2020, Li2025Aerial]. With connected IoT devices projected to exceed 75 billion by 2025 [Alao2025, Li2024c], network traffic detection has emerged as a critical component in ensuring the security and operational stability of these complex environments [Sheng2025]. Effective traffic detection not only identifies anomalous behaviors and malicious attacks but also supports optimized allocation of network resources and quality of service guarantees [Wu2021, Liu2024]. In increasingly sophisticated threat landscapes, particularly when facing zero-day attacks and encrypted traffic challenges [Stellios2018, Papadogiannaki2022], precise traffic classification and anomaly detection mechanisms have become essential pillars for maintaining the integrity and reliability of IoT ecosystems [Dai2023].
Despite significant advances in network traffic detection research [Nascita2024], existing methods face mounting challenges as IoT environments become more complex and traffic encryption technologies become increasingly sophisticated and widespread [Lin2021]. Traditional traffic classification methods, including port-based approaches and deep packet inspection (DPI) [Blaise2020, Yan2019], have become unsuitable for modern IoT network environments due to the prevalence of port disguising techniques and encrypted traffic [Zhu2023]. To address these limitations, traffic detection methods based on machine learning (ML) [Ede2020] and deep learning (DL) were subsequently introduced [Zhang2019, Zhang2025c]. ML methods aim to identify network traffic through manually designed features and trained classifiers [Gaurav2023], offering improvements in pattern recognition capabilities. Concurrently, DL methods employ neural networks to automatically learn representations from raw traffic data, processing traffic in an end-to-end fashion by transforming it into image or sequence formats [Zhou2017, Xiao2022].
However, both ML and DL approaches present significant limitations in IoT contexts [Xu2024, Zhang2024b]. ML-based methods such as ensemble classifiers and support vector machine (SVM) heavily rely on manually engineered features and struggle to adapt to dynamically changing network environments, thereby limiting their effectiveness against evolving threats and traffic patterns in IoT networks [Rezaei2019]. Moreover, DL-based methods require large volumes of high-quality labeled training data [Wang2021, Nakip2024], which are difficult to obtain in rapidly changing IoT environments [Tong2024], and suffer from high computational complexity and limited representation capabilities for encrypted traffic [Wei2022, Sadeghzadeh2021]. These shortcomings collectively result in low deployment efficiency in the resource-constrained environments typical of many IoT applications [Liang2022]. In addition, while some works have employed self-supervised learning methods such as masked autoencoders for traffic classification to reduce labeled data requirements [Xu2024a], these approaches typically focus on reconstruction-based learning objectives and fail to leverage the powerful representation learning capabilities of generative models [De2022, Xie2025], which could potentially overcome many of these limitations through more sophisticated representation learning and reduced dependence on labeled data [Xiang2023].
Based on an extensive analysis of the aforementioned limitations, we seek to propose a novel framework that intelligently integrates the advantages of self-supervised learning and generative artificial intelligence (AI), thereby addressing challenges including scarcity of labeled data, difficulties in extracting meaningful features from encrypted traffic, and deployment constraints in resource-limited IoT environments. However, implementing the above framework presents several significant technical challenges. Firstly, extracting discriminative features from network traffic visual representations requires overcoming the noisy characteristics of network data, particularly in encrypted traffic scenarios where patterns are obscured [Abbasi2021, Li2024b]. Secondly, achieving optimal feature selection that balances classification accuracy with computational efficiency necessitates navigating a complex, high-dimensional search space that traditional optimization methods struggle to effectively explore [Song2024]. Finally, ensuring the adaptability of the framework across diverse IoT environments with varying computational resources and traffic patterns demands solutions that can maintain performance under resource constraints while adapting to shifting data distributions [Azab2024]. Conventional approaches fail to address these interrelated challenges comprehensively, often sacrificing either performance or efficiency, and lacking the adaptability required for heterogeneous IoT deployments.
Accordingly, we propose DMLITE, i.e., diffusion model and large language model (LLM) integrated traffic embedding, which is a novel solution that directly addresses these challenges through an integrated approach combining diffusion models and LLMs. The main contributions of this paper are summarized as follows:
-
•
Generative AI-Powered Traffic Detection Architecture: We design and implement DMLITE, a novel framework that leverages the complementary strengths of diffusion models and LLMs for IoT traffic detection. This architecture transforms raw network traffic into visual representations and employs the denoising diffusion probabilistic model (DDPM) to extract discriminative features even from encrypted traffic. To the best of our knowledge, this is the first work to integrate diffusion models and LLM-guided optimization for network traffic classification, establishing a new paradigm that moves beyond traditional discriminative approaches to network security.
-
•
Diffusion-based Multi-level Feature Extraction: We develop an innovative self-supervised feature extraction approach using denoising diffusion models that captures both fine-grained and abstract traffic patterns through multi-level feature fusion. Our method combines contrastive learning with representative sample selection to enable efficient fine-tuning on minimal labeled data. This approach overcomes the limitations of traditional feature extraction techniques by effectively modeling the complex distribution of network traffic data and extracting more discriminative representations, particularly valuable for encrypted traffic where subtle patterns determine classification accuracy.
-
•
LLM-guided Adaptive Feature Selection: We introduce a novel optimization framework that employs the DeepSeek LLM to dynamically tune particle swarm optimization (PSO) parameters for intelligent feature selection. Our dual objective function minimizes both the maximum classification error and the variance across different data distributions, ensuring robust performance across diverse deployment scenarios. This approach significantly reduces computational requirements while maintaining high classification accuracy, making the system viable for resource-constrained IoT environments where existing methods often fail to balance performance with efficiency.
-
•
Comprehensive Performance Evaluation and Analysis: Through extensive experiments on multiple real-world IoT traffic datasets, we demonstrate that DMLITE achieves significant improvements over the best baseline model, with an average increase of 3.7% in classification accuracy. Moreover, ablation results further reveal important insights about the effectiveness of generative models for encrypted traffic analysis.
The remainder of this paper is organized as follows. Section II reviews related work in network traffic detection, diffusion models, and LLMs. Section III presents the detailed architecture and components of our DMLITE framework. Section IV presents and analyzes the experimental results. Finally, Section V concludes the paper.
II Related Work
In this section, we present a comprehensive review of existing research related to network traffic detection, focusing on the evolution from traditional methods to advanced DL and generative approaches. We categorize the literature into four main areas, which are traditional and ML-based traffic detection, DL approaches for network traffic analysis, diffusion models for representation learning, and LLMs for optimization tasks.
II-A Traditional and ML-based Network Traffic Detection
Traditional network traffic detection methods have evolved significantly over the past decades, from simple rule-based systems to sophisticated ML approaches [Nguyen2008, Finsterbusch2014]. Initially, the port-based classification approach was used for traffic analysis, where application recognition depended on the standardized port mappings established by the Internet assigned numbers authority (IANA) [Cotton2011]. For example, the authors in [Schneider1996] proposed a TCP/IP traffic classification method based on port number correlation with applications. Similarly, the authors in [Yoon2009] introduced an application traffic classification method using fixed IP-port information automatically collected from application behavior analysis, which enables fast and accurate real-time traffic classification through simple packet header matching. Moreover, the authors in [Moore2005] and [Madhukar2006] evaluated port-based network application classification approaches. For further improving the classification accuracy of UDP traffic, the authors in [Zhang2014] investigated a component-based method, where connected half-tuples are grouped into subgraphs and classified according to the most frequently used port numbers within each group. Although port-based traffic classification methods have advantages such as faster identification speed and lower computational resources [Doroud2018], they have become increasingly inadequate for modern network environments due to protocol camouflage and dynamic port utilization [Donato2014, Liu2021]. Specifically, some applications exploit legitimate port numbers to transmit unauthorized traffic data, such as malware concealed within HTTP streams. In addition, several applications can eschew standardized port assignments in favor of randomly selected or ephemeral ports for service delivery, as commonly observed in Voice over Internet Protocol implementations [Azab2012].
Subsequently, DPI, alternatively designated as signature-based detection methods, have been developed for traffic classification [Dainotti2012]. This type of method compares packets against predefined signature databases, thereby enabling accurate traffic classification through payload examination rather than relying on potentially misleading port associations [Azab2024]. For instance, the authors in [Sen2004] introduced a novel approach for peer-to-peer traffic identification by using application-layer signatures, which achieves accurate detection without relying on port numbers. Based on this, the authors in [Aceto2010] proposed PortLoad, a hybrid traffic classification approach that combines the efficiency and reduced privacy invasiveness of port-based methods with the accuracy of DPI techniques, achieving a better balance between classification performance and computational overhead. Further achieving faster DPI-based traffic monitoring, the authors in [AlHisnawi2016] introduced a quotient filter-based DPI approach for signature-based packet payload matching. Furthermore, the authors in [Deri2014] proposed an open-source high-speed DPI library and verified its protocol detection accuracy and efficiency in some monitoring projects. To tackle the growing challenge of encrypted communications, the authors in [Sherry2015] designed a system aimed at performing DPI operations over encrypted traffic without compromising data privacy through specialized cryptographic techniques. However, these approaches rely heavily on static rules and signatures, which makes them inadequate for detecting zero-day attacks and emerging threats in the dynamic IoT ecosystem. Moreover, they become inoperative once laws and privacy policies restrict payload access or when applications employ obfuscation and encapsulation strategies [Pacheco2019].
To overcome these limitations, researchers have applied various ML techniques to network traffic classification [Nguyen2012]. The authors in [Afuwape2021] evaluated ensemble ML classifiers, including random forest and gradient boosting, for virtual private network (VPN) and non-virtual private network (non-VPN) traffic classification, and demonstrated that ensemble methods significantly outperform single classifiers like k-nearest neighbors (KNN). Likewise, the authors in [Kumar2022] conducted a comprehensive experimental analysis of various ML algorithms for IoT network traffic classification, comparing them in terms of accuracy, speed, and training time. Recent works like [Das2022] implemented a detection framework by combining ML approaches with feature selection for network intrusion detection, and the experiment results demonstrated that carefully selected features could significantly improve detection accuracy. Moreover, the author in [Dong2021] designed a cost-sensitive multi-class support vector machine method with active learning for network traffic classification, tackling the class imbalance problem by dynamically assigning weights to applications. Additionally, the ML-based approach for encrypted traffic classification has gained traction. The authors in [Zaki2022] proposed a granular multi-label classification framework that utilizes classifier chains to classify at three levels of granular classification of encrypted network traffic for tackling the growing challenge of encrypted communications. Similarly, the authors in [Elmaghraby2024] developed three ML approaches combining neural networks and bidirectional long short-term memory (LSTM) with ensemble voting techniques for encrypted network traffic classification, achieving up to 96.8% accuracy in classifying applications such as browsing, VoIP, file transfer, and video streaming without inspecting packet contents directly. Despite these advances, ML-based approaches are hampered by the insufficient volume of labeled network traffic samples [Shahraki2022]. Furthermore, most of them still rely heavily on domain expertise for feature engineering and struggle to adapt to the ever-changing nature of network traffic patterns [Xu2024].
II-B DL Approaches for Network Traffic Analysis
The limitations of traditional ML approaches have led researchers to explore DL techniques for network traffic analysis, which can automatically learn feature representations from raw data [Aceto2018, Kalwar2024]. The authors in [Wang2017] pioneered the application of convolutional neural networks (CNNs) for malware traffic classification, and this method utilized representation learning that treats raw traffic data as images, thus achieving practical accuracy requirements without requiring hand-designed features. Extending this, the authors in [Lotfollahi2020] designed Deep Packet, a DL-based framework using CNN and stacked autoencoders that integrates feature extraction and classification for both application types and traffic categories. Similarly, the authors in [Lan2022] presented a cascaded neural framework that combines a one-dimensional CNN and a bidirectional long short-term memory with self-attention to distinguish various darknet applications and protocols. For traffic analysis in IoT security, the authors in [Zhang2024] developed an automatic and efficient DL method specifically designed for resource-constrained IoT environments that maintains high detection accuracy while reducing computational overhead. Likewise, the authors in [Li2025] presented a hypergraph convolution-based framework for detecting malicious encrypted traffic in non-terrestrial IoT networks spanning satellites, unmanned aerial vehicles, and base stations. Recently, the authors in [Xiao2025] introduced RBLJAN that employs byte-label joint attention mechanisms and adversarial training to capture long-range dependencies in traffic patterns, thus enabling efficient encrypted network traffic classification. However, these approaches typically require large volumes of labeled training data, which is often scarce in rapidly evolving IoT environments, and their complex architectures demand substantial computational resources that may exceed the capabilities of many IoT devices [Xu2024a].
To address the challenge of labeled data scarcity, several researchers have explored semi-supervised and self-supervised learning approaches [AbdelBasset2021, Horowicz2024]. The authors in [Dong2021a] introduced a novel semi-supervised deep reinforcement learning method that adaptively optimizes detection strategies through environmental feedback. Likewise, the authors in [Ning2022] and [Zhao2022] designed the semi-supervised learning-based ConvLaddernet and flow transformer frameworks, respectively, aimed at achieving improved classification performance when only a small quantity of annotated data is available. Recently, the authors in [Wang2024] developed a federated semi-supervised learning framework using autoencoder-based models for privacy-preserving network traffic analysis in smart home environments. In addition, the authors in [Lin2024] proposed a contrastive pre-training approach combined with semi-supervised learning to achieve robust traffic representation and mitigate classification bias. For reducing labeling dependency in traffic classification, the authors in [Zhao2023] and [Zhao2024] leveraged a masked autoencoder-based traffic transformer with multi-level flow representation to capture both local and global traffic patterns. Similarly, the authors in [Xu2024a] employed a masked autoencoder architecture and transformer-based backbone, efficiently extracting features from non-redundant traffic data. Moreover, the authors in [Xiao2024] introduced a federated self-supervised generative adversarial network-based approach that enables the recognition of traffic originating from unidentified services and produces artificial samples mimicking the distributional properties of unknown traffic patterns. In addition, the authors in [Zheng2025] developed a self-supervised learning-based multi-feature fusion framework that introduces random subset selection for data augmentation and a novel fusion mechanism to extract temporal features from traffic tables. Nevertheless, even these advanced DL approaches struggle with the dual challenges of computational efficiency and representational capacity for encrypted traffic. Specifically, most of them often fail to capture the subtle patterns that distinguish different types of encrypted traffic while maintaining the computational efficiency required for deployment in resource-constrained IoT environments.
II-C Diffusion Model Appliactions
Diffusion models have recently emerged as powerful generative approaches that can be combined with other learning-based algorithms for control and optimization [Sun2025b, Fang2024]. For example, the authors in [Zhang2024a] explored diffusion model-based policy networks in deep reinforcement learning frameworks for energy management optimization. Similarly, the authors in [Zhao2024a] integrated discrete diffusion models with hierarchical multi-agent deep reinforcement learning for enhancing goal-conditioned navigation and sensing. Furthermore, the authors in [Zheng2025a] demonstrated how diffusion model-based approaches can enhance traditional optimization methods to achieve superior performance in complex network resource allocation and coordination problems. Likewise, the authors in [Liu2025a] explored attention-enhanced diffusion models for edge service optimization. Recent works such as [Zhang2025a] explored diffusion-enhanced reinforcement learning methods that employ the generative potential of diffusion models while integrating the powerful decision-formulation mechanisms of reinforcement learning. Additionally, the authors in [Zhang2025d] investigated the application of generative diffusion models combined with deep learning approaches for multi-objective optimization problems. Likewise, the authors in [Wang2025b] leveraged diffusion models as prediction strategies and presented a dynamic multi-objective evolutionary approach.
Moreover, some works have investigated applications of the diffusion model within network security [Sun2025d]. For example, the authors in [Liang2025] explored the applications of diffusion models as network optimizers and demonstrated that they can achieve flexible and efficient network performance optimization. Recent works like [Wang2025a] and [Yang2025] designed diffusion model-based approaches for Wi-Fi data processing for defending against membership inference attacks and enhancing indoor localization accuracy, respectively. Furthermore, the authors in [Zhang2025] introduced a generative diffusion model-enabled approach for secure beamforming optimization in intelligent reflecting surface-assisted IoT communications. Likewise, the authors in [Liang2025a] developed an improved twin delayed deep deterministic policy gradient algorithm based on the diffusion model to enhance unmanned aerial vehicle-enabled secure data collection in IoT networks. In addition, the authors in [Wang2025] developed a novel secure sensing system based on diffusion mode that leverages both the discrete conditional diffusion model for graph generation to optimize link activation and continuous conditional diffusion models to generate safeguarding signals, protecting users against unauthorized monitoring. To tackle data scarcity challenges in IoT malware detection, the authors in [Camerota2024] leveraged the DDPM for synthetic network traffic image generation.
Several studies have proposed exploiting the unique ability of diffusion models to learn complex data distributions and extract meaningful representations [Yun2024]. The authors in [Chen2024] systematically deconstructed modern diffusion models to identify their essential components for effective representation learning. Furthermore, the authors in [Xiang2023] investigated the inherent representation learning capabilities of denoising diffusion autoencoders (DDAEs) and demonstrated that these models can obtain high-quality discriminative representations within their intermediate layers through generative pre-training alone. Based on this, the authors in [Xiang2025] further proposed DDAE++, which introduces a self-conditioning mechanism that leverages the rich semantic information within diffusion networks to simultaneously improve both generative quality and discriminative performance. The authors in [Hao2024] leveraged pre-trained diffusion models for unsupervised concept extraction. Similarly, the authors in [Bandara2025] and [Sadia2024] explored the applications of diffusion models as feature extraction in sensing change detection and medical imaging, respectively. To learn complex unknown noise distributions, the authors [Zhao2024b] introduced the application of diffusion models to signal detection in near-field communication systems and presented a maximum likelihood estimation diffusion detector.
Despite these advances, diffusion models have seen limited application in network security contexts, with most research focusing on their generative capabilities rather than their potential for discriminative feature extraction in complex domains like encrypted network traffic. Moreover, current applications of diffusion models typically utilize only single-layer representations, failing to leverage the rich multi-level features formed during the diffusion process that could potentially capture both fine-grained and abstract patterns in network traffic data.
| Reference |
|
|
|
|
|
|
|
|
|
|||||||||||||||||
| [Afuwape2021] | Ensemble classifier | |||||||||||||||||||||||||
| [Dong2021] | Cost-sensitive SVM | |||||||||||||||||||||||||
| [Das2022] | Ensemble Feature selection | |||||||||||||||||||||||||
| [Zaki2022] | Classifier Chain | |||||||||||||||||||||||||
| [Elmaghraby2024] | LSTM and ensemble voting | |||||||||||||||||||||||||
| [Wang2017] | CNN | |||||||||||||||||||||||||
| [Lotfollahi2020] | CNN and stacked autoencoders | |||||||||||||||||||||||||
| [Zhang2024] | NASP | |||||||||||||||||||||||||
| [Li2025] | Hypergraph neural networks | |||||||||||||||||||||||||
| [Xiao2025] | CNN with joint attention, GAN | |||||||||||||||||||||||||
| [Ning2022] | CNN | |||||||||||||||||||||||||
| [Wang2024] | Federated learning | |||||||||||||||||||||||||
| [Lin2024] | Transformer | |||||||||||||||||||||||||
| [Zhao2023] | Masked autoencoder | |||||||||||||||||||||||||
| [Xu2024a] | Masked autoencoder | |||||||||||||||||||||||||
| [Xiao2024] | GAN and federated learning | |||||||||||||||||||||||||
| [Lan2022] | CNN and LSTM | |||||||||||||||||||||||||
| [Ginige2024] | GPT-2 | |||||||||||||||||||||||||
| [zhou2024enhancing] | GPT-3.5-turbo | |||||||||||||||||||||||||
| [chen2024merlot] | GPT-2-base | |||||||||||||||||||||||||
| This Work | DDPM and DeepSeek |
II-D LLM Applications in Optimization and Parameter Tuning
LLMs have demonstrated remarkable capabilities across various domains, including optimization and parameter tuning tasks, due to their powerful reasoning capabilities [Li2024, OBETrans]. For example, the authors in [Zhang2024b] presented a model-centric optimization approach for democratizing LLM deployment on mobile edge networks. In addition, the authors in [Jiang2024] proposed a framework where LLMs guide the optimization process for complex engineering problems by iteratively refining solution strategies based on performance feedback. Building on this, the authors in [Du2025] introduced a reinforcement learning with LLMs interaction framework that leverages LLM-empowered generative agents to provide real-time subjective quality of experience feedback for optimizing resource allocation in distributed diffusion model services. Similarly, the authors in [Yan2025] deployed a hybrid approach based on LLM for simultaneously optimizing vehicle-to-infrastructure communication and autonomous driving policies, where LLMs optimize driving decisions through experience learning while collaborating with the double deep Q-learning algorithm for communication optimization. Furthermore, the authors in [Sun2025a] introduced an inventive paradigm exploiting LLM-enabled graphs for dynamic network optimization, demonstrating how LLMs can effectively optimize UAV trajectory planning and communication resource allocation.
Moreover, some works have focused on utilizing LLMs for parameter adjustment [Custode2024, Kochnev2025]. For example, the authors in [Zhang2023] and [Liu2025b] explored the use of LLMs for hyperparameter optimization in ML models, showing that language models can effectively reason about parameter relationships and suggest promising configurations based on previous performance data. Building on this, the authors in [Li2025a] utilized DeepSeek to facilitate flexible hyperparameter adjustment in deep reinforcement learning, evidencing the ability of LLMs to strengthen algorithmic outcomes through contextually-driven parameter adaptation. Similarly, the authors in [Miyake2023] proposed an LLM-enabled parameter tuning framework for robotic systems, where LLMs interpret user preferences and dynamically adjust operational parameters to optimize human-robot interaction performance in physical care tasks. Recent work like [Chen2025] introduced an LLM-based hyperparameter adaptation framework for evolutionary reinforcement learning, where the Qwen model dynamically adjusts learning rate and discount factor based on fitness. Likewise, the authors in [Mahammadli2024] proposed a novel LLM-based hyperparameter optimization framework aimed at improving parameter space exploitation. Additionally, the authors in [Li2024a] proposed a novel system aimed at employing LLMs for automatically designing dense reward functions within reinforcement learning environments, significantly improving agent learning efficiency in complex tasks through iterative reward function refinement.
In evolutionary algorithms, several researchers have investigated strategies for parameter adaptation [Phan2020, Melin2013]. For instance, the authors in [Tanabe2013], the authors designed a success-history based parameter adaptation mechanism for the differential evolution algorithm that dynamically adjusts control parameters by leveraging historical information of successful parameters. Building on this, the authors in [Viktorin2019] and [Ghosh2022] further enhanced this approach by introducing distance-based and nearest spatial neighborhood-based improvements, respectively. Likewise, the authors in [Zhou2022] developed a parameter adaptation-enhanced ant colony optimization method with a dynamic hybrid mechanism that analyzes historical search information to adjust key parameters during optimization. Moreover, the authors in [Sun2021] employed the policy gradient method to acquire parameter control strategies from optimization experiences, which enables algorithms to autonomously adjust their behavior based on the characteristics of the problem being solved. Recent works like [Chen2024a] integrated multiple information sources to evaluate optimization progress and adjust algorithm parameters accordingly. Nevertheless, these approaches rely on predefined adaptation rules or heuristics rather than leveraging the reasoning capabilities and contextual understanding offered by LLMs, thus limiting their flexibility and generalization across diverse problem domains. Furthermore, existing optimization approaches typically focus on single-objective functions, neglecting the importance of performance robustness across different data distributions, which is a critical consideration for IoT traffic detection systems that typically operate effectively across heterogeneous deployment environments.
Different from these approaches, the proposed DMLITE framework aims to combine the representational power of diffusion models with the reasoning capabilities of LLMs to create a comprehensive solution for IoT traffic detection that overcomes the limitations identified in the literature. In the following, we present the details of the DMLITE framework.
III DMLITE Framework: Diffusion Model and LLM Integrated Traffic Embedding
In this section, we introduce a novel DMLITE framework for IoT network traffic classification. In the following, we detail the framework, beginning with an overview of the framework architecture, followed by detailed descriptions of each component, and concluding with a computational complexity analysis.
III-A Overview of DMLITE Framework
As aforementioned, network traffic classification faces unique challenges, including rapidly changing communication patterns, limited labeled data for emerging threats, and resource constraints in edge computing environments [Zhang2015, Xiong2014]. Traditional classification methods often struggle with these challenges due to their reliance on manual feature engineering or their inability to adapt to new traffic patterns without extensive retraining [Zhang2013]. To address these challenges, we aim to propose an integrated approach that leverages both diffusion models and LLMs to extract and optimize discriminative features from network traffic.
Specifically, the designed DMLITE framework addresses these limitations through a three-stage pipeline that combines the representation power of diffusion models with the optimization capabilities of LLMs, and it consists of the following key components:
-
•
Traffic Visual Preprocessing: In this process, we convert raw network traffic into visual representations suitable for DL models. This transformation enables the application of powerful computer vision techniques to the domain of network traffic analysis.
-
•
Diffusion-based Feature Extraction: We leverage DDPMs to extract discriminative traffic features. This approach capitalizes on the diffusion models to learn complex data distributions and capture multi-scale features.
-
•
LLM-guided Feature Selection Optimization: Employing LLMs to optimize the feature selection process, reducing dimensionality while maintaining classification accuracy across diverse network environments.
The DMLITE framework integrates these components to enhance classification accuracy, generalization capability, and computational efficiency in IoT traffic classification tasks. In the following subsections, we detail each component and explain how they collectively address the challenges of IoT traffic classification.
III-B Traffic Visual Preprocessing
IoT network traffic presents unique challenges for classification due to its heterogeneous nature, protocol diversity, and evolving patterns [Booij2022]. Traditional feature extraction methods often rely on domain expertise and manual feature engineering, which can be time-consuming and may not generalize well across different IoT environments [Ren2021, Kumar2022]. Visual representations offer a promising alternative, as they can preserve structural information while enabling the application of advanced DL techniques [Wang2017].
As such, we transform raw network traffic into visual representations, enabling the application of powerful computer vision techniques to the traffic classification domain [Xu2024, Ding2022]. Specifically, as shown in Fig. 1, our preprocessing module employs the USTC-TK2016 toolkit [Wang2017] to convert raw network traffic data in PCAP format into standardized image representations. In particular, the transformation process consists of four critical steps:
-
1.
Traffic Splitting: We divide the continuous network traffic capture into discrete traffic units based on flow granularity. Each flow is defined as a sequence of packets sharing the same 5-tuple (source IP, destination IP, source port, destination port, protocol), which is given by
(1) -
2.
Traffic Sanitization: To eliminate network-specific bias and ensure privacy, we perform traffic anonymization by randomizing MAC and IP addresses. Additionally, empty and duplicate samples are removed to prevent model bias, i.e.,
(2) -
3.
Uniform Transformation: Each traffic unit is trimmed or padded to a fixed length of bytes to ensure consistency in processing. This standardization is essential for the subsequent visual transformation, which is given by
(3) -
4.
Image Generation: The byte sequence is converted into a grayscale image, where each pixel corresponds to a byte value, i.e.,
(4)
Note that this preprocessing approach transforms network traffic into a visual domain that preserves the inherent patterns and byte-level relationships while enabling the application of DL techniques for subsequent feature extraction and classification. The visual representation captures both local byte-level patterns and global flow-level structures, providing a rich foundation for our diffusion-based feature extraction process.
III-C Diffusion-based Feature Extraction
In general, IoT traffic classification faces challenges related to feature representation, such as the presence of noise in raw traffic data, and the lack of discriminative features for distinguishing between similar traffic types. Traditional feature extraction methods often rely on handcrafted features or general-purpose neural networks that may not effectively capture the unique characteristics of network traffic [Zhang2020].
In this case, diffusion models have demonstrated remarkable capability in learning complex data distributions and extracting meaningful representations in various domains such as computer vision and natural language processing [Croitoru2023, Liu2024a]. Moreover, diffusion models have proven effective in self-supervised representation learning scenarios [Chen2024], which makes them particularly suitable for limited labeled network traffic data in IoT environments. In particular, diffusion models demonstrate superior efficiency in modeling complex training distributions compared to other generative approaches, and have been demonstrated to extract high-quality discriminative features from their intermediate layers during the denoising process [Xiang2023].
Thus, we introduce a novel application of DDPMs [Ho2020] for network traffic feature extraction, leveraging their ability to progressively denoise data and capture multi-scale features that are particularly valuable for distinguishing subtle differences in traffic patterns. Fig. 2 illustrates the framework of our diffusion-based feature extraction approach. This approach consists of four key stages, which are DDPM model training, optimal feature layer identification, representative sample selection and fine-tuning, and multi-level feature fusion. The details are as follows.
III-C1 DDPM Model Training
We first train a DDPM based on a U-Net architecture to learn the underlying distribution of network traffic images. The diffusion process follows a forward process that gradually adds Gaussian noise to the data and a reverse process that learns to denoise [He2025]. In particular, the forward process is given by [Ho2020]
| (5) |
Moreover, the reverse process is given by
| (6) |
where is the noise schedule, and are learned by the neural network.
Following this, the model is trained using the variational lower bound objective as follows:
| (7) |
where is the random noise and is the noise predicted by the model.
III-C2 Optimal Feature Layer Identification
Then, we evaluate different layers of the trained denoising network to identify the optimal feature extraction layer that provides the most discriminative representations. This process is important because different layers in the network capture features at varying levels of abstraction. For each layer , we compute a classification performance metric using a simple classifier as follows:
| (8) |
where is the optimal layer.
III-C3 Representative Sample Selection and Fine-tuning
In this process, we select a representative subset of the training data to fine-tune the optimal feature extraction layer by using K-means clustering [Shi2010]. This approach aims to reduce computational requirements while maintaining performance by focusing on the most informative samples, which can be given by
| (9) |
where represents the feature extraction at the optimal layer , is the training dataset, and is the selected subset.
Note that the fine-tuning process employs contrastive learning loss for enhanced feature discrimination. The contrastive learning loss based on InfoNCE is defined as follows [He2020]:
| (10) |
where is the normalized feature vector, is a temperature parameter controlling the sharpness of the distribution, and is the number of samples. This contrastive loss encourages the model to learn representations where samples from the same class are close to each other while samples from different classes are pushed apart in the feature space.
Following this, the optimization objective is given by
| (11) |
III-C4 Multi-level Feature Fusion
To capture features at different levels of abstraction, we concatenate features from the optimal layer with its adjacent layers (above and below) to form a comprehensive feature representation [Yang2003]. This multi-level fusion enhances the robustness of the extracted features by incorporating both fine-grained and abstract patterns, which is given by
| (12) |
where denotes the concatenation operation.
As can be seen, this diffusion-based feature extraction approach captures the intricate patterns in network traffic while providing robust representations that are less sensitive to noise and domain shifts. By leveraging the hierarchical nature of the U-Net architecture and the denoising capabilities of diffusion models, our method extracts features that effectively distinguish between different types of IoT traffic, even when the differences are subtle or when limited labeled data is available.
III-D LLM-guided Feature Selection Optimization
Due to the high dimensionality of extracted features, we aim to conduct feature selection to avoid the increased computational complexity and potential overfitting. Conventional feature selection methods often rely on fixed heuristics or require extensive manual tuning of hyperparameters [Diao2015], limiting their effectiveness across diverse IoT environments. However, the dynamic nature of IoT traffic patterns necessitates adaptive feature selection strategies that can account for varying data distributions across different network settings.
In this case, we integrate the LLMs into the feature selection process to develop a more intelligent and adaptive feature selection strategy. We select PSO as the foundation for feature selection due to several key advantages. Specifically, PSO demonstrates excellent convergence properties for binary optimization problems like feature selection [Zhang2014a]. Additionally, PSO offers superior computational efficiency with fewer parameters to adjust compared to genetic algorithms and other swarm intelligence algorithms [Li2025c]. Furthermore, the social learning mechanism of PSO balances global and local search through particle information exchange while maintaining population diversity throughout the optimization process, helping prevent premature convergence to suboptimal feature subsets [Abbasi2022]. By guiding the parameter optimization of the PSO algorithm, LLMs can help identify the most discriminative features while maintaining computational efficiency. Fig. 3 shows the proposed LLM-guided PSO-based feature selection algorithm. In the following, we first introduce the optimization objective during the feature selection process, then present the PSO parameter optimization, and finally detail the iterative parameter search and evaluation method.
III-D1 Dual Optimization Objective
To ensure both high performance and robustness across different data distributions, we define a dual optimization objective that aims to minimize both the maximum classification error rate across multiple data subsets and the variance of error rates between these subsets, which is given by
| (13) | ||||
where is the classification error rate on subset with feature selection parameters , and is a weighting factor balancing the two components. This dual objective encourages the selection of features that perform well across all data subsets, rather than features that excel on some subsets but perform poorly on others.
III-D2 PSO Parameter Optimization
Following this, we utilize the DeepSeek LLM to optimize three critical parameters of the PSO algorithm, which are inertia weight , cognitive learning factor , and social learning factor during optimizing the aforementioned optimization objective shown in Eq. (13). In general, the standard PSO update equations are given by [Abbasi2022]
| (14) |
| (15) |
where and are the velocity and position of particle at iteration , is the best position found by particle , is the global best position, and are random numbers in [0,1].
The PSO convergence analysis demonstrates that parameter configuration critically determines exploration-exploitation balance [Clerc2002, Trelea2003]. Different from static parameter configurations or linear parameter adjustment methods, we use the LLMs to optimize these parameters based on historical performance data and domain knowledge, which is given by
| (16) |
where represents historical performance data and is the validation dataset. Algorithm 1 details the interaction between PSO and DeepSeek LLM. Specifically, the PSO-DeepSeek interaction is described as follows:
Step 1: Prompt Construction. Designing a structured natural language query that includes current PSO parameters , recent performance metrics, such as fitness value, classification accuracy, and feature counts over the past 6 iterations, and explicit instructions requesting specific numerical parameter adjustments to improve convergence behavior.
Step 2: DeepSeek API Query. Invoking the LLM with the constructed prompt.
Step 3: Response Parsing. Extracting numerical parameter changes from the natural language response of DeepSeek using regex patterns that recognize common directive formats such as increase by 0.1 or set to 1.8.
Step 4: Validated Parameter Update. Applying the suggested changes while enforcing theoretically sound boundary constraints.
Step 5: Fallback Mechanism. Reverting to previous parameter values if parsing fails or suggested values violate bounds, ensuring algorithmic robustness against ambiguous or invalid LLM responses.
III-D3 Iterative Parameter Search and Evaluation
In what follows, the LLM iteratively refines the PSO parameters based on performance feedback, creating a feedback loop that continuously improves the feature selection process. In each iteration, the PSO algorithm with the LLM-suggested parameters selects a feature subset, which is then evaluated using the dual optimization objective, i.e.,
| (17) |
Then, the results are fed back to the LLM, which generates improved parameter suggestions:
| (18) |
where represents the performance metrics of the selected feature subset.
Algorithm 2 shows the pseudocode of the LLM-guided feature selection optimization strategy. By integrating the reasoning capabilities of LLMs [Sun2025] with the exploration efficiency of PSO, this approach intelligently navigates the feature space to identify the most discriminative features for IoT traffic classification.
III-E Computational Complexity Analysis
The practical deployment of IoT traffic classification systems often requires careful consideration of computational resources, especially in edge computing environments. Thus, we analyze the computational complexity of each component of the DMLITE framework and discuss optimizations for resource-constrained environments. Specifically, the computational complexity of the DMLITE framework can be analyzed for each of its major components:
-
1.
Traffic Visual Preprocessing: The complexity is , where is the number of traffic flows and is the maximum flow length (capped at 784 bytes in our implementation). This step is relatively lightweight and can be efficiently implemented on edge devices.
-
2.
Diffusion-based Feature Extraction:
-
•
DDPM Model Training: , where is the number of diffusion steps, is the number of epochs, is the number of training samples, and is the dimensionality of the data. This is the most computationally intensive phase, but it is performed offline during the development stage.
-
•
Optimal Layer Identification: , where is the number of layers evaluated. This is also performed offline during model development.
-
•
Representative Sample Selection: , where is the number of clusters, is the dataset size, is the feature dimension, and is the number of iterations for K-means. Our approach reduces this cost by operating on only 5% of the training data.
-
•
Fine-tuning: , where is the number of fine-tuning epochs and is the size of the representative subset (5% of ). This phase is significantly more efficient than full model training due to the reduced dataset size.
-
•
-
3.
LLM-guided Feature Selection:
-
•
PSO Optimization: , where is the number of particles, is the number of iterations, is the feature dimension, and is the cost of evaluating the fitness of each particle. The LLM guidance helps reduce the number of iterations required to reach an optimal solution.
-
•
LLM Parameter Refinement: , where is the number of refinement rounds and is the inference cost of the LLM. This step is performed on a server rather than edge devices.
-
•
The overall computational complexity of the DMLITE framework is dominated by the DDPM model training phase. However, once the model is trained, the feature extraction and selection phases are relatively efficient, especially since we only use 5% of the training data for fine-tuning. The LLM-guided optimization introduces additional computational cost, but this is offset by the improved efficiency of the resulting feature selection, leading to better classification performance with fewer features.
| Dataset | Classes | Total samples | Detailed class distribution | |
| USTC-TFC | 20 | 272708 | Malware | Cridex: 15000 Geodo: 6000 Htbot: 13549 Miuref: 17178 Neris: 14456 |
| Nsis-ay: 14984 Shifu: 12000 Tinba: 12653 Virut: 12982 Zeus: 15761 | ||||
| Benign | Bittorrent: 16335 FTP: 13804 Facetime: 12657 Gmail: 10051 MySQL: 17008 | |||
| Outlook: 11115 SMB: 15772 Skype: 16213 Weibo: 13158 WoW: 12032 | ||||
| ISCX-VPN | 12 | 340339 | non-VPN | audio: 206226 chat: 11326 file: 73347 mail: 8948 streaming: 2667 voip: 4033 |
| VPN | vpn-audio: 20742 vpn-chat: 8065 vpn-file: 1949 vpn-mail: 598 vpn-streaming: 666 vpn-voip: 1832 | |||
| Edge-IIoTset | 24 | 570860 | Normal traffic | Distance: 32653 Flame_Sensor (FS): 32687 Heart_Rate (HR): 20842 |
| IR_Receiver (IR): 32730 Modbus: 1138 phValue: 35522 Soil_Moisture (SM): 34584 | ||||
| Sound_Sensor (SS): 32695 Temperature_and_Humidity (TH): 33282 Water_Level (WL): 34330 | ||||
| Attack traffic | Backdoor: 1693 DDoS_HTTP: 16760 DDoS_ICMP: 60000 DDoS_TCP: 60000 DDoS_UDP: 60000 | |||
| MITM:102 OS_Fingerprinting (OS): 290 Password: 28565 Port_Scanning (PS): 20132 Ransomware: 1211 | ||||
| SQL_injection (SQL): 8790 Uploading: 15183 Vulnerability_scanner (VS):5274 XSS: 2397 | ||||
For deployment in resource-constrained IoT environments, the trained model can be further optimized through techniques such as model quantization, pruning, and knowledge distillation, reducing the overall computational requirements while maintaining classification accuracy. Additionally, the modular nature of our framework allows for flexible deployment configurations, where more computationally intensive components can be offloaded to cloud servers while lightweight inference can be performed at the edge.
Compared to traditional approaches that require extensive feature engineering or complex DL models that need to process raw traffic data directly, our framework offers a more balanced approach to the accuracy-efficiency trade-off, making it particularly suitable for real-world IoT security applications.
IV Experimental Results and Analysis
In this section, we present a comprehensive evaluation of the proposed DMLITE framework. We first describe the datasets and implementation details, followed by performance comparison with state-of-the-art methods, and conclude with ablation studies to validate the contribution of different components.
| Parameter | Description | Value |
| Training epochs | 100 | |
| batch_size | Training batch size | 64 |
| Number of diffusion timesteps | 500 | |
| Model embedding dimension | 64 | |
| Feature extraction time step | 50 | |
| Fine-tuning stage | ||
| The ratio of data for fine-tuning | 0.05 | |
| Fine-tuning epochs | 10 | |
| Learning rate | 1e-6 | |
| Weight decay | 1e-2 | |
| Method | Parameter | Value |
| 2D-CNN [Wang2017] | Batch size | 64 |
| Epochs | 20 | |
| Learning rate | ||
| Weight decay | 0.05 | |
| Convolution kernels | 5 | |
| DP-CNN [Lotfollahi2020] | Batch size | 64 |
| Epochs | 100 | |
| Learning rate | ||
| Dropout | 0.05 | |
| Convolution kernels | 4 | |
| DP-SAE [Lotfollahi2020] | Batch size | 64 |
| Epochs | 100 | |
| Learning rate | ||
| Dropout | 0.05 | |
| RBLJAN [Xiao2025] | Batch size | 64 |
| Epochs | 100 | |
| Learning rate | ||
| Dropout | 0.5 | |
| Convolution kernels | 8 | |
| MTC-MAE [Xu2024a] Pre-training | Batch size | 64 |
| Epochs | 300 | |
| Learning rate | ||
| Weight decay | 0.05 | |
| The masking ratio | 0.6 | |
| MTC-MAE [Xu2024a] Fine-tuning | Batch size | 64 |
| Epochs | 50 | |
| Learning rate | ||
| Weight decay | 0.05 | |
| YaTC [Zhao2023] Pre-training | Batch size | 64 |
| Total step | 150000 | |
| Learning rate | ||
| Weight decay | 0.05 | |
| The masking ratio | 0.9 | |
| YaTC [Zhao2023] Fine-tuning | Batch size | 64 |
| Epochs | 200 | |
| Learning rate | ||
| Weight decay | 0.05 |
IV-A Dataset and Implementation
To evaluate the performance of the DMLITE framework, we conduct experiments on three widely-used public network traffic datasets, i.e., USTC-TFC [Wang2017], ISCX-VPN [DraperGil2016], and Edge-IIoTset [Ferrag2022]. Table II summarizes the characteristics of datasets, including the number of classes, total samples, and class distribution of samples. Specifically, the USTC-TFC dataset contains network traffic data for both malicious and benign applications. It includes 10 types of malware traffic (such as Cridex, Geodo, Htbot, Zeus, etc.) and 10 types of normal traffic activities, providing a balanced environment for evaluating traffic classification performance. Moreover, the ISCX-VPN dataset focuses on encrypted traffic classification with 12 traffic categories (6 VPN and 6 non-VPN), including audio, chat, file transfer, email service, streaming, and VoIP. This dataset is particularly valuable for evaluating performance on encrypted traffic since traditional pattern-matching techniques often fail. The Edge-IIoTset dataset comprises network traffic data collected from various edge computing and IoT devices. It includes traffic from multiple IoT device categories (such as temperature, humidity, and ultrasonic sensors) with both benign operational patterns and simulated attack traffic vectors, including DDoS, information gathering, man-in-the-middle, injection attacks, and malware. This comprehensive dataset captures the unique characteristics of resource-constrained IoT communications at network edges, which makes it particularly valuable for evaluating traffic classification algorithms in edge computing environments where processing capabilities and bandwidth are limited.
The proposed framework is implemented using PyTorch 2.7.1 along with CUDA 12.6 and conducted on a server equipped with NVIDIA GeForce RTX 3060 GPU. For the diffusion model component, we employ a U-Net architecture, and the diffusion process follows a cosine noise schedule, in which we use the AdamW optimizer with a learning rate of 0.001 and weight decay of 0.05. For the LLM component, we utilize a pre-trained DeepSeek model, in which the population size , the maximum iteration count , and the number of data subsets is 4. Each data subset is selected through the K-means clustering algorithm and contains approximately 5% of the total sample. Moreover, we utilize a KNN classifier with as the performance evaluator for comparing feature representations extracted from different intermediate layers of the diffusion model, and the light gradient boosting machine (LightGBM) as the primary classifier for final performance evaluation and comparative experiments. Table III presents the other training parameters used in the DMLITE framework.
As for evaluation metrics, we employ multiple complementary evaluation metrics that capture different aspects of classification performance, including accuracy (AC), precision (PR), recall (RC), and weighted F1-score (F1) as evaluation metrics. Additionally, we conduct per-class F1-score analysis to assess the effectiveness of the framework across individual traffic categories, enabling a more granular understanding of classification capabilities for specific traffic types.
| Model | USTC-TFC | ISCX-VPN | Edge-IIoTset | |||||||||
| AC | PR | RC | F1 | AC | PR | RC | F1 | AC | PR | RC | F1 | |
| 2D-CNN | 88.05% | 88.69% | 88.05% | 88.10% | 92.43% | 92.90% | 92.43% | 92.57% | 79.95% | 82.76% | 79.95% | 80.86% |
| DP-CNN | 68.22% | 72.52% | 68.22% | 65.02% | 84.29% | 84.59% | 84.29% | 83.23% | 39.15% | 44.17% | 39.15% | 34.41% |
| DP-SAE | 37.33% | 41.68% | 37.33% | 34.62% | 53.17% | 62.24% | 53.17% | 51.05% | 11.68% | 25.89% | 11.68% | 10.66% |
| RBLJAN | 86.74% | 87.60% | 86.74% | 85.73% | 89.63% | 89.83% | 89.63% | 89.49% | 50.71% | 65.21% | 50.71% | 45.60% |
| YaTC | 96.33% | 96.44% | 96.33% | 96.33% | 87.68% | 87.53% | 87.68% | 87.60% | 97.06% | 97.99% | 97.06% | 96.91% |
| MTC-MAE | 92.10% | 94.77% | 92.10% | 89.43% | 66.67% | 77.78% | 66.67% | 53.34% | 89.97% | 92.22% | 89.97% | 86.01% |
| DMLITE | 98.87% | 98.88% | 98.87% | 98.87% | 92.61% | 92.65% | 92.61% | 92.63% | 99.83% | 99.83% | 99.83% | 99.83% |
| Model | USTC-TFC | ISCX-VPN | Edge-IIoTset |
| 2D-CNN | 2611.75 | 2131.06 | 4059.90 |
| DP-CNN | 4674.83 | 4251.73 | 8261.12 |
| DP-SAE | 2324.84 | 2022.44 | 4463.43 |
| RBLJAN | 37809.12 | 24804.05 | 43403.27 |
| YaTC | 11039.82 | 10622.12 | 34145.02 |
| MTC-MAE | 22797.50 | 17957.72 | 43968.08 |
| DMLITE | 4339.35 | 5418.86 | 28660.60 |
IV-B Comparison with Baselines
To verify the performance of the proposed DMLITE framework, we conduct comprehensive experiments comparing it with several state-of-the-art baselines, including supervised learning methods (i.e., 2D-CNN [Wang2017], Deep Packet [Lotfollahi2020], and RBLJAN [Xiao2025]), and self-supervised learning approaches (i.e., YaTC [Zhao2023] and MTC-MAE [Xu2024a]). Note that DP-CNN and DP-SAE are Deep Packet with the one-dimensional CNN and the stacked autoencoder, respectively, and the key parameter settings of these baselines are shown in Table IV.
| Model | Cridex | Geodo | Htbot | Miuref | Neris | Nsis-ay | Shifu | Tinba | Virut | Zeus |
| 2D-CNN | 85.63% | 81.63% | 79.27% | 81.30% | 90.01% | 88.00% | 87.55% | 92.77% | 89.95% | 96.43% |
| DP-CNN | 59.35% | 52.65% | 57.44% | 15.14% | 71.19% | 66.51% | 27.72% | 19.68% | 69.29% | 42.91% |
| DP-SAE | 44.42% | 37.05% | 49.20% | 4.85% | 16.84% | 40.82% | 8.94% | 4.91% | 41.46% | 16.42% |
| RBLJAN | 69.32% | 27.53% | 78.47% | 41.71% | 92.12% | 99.47% | 92.74% | 60.56% | 89.00% | 96.71% |
| YaTC | 100.00% | 99.39% | 91.67% | 98.11% | 73.44% | 100.00% | 99.36% | 100.00% | 77.78% | 100.00% |
| MTC-MAE | 99.97% | 99.58% | 98.58% | 98.74% | 71.00% | 96.05% | 98.16% | 99.35% | 0.00% | 99.74% |
| DMLITE | 99.97% | 99.96% | 99.80% | 99.85% | 91.54% | 99.37% | 99.84% | 99.94% | 90.22% | 99.83% |
| Model | Bittorrent | FTP | Facetime | Gmail | MySQL | Outlook | SMB | Skype | WoW | VAE | |
| 2D-CNN | 88.01% | 93.93% | 88.23% | 89.00% | 85.17% | 89.39% | 88.28% | 88.74% | 88.90% | 90.59% | 88.01% |
| DP-CNN | 42.47% | 80.01% | 98.35% | 73.14% | 56.80% | 9.93% | 91.47% | 66.56% | 99.93% | 93.15% | 57.92% |
| DP-SAE | 61.17% | 72.44% | 44.60% | 21.88% | 12.18% | 24.76% | 30.64% | 6.20% | 47.90% | 39.85% | 30.88% |
| RBLJAN | 100.00% | 99.99% | 99.98% | 99.99% | 100.00% | 99.99% | 99.98% | 99.95% | 99.98% | 99.97% | 86.71% |
| YaTC | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 99.37% | 100.00% | 96.80% |
| MTC-MAE | 99.28% | 0.00% | 99.85% | 87.80% | 99.38% | 98.90% | 93.40% | 99.19% | 99.59% | 99.48% | 86.24% |
| DMLITE | 100.00% | 100.00% | 99.96% | 99.91% | 100.00% | 99.90% | 100.00% | 100.00% | 99.96% | 100.00% | 98.95% |
| Model | Audio | Chat | File | ST | Voip | Vpn-audio | Vpn-chat | Vpn-file | Vpn-mail | Vpn-ST | Vpn-voip | VAE | |
| 2D-CNN | 95.18% | 78.45% | 89.01% | 71.42% | 80.30% | 84.11% | 99.08% | 96.17% | 85.48% | 75.00% | 86.57% | 94.35% | 86.26% |
| DP-CNN | 92.96% | 83.59% | 93.79% | 54.10% | 91.13% | 46.10% | 95.83% | 86.78% | 90.06% | 75.79% | 98.20% | 68.96% | 81.44% |
| DP-SAE | 48.53% | 44.51% | 47.47% | 10.74% | 62.70% | 18.48% | 78.77% | 61.44% | 47.04% | 54.61% | 98.95% | 20.01% | 49.44% |
| RBLJAN | 98.02% | 88.39% | 99.22% | 69.47% | 99.68% | 64.20% | 98.62% | 94.64% | 95.02% | 96.21% | 99.84% | 64.61% | 88.99% |
| YaTC | 89.48% | 89.00% | 70.95% | 90.48% | 93.49% | 94.07% | 100.00% | 99.32% | 96.92% | 98.33% | 97.78% | 99.12% | 93.24% |
| MTC-MAE | 80.17% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 78.33% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 13.21% |
| DMLITE | 94.37% | 93.37% | 84.57% | 92.81% | 94.07% | 95.04% | 99.93% | 97.89% | 92.54% | 94.83% | 93.02% | 99.73% | 94.35% |
Table V demonstrates that DMLITE achieves superior performance across the three datasets. The reason may be that the diffusion-based multi-level feature extraction mechanism effectively captures meaningful features through the U-Net architecture during the progressive denoising process. Moreover, compared to supervised methods, the enhanced performance exhibited by DMLITE and other self-supervised approaches, i.e., YaTC and MTC-MAE, validates the efficacy of self-supervised learning paradigms in network traffic classification tasks. Meanwhile, the self-supervised learning paradigm of DMLITE effectively overcomes the critical challenge of labeled data in rapidly evolving IoT environments. Additionally, the well-balanced precision and recall scores obtained by DMLITE underscore its effectiveness in reducing both false positives and false negatives, which is crucial for practical traffic classification systems that focus on classification stability. However, the relatively lower accuracy on the ISCX-VPN dataset reveals potential challenges when handling certain types of complex encrypted VPN traffic patterns. The reason may be that VPN encryption introduces additional layers of obfuscation that make traffic patterns more homogeneous, thereby reducing the discriminative power of extracted features.
The computational overhead comparison is illustrated in Table VI, showing training duration requirements for all three datasets. Although 2D-CNN exhibits superior computational efficiency, its classification accuracy proves insufficient when handling intricate network traffic patterns, as evidenced by our performance assessments. DMLITE strikes an optimal balance between computational demands and classification quality, requiring reasonable training durations while substantially outperforming self-supervised alternatives such as YaTC and MTC-MAE in terms of efficiency.
The detailed category-specific performance analysis is illustrated in Tables VII, VIII, and IX, which provide an exhaustive evaluation comparing DMLITE against baseline approaches across three distinct datasets. The experimental results indicate that baseline methods exhibit significant performance fluctuations, failing entirely to identify certain traffic categories, particularly when dealing with encrypted network traffic or complex network patterns. In contrast, the proposed DMLITE framework achieves remarkably stable and superior performance across all traffic categories throughout the three experimental datasets. Notably, self-supervised approaches such as MTC-MAE exhibit substantially diminished detection capabilities for multiple traffic categories where DMLITE demonstrates exceptional proficiency, thereby validating the superior feature extraction capabilities. Furthermore, these experimental results show the superiority of DMLITE in preserving classification effectiveness across heterogeneous network environments, especially for security-critical deployment scenarios where uniform cross-category performance reliability is paramount.
IV-C Ablation Study
To evaluate the effectiveness of each component in the proposed DMLITE framework, we conduct an ablation study by incrementally adding key components to the baseline model. Specifically, we compare and analyze four variants of our approach, which are as follows:
-
•
DMLITE-1: Utilizes only the optimal feature extraction layer from the diffusion model without any enhancements.
-
•
DMLITE-2: Extends DMLITE-1 by incorporating the contrastive learning-based fine-tuning strategy.
-
•
DMLITE-3: Builds upon DMLITE-2 by adding the multi-layer feature fusion mechanism.
-
•
DMLITE: Our complete framework that enhances DMLITE-3 with the LLM-guided feature selection optimization.
Fig. 4 presents the comparative performance of these variants on both the USTC-TFC, ISCX-VPN, and Edge-IIoTset datasets across four evaluation metrics, including accuracy, precision, recall, and F1-score. Although these components of the DMLITE framework may introduce the complexity of deployment and maintenance, experimental results show that each component delivers performance improvements across three datasets, which validates the necessity and effectiveness of this multi-stage design approach.
The variant DMLITE-1 shows remarkable classification performance on both USTC-TFC and Edge-IIoTset datasets, which validates that diffusion models possess capabilities for extracting meaningful representations from complex network traffic patterns. The reason can be explained by the diffusion model’s unique denoising methodology, which progressively learns to reconstruct traffic data from noise, thereby acquiring a deep understanding of the underlying data distribution characteristics that effectively distinguish between different traffic categories.
The incorporation of contrastive learning-based fine-tuning mechanisms produces the most significant performance boost among all proposed enhancements. This improvement is particularly pronounced on the challenging ISCX-VPN dataset, where we document a substantial 2.27% accuracy improvement compared to the baseline DMLITE-1 configuration. This enhancement can be attributed to the introduced contrastive learning framework, which learn discriminative embeddings by maximizing similarity between positive pairs while minimizing similarity between negative samples, thus refining the feature space organization for improved classification boundaries.
Finally, LLM-guided feature selection optimization is integrated into the model as a final enhancement. Although this component introduces additional computational overhead, it achieves consistent performance refinements across all evaluation metrics and experimental datasets. This component leverages the reasoning capabilities of LLMs to identify and prioritize the most informative features, resulting in a more robust classification framework. The comprehensive ablation analysis demonstrates that our proposed multi-component architecture achieves optimal performance through the synergistic combination of diffusion-based representation learning, contrastive fine-tuning, and intelligent feature optimization strategies.
| Model | Distance | FS | HR | IR | Modbus | phValue | SM | SS | TH | WL | Backdoor | DDoS-HTTP |
| 2D-CNN | 77.20% | 81.58% | 75.77% | 86.06% | 26.72% | 83.63% | 72.06% | 83.71% | 82.93% | 76.95% | 29.09% | 77.40% |
| DP-CNN | 3.26% | 5.54% | 4.01% | 2.07% | 30.02% | 13.84% | 7.46% | 3.40% | 66.95% | 3.43% | 35.67% | 98.74% |
| DP-SAE | 3.35% | 3.22% | 2.76% | 1.86% | 22.90% | 3.04% | 3.04% | 3.19% | 38.95% | 3.16% | 8.85% | 12.92% |
| RBLJAN | 25.87% | 30.68% | 5.44% | 5.06% | 58.32% | 10.22% | 7.30% | 15.09% | 69.34% | 12.47% | 99.72% | 99.49% |
| YaTC | 68.43% | 100.00% | 99.95% | 99.95% | 92.17% | 99.87% | 99.97% | 99.98% | 80.61% | 99.96% | 97.06% | 99.85% |
| MTC-MAE | 99.74% | 96.90% | 96.24% | 99.03% | 0.00% | 79.10% | 93.82% | 99.17% | 95.35% | 94.62% | 0.00% | 0.00% |
| DMLITE | 99.95% | 99.94% | 99.93% | 99.97% | 89.57% | 99.92% | 99.83% | 99.91% | 99.56% | 99.77% | 98.19% | 99.79% |
| Model | DDoS-ICMP | DDoS-TCP | DDoS-UDP | MITM | OS | Password | Port | Ransomware | SQL | Uploading | VS | XSS | VAE |
| 2D-CNN | 82.57% | 83.36% | 92.61% | 10.42% | 19.66% | 83.07% | 78.20% | 26.79% | 79.63% | 61.20% | 59.85% | 64.61% | 66.46% |
| DP-CNN | 0.00% | 0.00% | 0.00% | 0.00% | 12.87% | 84.28% | 5.68% | 96.93% | 37.81% | 85.78% | 97.30% | 27.74% | 30.12% |
| DP-SAE | 0.00% | 0.00% | 0.00% | 2.60% | 4.48% | 9.43% | 1.14% | 16.25% | 16.86% | 7.54% | 26.30% | 4.80% | 8.19% |
| RBLJAN | 0.00% | 30.81% | 0.00% | 64.13% | 96.51% | 98.48% | 55.87% | 93.08% | 99.39% | 98.73% | 99.93% | 96.85% | 53.03% |
| YaTC | 99.96% | 100.00% | 100.00% | 34.62% | 98.25% | 99.83% | 99.80% | 97.07% | 99.66% | 99.84% | 98.77% | 99.16% | 94.37% |
| MTC-MAE | 92.73% | 99.04% | 99.72% | 0.00% | 0.00% | 80.30% | 95.78% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 55.06% |
| DMLITE | 99.98% | 99.97% | 99.98% | 70.59% | 96.67% | 99.72% | 99.63% | 96.67% | 99.83% | 99.93% | 98.96% | 99.16% | 97.81% |



IV-D Feature Extraction Analysis
The diffusion-based feature extraction process represents a critical component of the DMLITE framework. Thus, we conduct experiments to investigate how varying training epochs affects the quality of extracted features in our diffusion-based feature extraction module. Specifically, we evaluate four different training configurations with epochs set to 50, 100, 150, and 200, respectively, while maintaining other hyperparameters constant. For each configuration, we extract features from the optimal layer and evaluate their classification accuracy across all three benchmark datasets.
As shown in Fig. 5, classification accuracy improves as training epochs increase from 50 to 100, with accuracy gains of 0.30%, 0.86%, and 0.03% on USTC-TFC, ISCX-VPN, and Edge-IIoTset, respectively. This improvement can be attributed to the progressive learning of the underlying traffic data distribution by the diffusion model, where additional training iterations allow the denoising network to better capture fine-grained features through the U-Net architecture. However, further increasing epochs from 100 to 150 shows mixed results. Specifically, while ISCX-VPN achieves a marginal improvement of 0.23%, both USTC-TFC and Edge-IIoTset experience slight performance degradation of 0.12% and 0.13%, respectively. When extending training to 200 epochs, we observe consistent performance degradation of approximately 0.01%-0.09% across all datasets, which suggests potential overfitting where the model begins to memorize training-specific patterns rather than generalizing to discriminative traffic features. Although ISCX-VPN achieves its peak performance at 150 epochs, the marginal 0.23% improvement comes at the cost of nearly 50% additional training time. Thus, we select 100 epochs as the default configuration in the DMLITE framework, as it provides an optimal balance between feature extraction quality and computational efficiency across diverse datasets.
Moreover, the extraction timestep parameter () directly influences the quality and characteristics of the extracted features. This parameter determines at which point in the diffusion process features are captured, thereby affecting the balance between low-level details and high-level semantic information. Specifically, following the diffusion formulation, the noised version of at timestep is defined as , where and control how much original semantic information is preserved versus how much stochastic noise is introduced [Xiang2023]. In this case, different timesteps activate distinct levels of semantic information within the denoising network. Moreover, a more optimal timestep creates an implicit information bottleneck that compresses high-level semantics into compact, linearly-separable features, with alignment and uniformity metrics demonstrating that features extracted at optimal timesteps achieve balanced semantic consistency and discriminative specificity [Xiang2025]. Thus, to systematically evaluate the impact of this parameter on classification performance, we conducted experiments with three different extraction timestep values, which are , , and .
Fig. 6 presents the classification accuracy achieved across the three datasets using different extraction timesteps. The experimental results show that increasing the extraction timestep generally leads to improved classification performance across all datasets. In particular, this significant enhancement on the more challenging encrypted traffic dataset suggests that higher extraction timesteps enable the diffusion model to capture more abstract and discriminative features that are particularly valuable for distinguishing between complex encrypted traffic patterns.
These findings indicate that features extracted at later stages of the diffusion process contain richer semantic information beneficial for classification tasks, especially for complex traffic patterns. The progressive improvement with increasing extraction timesteps can be attributed to the diffusion model’s ability to gradually refine its understanding of the underlying data distribution through iterative denoising steps. However, it is worth noting that the performance gains begin to plateau, particularly for the USTC-TFC and Edge-IIoTset datasets, suggesting an optimal range for the extraction timestep parameter rather than an indefinite improvement with increasing values.
V Conclusion
This paper proposes the DMLITE framework that combines diffusion models with DeepSeek to tackle network traffic classification challenges in computationally restricted IoT environments. The proposed DMLITE first introduces a self-supervised diffusion-based hierarchical feature extraction method that identifies multiscale traffic characteristics within encrypted communication channels without extensive labeled datasets. Then, an LLM-guided adaptive feature selection method that dynamically optimizes the feature space while preserving computational efficiency through intelligent parameter tuning. Extensive experimental validation across diverse benchmark datasets confirms DMLITE delivers substantial performance enhancements compared to baselines, and achieves exceptional classification accuracy across different traffic categories while dramatically reducing computational overhead. However, the framework presents higher computational complexity during the DDPM training phase and introduces deployment complexity due to its multi-component architecture, requiring careful coordination between different modules.
Future research directions include investigating lightweight diffusion architectures tailored for resource-constrained environments through neural architecture search or knowledge distillation techniques to further reduce computational overhead while maintaining feature quality. Additionally, developing automatic timestep selection mechanisms based on dataset characteristics would enhance the generalizability of the method across diverse traffic distributions without manual hyperparameter tuning. Finally, extending the framework to handle continual learning scenarios where new traffic patterns emerge dynamically would address the evolving nature of IoT network environments and enhance long-term deployment viability in real-world applications.