Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Hang Xu1  Linjiang Huang2   Feng Zhao1
1MoE Key Lab of BIPC, USTC 2Beihang University
Corresponding author
Abstract

Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method’s performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at https://github.com/xuhang07/TEP-Diffusion.

1 Introduction

Refer to caption
Figure 1: Top: Comparison of text embedding perturbation with previous randomness. Bottom: The corresponding generated images of the Top. Our method is plug-and-play.

Diffusion models start from random noise and have demonstrated impressive generative capabilities in text-to-image(T2I) generation. However, due to the inconsistency in their training-inference paradigm, where multiple noise-to-data mappings are learned during training but only a single noise is used during inference, the full potential of diffusion models in generation remains untapped. Therefore, inspired by the test-time scaling (TTS) techniques in LLMs [muennighoff2025s1, liu2025inference], many researchers aim to enhance the generation quality of diffusion models by scaling inference computations during inference [ma2025inference, singhal2025general]. Specifically, these TTS methods rely on the sampling randomness of diffusion models (like initial noise) to generate multiple candidate samples, evaluate them using reward models, and then employ search strategies to select and further refine the candidates. Therefore, the core components of TTS methods consist of randomness, search strategies, and reward models.

Research on search strategies and reward models has dominated TTS methods for T2I diffusion models, while randomness and its impact on these methods remain unexplored. Notably, randomness directly affects the size of the search space in TTS methods [zhang2025inference]. However, most existing works rely solely on spatial random noise introduced in latent space (i.e., SDE), which may not provide a sufficiently large search space. A constrained search space means repeated sampling tends to converge on similar and redundant candidates, leading to ineffective use of computational resources [puri2025probabilistic]. Therefore, it is meaningful to explore a new format of randomness that can both enhance generative diversity and complement existing spatial noise randomness.

In this paper, we explore a new format of randomness, text embedding perturbation, for TTS methods in T2I diffusion models. While recent studies have utilized text embedding perturbation to generate more diverse images [sadat2023cads], they struggle to maintain visual quality and text faithfulness, making them unsuitable for TTS methods (see Fig. 2). Our experimental analysis attributes this limitation to two key factors: (1) poor complementarity between text embedding perturbation and existing spatial noise randomness, and (2) excessive uniformity in the responses of diffusion model components. First, from a frequency-domain perspective, we reveal a complementary relationship between text embedding perturbation and spatial noise randomness: while spatial noise randomness primarily affects low-frequency components, text embedding perturbation enhances high-frequency details (Fig. 4). This complementarity extends to their joint impact on image quality throughout denoising (Figs. LABEL:fig:sde2ode and Fig. LABEL:fig:tep2ntep). However, prior work may not focus on this synergy, degrading both visual quality and text alignment. Second, as shown in Fig. 5, text embeddings demonstrate distinct requirements for and tolerance of perturbation based on the time step, the specific components of the embedding, and the depth of the Cross-Attention layer. Consequently, applying a discriminative perturbation strategy tailored to the needs of these unique dimensions is essential, which is a crucial aspect but has not been explored in existing research related to TTS.

Therefore, to fully unlock the potential of text embedding perturbation as a new format of randomness, we propose the Text Embedding Perturbation (TEP) framework (see Fig. LABEL:fig:framework), which integrates text embedding perturbation into existing TTS methods. Our framework features two key designs: (1) We introduce step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation to further enhance the complementarity between text embedding perturbation and spatial noise randomness. (2) A mild perturbation is applied to the conditional text embedding and deeper layers, while a stronger perturbation is applied to the unconditional text embedding and shallower layers, better aligning with their distinct response to perturbation. Moreover, to preserve core textual semantics while inducing randomness, we employ token-wise adaptive perturbation, accounting for the differential importance of semantic components during generation. Notably, our TEP framework offers a “free lunch”: it seamlessly integrates with all existing TTS methods with negligible additional computational cost (Tab. 3), while significantly boosting their performance ceiling (Fig. LABEL:fig:distribution).

We summarize the contributions as follows: (1) We introduce a novel format of randomness, text embedding perturbation, for TTS, and systematically analyze its integration properties with spatial noise randomness within TTS methods. (2) We propose the TEP framework that incorporates temporal perturbation strength scheduling based on a frequency-based mechanism, and spatially distinctive perturbation through branch-wise response modulation. (3) We show the effectiveness and plug-and-play of our framework across existing TTS methods with negligible overhead.

2 Related Works

Reward Alignment in Diffusion Models

optimizes generation to maximize reward model evaluations. Current approaches fall into two categories: fine-tuning-based methods [wallace2024diffusion, liu2025flow] that adapt models using preference data or reward gradients (computationally costly and inflexible), and more flexible TTS techniques that require no retraining. The latter can be further divided into gradient-based [song2020score, bansal2023universal] and sampling-based methods [ma2025inference, singhal2025general], with sampling approaches offering particular advantages: they don’t require differentiable reward models, making them flexible to use.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: In (a), we demonstrate that text embedding perturbation consistently enhances generation diversity across all CFG scales, highlighting its potential as a novel format of randomness. However, in (b), we reveal that existing approaches(CADS) incorporating text embedding perturbation for better generative diversity are incompatible with TTS in T2I diffusion models, as they may not maintain image quality. We use SD3.5 with ImageReward for evaluations, and results of more backbones are in Appendix.

Sampling-Based TTS Methods

operate through two core mechanisms: the diffusion model’s inherent stochastic sampling and reward-guided filtering. These approaches can be categorized by their randomness sources into three types: (1) ODE-based methods [ma2025inference] that rely solely on initial noise, directly denoising it for final selection; (2) Particle sampling [singhal2025general, singh2025code] that injects additional SDE noise during denoising to explore diverse trajectories, employing either best-of-N or importance sampling strategies for high-reward potential selection; and (3) Resampling-based methods [ma2025inference] that combine ODE processes with additional resampling operations to regenerate high-reward potential.

Text Embedding

serves as a bridge between text and images in T2I diffusion models and has become an indispensable component of modern models [yu2024uncovering]. Some studies have demonstrated that perturbing text embeddings can enhance the diversity of image generation. For example, CADS [sadat2023cads] adds Gaussian noise to the condition and gradually anneals it during generation, resulting in improved diversity.

3 Motivation

Refer to caption
(a)
Refer to caption
(b)
Figure 3: In (a), we switch from SDE to ODE at specific steps. We observe that the noise injected by the SDE in the early steps helps select better-quality images, while in the later steps, it has a negative impact. In (b), we attenuate the SDE-injected noise in the frequency domain at specific steps to analyze the influence of its high- and low-frequency components on generation. We find that the low-frequency one play a crucial role throughout the entire process, whereas suppressing the high-frequency components improves image quality in TTS. We use SD3.5 with ImageReward for evaluations, and more backbones’ results are in Appendix.

Unless otherwise specified, we employ SD3.5 with ImageReward as the evaluation metric. The corresponding analysis results for other mainstream diffusion models, which exhibit similar results, are provided in Appendix.

3.1 Preliminaries: Generative Diversity and Quality in Diffusion Models

T2I diffusion models rely on CFG to generate high-quality outputs, which blend the predictions of a conditional model and an unconditional model with a CFG scale ww:

ϵ^t=ϵθ(xt,t,E(y))+w(ϵθ(xt,t,E(yc))ϵθ(xt,t,E(y)))\hat{\epsilon}_{t}=\epsilon_{\theta}(x_{t},t,E(y_{\emptyset}))+w\left(\epsilon_{\theta}(x_{t},t,E(y_{c}))-\epsilon_{\theta}(x_{t},t,E(y_{\emptyset}))\right) (1)

where yy_{\emptyset} and ycy_{c} represent the null text prompt and the text prompt, respectively. EE represents the text encoder.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: We demonstrate the differences and complementarity between text embedding perturbation and spatial noise randomness in TTS. In (a), we only introduced randomness at fixed timesteps while keeping other steps unchanged, and measured the MSE between images generated with and without randomness. Spatial noise randomness has a greater impact during the low-frequency generation phase (early steps), while text embedding perturbation plays a more significant role in the high-frequency phase (late steps). Furthermore, when both randomness formats are combined, the variation in images shows a more pronounced improvement. In (b), we further provide visual results to support the previous discussion. We use SD3.5 with ImageReward for evaluations, and results of more backbones are in Appendix.

Numerous studies have demonstrated that CFG can impair the generative diversity of models [moufad2025conditional, koulischer2025feedback], as we also illustrate in Fig. LABEL:fig:diversity_cfg. Here, we follow  cideron2024diversity, and employ cosine similarity between embeddings of generated samples with the same conditions to quantify the model’s generative diversity:

D(θ)=𝔼𝐲Y[𝔼x1,x2pθ(|𝐲)[sD(x1,x2)]]D(\theta)=\mathbb{E}_{\mathbf{y}\sim Y}\left[\mathbb{E}_{x_{1},x_{2}\sim p_{\theta}(\cdot|\mathbf{y})}\left[s_{D}(x_{1},x_{2})\right]\right] (2)
sD(x1,x2)=1|E(x1)E(x2)||E(x1)||E(x2)|s_{D}(x_{1},x_{2})=1-\frac{|E(x_{1})\cdot E(x_{2})|}{|E(x_{1})||E(x_{2})|} (3)

Where pθ(|𝐲)p_{\theta}(\cdot|\mathbf{y}) represents the probability distribution of a certain variable given a text prompt and EE represents the embedding model. Here we use CLIP as the embedding model, and we calculate the average diversity value of 1k prompts on SD3.5 [esser2024scaling]. As shown in Fig. LABEL:fig:diversity_cfg, introducing extra randomness to the generation benefits the generation diversity like SDE-injected noise. Therefore, some work begins to introduce other forms of randomness.

Recent studies have demonstrated that perturbing text embeddings can enhance the diversity of generation in diffusion models, which involves adding substantial perturbations to the entire text embedding initially and gradually annealing them during the denoising process [sadat2023cads]. Since the composition of generated images is primarily shaped by both the text embedding and initial noise in the early steps [yi2024towards], applying stronger perturbations to the text embedding during these early stages can enrich image composition, thereby improving generative diversity as shown in Fig. LABEL:fig:diversity_cfg. However, such perturbations severely disrupt semantic information, harming text-image alignment and, to some extent, degrading compositional quality as illustrated in Fig. LABEL:fig:cads. This trade-off is unacceptable for TTS methods. Therefore, we aim to identify an appropriate way to introduce text embedding perturbation into TTS methods by specifically analyzing its impact on generation quality and complementarity with existing randomness (i.e., SDE).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: (a) shows text embedding perturbation should be applied discriminatively across three dimensions: timesteps, the specific components of the embedding, and the depth of Cross-Attention layers, which are detailed in (b), (c), and (d), respectively. We perform perturbation across each of the three dimensions and evaluate rewards with BoN. Specifically: (b) illustrates that perturbation at early timesteps yields minimal improvement, while perturbation in later timesteps provides a significant boost. (c) indicates that the unconditional text embedding exhibits greater tolerance for perturbation compared to the conditional text embedding. (d) shows that shallower layers demonstrate higher tolerance for perturbation than deeper layers. We use SD3.5 with ImageReward for evaluations, and more backbones’ results are in Appendix.

3.2 Frequency-Domain Analysis of Randomness

In TTS methods, the randomness stems from three sources: initial noise, SDE-injected noise, and resampling noise. We call them spatial noise randomness, which acts in latent space and adheres to noise schedules. SDE-injected noise is taken as an example to analyze its impact on generation.

Spatial Noise Randomness Itself.

We switch the sampling process from SDE to ODE at specific steps and employ the BoN method to select the highest-quality output to observe the impact of spatial noise randomness in TTS methods. As shown in Fig. LABEL:fig:sde2ode, in early denoising steps, this format of randomness effectively improves the upper bound of generation quality, but in later steps, it significantly degrades the generative quality. We further investigate the influence of different frequency components in SDE-injected noise on generation quality. The results in Fig. LABEL:fig:frequency show that low-frequency components consistently have a positive effect, while partial removal of high-frequency components can enhance generation quality. This reveals the shifting role of spatial noise randomness throughout the gneration process from a frequency-domain perspective.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Text embedding perturbation effectively guarantees visual quality and text alignment in generated images. As shown in (a), introducing controlled randomness through discriminative perturbation expands the performance range, yielding higher aesthetic peaks while reducing subpar outputs, thus enabling more efficient high-reward potential selection. Examples in (b) further demonstrate its capability to enhance visual quality (AestheticScore) and text alignment (CLIPScore) simultaneously. Here we use SD3.5 to generate examples and conduct analysis.

Spatial Noise Randomness and Text Embedding Perturbation.

We specifically compare the impact of these two format of randomness on generation, where we introduce only one type of randomness at the current step while keeping all other steps deterministic. As shown in Fig. LABEL:fig:influence, we measure the difference between introducing the randomness and not introducing it (via MSE) and find that spatial noise randomness primarily affects low-frequency components in early steps, while text embedding perturbation has a stronger influence on high-frequency details in later steps (also illustrated in Fig. LABEL:fig:tep2ntep). Fig. LABEL:fig:diversity_visual further shows that spatial noise randomness significantly shapes low-frequency structural elements (e.g., composition) in early stages but has minimal effect on fine details in later steps. In contrast, text embedding perturbation induces substantial variations in high-frequency details, leading to richer diversity in fine-grained features. This also explains why CADS is not suitable for TTS methods: the excessive perturbation introduced in the early stages does not align with the preferred behavior of text embedding perturbation.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: Our proposed TEP framework. (a) illustrates the two main designs. Spatially, we apply specific perturbations to text embeddings according to their tolerance of perturbations, with stronger perturbations introduced to the unconditional text embedding and text embeddings in shallower layers. Meanwhile, we filter high-frequency components from the SDE-injected noise and restore it to standard Gaussian noise. Temporally, we progressively intensify the text embedding perturbation intensity throughout the denoising process, while simultaneously filtering more high-frequency content from the SDE-injected noise to enhance their complementarity. (b) demonstrates the application of our TEP framework across three different categories of TTS methods. In general, these methods introduce randomness or filter samples at certain timesteps, and we additionally inject text embedding perturbation at the same locations.

It is worth noting that these two types of randomness exhibit strong complementarity throughout the denoising process: spatial noise randomness mainly influences the early stages of generation, while text embedding perturbation primarily affects the later stages. As a result, they can be effectively combined to further enhance both the variability (Fig. LABEL:fig:influence) and diversity (Fig. LABEL:fig:diversity_visual) of the generated outputs.

3.3 Discriminative Perturbation of Text Embeddings in Diffusion Models

As text embedding perturbation effectively enhances generative diversity and complements spatial noise randomness, the critical focus of our work is determining the specific format of perturbation required to maintain generative quality. Through experimentation, we discover the differential characteristics of text embedding perturbation across three dimensions, as shown in Fig. LABEL:fig:differentiation.

Better Performance in the Later Stages of Generation.

As illustrated in Fig. LABEL:fig:tep2ntep, we continually introduce perturbation up to a specific step and employ BoN. Text embedding perturbation shows minimal improvement in rewards during the initial stages, but yields a significant increase in the reward during the later generation stages. This observation suggests that this type of perturbation is more effective when applied with higher intensity in the latter steps.

Greater Perturbation Tolerance of Unconditional Text Embeddings.

As depicted in Fig. LABEL:fig:uncond_cond, a substantial difference exists in the perturbation tolerance between the conditional and unconditional text embeddings. We independently perturb both and track the corresponding reward trend. As perturbation intensity steadily increases, the reward for the conditional text embedding exhibits an early decline, whereas the unconditional text embedding’s reward only begins to decrease after the perturbation reaches a very large magnitude. This finding aligns with intuition: the conditional text embedding represents the target generation region, while the unconditional one defines the generation region to be avoided, and the target generation region is inherently much smaller than the avoidance generation region.

Greater Perturbation Tolerance of Shallower Layers in the Same Step.

As shown in Fig. LABEL:fig:layer_perturb, we divide the denoising network into two halves—deep layers and shallow layers—and independently perturb them to observe the reward trend. Perturbing the text embeddings within the cross-attention modules of the deep layers rarely yields a significant increase in the reward, while perturbation in the shallow layers quickly results in a rise in the reward. This indicates that the deep features are utilized for semantic restoration; consequently, corrupting the semantics leads to a degradation in denoising performance, a phenomenon previously demonstrated in numerous studies [loos2025latent, wang2025dynamic, park2024explaining].

Table 1: Results of test-time scaling methods w/ and w/o our framework on SDXL.
Sampling Method w/ TEP HPSv2\uparrow ImageReward\uparrow CLIPScore\uparrow AestheticScore\uparrow GenEval\uparrow
None(SDXL) 0.269 0.221 1.012 5.823 0.53
ODE-Based Methods
BoN [ma2025inference] 0.284 0.943 1.034 6.392 0.59
0.294 1.105 1.047 6.525 0.63
ZeroOrder [ma2025inference] 0.283 0.939 1.030 6.321 0.55
0.293 0.977 1.044 6.448 0.62
Particle Sampling Methods
CoDe [singh2025code] 0.282 0.987 1.033 6.414 0.64
0.301 1.328 1.068 6.704 0.76
SVDD [li2024derivative] 0.284 0.974 1.032 6.398 0.65
0.299 1.303 1.063 6.722 0.78
DAS [kim2025test] 0.285 1.002 1.045 6.553 0.65
0.302 1.379 1.084 6.836 0.75
Resampling-Based Methods
SoP [ma2025inference] 0.280 0.948 1.037 6.402 0.61
0.288 1.032 1.070 6.536 0.66

Unified Discriminative Perturbation.

Combining the above three points of analysis, we simultaneously perform discriminative perturbation across the three dimensions (larger perturbation to shallower layers, later steps, and unconditional text embedding). We generate multiple images and present the range distribution of their rewards and corresponding visualizations. As shown in Fig. LABEL:fig:distribution, this form of perturbation maintains image generation quality effectively and leads to a higher reward ceiling, which is a highly desirable characteristic for TTS tasks. Furthermore, some images exhibit better detail and visual quality after text embedding perturbation, as illustrated in Fig. LABEL:fig:visual_quality. Consequently, we have successfully identified an appropriate text embedding perturbation strategy.

4 Text Embedding Perturbation Framework

In this section, we fully present our Text Embedding Perturbation (TEP) framework. We elaborate in detail on our two design points in Fig. LABEL:fig:framework and explain the position and timing of incorporating text embedding perturbation in Fig.LABEL:fig:position.

Spatial Discriminative Adjustment for Both Randomness (Blue Color in Fig. LABEL:fig:framework).

For randomness brought by text embedding perturbation, we implement spatial discriminative perturbation by ① applying smaller perturbations to conditional text embeddings and larger perturbations to unconditional text embeddings:

ϵ^t=ϵθ(xt,t,E^(y))+w(ϵθ(xt,t,E^(yc))ϵθ(xt,t,E^(y)))\hat{\epsilon}_{t}=\epsilon_{\theta}(x_{t},t,\hat{E}(y_{\emptyset}))+w\left(\epsilon_{\theta}(x_{t},t,\hat{E}(y_{c}))-\epsilon_{\theta}(x_{t},t,\hat{E}(y_{\emptyset}))\right) (4)
E^(y)=E(y)+w1ϵ1\hat{E}(y_{\emptyset})=E(y_{\emptyset})+w_{1}\epsilon_{1} (5)
E^(yc)=E(yc)+w2ϵ2\hat{E}(y_{c})=E(y_{c})+w_{2}\epsilon_{2} (6)

Where ϵ1,ϵ2N(0,I)\epsilon_{1},\epsilon_{2}\sim N(0,I) and w1>>w2w_{1}>>w_{2}. In this way, the model’s output retains sufficient distinctiveness for high-reward potential selection even in the later denoising stages, while avoiding excessive disruption to semantics and guidance that would degrade visual quality and text faithfulness. Additionally, we implement finer-grained perturbation for conditional text embeddings: semantic embeddings (tokens before [EOS]) receive minimal perturbation to preserve semantic integrity, while padding embeddings (tokens after [EOS]) undergo stronger perturbation (still weaker than w1w_{1}) to ensure sufficient diversity. ② applying smaller perturbations to deeper layers and larger perturbations to shallower layers. Specifically, the interaction between the noisy latent and the text embedding takes place in the Cross Attention:

output=CrossAttni(xt,E(y))\text{output}=\operatorname{CrossAttn}_{i}(x_{t},E(y)) (7)

Where, CrossAttni\operatorname{CrossAttn}_{i} represents the Cross Attention of the ii-th block. For stability, we simply set a threshold kk and consider layers before the kk-th layer as shallow layers, and the remaining as deep layers. We then assign different relative coefficients for the perturbation applied to different layers:

output=CrossAttni(xt,E(y)+siwϵ)\text{output}=\operatorname{CrossAttn}_{i}(x_{t},E(y)+s_{i}w\epsilon) (8)
si={1.5if i<k0.5if iks_{i}=\begin{cases}1.5&\text{if }i<k\\ 0.5&\text{if }i\geq k\end{cases} (9)

Where, for both the conditional text embedding and the unconditional text embedding, we use the same sis_{i}. For spatial noise randomness, we attenuate certain high-frequency components. Specifically, we first convert the signal to the frequency domain via Fast Fourier Transform (FFT). We then apply low-pass filtering with threshold pp. Finally, we convert it back to the spatial domain:

ϵlow=IFFT(Flow(FFT(ϵ),p))\epsilon_{low}=\operatorname{IFFT}(\operatorname{F_{low}}(\operatorname{FFT}(\epsilon),p)) (10)

Where, FF denotes a low-pass filter with a constant threshold pp. To ensure it follows a standard Gaussian distribution, we renormalize it to obtain the final injected noise:

ϵSDE=(ϵlowmean(ϵlow))/std(ϵlow)\epsilon_{SDE}=(\epsilon_{low}-\operatorname{mean}(\epsilon_{low}))/\operatorname{std}(\epsilon_{low}) (11)
Table 2: Results of test-time scaling methods w/ and w/o our framework on flow models(SD3.5).
Sampling Method w/ TEP HPSv2\uparrow ImageReward\uparrow CLIPScore\uparrow AestheticScore\uparrow GenEval\uparrow
None(SD3.5) 0.283 0.547 1.031 5.602 0.63
ODE-Based Methods
BoN [ma2025inference] 0.292 0.992 1.035 6.271 0.69
0.301 1.102 1.041 6.357 0.71
ZeroOrder [ma2025inference] 0.299 0.941 1.078 6.288 0.67
0.302 1.108 1.095 6.520 0.71
Particle Sampling Methods
CoDe [singh2025code] 0.296 1.038 1.077 6.319 0.75
0.310 1.411 1.106 6.646 0.80
SVDD [li2024derivative] 0.301 1.358 1.106 6.608 0.69
0.316 1.582 1.138 6.982 0.75
DAS [kim2025test] 0.294 0.907 1.107 6.281 0.76
0.307 1.270 1.116 6.403 0.81
Resampling-Based Methods
SoP [ma2025inference] 0.293 0.984 1.040 6.010 0.69
0.301 1.077 1.056 6.185 0.72

Temporal Scheduling for Better Complementarity of Both Randomness (Red Color in Fig. LABEL:fig:framework).

First, considering that text embedding perturbation has a greater impact on high-frequency details in later stages and contributes more to high-frequency refinement, we progressively intensify this perturbation throughout the denoising process. Specifically, we parameterize perturbation weights w1w_{1} and w2w_{2} as monotonically increasing functions of tt:

E^(y)=E(y)+siw1(t)ϵ1\hat{E}(y_{\emptyset})=E(y_{\emptyset})+s_{i}w_{1}(t)\epsilon_{1} (12)
E^(yc)=E(yc)+siw2(t)ϵ2\hat{E}(y_{c})=E(y_{c})+s_{i}w_{2}(t)\epsilon_{2} (13)

For spatial noise randomness, considering its negative impact on generation quality in later stages (primarily caused by its high-frequency components), we progressively increase the attenuation of these high-frequency elements throughout the denoising process:

ϵlow=IFFT(Flow(FFT(ϵ),p(t)))\epsilon_{low}=\operatorname{IFFT}(\operatorname{F_{low}}(\operatorname{FFT}(\epsilon),p(t))) (14)

The Position and Timing of Incorporating Text Embedding Perturbation.

Essentially, in TTS, randomness is introduced for sampling and selection. Simply put, for existing sampling-based TTS methods, we add perturbations to the text embedding before their sampling steps. Specifically, as shown in Fig. LABEL:fig:position, we categorize these methods and discuss the integration accordingly. For ODE-based methods, which sample the initial noise only at the start and filter after denoising is complete, we perturb the text embedding only during the initial stage. For particle sampling methods, which rely on the randomness introduced by the SDE process, we perturb the text embedding again before each SDE sampling step. For resampling-based methods, we perturb the text embedding again at each resampling step.

5 Experiment

Here, we demonstrate the applications of our TEP framework on T2I generation. More extended applications (i.e., T2V generation) and computation are shown in Appendix.

Refer to caption
Figure 8: Visual results of baselines w/ and w/o our methods.

5.1 Experimental settings

Baselines. We integrate TEP with existing TTS methods, including ODE-based methods (BoN, ZeroOrder [ma2025inference]), particle sampling methods (SVDD [li2024derivative], CoDe [singh2025code], DAS [kim2025test]), and resampling-based methods (SoP [ma2025inference]).

Backbones and their Text Encoders. We evaluate a diverse range of T2I diffusion models with different text encoders: SD2.1 [Rombach_2022_CVPR] (CLIP), SDXL [podell2023sdxl] (dualCLIP), SD3.5 [esser2024scaling] (dualCLIP+T5), Flux [flux2024] (dualCLIP+T5), and show-o2 [xie2025show] (VLM). We present the results for the most widely adopted models, SDXL and SD3.5, in the main text, with others’ results fully documented in Appendix.

Evaluations. We conduct experiments on Open-Image-Pref-v1 datasets with 7k+ prompts and GenEval benchmarks. We ensure that the number of simultaneously denoised latents at each step is fixed at 16.

Reward Models. We test our method on ImageReward [xu2023imagereward], HPSv2 [wu2023human], CLIPScore [hessel2021clipscore], and AestheticScore [Schuhmann:aesthetics]. These reward models serve both as intermediate verifiers and held-out rewards (final rewards). We additionally apply GenEval [ghosh2023geneval] as the held-out reward for ImageReward verifier. More information is shown in Appendix.

Table 3: Ablations of TEP.
Perturbation IR\uparrow HPSv2\uparrow Time
None 0.987 0.282 15.488s
+Text Embedding Perturbation 1.253 0.297 15.645s
+Spatial Noise Adjustment 1.148 0.288 15.703s
Ours 1.328 0.301 15.732s

5.2 Main results

Results on Unet-Based T2I Diffusion Models.

In this task, we use the same reward model as verifiers and held-out rewards. Results on GenEval are added for object-focused evaluation. As shown in Tab. 1, our method demonstrates significant improvements when seamlessly integrated with existing TTS methods. Among these, the enhancement is most pronounced for particle sampling methods, given that they not only conduct thorough searches of sampling paths but also extensively explore noise directions for generation.

Results on Flow-Based T2I Diffusion Models.

For flow models, we replace the default ODE solver with an SDE process following liu2025flow, then apply these TTS methods along with our framework. As shown in Tab. 2, in flow-based models, our framework also demonstrates strong performance, achieving improvements across all baselines. We present complete tables and additional results on flow-based diffusion models in Appendix.

5.3 Ablation study

Ablations on Components and their Computation.

Our framework involves processing both text embedding perturbation and spatial noise randomness, so we conduct ablation studies on these components. We perform experiments using CoDe on SD 2.1 and evaluate with ImageReward and HPSv2. As shown in Tab. 3, each component contributes to improved generation quality and only brings negligible additional computation. More ablations on framework design and hyperparameter settings are presented in Appendix.

Table 4: Generalization of different text encoders.
Model Text Encoder w/o TEP w/ TEP
SD2.1 CLIP 0.945 1.158
SDXL dualCLIP 0.987 1.328
SD3.5 dualCLIP+T5 1.038 1.411
Show-O2 VLM 1.215 1.448
Refer to caption
(a)
Refer to caption
(b)
Figure 9: The change curve of ImageReward with increasing NDFEs and NRFEs. In (a), our method helps the model achieve a higher upper limit and maintains an upward trend even with more NDFEs. Notably, although CoDe and BoN have same NDFEs, CoDe introduces additional randomness and evaluations, resulting in better performance. So we use NRFE in (b) to show the benefits of more evaluations during generation, where the advantage of our framework becomes increasingly evident as NRFE grows.

Num of NDFEs and NRMEs for Generation.

In Fig. LABEL:fig:ndfe, we demonstrate how NDFE (number of denoising function evaluations) and NRFE (number of reward function evaluations) influence the generation process. NDFE is related to the number of sampling steps, the initial noise quantity, and the number of intermediate sampling particles. Among these, increasing the number of sampling steps has a minor impact on the generation metrics, whereas raising the initial noise quantity or the number of particles can help the generation metrics rise rapidly until they converge to a stable level. However, particle sampling methods often outperform ODE-based methods, even when their NDFE values are the same. Therefore, we introduce NRFE, which also represents the number of filtering steps the model performs during TTS. In Fig. LABEL:fig:nrfe, we observe that as NRFE increases, the performance of the baseline gradually improves and then converges to a stable level, indicating that the model transitions from ODE-based methods to particle sampling methods. Interestingly, after incorporating text embedding perturbation, the generation metrics of the model continue to rise, reaching a higher upper limit. This is easy to understand: NRFE also represents the number of perturbations we introduce, and these perturbations provide models with a larger search space, accompanied by a higher upper bound. As NRFE increases, the model’s search becomes more thorough, allowing it to better approach this upper limit.

Generalizations on Text Encoders.

The strong generalizability of our approach lies in the fact that applying a reasonable perturbation to the text embedding derived from any text encoder consistently leads to performance improvements for diffusion models in TTS tasks, as shown in Tab. 4. We utilize CoDe methods w/ and w/o our TEP, and use ImageReward as our evaluation metric.

6 Conclusion

We introduce a novel format of randomness for TTS in T2I diffusion models, text embedding perturbation, which helps improve generative quality, significantly enhancing both the visual quality and textual fidelity of generated images. While T2I diffusion models continue to advance with remarkable generative potential, current TTS methods may struggle to fully exploit these capabilities. We highlight the importance of further exploration of TTS for T2I diffusion models to fully unlock their potential.