Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Mingzhuo Li1  Guang Li1  Jiafeng Mao2  Linfeng Ye3  Takahiro Ogawa1  Miki Haseyama1

1Hokkaido Univerisy  2The University of Tokyo  3University of Toronto
Correspondence to: Guang Li (guang@lmd.ist.hokudai.ac.jp)
Abstract

To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.

1 Introduction

With the rapid advancement of deep learning, deep neural networks have gained significant attention due to their extensive applications across various domains, particularly in computer vision [10]. However, these networks typically rely on large-scale datasets to obtain high performance, which results in extended training times that often span several hours or even days, and substantial demands on computational resources [29]. Moreover, the storage and management of massive datasets involve considerable time and financial costs. Dataset distillation [39] has emerged as a promising solution to mitigate these challenges by distilling the original dataset into a compact and high-quality synthetic dataset, which can train models to achieve performance comparable to that obtained using the original dataset.

Refer to caption
Figure 1: The workflow of the proposed method with the overall linear process marked as green. The generation of the image pool is indicated in blue, with optimization strategies aligning the original and distilled distributions. The sampling of the distilled dataset is indicated in orange, using the difficulty distribution that originates from the target task. The two components focus on different aspects of the Information Bottleneck optimization objective and are expected to function complementarily to enhance overall performance.

Since its introduction, dataset distillation has attracted significant attention, with a growing number of studies contributing to its rapid advancement [19, 27]. Current dataset distillation methods can be broadly categorized into non-generative and generative ones. Traditional non-generative methods aim to optimize a fixed set of synthetic images, with the size determined by image-per-class (IPC). The optimization is achieved by aligning specific training targets with those derived from the original dataset, under the assumption that models with similar alignment behavior will achieve comparable performance on downstream tasks. Various alignment targets have given rise to different methods, including gradient/trajectory matching [43, 1, 20, 22], distribution/feature matching [45, 35, 26, 5], and kernel-based methods [30, 4].

In contrast, generative dataset distillation methods utilize generative models [44, 2, 23, 24] to produce high-quality synthetic images, which is made feasible by embedding knowledge of the dataset into the model. This modification offers the flexibility to generate datasets of any size on demand, effectively removing the constraint of IPC and reducing time costs, which is particularly beneficial for scenarios such as continual learning [28, 13], federated learning [21, 15], privacy preservation [17, 18, 46], and neural architecture search [8]. Among these models, diffusion models, such as Imagen [34] and Stable Diffusion [33], have shown exceptional promise for their robustness and adaptability, promoting increasing interest in leveraging them for effective dataset distillation [12, 37, 38].

While current generative dataset distillation methods have demonstrated promising performance, the approaches primarily focus on guiding the model with knowledge extracted from the original dataset, overlooking the information specific to the downstream task [12, 37, 25]. This discrepancy between the training objective and the target task may lead to incomplete information during training, limiting the model’s optimal performance. To address this issue, we propose leveraging the relevance of the distilled dataset concerning the downstream task, aiming to generate datasets with superior performance on the target task.

In this paper, we focus on the downstream task of classification and introduce a difficulty-guided sampling to enhance the performance of generative dataset distillation. An image pool consisting of generated images is first obtained using a generative dataset distillation method with the optimization objective of aligning the diversity and representativeness between the original and distilled datasets. The final distilled dataset is selected by aligning the difficulty distribution of the image pool with that of the original dataset. As previous generative models tend to produce samples biased toward lower difficulty (i.e., easier samples), a pre-processing step of logarithmic transformation is introduced for distributional correction. Extensive experiments on various downstream models and datasets demonstrate the effectiveness of our proposed method. The contributions of this paper can be summarized as follows:

  • We propose a difficulty-guided sampling to utilize extra information related to the classification task, achieving task-specific dataset distillation.

  • We conduct sampling on an image pool following the difficulty distribution of the original dataset, and propose a logarithmic transformation to eliminate the bias of the image pool towards easy samples.

2 Dataset Distillation with Difficulty-Guided Sampling

This section is organized as follows. We begin by reviewing a widely adopted generative dataset distillation pipeline, which is based on aligning the distribution between the distilled and original datasets. We then present the detailed implementation of difficulty-guided sampling, supported by theoretical analysis. Finally, we illustrate the logarithmic transformation, which is designed to obtain effective sample selection. The workflow of the proposed method is shown in Fig. 1.

2.1 Preliminary

Latent diffusion models [32] operate in the latent space rather than directly in the pixel space, showing enhanced ability on abstract features. Given an image 𝒙𝒙\bm{x}bold_italic_x from the original dataset D𝐷Ditalic_D, it is first encoded into a latent vector 𝒛0subscript𝒛0\bm{z}_{0}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by the VAE encoder. A noisy latent 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then obtained by sequentially adding Gaussian noise ϵ𝒩(𝟎,𝑰)italic-ϵ𝒩0𝑰\epsilon\in\mathcal{N}(\bm{0},\bm{I})italic_ϵ ∈ caligraphic_N ( bold_0 , bold_italic_I ) to 𝒛0subscript𝒛0\bm{z}_{0}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over t𝑡titalic_t times as follows:

𝒛𝒕=α¯t𝒛0+1α¯tϵ,subscript𝒛𝒕subscript¯𝛼𝑡subscript𝒛01subscript¯𝛼𝑡italic-ϵ\bm{z_{t}}=\sqrt{\overline{\alpha}_{t}}\bm{z}_{0}+\sqrt{1-\overline{\alpha}_{t% }}\epsilon,bold_italic_z start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , (1)

where α¯tsubscript¯𝛼𝑡\overline{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes a hyper-parameter known as variance schedule. The diffusion model parameterized by θ𝜃\thetaitalic_θ is trained to predict the added noise ϵitalic-ϵ\epsilonitalic_ϵ, conditioned on class information 𝒄𝒄\bm{c}bold_italic_c, which is obtained via a class encoder. The training objective minimizes the discrepancy between the predicted noise ϵθ(𝒛𝒕,t,𝒄)subscriptitalic-ϵ𝜃subscript𝒛𝒕𝑡𝒄\epsilon_{\theta}(\bm{z_{t}},t,\bm{c})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_c ) and the ground truth ϵitalic-ϵ\epsilonitalic_ϵ as follows:

diffusion=argmaxθϵθ(𝒛𝒕,t,𝒄)ϵ22.subscriptdiffusionsubscript𝜃superscriptsubscriptnormsubscriptitalic-ϵ𝜃subscript𝒛𝒕𝑡𝒄italic-ϵ22\mathcal{L}_{\text{diffusion}}=\arg\max_{\theta}{||\epsilon_{\theta}(\bm{z_{t}% },t,\bm{c})-\epsilon||}_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT diffusion end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_c ) - italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

Once trained, the model is capable of generating images by iteratively denoising random noise, thereby achieving high-quality image synthesis.

To leverage diffusion models for dataset distillation, Minimax [12] introduces an approach that aims to maximize both the representativeness and diversity of the distilled dataset. Two auxiliary memory sets are constructed to facilitate the calculation, with representativeness memory rsubscript𝑟\mathcal{M}_{r}caligraphic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT containing real images and diversity memory dsubscript𝑑\mathcal{M}_{d}caligraphic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT containing generated images. Representativeness is defined as the similarity between the generated and original dataset, leading to the optimization objective as follows:

repre=argmaxθmin𝒛r[r]σ(𝒛^θ(𝒛t,𝒄),𝒛r),subscriptrepresubscript𝜃subscriptsubscript𝒛𝑟delimited-[]subscript𝑟𝜎subscriptbold-^𝒛𝜃subscript𝒛𝑡𝒄subscript𝒛𝑟\mathcal{L}_{\text{repre}}=\arg\max_{\theta}\min_{\bm{z}_{r}\in[\mathcal{M}_{r% }]}\sigma(\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c}),\bm{z}_{r}),caligraphic_L start_POSTSUBSCRIPT repre end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ [ caligraphic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT italic_σ ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c ) , bold_italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (3)

where σ(·,·)𝜎··\sigma(\text{\textperiodcentered}\ ,\ \text{\textperiodcentered})italic_σ ( · , · ) denotes the cosine similarity and 𝒛^θ(𝒛t,𝒄)subscriptbold-^𝒛𝜃subscript𝒛𝑡𝒄\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c})overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c ) is the latent predicted by the diffusion model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with input latent 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT conditioned by class vector 𝒄𝒄\bm{c}bold_italic_c. Similarly, diversity is defined based on the dissimilarity among the generated images, with the optimization objective as follows:

div=argminθmax𝒛^g[d]σ(𝒛^θ(𝒛t,𝒄),𝒛^g).subscriptdivsubscript𝜃subscriptsubscriptbold-^𝒛𝑔delimited-[]subscript𝑑𝜎subscriptbold-^𝒛𝜃subscript𝒛𝑡𝒄subscriptbold-^𝒛𝑔\mathcal{L}_{\text{div}}=\arg\min_{\theta}\max_{\bm{\hat{z}}_{g}\in[\mathcal{M% }_{d}]}\sigma(\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c}),\bm{\hat{z}}_{g}).caligraphic_L start_POSTSUBSCRIPT div end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ [ caligraphic_M start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT italic_σ ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c ) , overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) . (4)

By combining divsubscriptdiv\mathcal{L}_{\text{div}}caligraphic_L start_POSTSUBSCRIPT div end_POSTSUBSCRIPT and represubscriptrepre\mathcal{L}_{\text{repre}}caligraphic_L start_POSTSUBSCRIPT repre end_POSTSUBSCRIPT with the diffusion loss diffusionsubscriptdiffusion\mathcal{L}_{\text{diffusion}}caligraphic_L start_POSTSUBSCRIPT diffusion end_POSTSUBSCRIPT, the model is guided to produce distilled datasets of higher quality, thereby improving the performance on downstream tasks.

Although this method effectively leverages features from the original dataset, it overlooks information specific to the downstream task. This omission can lead to a mismatch between the optimization objective during training and the target downstream tasks, such as classification, limiting the optimal performance.

2.2 Difficulty-Guided Sampling

From the perspective of the Information Bottleneck (IB) principle, the objective of dataset distillation can be redefined as follows. For the original dataset X𝑋Xitalic_X and target downstream task Y𝑌Yitalic_Y, the goal is to find a compressed dataset T𝑇Titalic_T that discards irrelevant details from X𝑋Xitalic_X while retaining the information relevant to Y𝑌Yitalic_Y. Since the original dataset is no longer used during the downstream application, Y𝑌Yitalic_Y is conditionally independent of X𝑋Xitalic_X given T𝑇Titalic_T, resulting in the Markov chain structure XTY𝑋𝑇𝑌X\to T\to Yitalic_X → italic_T → italic_Y. This satisfies the Markov assumption required by IB, leading to the objective as follows:

IB=minTI(X;T)βI(T;Y),subscriptIBsubscript𝑇𝐼𝑋𝑇𝛽𝐼𝑇𝑌\mathcal{L}_{\text{IB}}=\min_{T}I(X;T)-\beta\ I(T;Y),caligraphic_L start_POSTSUBSCRIPT IB end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_I ( italic_X ; italic_T ) - italic_β italic_I ( italic_T ; italic_Y ) , (5)

where I(X;T)𝐼𝑋𝑇I(X;T)italic_I ( italic_X ; italic_T ) and I(T;Y)𝐼𝑇𝑌I(T;Y)italic_I ( italic_T ; italic_Y ) denote the mutual information between X𝑋Xitalic_X and T𝑇Titalic_T, and between T𝑇Titalic_T and Y𝑌Yitalic_Y, respectively. And β𝛽\betaitalic_β is a Lagrange multiplier. The former part improves the level of compression, while the latter part enhances the predictability of the target. Balancing these two objectives helps construct distilled datasets that are both compact and effective.

Recent generative dataset distillation methods primarily focus on optimizing the distribution of the distilled dataset concerning the original dataset. For example, the aforementioned Minimax enhances diversity and representativeness, while MGD3 [3] guides the denoising process toward desired distributional regions. These approaches can be broadly categorized as efforts to extract more features from the original dataset, making the distilled dataset resemble the original distribution. Since the original dataset inherently contains rich information, including labels for classification, such efforts implicitly benefit various downstream tasks. In other words, these approaches implicitly improve I(T;Y)𝐼𝑇𝑌I(T;Y)italic_I ( italic_T ; italic_Y ) by explicitly improving I(X;T)𝐼𝑋𝑇I(X;T)italic_I ( italic_X ; italic_T ), and the overall performance comes from the balancing of the two factors. In the absence of task-specific considerations, the enhancement of I(T;Y)𝐼𝑇𝑌I(T;Y)italic_I ( italic_T ; italic_Y ) is limited to the original dataset’s inherent information, which may result in potentially suboptimal performance on the specific downstream task.

To address this issue, we propose incorporating task-specific information to leverage the relevance between the distilled dataset T𝑇Titalic_T and the target task Y𝑌Yitalic_Y, explicitly improving I(T;Y)𝐼𝑇𝑌I(T;Y)italic_I ( italic_T ; italic_Y ) for better performance. Inspired by the findings of Wang et al. [40], which demonstrate the effectiveness of controlling sample difficulty for dataset enhancement, we introduce difficulty as a proxy to quantify the information content for classification task.

The difficulty of an image 𝒟xsubscript𝒟𝑥\mathcal{D}_{x}caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is defined as the inverse of the confidence P𝑃Pitalic_P assigned to the correct class ytruesubscript𝑦𝑡𝑟𝑢𝑒y_{true}italic_y start_POSTSUBSCRIPT italic_t italic_r italic_u italic_e end_POSTSUBSCRIPT predicted by a pre-trained classification model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as follows:

𝒟x=1Pfθ(ytrue|x).subscript𝒟𝑥1subscript𝑃subscript𝑓𝜃conditionalsubscript𝑦𝑡𝑟𝑢𝑒𝑥\mathcal{D}_{x}=1-P_{f_{\theta}}(y_{true}|x).caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1 - italic_P start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t italic_r italic_u italic_e end_POSTSUBSCRIPT | italic_x ) . (6)

As illustrated in Fig. 1, an image pool with a total size of n×IPC𝑛IPCn\times\text{IPC}italic_n × IPCis first constructed by collecting distilled images generated by the distillation pipeline of Minimax. The difficulty of each image in the image pool is then computed to serve as additional task-specific information. The sampling is then performed over the image pool following a specific sampling distribution.

Assuming the original dataset represents the ground truth for optimal performance, we hypothesize that a distilled dataset exhibiting a similar difficulty distribution to the original one may yield improved performance. Consequently, the sampling distribution is obtained by scaling the difficulty distribution of the original dataset to match the IPC. The effectiveness of this scaling-based sampling is supported by our experiments in Section 3.3, where we compare its performance with several pre-defined sampling distributions.

Refer to caption
Figure 2: The difficulty distributions of different datasets, in the example of two classes of ImageWoof. The x-axis represents difficulty intervals, while the y-axis indicates the number of images per interval. The average difficulty of the dataset is annotated in the title. The lower and upper row shows the sampling process with and without logarithmic transformation, respectively.

2.3 Pre-processing of Logarithmic Function

However, a notable bias exists between the difficulty distributions of the original and generated datasets [40]. Distilled datasets, particularly those obtained by generative models, are also subject to the bias, tending to contain a higher proportion of easy samples, as illustrated in Fig. 2. This imbalance hinders the coverage of the sampling distribution in certain areas, distorting the difficulty distribution of the distilled dataset, necessitating additional corrective steps.

To address this issue, a logarithmic transformation is applied to facilitate the alignment of the difficulty distributions of both the original dataset and the image pool to enable better sampling. The target of transformation is selected as the uniform distribution following the idea that classification models benefit from balanced data. Due to the observation that many images tend to cluster around similar difficulty values, particularly in the lower and upper extremes, directly applying the logarithmic function may amplify the influence of extreme values, affecting the overall stability.

Hence, thresholding at both the start and end of the original difficulty distribution PX(n)subscript𝑃𝑋𝑛P_{X}(n)italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ) is introduced to stabilize the transformation and prevent the dominance of extreme values. The clipped distribution PX(n)subscriptsuperscript𝑃𝑋𝑛P^{\prime}_{X}(n)italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ) is obtained as follows:

PX(n)=H(nb)PX(n)H(Nnt)+ϵ,subscriptsuperscript𝑃𝑋𝑛𝐻𝑛𝑏subscript𝑃𝑋𝑛𝐻𝑁𝑛𝑡italic-ϵP^{\prime}_{X}(n)=H(n-b)\ P_{X}(n)\ H(N-n-t)+\epsilon,italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ) = italic_H ( italic_n - italic_b ) italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ) italic_H ( italic_N - italic_n - italic_t ) + italic_ϵ , (7)

where b𝑏bitalic_b and t𝑡titalic_t denote the bottom and top thresholds, respectively. N𝑁Nitalic_N is the size of PX(n)subscript𝑃𝑋𝑛P_{X}(n)italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ), H(n)𝐻𝑛H(n)italic_H ( italic_n ) is the Heaviside step function and ϵitalic-ϵ\epsilonitalic_ϵ is a small value to avoid mathematic error. To keep the range between 00 and 1111, the logarithmic transformation f𝑓fitalic_f is defined as follows:

f(PX,b,t)=ln(PX(n)/min(PX(n)))ln(max(PX(n))/min(PX(n))).𝑓subscript𝑃𝑋𝑏𝑡superscriptsubscript𝑃𝑋𝑛superscriptsubscript𝑃𝑋𝑛superscriptsubscript𝑃𝑋𝑛superscriptsubscript𝑃𝑋𝑛f(P_{X},b,t)=\frac{\ln(P_{X}^{\prime}(n)/\min(P_{X}^{\prime}(n)))}{\ln(\max(P_% {X}^{\prime}(n))/\min(P_{X}^{\prime}(n)))}.italic_f ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_b , italic_t ) = divide start_ARG roman_ln ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) / roman_min ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) ) ) end_ARG start_ARG roman_ln ( roman_max ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) ) / roman_min ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_n ) ) ) end_ARG . (8)
Table 1: Comparison of downstream validation accuracy with other SOTA methods on ImageWoof. The results are obtained with ResNetAP-10. The best results are marked in bold.
IPC (Ratio) Test Model Random K-Center [36] Herding [41] DiT [31] DM [45] IDC-1 [16] Minimax [12] Ours Full Dataset
ConvNet-6 24.3±1.1subscript24.3plus-or-minus1.124.3_{\pm 1.1}24.3 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 19.4±0.9subscript19.4plus-or-minus0.919.4_{\pm 0.9}19.4 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 26.7±0.5subscript26.7plus-or-minus0.526.7_{\pm 0.5}26.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 34.2±1.1subscript34.2plus-or-minus1.134.2_{\pm 1.1}34.2 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 26.9±1.2subscript26.9plus-or-minus1.226.9_{\pm 1.2}26.9 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 33.3±1.1subscript33.3plus-or-minus1.133.3_{\pm 1.1}33.3 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 34.1±0.4subscript34.1plus-or-minus0.434.1_{\pm 0.4}34.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 35.1±0.5subscript35.1plus-or-minus0.535.1_{\pm 0.5}35.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 86.4±0.2subscript86.4plus-or-minus0.286.4_{\pm 0.2}86.4 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
10 (0.8%) ResNetAP-10 29.4±0.8subscript29.4plus-or-minus0.829.4_{\pm 0.8}29.4 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 22.1±0.1subscript22.1plus-or-minus0.122.1_{\pm 0.1}22.1 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 32.0±0.3subscript32.0plus-or-minus0.332.0_{\pm 0.3}32.0 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 34.7±0.5subscript34.7plus-or-minus0.534.7_{\pm 0.5}34.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 30.3±1.2subscript30.3plus-or-minus1.230.3_{\pm 1.2}30.3 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 37.3±0.4subscript37.3plus-or-minus0.437.3_{\pm 0.4}37.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 35.7±0.3subscript35.7plus-or-minus0.335.7_{\pm 0.3}35.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 37.4±0.3subscript37.4plus-or-minus0.337.4_{\pm 0.3}37.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 87.5±0.5subscript87.5plus-or-minus0.587.5_{\pm 0.5}87.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
ResNet-18 27.7±0.9subscript27.7plus-or-minus0.927.7_{\pm 0.9}27.7 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 21.1±0.4subscript21.1plus-or-minus0.421.1_{\pm 0.4}21.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 30.2±1.2subscript30.2plus-or-minus1.230.2_{\pm 1.2}30.2 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 34.7±0.4subscript34.7plus-or-minus0.434.7_{\pm 0.4}34.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 33.4±0.7subscript33.4plus-or-minus0.733.4_{\pm 0.7}33.4 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 36.9±0.4subscript36.9plus-or-minus0.436.9_{\pm 0.4}36.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 35.3±0.4subscript35.3plus-or-minus0.435.3_{\pm 0.4}35.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 35.9±0.6subscript35.9plus-or-minus0.635.9_{\pm 0.6}35.9 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 89.3±1.2subscript89.3plus-or-minus1.289.3_{\pm 1.2}89.3 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT
ConvNet-6 29.1±0.7subscript29.1plus-or-minus0.729.1_{\pm 0.7}29.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 21.5±0.8subscript21.5plus-or-minus0.821.5_{\pm 0.8}21.5 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 29.5±0.3subscript29.5plus-or-minus0.329.5_{\pm 0.3}29.5 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 36.1±0.8subscript36.1plus-or-minus0.836.1_{\pm 0.8}36.1 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 29.9±1.0subscript29.9plus-or-minus1.029.9_{\pm 1.0}29.9 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 35.5±0.8subscript35.5plus-or-minus0.835.5_{\pm 0.8}35.5 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 36.9±1.2subscript36.9plus-or-minus1.236.9_{\pm 1.2}36.9 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 38.1±0.2subscript38.1plus-or-minus0.238.1_{\pm 0.2}38.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 86.4±0.2subscript86.4plus-or-minus0.286.4_{\pm 0.2}86.4 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
20 (1.6%) ResNetAP-10 32.7±0.4subscript32.7plus-or-minus0.432.7_{\pm 0.4}32.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 25.1±0.7subscript25.1plus-or-minus0.725.1_{\pm 0.7}25.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 34.9±0.1subscript34.9plus-or-minus0.134.9_{\pm 0.1}34.9 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 41.1±0.8subscript41.1plus-or-minus0.841.1_{\pm 0.8}41.1 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 35.2±0.6subscript35.2plus-or-minus0.635.2_{\pm 0.6}35.2 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 42.0±0.4subscript42.0plus-or-minus0.442.0_{\pm 0.4}42.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 43.3±0.3subscript43.3plus-or-minus0.343.3_{\pm 0.3}43.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 45.5±0.4subscript45.5plus-or-minus0.445.5_{\pm 0.4}45.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 87.5±0.5subscript87.5plus-or-minus0.587.5_{\pm 0.5}87.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
ResNet-18 29.7±0.5subscript29.7plus-or-minus0.529.7_{\pm 0.5}29.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 23.6±0.3subscript23.6plus-or-minus0.323.6_{\pm 0.3}23.6 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 32.2±0.6subscript32.2plus-or-minus0.632.2_{\pm 0.6}32.2 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 40.5±0.5subscript40.5plus-or-minus0.540.5_{\pm 0.5}40.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 29.8±1.7subscript29.8plus-or-minus1.729.8_{\pm 1.7}29.8 start_POSTSUBSCRIPT ± 1.7 end_POSTSUBSCRIPT 38.6±0.2subscript38.6plus-or-minus0.238.6_{\pm 0.2}38.6 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 40.9±0.6subscript40.9plus-or-minus0.640.9_{\pm 0.6}40.9 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 43.4±1.0subscript43.4plus-or-minus1.043.4_{\pm 1.0}43.4 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 89.3±1.2subscript89.3plus-or-minus1.289.3_{\pm 1.2}89.3 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT
ConvNet-6 41.3±0.6subscript41.3plus-or-minus0.641.3_{\pm 0.6}41.3 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 36.5±1.0subscript36.5plus-or-minus1.036.5_{\pm 1.0}36.5 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 40.3±0.7subscript40.3plus-or-minus0.740.3_{\pm 0.7}40.3 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 46.5±0.8subscript46.5plus-or-minus0.846.5_{\pm 0.8}46.5 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 44.4±1.0subscript44.4plus-or-minus1.044.4_{\pm 1.0}44.4 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 43.9±1.2subscript43.9plus-or-minus1.243.9_{\pm 1.2}43.9 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 51.4±0.4subscript51.4plus-or-minus0.451.4_{\pm 0.4}51.4 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 52.0±0.6subscript52.0plus-or-minus0.652.0_{\pm 0.6}52.0 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 86.4±0.2subscript86.4plus-or-minus0.286.4_{\pm 0.2}86.4 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
50 (3.8%) ResNetAP-10 47.2±1.3subscript47.2plus-or-minus1.347.2_{\pm 1.3}47.2 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 40.6±0.4subscript40.6plus-or-minus0.440.6_{\pm 0.4}40.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 49.1±0.7subscript49.1plus-or-minus0.749.1_{\pm 0.7}49.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 49.3±0.2subscript49.3plus-or-minus0.249.3_{\pm 0.2}49.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 47.1±1.1subscript47.1plus-or-minus1.147.1_{\pm 1.1}47.1 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 48.3±1.0subscript48.3plus-or-minus1.048.3_{\pm 1.0}48.3 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 54.4±0.6subscript54.4plus-or-minus0.654.4_{\pm 0.6}54.4 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 57.1±0.9subscript57.1plus-or-minus0.957.1_{\pm 0.9}57.1 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 87.5±0.5subscript87.5plus-or-minus0.587.5_{\pm 0.5}87.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
ResNet-18 47.9±1.8subscript47.9plus-or-minus1.847.9_{\pm 1.8}47.9 start_POSTSUBSCRIPT ± 1.8 end_POSTSUBSCRIPT 39.6±1.0subscript39.6plus-or-minus1.039.6_{\pm 1.0}39.6 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 48.3±1.2subscript48.3plus-or-minus1.248.3_{\pm 1.2}48.3 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 50.1±0.5subscript50.1plus-or-minus0.550.1_{\pm 0.5}50.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 46.2±0.6subscript46.2plus-or-minus0.646.2_{\pm 0.6}46.2 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 48.3±0.8subscript48.3plus-or-minus0.848.3_{\pm 0.8}48.3 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 53.9±0.6subscript53.9plus-or-minus0.653.9_{\pm 0.6}53.9 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 54.9±0.1subscript54.9plus-or-minus0.154.9_{\pm 0.1}54.9 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 89.3±1.2subscript89.3plus-or-minus1.289.3_{\pm 1.2}89.3 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT

While the introduction of thresholds helps to produce a more balanced difficulty distribution, it also introduces distortion from artificially modifying some values. The Kullback-Leibler (KL) divergence is introduced to measure distribution-level differences, assisting in the determination of the appropriate clipping level. With the target of being similar to both the uniform distribution 𝒰𝒰\mathcal{U}caligraphic_U and the original difficulty distribution PX(n)subscript𝑃𝑋𝑛P_{X}(n)italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_n ), the optimal threshold value is pinpointed as follows:

b,t=argminb,t(λDKL(f(PX,b,t)||PX)+(1λ)DKL(f(PX,b,t)||𝒰)),\displaystyle\begin{split}b^{*},t^{*}=\arg\min_{b,t}(\lambda\ D_{\text{KL}}(f(% P_{X},b,t)||P_{X})\\ +(1-\lambda)\ D_{\text{KL}}(f(P_{X},b,t)||\mathcal{U})),\end{split}start_ROW start_CELL italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_b , italic_t end_POSTSUBSCRIPT ( italic_λ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_f ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_b , italic_t ) | | italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + ( 1 - italic_λ ) italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_f ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_b , italic_t ) | | caligraphic_U ) ) , end_CELL end_ROW (9)

where DKL(P||Q)D_{\text{KL}}(P||Q)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P | | italic_Q ) denotes the KL divergence of P from Q and λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] is a weighting factor controlling the trade-off between uniformity and similarity.

Through the above procedure, we obtain a distilled dataset that matches the difficulty distribution of the original dataset. In the specific downstream task of classification, it incorporates additional task-relevant information and is therefore expected to have improved performance compared to existing approaches.

3 Experiments

3.1 Datasets and Evaluation

To validate the effectiveness of the proposed method, extensive experiments are conducted on three 10-class subsets from the full-sized ImageNet [6] dataset: ImageWoof [9], ImageNette [9], and ImageIDC [16]. These subsets differ in the difficulty of classes for classification, with ImageWoof being the most challenging one, consisting of 10 specific dog breeds. ImageNette consists of 10 specific classes that are easy to classify, and ImageIDC contains 10 classes randomly selected from ImageNet. We evaluate the classification accuracy of the proposed method and compare with several SOTA methods, including dataset selection methods like Random, K-Center [36] and Herding [41], non-generative dataset distillation methods like DM [45] and IDC-1 [16], and generative dataset distillation methods like DiT [31] and Minimax [12]. The models for validation include ConvNet-6 [11], ResNet-18 [14], and ResNet-10 with average pooling (ResNetAP-10) [14], with a learning rate of 0.01 and top-1 accuracy being reported.

Table 2: Comparison of downstream validation accuracy with other SOTA methods on different ImageNet subsets. The results are obtained with ResNetAP-10. The best results are marked in bold.
IPC Random DiT [31] Minimax [12] Ours
ImageNette 10 54.2±1.6subscript54.2plus-or-minus1.654.2_{\pm 1.6}54.2 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 59.1±0.7subscript59.1plus-or-minus0.759.1_{\pm 0.7}59.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 59.8±0.3subscript59.8plus-or-minus0.359.8_{\pm 0.3}59.8 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 61.5±0.9subscript61.5plus-or-minus0.961.5_{\pm 0.9}61.5 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
20 63.5±0.5subscript63.5plus-or-minus0.563.5_{\pm 0.5}63.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 64.8±1.2subscript64.8plus-or-minus1.264.8_{\pm 1.2}64.8 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 66.3±0.4subscript66.3plus-or-minus0.466.3_{\pm 0.4}66.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 66.9±0.5subscript66.9plus-or-minus0.566.9_{\pm 0.5}66.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
50 76.1±1.1subscript76.1plus-or-minus1.176.1_{\pm 1.1}76.1 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 73.3±0.9subscript73.3plus-or-minus0.973.3_{\pm 0.9}73.3 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 75.2±0.2subscript75.2plus-or-minus0.275.2_{\pm 0.2}75.2 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 76.8±0.7subscript76.8plus-or-minus0.776.8_{\pm 0.7}76.8 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT
ImageIDC 10 48.1±0.8subscript48.1plus-or-minus0.848.1_{\pm 0.8}48.1 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 54.1±0.4subscript54.1plus-or-minus0.454.1_{\pm 0.4}54.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 60.3±1.0subscript60.3plus-or-minus1.060.3_{\pm 1.0}60.3 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 61.6±0.7subscript61.6plus-or-minus0.761.6_{\pm 0.7}61.6 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT
20 52.5±0.9subscript52.5plus-or-minus0.952.5_{\pm 0.9}52.5 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 58.9±0.2subscript58.9plus-or-minus0.258.9_{\pm 0.2}58.9 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 63.9±0.4subscript63.9plus-or-minus0.463.9_{\pm 0.4}63.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 64.3±0.5subscript64.3plus-or-minus0.564.3_{\pm 0.5}64.3 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
50 68.1±0.7subscript68.1plus-or-minus0.768.1_{\pm 0.7}68.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 64.3±0.6subscript64.3plus-or-minus0.664.3_{\pm 0.6}64.3 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 74.1±0.2subscript74.1plus-or-minus0.274.1_{\pm 0.2}74.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 74.2±0.7subscript74.2plus-or-minus0.774.2_{\pm 0.7}74.2 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT

The image pool is created using the Minimax pipeline with its default parameters and settings. For the diffusion model, a pre-trained DiT [31] with Difffit [42] for fine-tuning, and VAE [7] as the encoder. The input image is randomly arranged and transformed to 256×256256256256\times 256256 × 256 pixels. The number of denoising steps in the sampling process is 50. The distillation process lasts for 8 epochs with a mini-batch size of 8. An AdamW with a learning rate of 1e-3 is adopted as the optimizer. A ResNet-50 trained on the full ImageNet dataset is used as the pre-trained model for obtaining the difficulty scores. Each experiment is repeated 3 times, and the mean value and standard deviation are recorded.

3.2 Benchmark Results

Refer to caption
Figure 3: Visualization of images of the original and distilled dataset with difficulty scores.

Firstly, we compare the proposed method on the ImageWoof with different classification models and various IPC settings to show the method’s cross-architecture effectiveness. As shown in Table 1, our method demonstrates superior accuracy across all experiments, especially in high IPC settings, proving the method’s ability to enhance the task-specific performance of dataset distillation.

Then, we verify the generalization performance of the proposed method by conducting experiments on various datasets. As shown in Table 2, the performance trend observed on ImageNette and ImageIDC generally corresponds with those on ImageWoof, with the best performance demonstrated in most experiments.

To provide an intuitive understanding of our method, we illustrate the difficulty distributions of different datasets during the sampling process in Fig. 2, using two example classes “Shih-Tzu” (n02086240) and “English Foxhound” (n02089973) from ImageWoof. As shown in the figure, the image pool obtained by generative models exhibits a strong bias toward easy samples, failing to reflect the difficulty characteristics of the original dataset. As a result, distilled datasets using the original distribution have many difficulty intervals remaining unrepresented, reducing the effects of sampling. By contrast, the logarithmic transformation flattens the difficulty distribution of the image pool, facilitating the sampling of images matching the target distribution. However, it also alters the original dataset’s difficulty distribution after transformation, highlighting the need for further discussion on the impact and potential solutions, such as adjusting transformation parameters.

We also visualize the images of the aforementioned two classes in both the original and distilled datasets in Fig. 3, along with their corresponding difficulty scores. The comparison reveals that the distilled dataset contains images of various difficulties and visual characteristics, indicating good sample diversity. Additionally, images of the same difficulty share some common features, suggesting the potential factors that contribute to the difficulty.

3.3 Sampling Distribution

When obtaining the sampling distribution, we hypothesize that a distribution similar to the original dataset contributes to enhanced performance. We validate the hypothesis in Table 3, where we compare the downstream performance with various pre-defined sampling distributions, with “scale” denoting the strategy of scaling the difficulty distribution of the original dataset. As illustrated in Fig. 4, the four pre-defined distributions “hill”, “ground”, “slope” and “cliff” are named according to their shapes, including increasing proportions of easy samples.

Table 3: Comparison of downstream validation accuracy for different sampling distributions with the best results marked in bold. The label “scale” refers to the strategy of scaling the difficulty distribution of the original dataset. The results are obtained with ResNetAP-10 on ImageWoof. The best results are marked in bold.
Distribution IPC = 10 IPC = 20 IPC = 50
Hill 35.8±0.2subscript35.8plus-or-minus0.235.8_{\pm 0.2}35.8 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 41.7±0.3subscript41.7plus-or-minus0.341.7_{\pm 0.3}41.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 56.9±0.5subscript56.9plus-or-minus0.556.9_{\pm 0.5}56.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
Ground 36.7±0.7subscript36.7plus-or-minus0.736.7_{\pm 0.7}36.7 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 42.7±0.8subscript42.7plus-or-minus0.842.7_{\pm 0.8}42.7 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 55.0±0.6subscript55.0plus-or-minus0.655.0_{\pm 0.6}55.0 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT
Slope 37.8±0.4subscript37.8plus-or-minus0.437.8_{\pm 0.4}37.8 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 42.7±0.7subscript42.7plus-or-minus0.742.7_{\pm 0.7}42.7 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 56.1±0.6subscript56.1plus-or-minus0.656.1_{\pm 0.6}56.1 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT
Cliff 37.4±0.6subscript37.4plus-or-minus0.637.4_{\pm 0.6}37.4 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 44.3±0.3subscript44.3plus-or-minus0.344.3_{\pm 0.3}44.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 56.6±0.8subscript56.6plus-or-minus0.856.6_{\pm 0.8}56.6 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT
Scale 37.4±0.3subscript37.4plus-or-minus0.337.4_{\pm 0.3}37.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 45.5±0.4subscript45.5plus-or-minus0.445.5_{\pm 0.4}45.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 57.1±0.9subscript57.1plus-or-minus0.957.1_{\pm 0.9}57.1 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
Refer to caption
Figure 4: Visualization of pre-defined sampling distributions under different IPC settings, with increasing proportions of easy samples from left to right. The x-axis represents difficulty intervals, while the y-axis indicates the desired number of selected images per interval.

Experimental results show that distilled datasets sampled using the “scale” distribution achieve the best performance, likely due to class-wise differences in difficulty distributions within the dataset. Further comparisons among pre-defined sampling distributions reveal that smaller sampled datasets with a higher proportion of easier samples, as well as larger sampled datasets with a higher proportion of more difficult samples, tend to yield better performance. This finding suggests that adjusting the proportion of easy and difficult samples according to the IPC setting may lead to improved performance.

3.4 Size of Image Pool

Since the final distilled dataset is sampled from the image pool, its size can influence the overall performance, necessitating efforts to determine the appropriate size. To this end, we construct image pools of varying sizes of n×IPC𝑛IPCn\times\text{IPC}italic_n × IPC and conduct experiments to identify the optimal value of n𝑛nitalic_n for practical implementation.

As shown in Table 4, the classification accuracy varies with the size of the image pool. Based mainly on results under higher IPC settings, the size of 5×IPC5IPC5\times\text{IPC}5 × IPC yields the relatively best performance, and is therefore adopted in subsequent experiments. This behavior can be attributed to a trade-off: while a larger image pool increases diversity, it also introduces redundancy, especially in the context of concentrated difficulty distributions shown in Fig. 2. Moreover, the size affects the selection of threshold parameters in the logarithmic transformation, which are also applied to the original dataset, resulting in different sampling distributions.

Table 4: Comparison of downstream validation accuracy for different sizes of image pool. The results are obtained with ResNetAP-10 on ImageWoof. The best results are marked in bold.
Size IPC=10 IPC=20 IPC=50
2 ×\times× IPC 37.9±0.4subscript37.9plus-or-minus0.437.9_{\pm 0.4}37.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 44.3±0.3subscript44.3plus-or-minus0.344.3_{\pm 0.3}44.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 55.8±0.6subscript55.8plus-or-minus0.655.8_{\pm 0.6}55.8 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT
3 ×\times× IPC 35.9±0.8subscript35.9plus-or-minus0.835.9_{\pm 0.8}35.9 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 41.5±0.5subscript41.5plus-or-minus0.541.5_{\pm 0.5}41.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 56.7±0.4subscript56.7plus-or-minus0.456.7_{\pm 0.4}56.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT
4 ×\times× IPC 38.4±0.6subscript38.4plus-or-minus0.638.4_{\pm 0.6}38.4 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 43.7±0.5subscript43.7plus-or-minus0.543.7_{\pm 0.5}43.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 55.4±0.9subscript55.4plus-or-minus0.955.4_{\pm 0.9}55.4 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
5 ×\times× IPC 37.4±0.3subscript37.4plus-or-minus0.337.4_{\pm 0.3}37.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 45.5±0.4subscript45.5plus-or-minus0.445.5_{\pm 0.4}45.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 57.1±0.9subscript57.1plus-or-minus0.957.1_{\pm 0.9}57.1 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
6 ×\times× IPC 34.5±0.8subscript34.5plus-or-minus0.834.5_{\pm 0.8}34.5 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 42.7±1.1subscript42.7plus-or-minus1.142.7_{\pm 1.1}42.7 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 54.9±1.3subscript54.9plus-or-minus1.354.9_{\pm 1.3}54.9 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT

4 Conclusion

In this paper, we have proposed a difficulty-based sampling method to improve task-specific performance for dataset distillation. Unlike previous methods that focus on the information between the distilled and original datasets for optimization, we evaluate the information relevant to the target downstream task by using difficulty distribution to facilitate sampling, offering complementary optimization based on the Information Bottleneck principle. The proposed method achieves state-of-the-art performance in the specific task of classification in most experiments, verifying the effectiveness of difficulty-based sampling. Moreover, it also supports the effectiveness of task-specific information, suggesting its potential for enhancing performance on other downstream tasks.

References

  • Cazenavette et al. [2022] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proc. CVPR, pages 10718–10727, 2022.
  • Cazenavette et al. [2023] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. In Proc. CVPR, pages 3739–3748, 2023.
  • Chan-Santiago et al. [2025] Jeffrey A. Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. MGD3: Mode-guided dataset distillation using diffusion models. In Proc. ICML, 2025.
  • Chen et al. [2024] Yilan Chen, Wei Huang, and Tsui-Wei Weng. Provable and efficient dataset distillation for kernel ridge regression. In Proc. NeurIPS, 2024.
  • Cui et al. [2025] Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. OPTICAL: Leveraging optimal transport for contribution allocation in dataset distillation. In Proc. CVPR, 2025.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, pages 248–255, 2009.
  • Diederik P. and Max [2013] Kingma Diederik P. and Welling Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, pages 1–14, 2013.
  • Ding et al. [2024] Mucong Ding, Yuancheng Xu, Tahseen Rabbani, Xiaoyu Liu, Brian Gravelle, Teresa Ranadive, Tai-Ching Tuan, and Furong Huang. Calibrated dataset condensation for faster hyperparameter search. arXiv preprint arXiv:2405.17535, 2024.
  • Fastai [2019] Fastai. imagenette. https://github.com/fastai/imagenette, 2019.
  • Gaurav [2023] Menghani Gaurav. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 55(12):1–37, 2023.
  • Gidaris and Komodakis [2018] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proc. CVPR, pages 4367–4375, 2018.
  • Gu et al. [2024a] Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. In Proc. CVPR, pages 15793–15803, 2024a.
  • Gu et al. [2024b] Jianyang Gu, Kai Wang, Wei Jiang, and Yang You. Summarizing stream data for memory-restricted online continual learning. In Proc. AAAI, 2024b.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pages 770–778, 2016.
  • Jia et al. [2024] Yuqi Jia, Saeed Vahidian, Jingwei Sun, Jianyi Zhang, Vyacheslav Kungurtsev, Neil Zhenqiang Gong, and Yiran Chen. Unlocking the potential of federated learning: The symphony of dataset distillation via deep generative latents. In Proc. ECCV, 2024.
  • Kim et al. [2022] Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In Proc. ICML, pages 11102–11118, 2022.
  • Li et al. [2020] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. In Proc. ICIP, pages 305–309, 2020.
  • Li et al. [2022a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Computer Methods and Programs in Biomedicine, 227:107189, 2022a.
  • Li et al. [2022b] Guang Li, Bo Zhao, and Tongzhou Wang. Awesome dataset distillation. https://github.com/Guang000/Awesome-Dataset-Distillation, 2022b.
  • Li et al. [2023a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Dataset distillation using parameter pruning. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2023a.
  • Li et al. [2023b] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Dataset distillation for medical dataset sharing. In Proc. AAAI Workshop, pages 1–6, 2023b.
  • Li et al. [2024a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Importance-aware adaptive dataset distillation. Neural Networks, 2024a.
  • Li et al. [2024b] Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation: Balancing global structure and local details. In Proc. CVPR Workshop, pages 7664–7671, 2024b.
  • Li et al. [2025a] Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on self-knowledge distillation. In Proc. ICASSP, 2025a.
  • Li et al. [2025b] Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, and Miki Haseyama. Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory. In Proc. ICIP, 2025b.
  • Li et al. [2025c] Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Hyperbolic dataset distillation. arXiv preprint arXiv:2505.24623, 2025c.
  • Liu and Du [2025] Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions. arXiv preprint arXiv:2502.05673, 2025.
  • Masarczyk and Tautkute [2020] Wojciech Masarczyk and Ivona Tautkute. Reducing catastrophic forgetting with learning on synthetic data. In Proc. CVPR Workshop, pages 4321–4326, 2020.
  • Mohammad Mustafa [2023] Taye Mohammad Mustafa. Understanding of machine learning with deep learning: architectures, workflow, applications and future directions. Computers, 12(5):1–27, 2023.
  • Nguyen et al. [2021] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. In Proc. ICLR, 2021.
  • Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, pages 1–25, 2023.
  • Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pages 10684–10695, 2022a.
  • Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pages 10684–10695, 2022b.
  • Saharia et al. [2023] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2023.
  • Sajedi et al. [2023] Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z. Liu, Yuri A. Lawryshyn, and Konstantinos N. Plataniotis. DataDAM: Efficient dataset distillation with attention matching. In Proc. ICCV, pages 17097–17107, 2023.
  • Sener and Savarese [2017] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, pages 1–13, 2017.
  • Su et al. [2024a] Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. In Proc. ECCV Workshop, 2024a.
  • Su et al. [2024b] Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. In Proc. ECCV Workshop, pages 1–12, 2024b.
  • Wang et al. [2018] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, pages 1–14, 2018.
  • Wang et al. [2024] Zerun Wang, Jiafeng Mao, Xueting Wang, and Toshihiko Yamasaki. Training data synthesis with difficulty controlled diffusion model. arXiv preprint arXiv:2411.18109, pages 1–10, 2024.
  • Welling [2009] Max Welling. Herding dynamical weights to learn. In Proc. ICML, pages 1121–1128, 2009.
  • Xie et al. [2023] Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In Proc. ICCV, pages 4207–4216, 2023.
  • Zhao and Bilen [2021] Bo Zhao and Hakan Bilen. Dataset condensation with gradient matching. In Proc. ICLR, pages 1–20, 2021.
  • Zhao and Bilen [2022] Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan. In Proc. NeurIPS Workshop, 2022.
  • Zhao and Bilen [2023] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proc. WACV, pages 6514–6523, 2023.
  • Zheng and Li [2024] Tianhang Zheng and Baochun Li. Differentially private dataset condensation. In Proc. NDSS Workshop, 2024.