Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Mingzhuo Li¹ Guang Li¹ Jiafeng Mao² Linfeng Ye³ Takahiro Ogawa¹ Miki Haseyama¹

¹Hokkaido Univerisy ²The University of Tokyo ³University of Toronto
Correspondence to: Guang Li (guang@lmd.ist.hokudai.ac.jp)

Abstract

To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.

1 Introduction

With the rapid advancement of deep learning, deep neural networks have gained significant attention due to their extensive applications across various domains, particularly in computer vision [10]. However, these networks typically rely on large-scale datasets to obtain high performance, which results in extended training times that often span several hours or even days, and substantial demands on computational resources [29]. Moreover, the storage and management of massive datasets involve considerable time and financial costs. Dataset distillation [39] has emerged as a promising solution to mitigate these challenges by distilling the original dataset into a compact and high-quality synthetic dataset, which can train models to achieve performance comparable to that obtained using the original dataset.

Refer to caption — Figure 1: The workflow of the proposed method with the overall linear process marked as green. The generation of the image pool is indicated in blue, with optimization strategies aligning the original and distilled distributions. The sampling of the distilled dataset is indicated in orange, using the difficulty distribution that originates from the target task. The two components focus on different aspects of the Information Bottleneck optimization objective and are expected to function complementarily to enhance overall performance.

Since its introduction, dataset distillation has attracted significant attention, with a growing number of studies contributing to its rapid advancement [19, 27]. Current dataset distillation methods can be broadly categorized into non-generative and generative ones. Traditional non-generative methods aim to optimize a fixed set of synthetic images, with the size determined by image-per-class (IPC). The optimization is achieved by aligning specific training targets with those derived from the original dataset, under the assumption that models with similar alignment behavior will achieve comparable performance on downstream tasks. Various alignment targets have given rise to different methods, including gradient/trajectory matching [43, 1, 20, 22], distribution/feature matching [45, 35, 26, 5], and kernel-based methods [30, 4].

In contrast, generative dataset distillation methods utilize generative models [44, 2, 23, 24] to produce high-quality synthetic images, which is made feasible by embedding knowledge of the dataset into the model. This modification offers the flexibility to generate datasets of any size on demand, effectively removing the constraint of IPC and reducing time costs, which is particularly beneficial for scenarios such as continual learning [28, 13], federated learning [21, 15], privacy preservation [17, 18, 46], and neural architecture search [8]. Among these models, diffusion models, such as Imagen [34] and Stable Diffusion [33], have shown exceptional promise for their robustness and adaptability, promoting increasing interest in leveraging them for effective dataset distillation [12, 37, 38].

While current generative dataset distillation methods have demonstrated promising performance, the approaches primarily focus on guiding the model with knowledge extracted from the original dataset, overlooking the information specific to the downstream task [12, 37, 25]. This discrepancy between the training objective and the target task may lead to incomplete information during training, limiting the model’s optimal performance. To address this issue, we propose leveraging the relevance of the distilled dataset concerning the downstream task, aiming to generate datasets with superior performance on the target task.

In this paper, we focus on the downstream task of classification and introduce a difficulty-guided sampling to enhance the performance of generative dataset distillation. An image pool consisting of generated images is first obtained using a generative dataset distillation method with the optimization objective of aligning the diversity and representativeness between the original and distilled datasets. The final distilled dataset is selected by aligning the difficulty distribution of the image pool with that of the original dataset. As previous generative models tend to produce samples biased toward lower difficulty (i.e., easier samples), a pre-processing step of logarithmic transformation is introduced for distributional correction. Extensive experiments on various downstream models and datasets demonstrate the effectiveness of our proposed method. The contributions of this paper can be summarized as follows:

•

We propose a difficulty-guided sampling to utilize extra information related to the classification task, achieving task-specific dataset distillation.
•

We conduct sampling on an image pool following the difficulty distribution of the original dataset, and propose a logarithmic transformation to eliminate the bias of the image pool towards easy samples.

2 Dataset Distillation with Difficulty-Guided Sampling

This section is organized as follows. We begin by reviewing a widely adopted generative dataset distillation pipeline, which is based on aligning the distribution between the distilled and original datasets. We then present the detailed implementation of difficulty-guided sampling, supported by theoretical analysis. Finally, we illustrate the logarithmic transformation, which is designed to obtain effective sample selection. The workflow of the proposed method is shown in Fig. 1.

2.1 Preliminary

Latent diffusion models [32] operate in the latent space rather than directly in the pixel space, showing enhanced ability on abstract features. Given an image $\bm{x}$ from the original dataset $D$ , it is first encoded into a latent vector $\bm{z}_{0}$ by the VAE encoder. A noisy latent $\bm{z}_{t}$ is then obtained by sequentially adding Gaussian noise $\epsilon\in\mathcal{N}(\bm{0},\bm{I})$ to $\bm{z}_{0}$ over $t$ times as follows:

\bm{z_{t}}=\sqrt{\overline{\alpha}_{t}}\bm{z}_{0}+\sqrt{1-\overline{\alpha}_{t% }}\epsilon,

(1)

where $\overline{\alpha}_{t}$ denotes a hyper-parameter known as variance schedule. The diffusion model parameterized by $\theta$ is trained to predict the added noise $\epsilon$ , conditioned on class information $\bm{c}$ , which is obtained via a class encoder. The training objective minimizes the discrepancy between the predicted noise $\epsilon_{\theta}(\bm{z_{t}},t,\bm{c})$ and the ground truth $\epsilon$ as follows:

\mathcal{L}_{\text{diffusion}}=\arg\max_{\theta}{||\epsilon_{\theta}(\bm{z_{t}% },t,\bm{c})-\epsilon||}_{2}^{2}.

(2)

Once trained, the model is capable of generating images by iteratively denoising random noise, thereby achieving high-quality image synthesis.

To leverage diffusion models for dataset distillation, Minimax [12] introduces an approach that aims to maximize both the representativeness and diversity of the distilled dataset. Two auxiliary memory sets are constructed to facilitate the calculation, with representativeness memory $\mathcal{M}_{r}$ containing real images and diversity memory $\mathcal{M}_{d}$ containing generated images. Representativeness is defined as the similarity between the generated and original dataset, leading to the optimization objective as follows:

\mathcal{L}_{\text{repre}}=\arg\max_{\theta}\min_{\bm{z}_{r}\in[\mathcal{M}_{r% }]}\sigma(\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c}),\bm{z}_{r}),

(3)

where $\sigma(\text{\textperiodcentered}\ ,\ \text{\textperiodcentered})$ denotes the cosine similarity and $\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c})$ is the latent predicted by the diffusion model $f_{\theta}$ with input latent $\bm{z}_{t}$ conditioned by class vector $\bm{c}$ . Similarly, diversity is defined based on the dissimilarity among the generated images, with the optimization objective as follows:

\mathcal{L}_{\text{div}}=\arg\min_{\theta}\max_{\bm{\hat{z}}_{g}\in[\mathcal{M% }_{d}]}\sigma(\bm{\hat{z}}_{\theta}(\bm{z}_{t},\bm{c}),\bm{\hat{z}}_{g}).

(4)

By combining $\mathcal{L}_{\text{div}}$ and $\mathcal{L}_{\text{repre}}$ with the diffusion loss $\mathcal{L}_{\text{diffusion}}$ , the model is guided to produce distilled datasets of higher quality, thereby improving the performance on downstream tasks.

Although this method effectively leverages features from the original dataset, it overlooks information specific to the downstream task. This omission can lead to a mismatch between the optimization objective during training and the target downstream tasks, such as classification, limiting the optimal performance.

2.2 Difficulty-Guided Sampling

From the perspective of the Information Bottleneck (IB) principle, the objective of dataset distillation can be redefined as follows. For the original dataset $X$ and target downstream task $Y$ , the goal is to find a compressed dataset $T$ that discards irrelevant details from $X$ while retaining the information relevant to $Y$ . Since the original dataset is no longer used during the downstream application, $Y$ is conditionally independent of $X$ given $T$ , resulting in the Markov chain structure $X\to T\to Y$ . This satisfies the Markov assumption required by IB, leading to the objective as follows:

\mathcal{L}_{\text{IB}}=\min_{T}I(X;T)-\beta\ I(T;Y),

(5)

where $I(X;T)$ and $I(T;Y)$ denote the mutual information between $X$ and $T$ , and between $T$ and $Y$ , respectively. And $\beta$ is a Lagrange multiplier. The former part improves the level of compression, while the latter part enhances the predictability of the target. Balancing these two objectives helps construct distilled datasets that are both compact and effective.

Recent generative dataset distillation methods primarily focus on optimizing the distribution of the distilled dataset concerning the original dataset. For example, the aforementioned Minimax enhances diversity and representativeness, while MGD3 [3] guides the denoising process toward desired distributional regions. These approaches can be broadly categorized as efforts to extract more features from the original dataset, making the distilled dataset resemble the original distribution. Since the original dataset inherently contains rich information, including labels for classification, such efforts implicitly benefit various downstream tasks. In other words, these approaches implicitly improve $I(T;Y)$ by explicitly improving $I(X;T)$ , and the overall performance comes from the balancing of the two factors. In the absence of task-specific considerations, the enhancement of $I(T;Y)$ is limited to the original dataset’s inherent information, which may result in potentially suboptimal performance on the specific downstream task.

To address this issue, we propose incorporating task-specific information to leverage the relevance between the distilled dataset $T$ and the target task $Y$ , explicitly improving $I(T;Y)$ for better performance. Inspired by the findings of Wang et al. [40], which demonstrate the effectiveness of controlling sample difficulty for dataset enhancement, we introduce difficulty as a proxy to quantify the information content for classification task.

The difficulty of an image $\mathcal{D}_{x}$ is defined as the inverse of the confidence $P$ assigned to the correct class $y_{true}$ predicted by a pre-trained classification model $f_{\theta}$ as follows:

\mathcal{D}_{x}=1-P_{f_{\theta}}(y_{true}|x).

(6)

As illustrated in Fig. 1, an image pool with a total size of $n\times\text{IPC}$ is first constructed by collecting distilled images generated by the distillation pipeline of Minimax. The difficulty of each image in the image pool is then computed to serve as additional task-specific information. The sampling is then performed over the image pool following a specific sampling distribution.

Assuming the original dataset represents the ground truth for optimal performance, we hypothesize that a distilled dataset exhibiting a similar difficulty distribution to the original one may yield improved performance. Consequently, the sampling distribution is obtained by scaling the difficulty distribution of the original dataset to match the IPC. The effectiveness of this scaling-based sampling is supported by our experiments in Section 3.3, where we compare its performance with several pre-defined sampling distributions.

2.3 Pre-processing of Logarithmic Function

However, a notable bias exists between the difficulty distributions of the original and generated datasets [40]. Distilled datasets, particularly those obtained by generative models, are also subject to the bias, tending to contain a higher proportion of easy samples, as illustrated in Fig. 2. This imbalance hinders the coverage of the sampling distribution in certain areas, distorting the difficulty distribution of the distilled dataset, necessitating additional corrective steps.

To address this issue, a logarithmic transformation is applied to facilitate the alignment of the difficulty distributions of both the original dataset and the image pool to enable better sampling. The target of transformation is selected as the uniform distribution following the idea that classification models benefit from balanced data. Due to the observation that many images tend to cluster around similar difficulty values, particularly in the lower and upper extremes, directly applying the logarithmic function may amplify the influence of extreme values, affecting the overall stability.

Hence, thresholding at both the start and end of the original difficulty distribution $P_{X}(n)$ is introduced to stabilize the transformation and prevent the dominance of extreme values. The clipped distribution $P^{\prime}_{X}(n)$ is obtained as follows:

P^{\prime}_{X}(n)=H(n-b)\ P_{X}(n)\ H(N-n-t)+\epsilon,

(7)

where $b$ and $t$ denote the bottom and top thresholds, respectively. $N$ is the size of $P_{X}(n)$ , $H(n)$ is the Heaviside step function and $\epsilon$ is a small value to avoid mathematic error. To keep the range between $0$ and $1$ , the logarithmic transformation $f$ is defined as follows:

f(P_{X},b,t)=\frac{\ln(P_{X}^{\prime}(n)/\min(P_{X}^{\prime}(n)))}{\ln(\max(P_% {X}^{\prime}(n))/\min(P_{X}^{\prime}(n)))}.

(8)

Table 1: Comparison of downstream validation accuracy with other SOTA methods on ImageWoof. The results are obtained with ResNetAP-10. The best results are marked in bold.

IPC (Ratio)	Test Model	Random	K-Center [36]	Herding [41]	DiT [31]	DM [45]	IDC-1 [16]	Minimax [12]	Ours	Full Dataset
	ConvNet-6	$24.3_{\pm 1.1}$	$19.4_{\pm 0.9}$	$26.7_{\pm 0.5}$	$34.2_{\pm 1.1}$	$26.9_{\pm 1.2}$	$33.3_{\pm 1.1}$	$34.1_{\pm 0.4}$	$35.1_{\pm 0.5}$	$86.4_{\pm 0.2}$
10 (0.8%)	ResNetAP-10	$29.4_{\pm 0.8}$	$22.1_{\pm 0.1}$	$32.0_{\pm 0.3}$	$34.7_{\pm 0.5}$	$30.3_{\pm 1.2}$	$37.3_{\pm 0.4}$	$35.7_{\pm 0.3}$	$37.4_{\pm 0.3}$	$87.5_{\pm 0.5}$
	ResNet-18	$27.7_{\pm 0.9}$	$21.1_{\pm 0.4}$	$30.2_{\pm 1.2}$	$34.7_{\pm 0.4}$	$33.4_{\pm 0.7}$	$36.9_{\pm 0.4}$	$35.3_{\pm 0.4}$	$35.9_{\pm 0.6}$	$89.3_{\pm 1.2}$
	ConvNet-6	$29.1_{\pm 0.7}$	$21.5_{\pm 0.8}$	$29.5_{\pm 0.3}$	$36.1_{\pm 0.8}$	$29.9_{\pm 1.0}$	$35.5_{\pm 0.8}$	$36.9_{\pm 1.2}$	$38.1_{\pm 0.2}$	$86.4_{\pm 0.2}$
20 (1.6%)	ResNetAP-10	$32.7_{\pm 0.4}$	$25.1_{\pm 0.7}$	$34.9_{\pm 0.1}$	$41.1_{\pm 0.8}$	$35.2_{\pm 0.6}$	$42.0_{\pm 0.4}$	$43.3_{\pm 0.3}$	$45.5_{\pm 0.4}$	$87.5_{\pm 0.5}$
	ResNet-18	$29.7_{\pm 0.5}$	$23.6_{\pm 0.3}$	$32.2_{\pm 0.6}$	$40.5_{\pm 0.5}$	$29.8_{\pm 1.7}$	$38.6_{\pm 0.2}$	$40.9_{\pm 0.6}$	$43.4_{\pm 1.0}$	$89.3_{\pm 1.2}$
	ConvNet-6	$41.3_{\pm 0.6}$	$36.5_{\pm 1.0}$	$40.3_{\pm 0.7}$	$46.5_{\pm 0.8}$	$44.4_{\pm 1.0}$	$43.9_{\pm 1.2}$	$51.4_{\pm 0.4}$	$52.0_{\pm 0.6}$	$86.4_{\pm 0.2}$
50 (3.8%)	ResNetAP-10	$47.2_{\pm 1.3}$	$40.6_{\pm 0.4}$	$49.1_{\pm 0.7}$	$49.3_{\pm 0.2}$	$47.1_{\pm 1.1}$	$48.3_{\pm 1.0}$	$54.4_{\pm 0.6}$	$57.1_{\pm 0.9}$	$87.5_{\pm 0.5}$
	ResNet-18	$47.9_{\pm 1.8}$	$39.6_{\pm 1.0}$	$48.3_{\pm 1.2}$	$50.1_{\pm 0.5}$	$46.2_{\pm 0.6}$	$48.3_{\pm 0.8}$	$53.9_{\pm 0.6}$	$54.9_{\pm 0.1}$	$89.3_{\pm 1.2}$

While the introduction of thresholds helps to produce a more balanced difficulty distribution, it also introduces distortion from artificially modifying some values. The Kullback-Leibler (KL) divergence is introduced to measure distribution-level differences, assisting in the determination of the appropriate clipping level. With the target of being similar to both the uniform distribution $\mathcal{U}$ and the original difficulty distribution $P_{X}(n)$ , the optimal threshold value is pinpointed as follows:

\displaystyle\begin{split}b^{*},t^{*}=\arg\min_{b,t}(\lambda\ D_{\text{KL}}(f(% P_{X},b,t)||P_{X})\\ +(1-\lambda)\ D_{\text{KL}}(f(P_{X},b,t)||\mathcal{U})),\end{split}

(9)

where $D_{\text{KL}}(P||Q)$ denotes the KL divergence of P from Q and $\lambda\in[0,1]$ is a weighting factor controlling the trade-off between uniformity and similarity.

Through the above procedure, we obtain a distilled dataset that matches the difficulty distribution of the original dataset. In the specific downstream task of classification, it incorporates additional task-relevant information and is therefore expected to have improved performance compared to existing approaches.

3 Experiments

3.1 Datasets and Evaluation

To validate the effectiveness of the proposed method, extensive experiments are conducted on three 10-class subsets from the full-sized ImageNet [6] dataset: ImageWoof [9], ImageNette [9], and ImageIDC [16]. These subsets differ in the difficulty of classes for classification, with ImageWoof being the most challenging one, consisting of 10 specific dog breeds. ImageNette consists of 10 specific classes that are easy to classify, and ImageIDC contains 10 classes randomly selected from ImageNet. We evaluate the classification accuracy of the proposed method and compare with several SOTA methods, including dataset selection methods like Random, K-Center [36] and Herding [41], non-generative dataset distillation methods like DM [45] and IDC-1 [16], and generative dataset distillation methods like DiT [31] and Minimax [12]. The models for validation include ConvNet-6 [11], ResNet-18 [14], and ResNet-10 with average pooling (ResNetAP-10) [14], with a learning rate of 0.01 and top-1 accuracy being reported.

Table 2: Comparison of downstream validation accuracy with other SOTA methods on different ImageNet subsets. The results are obtained with ResNetAP-10. The best results are marked in bold.

	IPC	Random	DiT [31]	Minimax [12]	Ours
ImageNette	10	$54.2_{\pm 1.6}$	$59.1_{\pm 0.7}$	$59.8_{\pm 0.3}$	$61.5_{\pm 0.9}$
	20	$63.5_{\pm 0.5}$	$64.8_{\pm 1.2}$	$66.3_{\pm 0.4}$	$66.9_{\pm 0.5}$
	50	$76.1_{\pm 1.1}$	$73.3_{\pm 0.9}$	$75.2_{\pm 0.2}$	$76.8_{\pm 0.7}$
ImageIDC	10	$48.1_{\pm 0.8}$	$54.1_{\pm 0.4}$	$60.3_{\pm 1.0}$	$61.6_{\pm 0.7}$
	20	$52.5_{\pm 0.9}$	$58.9_{\pm 0.2}$	$63.9_{\pm 0.4}$	$64.3_{\pm 0.5}$
	50	$68.1_{\pm 0.7}$	$64.3_{\pm 0.6}$	$74.1_{\pm 0.2}$	$74.2_{\pm 0.7}$

The image pool is created using the Minimax pipeline with its default parameters and settings. For the diffusion model, a pre-trained DiT [31] with Difffit [42] for fine-tuning, and VAE [7] as the encoder. The input image is randomly arranged and transformed to $256\times 256$ pixels. The number of denoising steps in the sampling process is 50. The distillation process lasts for 8 epochs with a mini-batch size of 8. An AdamW with a learning rate of 1e-3 is adopted as the optimizer. A ResNet-50 trained on the full ImageNet dataset is used as the pre-trained model for obtaining the difficulty scores. Each experiment is repeated 3 times, and the mean value and standard deviation are recorded.

3.2 Benchmark Results

Firstly, we compare the proposed method on the ImageWoof with different classification models and various IPC settings to show the method’s cross-architecture effectiveness. As shown in Table 1, our method demonstrates superior accuracy across all experiments, especially in high IPC settings, proving the method’s ability to enhance the task-specific performance of dataset distillation.

Then, we verify the generalization performance of the proposed method by conducting experiments on various datasets. As shown in Table 2, the performance trend observed on ImageNette and ImageIDC generally corresponds with those on ImageWoof, with the best performance demonstrated in most experiments.

To provide an intuitive understanding of our method, we illustrate the difficulty distributions of different datasets during the sampling process in Fig. 2, using two example classes “Shih-Tzu” (n02086240) and “English Foxhound” (n02089973) from ImageWoof. As shown in the figure, the image pool obtained by generative models exhibits a strong bias toward easy samples, failing to reflect the difficulty characteristics of the original dataset. As a result, distilled datasets using the original distribution have many difficulty intervals remaining unrepresented, reducing the effects of sampling. By contrast, the logarithmic transformation flattens the difficulty distribution of the image pool, facilitating the sampling of images matching the target distribution. However, it also alters the original dataset’s difficulty distribution after transformation, highlighting the need for further discussion on the impact and potential solutions, such as adjusting transformation parameters.

We also visualize the images of the aforementioned two classes in both the original and distilled datasets in Fig. 3, along with their corresponding difficulty scores. The comparison reveals that the distilled dataset contains images of various difficulties and visual characteristics, indicating good sample diversity. Additionally, images of the same difficulty share some common features, suggesting the potential factors that contribute to the difficulty.

3.3 Sampling Distribution

When obtaining the sampling distribution, we hypothesize that a distribution similar to the original dataset contributes to enhanced performance. We validate the hypothesis in Table 3, where we compare the downstream performance with various pre-defined sampling distributions, with “scale” denoting the strategy of scaling the difficulty distribution of the original dataset. As illustrated in Fig. 4, the four pre-defined distributions “hill”, “ground”, “slope” and “cliff” are named according to their shapes, including increasing proportions of easy samples.

Table 3: Comparison of downstream validation accuracy for different sampling distributions with the best results marked in bold. The label “scale” refers to the strategy of scaling the difficulty distribution of the original dataset. The results are obtained with ResNetAP-10 on ImageWoof. The best results are marked in bold.

Distribution	IPC = 10	IPC = 20	IPC = 50
Hill	$35.8_{\pm 0.2}$	$41.7_{\pm 0.3}$	$56.9_{\pm 0.5}$
Ground	$36.7_{\pm 0.7}$	$42.7_{\pm 0.8}$	$55.0_{\pm 0.6}$
Slope	$37.8_{\pm 0.4}$	$42.7_{\pm 0.7}$	$56.1_{\pm 0.6}$
Cliff	$37.4_{\pm 0.6}$	$44.3_{\pm 0.3}$	$56.6_{\pm 0.8}$
Scale	$37.4_{\pm 0.3}$	$45.5_{\pm 0.4}$	$57.1_{\pm 0.9}$

Experimental results show that distilled datasets sampled using the “scale” distribution achieve the best performance, likely due to class-wise differences in difficulty distributions within the dataset. Further comparisons among pre-defined sampling distributions reveal that smaller sampled datasets with a higher proportion of easier samples, as well as larger sampled datasets with a higher proportion of more difficult samples, tend to yield better performance. This finding suggests that adjusting the proportion of easy and difficult samples according to the IPC setting may lead to improved performance.

3.4 Size of Image Pool

Since the final distilled dataset is sampled from the image pool, its size can influence the overall performance, necessitating efforts to determine the appropriate size. To this end, we construct image pools of varying sizes of $n\times\text{IPC}$ and conduct experiments to identify the optimal value of $n$ for practical implementation.

As shown in Table 4, the classification accuracy varies with the size of the image pool. Based mainly on results under higher IPC settings, the size of $5\times\text{IPC}$ yields the relatively best performance, and is therefore adopted in subsequent experiments. This behavior can be attributed to a trade-off: while a larger image pool increases diversity, it also introduces redundancy, especially in the context of concentrated difficulty distributions shown in Fig. 2. Moreover, the size affects the selection of threshold parameters in the logarithmic transformation, which are also applied to the original dataset, resulting in different sampling distributions.

Table 4: Comparison of downstream validation accuracy for different sizes of image pool. The results are obtained with ResNetAP-10 on ImageWoof. The best results are marked in bold.

Size	IPC=10	IPC=20	IPC=50
2 $\times$ IPC	$37.9_{\pm 0.4}$	$44.3_{\pm 0.3}$	$55.8_{\pm 0.6}$
3 $\times$ IPC	$35.9_{\pm 0.8}$	$41.5_{\pm 0.5}$	$56.7_{\pm 0.4}$
4 $\times$ IPC	$38.4_{\pm 0.6}$	$43.7_{\pm 0.5}$	$55.4_{\pm 0.9}$
5 $\times$ IPC	$37.4_{\pm 0.3}$	$45.5_{\pm 0.4}$	$57.1_{\pm 0.9}$
6 $\times$ IPC	$34.5_{\pm 0.8}$	$42.7_{\pm 1.1}$	$54.9_{\pm 1.3}$

4 Conclusion

In this paper, we have proposed a difficulty-based sampling method to improve task-specific performance for dataset distillation. Unlike previous methods that focus on the information between the distilled and original datasets for optimization, we evaluate the information relevant to the target downstream task by using difficulty distribution to facilitate sampling, offering complementary optimization based on the Information Bottleneck principle. The proposed method achieves state-of-the-art performance in the specific task of classification in most experiments, verifying the effectiveness of difficulty-based sampling. Moreover, it also supports the effectiveness of task-specific information, suggesting its potential for enhancing performance on other downstream tasks.

References

Cazenavette et al. [2022] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proc. CVPR, pages 10718–10727, 2022.
Cazenavette et al. [2023] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. In Proc. CVPR, pages 3739–3748, 2023.
Chan-Santiago et al. [2025] Jeffrey A. Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. MGD3: Mode-guided dataset distillation using diffusion models. In Proc. ICML, 2025.
Chen et al. [2024] Yilan Chen, Wei Huang, and Tsui-Wei Weng. Provable and efficient dataset distillation for kernel ridge regression. In Proc. NeurIPS, 2024.
Cui et al. [2025] Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. OPTICAL: Leveraging optimal transport for contribution allocation in dataset distillation. In Proc. CVPR, 2025.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, pages 248–255, 2009.
Diederik P. and Max [2013] Kingma Diederik P. and Welling Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, pages 1–14, 2013.
Ding et al. [2024] Mucong Ding, Yuancheng Xu, Tahseen Rabbani, Xiaoyu Liu, Brian Gravelle, Teresa Ranadive, Tai-Ching Tuan, and Furong Huang. Calibrated dataset condensation for faster hyperparameter search. arXiv preprint arXiv:2405.17535, 2024.
Fastai [2019] Fastai. imagenette. https://github.com/fastai/imagenette, 2019.
Gaurav [2023] Menghani Gaurav. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 55(12):1–37, 2023.
Gidaris and Komodakis [2018] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proc. CVPR, pages 4367–4375, 2018.
Gu et al. [2024a] Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. In Proc. CVPR, pages 15793–15803, 2024a.
Gu et al. [2024b] Jianyang Gu, Kai Wang, Wei Jiang, and Yang You. Summarizing stream data for memory-restricted online continual learning. In Proc. AAAI, 2024b.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pages 770–778, 2016.
Jia et al. [2024] Yuqi Jia, Saeed Vahidian, Jingwei Sun, Jianyi Zhang, Vyacheslav Kungurtsev, Neil Zhenqiang Gong, and Yiran Chen. Unlocking the potential of federated learning: The symphony of dataset distillation via deep generative latents. In Proc. ECCV, 2024.
Kim et al. [2022] Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In Proc. ICML, pages 11102–11118, 2022.
Li et al. [2020] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. In Proc. ICIP, pages 305–309, 2020.
Li et al. [2022a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Computer Methods and Programs in Biomedicine, 227:107189, 2022a.
Li et al. [2022b] Guang Li, Bo Zhao, and Tongzhou Wang. Awesome dataset distillation. https://github.com/Guang000/Awesome-Dataset-Distillation, 2022b.
Li et al. [2023a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Dataset distillation using parameter pruning. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2023a.
Li et al. [2023b] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Dataset distillation for medical dataset sharing. In Proc. AAAI Workshop, pages 1–6, 2023b.
Li et al. [2024a] Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Importance-aware adaptive dataset distillation. Neural Networks, 2024a.
Li et al. [2024b] Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation: Balancing global structure and local details. In Proc. CVPR Workshop, pages 7664–7671, 2024b.
Li et al. [2025a] Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on self-knowledge distillation. In Proc. ICASSP, 2025a.
Li et al. [2025b] Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, and Miki Haseyama. Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory. In Proc. ICIP, 2025b.
Li et al. [2025c] Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Hyperbolic dataset distillation. arXiv preprint arXiv:2505.24623, 2025c.
Liu and Du [2025] Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions. arXiv preprint arXiv:2502.05673, 2025.
Masarczyk and Tautkute [2020] Wojciech Masarczyk and Ivona Tautkute. Reducing catastrophic forgetting with learning on synthetic data. In Proc. CVPR Workshop, pages 4321–4326, 2020.
Mohammad Mustafa [2023] Taye Mohammad Mustafa. Understanding of machine learning with deep learning: architectures, workflow, applications and future directions. Computers, 12(5):1–27, 2023.
Nguyen et al. [2021] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. In Proc. ICLR, 2021.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, pages 1–25, 2023.
Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pages 10684–10695, 2022a.
Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pages 10684–10695, 2022b.
Saharia et al. [2023] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2023.
Sajedi et al. [2023] Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z. Liu, Yuri A. Lawryshyn, and Konstantinos N. Plataniotis. DataDAM: Efficient dataset distillation with attention matching. In Proc. ICCV, pages 17097–17107, 2023.
Sener and Savarese [2017] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, pages 1–13, 2017.
Su et al. [2024a] Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. In Proc. ECCV Workshop, 2024a.
Su et al. [2024b] Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. In Proc. ECCV Workshop, pages 1–12, 2024b.
Wang et al. [2018] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A. Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, pages 1–14, 2018.
Wang et al. [2024] Zerun Wang, Jiafeng Mao, Xueting Wang, and Toshihiko Yamasaki. Training data synthesis with difficulty controlled diffusion model. arXiv preprint arXiv:2411.18109, pages 1–10, 2024.
Welling [2009] Max Welling. Herding dynamical weights to learn. In Proc. ICML, pages 1121–1128, 2009.
Xie et al. [2023] Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In Proc. ICCV, pages 4207–4216, 2023.
Zhao and Bilen [2021] Bo Zhao and Hakan Bilen. Dataset condensation with gradient matching. In Proc. ICLR, pages 1–20, 2021.
Zhao and Bilen [2022] Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan. In Proc. NeurIPS Workshop, 2022.
Zhao and Bilen [2023] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proc. WACV, pages 6514–6523, 2023.
Zheng and Li [2024] Tianhang Zheng and Baochun Li. Differentially private dataset condensation. In Proc. NDSS Workshop, 2024.