FFT-MIL: Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

1Institute of Artificial Intelligence (IAI), 2Department of Computer Science, 3College of Medicine University of Central Florida

Abstract

Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction.

We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures.

FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Architecture

FFT-MIL Architecture
The proposed Fourier Transform Multiple Instance Learning (FFT-MIL) framework augments existing MIL methods with a frequency-domain branch to improve global context modeling in WSI classification. The FFT-Block extracts a global frequency feature from a given WSI, which is fused with the output of CLAM’s attention backbone via addition to introduce global context at a stage where patch-level information has been aggregated. While illustrated with CLAM, the FFT-Block is modular and can be integrated into other MIL methods in a similar fashion.

Preprocessing

FFT-MIL Preprocessing
The preprocessing pipeline generates low-frequency representations of whole slide images (WSIs) to provide compact global context for downstream analysis. A tissue segmentation branch first identifies relevant regions at low magnification for efficient patch extraction, while the FFT-MIL branch operates on a downsampled WSI to apply the Fast Fourier Transform, frequency shifting, and center cropping. This process retains the dominant low-frequency components that capture large-scale structural patterns while filtering out high-frequency noise. A reconstruction pathway is included for visualization, highlighting how the extracted frequency information preserves global structure and diagnostic features with far lower input resolution.

Results

FFT-MIL Results
FFT-MIL consistently improves the performance of existing Multiple Instance Learning (MIL) methods by integrating frequency-derived global features with traditional spatial representations. The framework was evaluated across six MIL architectures, including CLAM, MIL, ABMIL, ACMIL, IBMIL, and ILRA, on the BRACS, LUAD, and IMP datasets. FFT-MIL increased the average performance by 3.51% in F1 score and 1.51% in AUC. These improvements demonstrate its effectiveness as a general plug-in mechanism for enhancing diagnostic accuracy in whole slide image classification.

Attention Heatmap

FFT-MIL Attention Heatmaps
The heatmaps illustrate the spatial impact of frequency-domain integration on attention behavior in WSI classification. Attention maps from the baseline CLAM model and the proposed FFT-MIL framework are compared on a representative BRACS slide. Both highlight similar diagnostic regions, while a pixel-wise difference map identifies areas of divergence. The baseline CLAM shows more diffuse attention, which indicates limited spatial precision and weaker global context. FFT-MIL produces sharper and more selective focus regions, supported by a 16% reduction in entropy and a 23% increase in standard deviation, which reflect improved concentration of attention. These findings show that FFT-MIL maintains alignment with the main semantic regions identified by CLAM while achieving greater spatial selectivity and precision in its attention distributions.

BibTeX


@misc{bilic2025fouriertransformmultipleinstance,
  title={Fourier Transform Multiple Instance Learning for Whole Slide Image Classification},
  author={Anthony Bilic and Guangyu Sun and Ming Li and Md Sanzid Bin Hossain and Yu Tian and Wei Zhang and Laura Brattain and Dexter Hadley and Chen Chen},
  year={2025},
  eprint={2510.15138},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.15138}
}