FFTW FAQ - Section 4 
Internals of FFTW
The innovation (if it can be so called) in FFTW consists in having a
variety of composable solvers, representing different FFT algorithms and implementation strategies, whose combination into a
particular plan for a given size can be determined at runtime according to the characteristics of your machine/compiler. 
This peculiar software architecture allows FFTW to adapt itself to
almost any machine.  
For more details (albeit somewhat outdated), see the paper "FFTW:
An Adaptive Software Architecture for the FFT", by M. Frigo and
S. G. Johnson, Proc. ICASSP 3, 1381 (1998), also available at the FFTW web page.  
This is a complex question, and there is no simple answer.  In fact,
the authors do not fully know the answer, either.  In addition to many
small performance hacks throughout FFTW, there are three general
reasons for FFTW's speed.  
- 	FFTW uses a variety of FFT algorithms and implementation styles
that can be arbitrarily composed to adapt itself to
a machine.  See Q4.1 `How does FFTW work?'.  
- 	FFTW uses a code generator to produce highly-optimized
routines for computing small transforms. 
- 	FFTW uses explicit divide-and-conquer to take advantage
of the memory hierarchy.  
For more details (albeit somewhat outdated), see the paper "FFTW:
An Adaptive Software Architecture for the FFT", by M. Frigo and
S. G. Johnson, Proc. ICASSP 3, 1381 (1998), available along with other references at
the FFTW web page.
Next: Known bugs.
Back: Using FFTW.
Return to contents.
Matteo Frigo and Steven G. Johnson / fftw@fftw.org
- 24 May 2018
Extracted from FFTW Frequently Asked Questions with Answers,
Copyright © 2018 Matteo Frigo and Massachusetts Institute of Technology.