Hi Rasmus,
FMM performance is quite a complex topic. First of all, consider a simple single-layer Helmholtz or Laplace operator. Then one matvec requires one FMM pass. While the FMM scales close to linearly, the algorithm itself is quite complex. So even for a few ten thousand elements, as soon as you have the matrix available as dense assembly the matvec may be faster in dense mode due to the much simpler data transport. Obviously, there will be a crossover point and you will soon run out of memory anyway if you continue dense.
The next issue is the number of passes for a single operator. The conjugate double layer also only requires one pass, while the double layer operator requires three passess and the Helmholtz hypersingular requires six passes (Laplace hypersingular needs three passes). This further slows things down.
We use FMM for some very large problems, which cannot be represented in dense mode (think hundred thousands to millions of unknowns). But in personal experience, whenever a dense representation is possible and initial assembly is not far too slow, then that is preferable to FMM.