# At which grid size does FMM become 'worth it'?

Hi everyone. Thanks a lot for all the help so far.

I’m (still) doing Helmholtz scattering on a rigid object. I’m using somewhat large grid’s, and my simulations takes a long time to compute.

``````import time
import bempp.api
import numpy as np

grid = bempp.api.import_grid('some_grid.stl')

k = 100   # wave number
space = bempp.api.function_space(grid, "P", 1) # piecewise linear needed  due to the hypersingular operator

I = bempp.api.operators.boundary.sparse.identity(space, space, space)
M = bempp.api.operators.boundary.helmholtz.double_layer(space, space, space,k,assembler= "fmm") # double layer potential
E = bempp.api.operators.boundary.helmholtz.hypersingular(space, space, space, k,assembler= "fmm")
lhs = 0.5 * I - M - 1j/k * E

# point source far away from the boundary
r = 2
theta = 45 * np.pi/180
phi = 20 * np.pi/180
r_s = r * np.array([np.cos(theta) * np.cos(phi) , np.sin(theta) * np.cos(phi) , np.sin(phi) ])

@bempp.api.complex_callable
def f(x, n, domain_index, result):
dist = np.sqrt((r_s[0] - x[0])**2 + (r_s[1] - x[1])**2 + (r_s[2] - x[2])**2)
result[0] = np.exp(1j * k * dist)/dist
rhs = bempp.api.GridFunction(space, fun=f)

t1 = time.time()
print('Solving equation system')
sol, info = bempp.api.linalg.gmres(lhs, rhs,tol=1E-4)

print('solve time ')
print(time.time() - t1)
print('Solver returned with ')
print(info)
print('\n\n'
``````

I’m currently running the above on a cloud computer with 4 VCPU’s and 32 GB Ram.
For a grid with ~20.000 elements the system logged a solve time of

``````solve time
6649.893636465073
Solver returned with
0
``````

So a little under two hours.

Now if i instead don’t specify the assembler like

``````M = bempp.api.operators.boundary.helmholtz.double_layer(space, space, space,k)
E = bempp.api.operators.boundary.helmholtz.hypersingular(space, space, space, k)

``````

then the program finished in about half the time

``````solve time
3013.474852323532
Solver returned with
0
``````

Which leads me to belive that either I’m utilizing the FMM wrong, or that it is only worth it for huge grids if the overhead is still greater than dense assembly for grids with over 20.000 elements (the first option being much more likely).
In theory FMM should scale with O[\log (N) N] instead of O[N^2] right?

I don’t know the significance of this but if i interrupt the computations the last funtion call is almost always

``````  File "/home/rasmus/anaconda3/lib/python3.8/site-packages/Bempp_cl-0.2.2-py3.8.egg/bempp/api/fmm/exafmm.py", line 160, in evaluate
result = self._module.evaluate(self._tree, self._fmm)
``````

Am i doing something stupid in my code? Even if there is no solution i would still like to know when the FMM is useful or not, so i can write better simulations in the future.

Best regards
Rasmus

Hi Rasmus,
I will try to help you through some lines of investigation. You are right twice the time for the FMM solver for 20k elements seems somewhat “weird”.

First of all I am wondering if the number of elements is sufficient. In other words, what is your mesh density criteria per wavelength. In addition is it possible for you to get a reference solution thanks to a dense assembler.

In any case, I would check first the convergence of the solution by displaying the residuals for both assemblers. Normally they are quite similar. If the solution is struggling to converge with the FMM so you will most likely have to increase the expansion order; 5 by default (a very important parameter to ensure a reliable expansion of FMM functions). The drawback is that leads to increase your memory requirements.

Best Regards.

Xavier V.

Hi Xavier,
Thanks a lot for taking the time.

I’m trying to aim for ~8-10 elements per wavelength, and i would need a finer mesh than the one i tested the above with eventually (50-60k elements). But i would prefer to get the program to run a bit faster before feeding it with larger meshes. However the 10 elements per wavelength criteria should be satisfied for the tested wavenuber k = 100.

I will have a look at it. What would i do if the solutions are dissimilar?

The compute cluster at my university are very ‘generous’ with memory so my current virtual machine has about 64 GB of RAM, of which only about 30% gets utilized, so Increasing memory usage is not a problem if it can increase convergence.

I tried looking at the BEMPP documentation and I can’t find how to increase the expansion order of the FMM. How do i change this parameter?

So i have done some investigations, and have turned to logging everything in a log file.

What i have found for meshes around 20k elements, is that the convergence of GMRES seems to be very slow, for both dense assembly and FMM. generally the number of iterations and values of the residuals seem to be exactly the same. For high values of k the iteration count gets well into the 100’s. I came back this morning from having a test script run overnight, and the iteration counter was over 500.

I have played a little around with changing the expansion order by using

``````bempp.api.GLOBAL_PARAMETERS.fmm.expansion_order = X
``````

at the default value of 5 each FMM call takes about 5-6 seconds

``````TIMING:bempp:Start operation: Evaluating Fmm.
TIMING:bempp:Start operation: Calling ExaFMM.
TIMING:bempp:Finished Operation: Calling ExaFMM.: 5.457153558731079s
TIMING:bempp:Start operation: Singular Corrections Evaluator
TIMING:bempp:Finished Operation: Singular Corrections Evaluator: 0.18956685066223145s
TIMING:bempp:Finished Operation: Evaluating Fmm.: 5.652909994125366s
``````

For one GMRES iteration, 12 of these callse to FMM were logged.

by increasing it to say 6, my memory consumption increases a little but so do the EXAFMM calls

``````TIMING:bempp:Start operation: Evaluating Fmm.
TIMING:bempp:Start operation: Calling ExaFMM.
TIMING:bempp:Finished Operation: Calling ExaFMM.: 6.3357932567596436s
TIMING:bempp:Start operation: Singular Corrections Evaluator
TIMING:bempp:Finished Operation: Singular Corrections Evaluator: 0.22159862518310547s
TIMING:bempp:Finished Operation: Evaluating Fmm.: 6.563088893890381s

``````

Which is slower per call, but now only 9 calls were made, so the total time should be similar.
The residual also seem to converge at the same rate, here for k \approx 18, which is the low end of my frequency range of interest. (0.96 , 0.87 , 0.77 , 0.64, 0.56 , 0.52 , 0.48 , etc)

Extrapolating the curve, assuming an exponential decay, it seems it will take about 40 iterations to reach a tolerance of 1e-2 and around 70 for 1e-3.

Without using FMM, the GMRES goes a lot quicker like

``````INFO:bempp:GMRES Iteration 1 with residual 0.9608518350984073
INFO:bempp:GMRES Iteration 2 with residual 0.8774929392680748
INFO:bempp:GMRES Iteration 3 with residual 0.7720222395249918
INFO:bempp:GMRES Iteration 4 with residual 0.6491716894707288
...
INFO:bempp:GMRES Iteration 55 with residual 0.01023279292849304
INFO:bempp:GMRES Iteration 56 with residual 0.009307566374188364
INFO:bempp:GMRES finished in 56 iterations and took 5.87E+01 sec.
``````

So in this example the non-FMM GMRES finishes in one minute, where the FMM GMRES takes around 1 minute for a single GMRES iteration.

Meanwhile the place were FMM is actually faster than dense is when assembling the matrices with

``````TIMING:bempp:assemble_sparse : 3.238e+00s
INFO:bempp:OpenCL Device set to: pthread-AMD EPYC Processor (with IBPB)
TIMING:bempp:Start operation: Initialising Exafmm.
TIMING:bempp:Finished Operation: Initialising Exafmm.: 2.3232381343841553s
TIMING:bempp:Start operation: Singular assembler:helmholtz_double_layer_boundary:opencl
TIMING:bempp:Finished Operation: Singular assembler:helmholtz_double_layer_boundary:opencl: 5.6644837856292725s
DEBUG:bempp:Using cached Fmm Interface.
TIMING:bempp:Start operation: Singular assembler:helmholtz_hypersingular_boundary:opencl
TIMING:bempp:Finished Operation: Singular assembler:helmholtz_hypersingular_boundary:opencl: 4.836869716644287s
INFO:bempp:Starting GMRES iteration
``````

Compared to

``````TIMING:bempp:Start operation: Regular assembler:helmholtz_double_layer_boundary:opencl
INFO:bempp:OpenCL Device set to: pthread-AMD EPYC Processor (with IBPB)
TIMING:bempp:Finished Operation: Regular assembler:helmholtz_double_layer_boundary:opencl: 279.1030373573303s
TIMING:bempp:Start operation: Singular assembler:helmholtz_double_layer_boundary:opencl
TIMING:bempp:Finished Operation: Singular assembler:helmholtz_double_layer_boundary:opencl: 5.6845927238464355s
TIMING:bempp:Start operation: Regular assembler:helmholtz_hypersingular_boundary:opencl
TIMING:bempp:Finished Operation: Regular assembler:helmholtz_hypersingular_boundary:opencl: 271.2179608345032s
TIMING:bempp:Start operation: Singular assembler:helmholtz_hypersingular_boundary:opencl
TIMING:bempp:Finished Operation: Singular assembler:helmholtz_hypersingular_boundary:opencl: 4.796561241149902s
INFO:bempp:Starting GMRES iteration
``````

So it seems like it is obviously the GMRES solver that slows down using FMM, but i don’t understand why it should be over 50 times slower for similar convergence.

So my conclusion is that i’m probably using FMM/EXAFMM wrong.

I know that the variables

``````bempp.api.GLOBAL_PARAMETERS.fmm.dense_evaluation
bempp.api.GLOBAL_PARAMETERS.fmm.ncrit
bempp.api.GLOBAL_PARAMETERS.fmm.depth
``````

are used in the EXAFMM calls, but i don’t know enough about FMM to know what values these should be.

Any of you know what is going on here?

Hi Rasmus,

FMM performance is quite a complex topic. First of all, consider a simple single-layer Helmholtz or Laplace operator. Then one matvec requires one FMM pass. While the FMM scales close to linearly, the algorithm itself is quite complex. So even for a few ten thousand elements, as soon as you have the matrix available as dense assembly the matvec may be faster in dense mode due to the much simpler data transport. Obviously, there will be a crossover point and you will soon run out of memory anyway if you continue dense.

The next issue is the number of passes for a single operator. The conjugate double layer also only requires one pass, while the double layer operator requires three passess and the Helmholtz hypersingular requires six passes (Laplace hypersingular needs three passes). This further slows things down.

We use FMM for some very large problems, which cannot be represented in dense mode (think hundred thousands to millions of unknowns). But in personal experience, whenever a dense representation is possible and initial assembly is not far too slow, then that is preferable to FMM.