I was benchmarking the blocked (DORMQR) and the non-blocked (DORM2R) Lapack function which applies Q from a QR factorization to another matrix C, so C = Q * C.
My expectation would be that the blocked version. Even if the flop count might be higher, the advantage comes by better cache utilization, higher levle BLAS routines and so on. But the number I see are quite the opposite. Any idea on that?
I was using square matrices with size BlockSize times BlockSize. Here are my numbers for 500 repetitions:
| BlockSize |Method|Time| Factor |
|——–|—|—-|—–|——-|———-|——|
| 32 |xORM2R |0.000169992 | 1.19355|
| 32 |xORMQR |0.000202894 | |
| 64 |xORM2R |0.000250101 | 249.983|
| 64 |xORMQR |0.062521 ||
| 128| xORM2R |0.000559092 | 151.6319|
| 128| xORMQR |0.0847762 | |
| 256| xORM2R |0.00103307 | 151.2085|
| 256| xORMQR |0.156209 | |
| 512| xORM2R |0.00210094 | 309.8199|
| 512| xORMQR |0.650913 | |
| 1024| xORM2R |0.00397611 | 1258.325|
| 1024| xORMQR |5.00324 | |
| 2048| xORM2R |0.00847983 | 2600.63|
| 2048| xORMQR |22.0529 | |
| 4096| xORM2R |0.0168719 | 10778.93|
| 4096| xORMQR |181.861 | |
I was first running this with my Lapack installation and afterwards with Intel MKL’s version of Lapack and did get the same numbers. Am I missing something?