Threaded MKL for paddle
Created by: wanglovesyang
I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However, I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources.
Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers. Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?