Created by: hong19860320
- Fix the dims of parent idx of the arm kernel of beam_search op
- elementwise_mul supports int64_t data type with broadcasting
- Add print op and kernel for debugging
- Support throwing the exception when the internal error occurs
- Refine while and conditional_block op kernel
- Support the graph optimization on subblocks
- Pass program_desc and block_idx into the kernel of the control flow ops(while/conditional_block/subgraph), and create the RuntimeProgram online, it make it possiable to call the control flow ops recursively
- Add unit test for masked transformer model