Set expected place in child thread for dataloader to avoid costing cuda memory on other card (#30338) * set expected place in child thread for dataloader * set device id when set tensor from numpy * revert tensor_py change * add compile guard * fix ci * fix bug