alinux: mm, memcg: fix possible soft lockup in try_charge
When events such as direct reclaim and oom occur intensively, soft lockup is very likely to happen in the instances with 1 vcpu and with kernel preempt disabled. The example soft lockup is as follows. [ 160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188] [ 160.557975] Modules linked in: button [ 160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1 [ 160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 [ 160.563707] RIP: 0010:shrink_node+0x1ae/0x450 [ 160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8 [ 160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 [ 160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000 [ 160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000 [ 160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000 [ 160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000 [ 160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50 [ 160.583450] FS: 00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000 [ 160.585704] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0 [ 160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 160.594133] Call Trace: [ 160.595602] do_try_to_free_pages+0xcc/0x390 [ 160.597356] try_to_free_mem_cgroup_pages+0xf9/0x1d0 [ 160.599198] ? out_of_memory+0xb5/0x4a0 [ 160.600882] try_charge+0x244/0x750 [ 160.602522] ? __pagevec_lru_add_fn+0x1d0/0x330 [ 160.604310] mem_cgroup_try_charge+0xb4/0x1d0 [ 160.606085] mem_cgroup_try_charge_delay+0x1c/0x40 [ 160.607892] do_anonymous_page+0xf7/0x540 [ 160.609574] __handle_mm_fault+0x665/0xa00 [ 160.611233] ? __switch_to_asm+0x35/0x70 [ 160.612838] handle_mm_fault+0x122/0x1e0 [ 160.614407] __do_page_fault+0x1b7/0x470 [ 160.615962] do_page_fault+0x32/0x140 [ 160.617474] ? async_page_fault+0x8/0x30 [ 160.619012] async_page_fault+0x1e/0x30 [ 160.620526] RIP: 0033:0x40068e Fix it by adding cond_resched() in try_charge(), just before goto retry after OOM_SUCCESS, in order to let OOM free some memory first. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Showing
想要评论请 注册 或 登录