未验证 提交 9e0e7c27 编写于 作者: J Jialun 提交者: GitHub

Fix OOM after cluster reset when gp_vmem_protect_limit > 16GB (#6862)

The function VmemTracker_ShmemInit will initialize chunkSizeInBits
according to gp_vmem_protect_limit. Which is the unit of chunk size.
The base value of chunkSizeInBits is 20(1MB). If gp_vmem_protect_limit
is larger than 16GB, it will increase to adapter the large memory
environment. This value should not be changed after initialized.
But if this function was called more times, chunkSizeInBits will
accumulate.

Considering the scenario, QD crashed, then postmaster will reaper the
QD process and reset shared memory. This will lead to VmemTracker_ShmemInit
be called more times. So chunkSizeInBits will increase every time after
crash when gp_vmem_protect_limit is larger than 16GB. At last, the
chunkSize will be very large which means the new reserved chunk will
always be zero or a very small value. So the memory limit mechanism
takes no effect and will cause Out-of-Memory when cannot really
allocate new memory.

So we set chunkSizeInBits to BITS_IN_MB in VmemTracker_ShmemInit
every time instead of Assert.

Why there is no new test case in this commit?
- We just change an Assert to assignment, no logic changes.
- It is very difficult to add a crash case in current isolation test
  frame, for the connection will be lost due to crash.

We have verified the case in our dev environment manually by setting
gp_vmem_protect_limit to 65535 and kill -9 QD process. Then we see
chunkSizeInBits increases every time. At last, we got error message
"ERROR:  Canceling query because of high VMEM usage."
上级 7281a162
......@@ -106,7 +106,7 @@ VmemTracker_ShmemInit()
if(!IsUnderPostmaster)
{
Assert(chunkSizeInBits == BITS_IN_MB);
chunkSizeInBits = BITS_IN_MB;
vmemChunksQuota = gp_vmem_protect_limit;
/*
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册