未验证 提交 d69bd061 编写于 作者: A Ankit Jain 提交者: GitHub

[wasm] EmccCompile: Improve AOT time by better utilizing the cores (#67195)

* [wasm] EmccCompile: Improve AOT time by better utilizing the cores

Problem:

`EmccCompile` tasks compiles `.bc` files to `.o` files, and uses
`Parallel.ForEach` to run `emcc` for these in parallel.

The problem manifests when `EmccCompile` is compiling lot of files.
- To start with, the intended number of cores are being used
- but at some point (in my case after ~150 out of 180 files), the number
  of cores being utilized goes down to 1.
- And the reason is that `Parallel.ForEach` partitions the list of
  files(jobs), and they execute only the assigned jobs

From: https://github.com/dotnet/runtime/issues/46146#issuecomment-754021690

Stephen Toub:
    "As such, by default ForEach works on a scheme whereby each
    thread takes one item each time it goes back to the enumerator,
    and then after a few times of this upgrades to taking two items
    each time it goes back to the enumerator, and then four, and
    then eight, and so on. This ammortizes the cost of taking and
    releasing the lock across multiple items, while still enabling
    parallelization for enumerables containing just a few items. It
    does, however, mean that if you've got a case where the body
    takes a really long time and the work for every item is
    heterogeneous, you can end up with an imbalance."

The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.

Instead, we want to use work-stealing so jobs can be run by any partition.

In my highly unscientific testing, with AOT for `System.Buffers.Tests`,
the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins
to 4.0mins .

* MonoAOTCompiler.cs: Ensure that the parallel jobs get scheduled with

.. work-stealing, instead of being partitioned.
上级 aded3141
......@@ -419,8 +419,34 @@ private bool ExecuteInternal()
if (BuildEngine is IBuildEngine9 be9)
allowedParallelism = be9.RequestCores(allowedParallelism);
/*
From: https://github.com/dotnet/runtime/issues/46146#issuecomment-754021690
Stephen Toub:
"As such, by default ForEach works on a scheme whereby each
thread takes one item each time it goes back to the enumerator,
and then after a few times of this upgrades to taking two items
each time it goes back to the enumerator, and then four, and
then eight, and so on. This ammortizes the cost of taking and
releasing the lock across multiple items, while still enabling
parallelization for enumerables containing just a few items. It
does, however, mean that if you've got a case where the body
takes a really long time and the work for every item is
heterogeneous, you can end up with an imbalance."
The time taken by individual compile jobs here can vary a
lot, depending on various factors like file size. This can
create an imbalance, like mentioned above, and we can end up
in a situation where one of the partitions has a job that
takes very long to execute, by which time other partitions
have completed, so some cores are idle. But the the idle
ones won't get any of the remaining jobs, because they are
all assigned to that one partition.
Instead, we want to use work-stealing so jobs can be run by any partition.
*/
ParallelLoopResult result = Parallel.ForEach(
argsList,
Partitioner.Create(argsList, EnumerablePartitionerOptions.NoBuffering),
new ParallelOptions { MaxDegreeOfParallelism = allowedParallelism },
(args, state) => PrecompileLibraryParallel(args, state));
......
......@@ -129,7 +129,34 @@ private bool ExecuteActual()
if (BuildEngine is IBuildEngine9 be9)
allowedParallelism = be9.RequestCores(allowedParallelism);
ParallelLoopResult result = Parallel.ForEach(filesToCompile,
/*
From: https://github.com/dotnet/runtime/issues/46146#issuecomment-754021690
Stephen Toub:
"As such, by default ForEach works on a scheme whereby each
thread takes one item each time it goes back to the enumerator,
and then after a few times of this upgrades to taking two items
each time it goes back to the enumerator, and then four, and
then eight, and so on. This ammortizes the cost of taking and
releasing the lock across multiple items, while still enabling
parallelization for enumerables containing just a few items. It
does, however, mean that if you've got a case where the body
takes a really long time and the work for every item is
heterogeneous, you can end up with an imbalance."
The time taken by individual compile jobs here can vary a
lot, depending on various factors like file size. This can
create an imbalance, like mentioned above, and we can end up
in a situation where one of the partitions has a job that
takes very long to execute, by which time other partitions
have completed, so some cores are idle. But the the idle
ones won't get any of the remaining jobs, because they are
all assigned to that one partition.
Instead, we want to use work-stealing so jobs can be run by any partition.
*/
ParallelLoopResult result = Parallel.ForEach(
Partitioner.Create(filesToCompile, EnumerablePartitionerOptions.NoBuffering),
new ParallelOptions { MaxDegreeOfParallelism = allowedParallelism },
(toCompile, state) =>
{
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册