Add documentation for merge on red, build analysis, and v1/v2 tests (#84982)

Co-authored-by: N Carlos Sánchez López <1175054+carlossanlop@users.noreply.github.com> Co-authored-by: N Jeremy Koritzinsky <jkoritzinsky@gmail.com>

Add documentation for merge on red, build analysis, and v1/v2 tests (#84982)
Co-authored-by: N Carlos Sánchez López <1175054+carlossanlop@users.noreply.github.com> Co-authored-by: N Jeremy Koritzinsky <jkoritzinsky@gmail.com>
313074f4 · Juan Hoyos · GitHub · 7ddc45a4 · 313074f4 · 313074f4
13 changed file
--- a/.github/ISSUE_TEMPLATE/04_ci_known_issue.yml
+++ b/.github/ISSUE_TEMPLATE/04_ci_known_issue.yml
@@ -10,7 +10,7 @@ body:
    id: background
    attributes:
      label: Error Blob
-      description: Please identify a clear error string that can help identify future instances of this issue. For more information on how to fill this check https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#filling-out-known-issues-json-blob
+      description: Please identify a clear error string that can help identify future instances of this issue. For more information on how to fill this check our issue triage guidelines at [Failure Analysis](/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#what-to-do-if-you-determine-the-failure-is-unrelated)
      value: |
        ```json
        {

--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -61,7 +61,7 @@ The best way to create a minimal reproduction is gradually removing code and dep

 Project maintainers will merge changes that improve the product significantly.

-The [Pull Request Guide](docs/pr-guide.md) and [Copyright](docs/project/copyright.md) docs define additional guidance.
+The [Pull Request Guide](docs/workflow/pr-guide.md) and [Copyright](docs/project/copyright.md) docs define additional guidance.

 ### DOs and DON'Ts

@@ -113,7 +113,7 @@ We use and recommend the following workflow:
    - Make sure that the tests are all passing, including your new tests.
 7. Create a pull request (PR) against the dotnet/runtime repository's **main** branch.
    - State in the description what issue or improvement your change is addressing.
-    - Check if all the Continuous Integration checks are passing.
+    - Check if all the Continuous Integration checks are passing. Refer to [triaging failures in CI](docs/workflow/ci/failure-analysis.md) to check if any outstanding errors are known.
 8. Wait for feedback or approval of your changes from the [area owners](docs/area-owners.md).
    - Details about the pull request [review procedure](docs/pr-guide.md).
 9. When area owners have signed off, and all checks are green, your PR will be merged.
@@ -165,9 +165,9 @@ The following file header is the used for files in this repo. Please use it for

 ### PR - CI Process

-The [dotnet continuous integration](https://dev.azure.com/dnceng/public/) (CI) system will automatically perform the required builds and run tests (including the ones you are expected to run) for PRs. Builds and test runs must be clean.
+The [dotnet continuous integration](https://dev.azure.com/dnceng-public/public/_build) (CI) system will automatically perform the required builds and run tests (including the ones you are expected to run) for PRs. Builds and test runs must be clean or have bugs properly filed against flaky/unexpected failures that are unrelated to your change.

-If the CI build fails for any reason, the PR issue will be updated with a link that can be used to determine the cause of the failure.
+If the CI build fails for any reason, the PR issue will link to the `Azure DevOps` build with further information on the failure.

 ### PR Feedback


--- a/docs/pr-builds.md
+++ b/docs/pr-builds.md
-## PR Builds
-When submitting a PR to the `dotnet/runtime` repository various builds will run validation in many areas to ensure we keep productivity and quality high.
-
-The `dotnet/runtime` validation system can become overwhelming as we need to cover a lot of build scenarios and test in all the platforms that we support. In order to try to make this more reliable and spend the least amount of time testing what the PR changes need we have various pipelines, required and optional that are covered in this document.
-
-Most of the repository pipelines use a custom mechanism to evaluate paths based on the changes contained in the PR to try and build/test the least that we can without compromising quality. This is the initial step on every pipeline that depends on this infrastructure, called "Evaluate Paths". In this step you can see the result of the evaluation for each subset of the repository. For more details on which subsets we have based on paths see [here](https://github.com/dotnet/runtime/blob/513fe2863ad5ec6dc453d223d4b60f787a0ffa78/eng/pipelines/common/evaluate-default-paths.yml). Also to understand how this mechanism works you can read this [comment](https://github.com/dotnet/runtime/blob/513fe2863ad5ec6dc453d223d4b60f787a0ffa78/eng/pipelines/evaluate-changed-paths.sh#L3-L12).
-
-### Runtime pipeline
-This is the "main" pipeline for the runtime product. In this pipeline we include the most critical tests and platforms where we have enough test resources in order to deliver test results in a reasonable amount of time. The tests executed in this pipeline for runtime and libraries are considered innerloop, are the tests that are executed locally when one runs tests locally.
-
-For mobile platforms and wasm we run some smoke tests that aim to protect the quality of these platforms. We had to move to a smoke test approach given the hardware and time limitations that we encountered and contributors were affected by this with unstability and long wait times for their PRs to finish validation.
-
-### Runtime-dev-innerloop pipeline
-This pipeline is also required, and its intent is to cover a developer innerloop scenarios that could be affected by any change, like running a specific build command or running tests inside Visual Studio, etc.
-
-### Dotnet-linker-tests
-This is also a required pipeline. The purpose of this pipeline is to test that the libraries code is ILLink friendly. Meaning that when we trim our libraries using the ILLink, we don't have any trimming bugs, like a required method on a specific scenario is trimmed away by accident.
-
-### Runtime-staging
-This pipeline runs on every change, however it behaves a little different than the other pipelines. This pipeline, will not fail if there are test failures, however it will fail if there is a timeout or a build failure. The reason why we fail on build failures is because we want to protect the developer innerloop (building the repository) for this platform.
-
-The tests will not fail because the intent of this platform is to stage new platforms where the test infrastructure is new and we need to test if we have enough capacity to include that new platform on the "main" runtime pipeline without causing flakiness. Once we analyze data and a platform is stable when running on PRs in this pipeline for at least a weak it can be promoted either to the `runtime-extra-platforms` pipeline or to the `runtime` pipeline.
-
-### Runtime-extra-platforms
-This pipeline does not run by default as it is not required for a PR, but it runs twice a day, and it can also be invoked in specific PRs by commenting `/azp run runtime-extra-platforms`. However, this pipeline is still an important part of our testing.
-
-This pipeline runs innerloop tests on platforms where we don't have enough hardware capacity to run tests (mobile, browser) or on platforms where we believe tests should organically pass based on the coverage we have in the "main" runtime pipeline. For example, in the "main" pipeline we run tests on Ubuntu 21.10 but since we also support Ubuntu 18.04 which is an LTS release, we run tests on Ubuntu 18.04 of this pipeline just to make sure we have healthy tests on those platforms which we are releasing a product for.
-
-Another concrete scenario would be windows arm64 for libraries tests. Where we don't have enough hardware, but the JIT is the most important piece to test as that is what generates the native code to run on that platform, so we run JIT tests on arm64 in the "main" pipeline, but our libraries tests are only run on the `runtime-extra-platforms` pipeline.
-
-### Outerloop pipelines
-We have various pipelines that their names contain `Outerloop` on them. These pipelines will not run by default on every PR, they can also be invoked using the `/azp run` comment and will run on a daily basis to analyze test results.
-
-These pipelines will run tests that take very long, that are not very stable (i.e some networking tests), or that modify machine state. Such tests are called `Outerloop` tests rather than `innerloop`.
-
-## Analyzing Failures
-
-The PR Build Analysis tab has summary of all failures, including matches with the list of known issues. This tab should be your first stop for analyzing the PR failures.
-
-Validation may fail for several reasons:
-
-### Option 1: You have a defect in your PR
-
-* Simply push the fix to your PR branch, and validation will start over.
-
-### Option 2: There is a flaky test that is not related to your PR
-
-* Your assumption should be that a failed test indicates a problem in your PR. (If we don't operate this way, chaos ensues.) If the test fails when run again, it is almost surely a failure caused by your PR. However, there are occasions where unrelated failures occur. Here's some ways to know:
-  * Perhaps you see the same failure in CI results for unrelated active PR's.
-  * It's a known issue listed in our [big tracking issue](https://github.com/dotnet/runtime/issues/702) or tagged `blocking-clean-ci` [(query here)](https://github.com/dotnet/runtime/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Ablocking-clean-ci+)
-  * It's otherwise beyond any reasonable doubt that your code changes could not have caused this.
-  * If the tests pass on rerun, that may suggest it's not related.
-* In this situation, you want to re-run but not necessarily rebase on main.
-  * To rerun just the failed leg(s):
-    * Click on any leg. Navigate through the Azure DevOps UI, find the "..." button and choose "Retry failed legs"
-    * Or, on the GitHub Checks tab choose "re-run failed checks". This will not rebase your change.
-  * To rerun all validation:
-    * Add a comment `/azp run runtime`
-    * Or, click on "re-run all checks" in the GitHub Checks tab
-    * Or, simply close and reopen the PR.
-* If you have established that it is an unrelated failure, please ensure we have an active issue for it. See the [unrelated failure](#what-to-do-if-you-determine-the-failure-is-unrelated) section below.
-* Whoever merges the PR should be satisfied that the failure is unrelated, is not introduced by the change, and that we are appropriately tracking it.
-
-### Option 3: The state of the main branch HEAD is bad.
-
-* This is the very rare case where there was a build break in main, and you got unlucky. Hopefully the break has been fixed, and you want CI to rebase your change and rerun validation.
-* To rebase and rerun all validation:
-  * Add a comment `/azp run runtime`
-  * Or, click on "re-run all checks" in the GitHub Checks tab
-  * Or, simply close and reopen the PR.
-  * Or, amend your commit with `--amend --no-edit` and force push to your branch.
-
-### Additional information:
-  * You can list the available pipelines by adding a comment like `/azp list` or get the available commands by adding a comment like `azp help`.
-  * In the rare case the license/cla check fails to register a response, it can be rerun by issuing a GET request to `https://cla.dotnetfoundation.org/check/dotnet/runtime?pullRequest={pr_number}`. A successful response may be a redirect to `https://github.com`.
-  * Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.
-
-## What to do if you determine the failure is unrelated
-
-If you have determined the failure is definitely not caused by changes in your PR, please do this:
-
-* If the failure is identified as a known issue in the Build Analysis tab, no further action is required.
-* Search for an [existing issue](https://github.com/dotnet/runtime/issues). Usually the test method name or (if a crash/hang) the test assembly name are good search parameters.
-  * If there's an existing issue, add a comment with
-    * a) the link to the build
-    * b) the affected configuration (ie `net6.0-windows-Release-x64-Windows.81.Amd64.Open`)
-    * c) all console output including the error message and stack trace from the Azure DevOps tab (This is necessary as retention policies are in place that recycle old builds.)
-    * d) if there's a dump file (see Attachments tab in Azure DevOps) include that
-    * If the issue is already closed, reopen it and update the labels to reflect the current failure state.
-    * If possible, please [update the failure signature](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#how-to-fill-out-a-known-issue-error-message-section) to have it automatically identified by the Build Analysis as known issue next time.
-  * If there's no existing issue, create an issue with the same information listed above.
-  * Update the original pull request with a comment linking to the new or existing issue.
-* If the failure is occuring frequently, please disable the failing test(s) with the corresponding issue link tracking the disable in a follow-up Pull Request
-  * Update the tracking issue with the label `disabled-test`.
-  * For libraries tests add a [`[ActiveIssue(link)]`](https://github.com/dotnet/arcade/blob/master/src/Microsoft.DotNet.XUnitExtensions/src/Attributes/ActiveIssueAttribute.cs) attribute on the test method. You can narrow the disabling down to runtime variant, flavor, and platform. For an example see [File_AppendAllLinesAsync_Encoded](https://github.com/dotnet/runtime/blob/cf49643711ad8aa4685a8054286c1348cef6e1d8/src/libraries/System.IO.FileSystem/tests/File/AppendAsync.cs#L74)
-  * For runtime tests found under `src/tests`, please edit [`issues.targets`](https://github.com/dotnet/runtime/blob/main/src/tests/issues.targets). There are several groups for different types of disable (mono vs. coreclr, different platforms, different scenarios). Add the folder containing the test and issue mimicking any of the samples in the file.
-
-There are plenty of possible bugs, e.g. race conditions, where a failure might highlight a real problem and it won't manifest again on a retry. Therefore these steps should be followed for every iteration of the PR build, e.g. before retrying/rebuilding.
--- a/docs/workflow/README.md
+++ b/docs/workflow/README.md
 # Workflow Guide

-* [Build Requirements](#build-requirements)
-* [Getting Yourself Started](#getting-yourself-started)
-* [Configurations and Subsets](#configurations-and-subsets)
-  * [What does this mean for me?](#what-does-this-mean-for-me)
-* [Full Instructions on Building and Testing the Runtime Repo](#full-instructions-on-building-and-testing-the-runtime-repo)
+- [Build Requirements](#build-requirements)
+- [Getting Yourself Started](#getting-yourself-started)
+- [Configurations and Subsets](#configurations-and-subsets)
+  - [What does this mean for me?](#what-does-this-mean-for-me)
+- [Full Instructions on Building and Testing the Runtime Repo](#full-instructions-on-building-and-testing-the-runtime-repo)
+- [Warnings as Errors](#warnings-as-errors)
+- [Submitting a PR](#submitting-a-pr)
+- [Triaging errors in CI](#triaging-errors-in-ci)

 The repo can be built for the following platforms, using the provided setup and the following instructions. Before attempting to clone or build, please check the requirements that match your machine, and ensure you install and prepare all as necessary.

@@ -91,3 +94,11 @@ And how to measure performance:
 ## Warnings as Errors

 The repo build treats warnings as errors. Dealing with warnings when you're in the middle of making changes can be annoying (e.g. unused variable that you plan to use later). To disable treating warnings as errors, set the `TreatWarningsAsErrors` environment variable to `false` before building. This variable will be respected by both the `build.sh`/`build.cmd` root build scripts and builds done with `dotnet build` or Visual Studio. Some people may prefer setting this environment variable globally in their machine settings.
+
+## Submitting a PR
+
+Before submitting a PR, make sure to review the [contribution guidelines](../../CONTRIBUTING.md). After you get familiarized with them, please read the [PR guide](ci/pr-guide.md) to find more information about tips and conventions around creating a PR, getting it reviewed, and understanding the CI results.
+
+## Triaging errors in CI
+
+Given the size of the runtime repository, flaky tests are expected to some degree. There are a few mechanisms we use to help with the discoverability of widely impacting issues. We also have a regular procedure that ensures issues get properly tracked and prioritized. You can find more information on [triaging failures in CI](ci/failure-analysis.md).
--- a/docs/workflow/ci/analysis-check.png
+++ b/docs/workflow/ci/analysis-check.png
--- a/docs/workflow/ci/failed-build.png
+++ b/docs/workflow/ci/failed-build.png
--- a/docs/workflow/ci/failed-test.png
+++ b/docs/workflow/ci/failed-test.png
--- a/docs/workflow/ci/failure-analysis.md
+++ b/docs/workflow/ci/failure-analysis.md
+# Analyzing Failures with Build Analysis and Known Issues
+
+* [Triaging errors seen in CI](#triaging-errors-seen-in-ci)
+  * [Option 1: You have a defect in your PR](#option-1-you-have-a-defect-in-your-pr)
+  * [Option 2: There is a flaky test that is not related to your PR](#option-2-there-is-a-flaky-test-that-is-not-related-to-your-pr)
+  * [Option 3: The state of the main branch HEAD is bad.](#option-3-the-state-of-the-main-branch-head-is-bad)
+  * [Additional information:](#additional-information)
+* [What to do if you determine the failure is unrelated](#what-to-do-if-you-determine-the-failure-is-unrelated)
+  * [Examples of Build Analysis](#examples-of-build-analysis)
+    * [Good usage examples](#good-usage-examples)
+    * [Bad usage examples](#bad-usage-examples)
+
+## Triaging errors seen in CI
+
+In case of failure, any PR on the runtime will have a failed GitHub check - PR Build Analysis - which has a summary of all failures, including a list of matching  known issues as well as any regressions introduced to the build or the tests. This tab should be your first stop for analyzing the PR failures.
+
+![Build analysis check](analysis-check.png)
+
+This check tries to bubble as much useful information about all failures for any given PR and the pipelines it runs. It tracks both build and test failures and provides quick links to the build/test legs, the logs, and other supplemental information that `Azure DevOps` may provide. The idea is to minimize the number of links to follow and tries to surface well known issues that have already been previously identified. It also adds a link to the `Helix Artifacts` tab of a failed test, as it often contains more detailed logs of the execution or a dump that's been collected at fault time.
+
+Validation may fail for several reasons, and for each one we have a different recommended action:
+
+### Option 1: You have a defect in your PR
+
+* Simply push the fix to your PR branch, and validation will start over.
+
+### Option 2: There is a flaky test that is not related to your PR
+
+* Your assumption should be that a failed test indicates a problem in your PR. (If we don't operate this way, chaos ensues.) However, there are often subtle regressions and flaky bugs that might have slipped into the target branch.
+  * Reruns might help, but we tend to be conservative with them as they tend to spike our resource usage. Opt to use them only if there are no known issue that can be correlated to the failures and it's not clear if the errors could be correlated. Try to rerun only the particular legs if possible, by navigating to the GitHub Checks tab and clicking on `Re-run failed checks`.
+  * There's the possibility someone else has already investigated the issue. In such case, the build analysis tab should report the issue like so:
+    ![known issue example](known-issue-example.png)
+    There's no additional work required here - the bug is getting tracked and appropriate data is being collected.
+  * If the error is not getting reported as a known issue and you believe it's unrelated, see the [unrelated failure](#what-to-do-if-you-determine-the-failure-is-unrelated) section for next steps.
+
+### Option 3: The state of the main branch HEAD is bad.
+
+* This is the very rare case where there was a build break in main, and you got unlucky. Hopefully the break has been fixed, and you want CI to rebase your change and rerun validation.
+* To rebase and rerun all validation:
+  * Add a comment `/azp run runtime`
+  * Or, click on "re-run all checks" in the GitHub Checks tab
+  * Or, simply close and reopen the PR.
+  * Or, amend your commit with `--amend --no-edit` and force push to your branch.
+
+### Additional information:
+  * In the rare case the license/cla check fails to register a response, close and reopen the PR or push an empty commit.
+  * Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.
+
+## What to do if you determine the failure is unrelated
+
+An issue that has not been reported before will look like this in the `Build Analysis` check tab:
+
+![failed test](failed-test.png)
+
+You can use the console log, any potential attached dumps in the artifacts section, or any other piece of information printed to help you decide if it's a regression caused by the change. Similarly, for runtime tests we will try to print the crashing stacks to aid in the investigation.
+
+If you have considered all the diagnostic artifacts and determined the failure is definitely not caused by changes in your PR, please do this:
+
+1. Identify a string from the logs that uniquely identifies the issue at hand. A good example of this the string `The system cannot open the device or file specified. : 'NuGet-Migrations'` for issue https://github.com/dotnet/runtime/issues/80619.
+2. On the test failure in the tab you can select `Report repository issue`. This will prepopulate an issue with the appropriate tags and with a body similar to:
+    ````
+    Build Information
+    Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=242380
+    Build error leg or test failing: Build / linux-arm64 Release AllSubsets_Mono_Minijit_RuntimeTests minijit / Build Tests
+    Pull request: https://github.com/dotnet/runtime/pull/84716
+    <!-- Error message template  -->
+    ## Error Message
+    Fill the error message using [known issues guidance](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#how-to-fill-out-a-known-issue-error-section).
+
+    ```json
+    {
+        "ErrorMessage": "",
+        "BuildRetry": false,
+        "ErrorPattern": "",
+        "ExcludeConsoleLog": false
+    }
+    ```
+    ````
+    It already contains most of the essential information, but *it is very important that you fill out the json blob*.
+
+    - You can add into the `ErrorMessage` field the string that you found uniquely identifies the issue. In case you need to use a regex, use the `ErrorPattern` field instead. This is a limited to a single-line, non-backtracking regex as described [here](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#regex-matching). This regex also needs to be appropriately escaped. Check the [arcade known issues](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssues.md#filling-out-known-issues-json-blob) documentation for a good guide on proper regex and JSON escaping.
+    - The field `ExcludeConsoleLog` describes if the execution logs should be considered on top of the individual test results. **For most cases, this should be set to `true` as the failure will happen within a single test**. Setting it to `false` will mean all failures within an xUnit set of tests will also get attributed to this particular error, since there's one log describing all the problems. Due to limitations in Known Issues around rate limiting and xUnit resiliency, setting `ExcludeConsoleLog=false` is necessary in two scenarios:
+      + Nested tests as reported to Azure DevOps. Essentially this means theory failures, which look like this when reported in Azure DevOps: ![xUnit theory seen in azure devops](theory-azdo.png).
+        Adding support for this requires too many API calls, so using the console log here is necessary.
+      + Native crashes in libraries also require using the console log. This is needed as the crash corrupts the test results to be reported to Azure DevOps, so only the console logs are left.
+    - Optionally you can add specifics as needed like leg, configuration parameters, available dump links.
+
+Once the issue is open, feel free to rerun the `Build Analysis` check and the issue should be recognized as known if all was filed correctly and you are ready to merge once all unrelated issues are marked as known. However, there are some known limitations to the system as previously described. Additionally, the system only looks at the error message the stacktrace fields of an Azure DevOps test result, and the console log in the helix queue. If rerunning the check doesn't pick up the known issue and you feel it should, feel free to tag  @dotnet/runtime-infrastructure to request infrastructure team for help.
+
+After you do this, if the failure is occurring frequently as per the data captured in the recently opened issue, please disable the failing test(s) with the corresponding tracking issue link in a follow-up Pull Request.
+
+* Update the tracking issue with the `disabled-test` label and remove the blocking tags.
+* For libraries tests add a [`[ActiveIssue(link)]`](https://github.com/dotnet/arcade/blob/master/src/Microsoft.DotNet.XUnitExtensions/src/Attributes/ActiveIssueAttribute.cs) attribute on the test method. You can narrow the disabling down to runtime variant, flavor, and platform. For an example see [File_AppendAllLinesAsync_Encoded](https://github.com/dotnet/runtime/blob/cf49643711ad8aa4685a8054286c1348cef6e1d8/src/libraries/System.IO.FileSystem/tests/File/AppendAsync.cs#L74)
+* For runtime tests found under `src/tests`, please edit [`issues.targets`](https://github.com/dotnet/runtime/blob/main/src/tests/issues.targets). There are several groups for different types of disable (mono vs. coreclr, different platforms, different scenarios). Add the folder containing the test and issue mimicking any of the samples in the file.
+
+There are plenty of intermittent failures that won't manifest again on a retry. Therefore these steps should be followed for every iteration of the PR build, e.g. before retrying/rebuilding.
+
+### Examples of Build Analysis
+
+#### Good usage examples
+
+- Sufficiently specific strings. Ex: issue https://github.com/dotnet/runtime/issues/80619
+
+```json
+{
+  "ErrorPattern": "The system cannot open the device or file specified. : (&#39;|')NuGet-Migrations(&#39;|')",
+  "BuildRetry": false,
+  "ExcludeConsoleLog": false
+}
+```
+
+This is a case where the issue is tied to the machine the workitem falls on. Everything would fail in that test group, so `ExcludeConsoleLog` isn't harmful and the string is specific to the issue. The proper usage of this provides useful insight such as an accurate count of the impact of the issue without blocking other devs:
+
+![issue impact with data for investigation](issue-impact.png)
+
+#### Bad usage examples
+
+- Overly generic short strings. For example "dlbigleakthd", just refering to the test name is likely to match the build log in case there's a build failure, since the log will list the file getting built. In that case a better thing is to use the name of the scripts (sh/cmd) or part of the dump that caused the crash.
--- a/docs/workflow/ci/issue-impact.png
+++ b/docs/workflow/ci/issue-impact.png
--- a/docs/workflow/ci/known-issue-example.png
+++ b/docs/workflow/ci/known-issue-example.png
--- a/docs/workflow/ci/pipelines-overview.md
+++ b/docs/workflow/ci/pipelines-overview.md
+# Pipelines overview - Architecture and different available pipelines
+
+* [Pipelines used in dotnet/runtime](#pipelines-used-in-dotnetruntime)
+  * [Runtime pipeline](#runtime-pipeline)
+  * [Runtime-dev-inner loop pipeline](#runtime-dev-inner-loop-pipeline)
+  * [Dotnet-linker-tests](#dotnet-linker-tests)
+  * [Runtime-staging](#runtime-staging)
+  * [Runtime-extra-platforms](#runtime-extra-platforms)
+  * [Outer loop pipelines](#outer-loop-pipelines)
+* [Running of different runtime-level tests and their orchestration in Helix](#running-of-different-runtime-level-tests-and-their-orchestration-in-helix)
+  * [Legacy tests](#legacy-tests)
+  * [SourceGen Orchestrated tests](#sourcegen-orchestrated-tests)
+
+The runtime repository counts with a large number of validation pipelines to help assess product quality across different scenarios. Some of them run automatically, and some run per request to accommodate hardware availability and other resource constraints. However, the overall orchestration remains largely the same.
+
+```mermaid
+gitGraph
+    commit
+    commit
+    branch feature/utf8
+    checkout feature/utf8
+    commit
+    commit
+    checkout main
+    commit
+    merge feature/utf8 type: REVERSE
+    commit
+    commit
+```
+
+Say there's a PR from `feature/utf8` to `main`. The `Azure DevOps Pipeline` plugin will take the merge commit to `main` and queue all default pipelines and any other requested pipelines to `Azure DevOps`.
+
+```mermaid
+gantt
+    title Execution of a PR in our CI
+    dateFormat DDDD
+    axisFormat %j
+    section GH PR
+    Send PR using AZDO Plugin   : prCreate, 001, 1d
+    Workitem Analysis           : analysis, after lookup, 1d
+    Merge Step                  : after analysis, 1d
+    section Azure DevOps
+    Build Runtimes and Libs   : build, after prCreate, 1d
+    Build Tests               : buildTest, after build, 1d
+    section Helix
+    Run Tests                       : test, after buildTest, 1d
+    Report Tests to Azure DevOps    : testReport, after test, 1d
+    section Known Issues Infrastructure
+    Lookup known strings in issues  : lookup, after testReport, 1d
+```
+
+Each pipeline will create its own build of the different runtimes, the tests, and will eventually run the tests. We usually run our tests in a separate environment called Helix. This system allows for distribution of the large number of tests across the wide array of platforms supported. Once each worker machine processes its own results, these get reported back to `Azure DevOps` and they become available in the tests tab of the build.
+
+## Pipelines used in dotnet/runtime
+
+This repository contains several runtimes and a wide range of supported libraries and platforms. This complexity makes it hard to balance resource usage, testing coverage, and developer productivity. In order to try to make build efforts more reliable and spend the least amount of time testing what the PR changes need, we have various pipelines - some required, some optional. You can list the available pipelines by adding a comment like `/azp list` on a PR or get the available commands by adding a comment like `/azp help`.
+
+Most of the repository pipelines use a custom mechanism to evaluate paths based on the changes contained in the PR to try and build/test the least that we can without compromising quality. This is the initial step on every pipeline that depends on this infrastructure, called "Evaluate Paths". In this step you can see the result of the evaluation for each subset of the repository. For more details on which subsets we have based on paths, see [here](/eng/pipelines/common/evaluate-default-paths.yml). Also, to understand how this mechanism works, you can read this [comment](/eng/pipelines/evaluate-changed-paths.sh#L3-L12).
+
+### Runtime pipeline
+
+This is the "main" pipeline for the runtime product. In this pipeline we include the most critical tests and platforms where we have enough test resources in order to deliver results in a reasonable amount of time. The tests executed in this pipeline for runtime and libraries are considered inner loop. These are the same tests that are executed locally when one runs tests locally.
+
+For mobile platforms and wasm we run some smoke tests that aim to protect the quality of these platforms. We had to move to a smoke test approach given the hardware and time limitations that we encountered, and contributors were affected by this with instability and long wait times for their PRs to finish validation.
+
+### Runtime-dev-inner loop pipeline
+
+This pipeline is also required, and its intent is to cover developer inner loop scenarios that could be affected by any change, like running a specific build command or running tests inside Visual Studio, etc.
+
+### Dotnet-linker-tests
+
+This is also a required pipeline. The purpose of this pipeline is to test that the libraries code is ILLink friendly. Meaning that when we trim our libraries using the ILLink, we don't have any trimming bugs, like a required method on a specific scenario is trimmed away by accident.
+
+### Runtime-staging
+
+This pipeline runs on every change; however it behaves a little different than the other pipelines. This pipeline will not fail if there are test failures, however it will fail if there is a timeout or a build failure. We fail on build failures is because we want to protect the developer inner loop (building the repository) for this platform.
+
+The tests will not fail because this pipeline is for staging new platforms where the test infrastructure is new, and we need to test if we have enough capacity to include that new platform on the "main" runtime pipeline without causing flakiness. Once we analyze data and a platform is stable when running on PRs in this pipeline for at least a week, it can be promoted either to the `runtime-extra-platforms` pipeline or to the `runtime` pipeline.
+
+### Runtime-extra-platforms
+
+This pipeline does not run by default as it is not required for a PR, but it runs twice a day, and it can also be invoked in specific PRs by commenting `/azp run runtime-extra-platforms`. However, this pipeline is still an important part of our testing.
+
+This pipeline runs inner loop tests on platforms where we don't have enough hardware capacity to run tests (mobile, browser) or on platforms where we believe tests should organically pass based on the coverage we have in the "main" runtime pipeline. For example, in the "main" pipeline we run tests on Ubuntu 21.10. Since we also support Ubuntu 18.04 which is an LTS release, we run tests on Ubuntu 18.04 of this pipeline to make sure we have healthy tests on platforms which we are releasing a product for.
+
+This pipeline also runs tests for platforms that are generally stable but we don't have enough hardware to put into the regular runtime pipeline. For example, we run the libraries tests for windows arm64 in this pipeline. We don't have enough hardware to run the JIT tests and libraries tests for windows arm64 on every PR. The JIT is the most important piece to test here, as that is what generates the native code to run on that platform. So, we run JIT tests on arm64 in the "main" pipeline, while our libraries tests are only run on the `runtime-extra-platforms` pipeline.
+
+### Outer loop pipelines
+
+We have various pipelines that their names contain `Outerloop` in them. These pipelines will not run by default on every PR, they can also be invoked using the `/azp run` comment and will run on a daily basis to analyze test results.
+
+These pipelines will run tests that are long-running, are not very stable (i.e. some networking tests), or that modify machine state.
+
+## Running of different runtime-level tests and their orchestration in Helix
+
+### Legacy tests
+
+In older runtime tests, the classic xUnit console runner runs a generated set of xUnit facts. Each fact invokes a shell/batch script that sets up the environment, then starts the console apps that make up the runtime test bed. The wrapper is also responsible for harvesting all output from the processes that get started. The main advantage of this method is that each test runs in process isolation. This allows xUnit and its child process to have decoupled runtimes, hardening the test harness against native crashes. However, this is extremely expensive since startup costs and process start costs are paid per test. The usual flow for a Helix workitem of this type is as follows:
+
+```mermaid
+sequenceDiagram
+    title Legacy Tests in Helix
+    participant E as Helix Entrypoint
+    participant W as Test Wrapper
+    participant T as Test N
+    participant R as Helix Reporter
+    activate E
+    E->>+W: Launch xUnit test wrapper hosted in LKG runtime
+    W->>+T: Launch each test in process isolation
+    T->>-W: Test report success with 100 exit code.
+    W->>-E: -
+    E->>+R: Report test results to Azure DevOps
+    R->>-E: -
+    deactivate E
+```
+
+### SourceGen Orchestrated tests
+
+Consolidated runtime tests generate an entry point assembly during build. The source generation globs the tests that will run and generates a `Main` method that runs each test in a `try`/`catch` block, while capturing all the necessary output. There are a few tests that require isolation, and instead of calling into them in-proc, the call starts another process as appropriate. The main advantage of this method is that it relies less heavily on process isolation, making testing more cost-efficient. However, this also means the first native or managed unhandled exception will pause all testing - much like what happens with library tests. The merged runner that invokes the tests sequentially is hosted under a watchdog to handle hangs, and there's a log fixer that runs afterwards to try to fixup the corrupted logs in case of a crash, so that Helix can report the workitem progress as much as possible. The usual flow for a Helix workitem of this type is as follows:
+
+```mermaid
+sequenceDiagram
+    title Merged Tests in Helix
+    participant E as Helix Entrypoint
+    participant W as Watchdog
+    participant M as Merged Runner
+    participant L as Log Fixer
+    participant R as Helix Reporter
+    activate E
+    E->>+W: Launch watchdog
+    W->>+M: Launch tests with filters
+    M->>-W: Tests finish or crash
+    W->>-E: .
+    E->>+L: Fix log and symbolize crashes
+    L->>-E: -
+    E->>+R: Report test results to Azure DevOps
+    R->>-E: -
+    deactivate E
+```
--- a/docs/pr-guide.md
+++ b/docs/pr-guide.md
@@ -19,19 +19,23 @@ Every pull request will have automatically a single `area-*` label assigned. The

 If during the code review process a merge conflict occurs the area owner is responsible for its resolution. Pull requests should not be on hold due to the author's unwillingness to resolve code conflicts. GitHub makes this easier by allowing simple conflict resolution using the [conflict-editor](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/resolving-a-merge-conflict-on-github).

+## Pull Request Builds
+
+When submitting a PR to the `dotnet/runtime` repository, various builds will run validating many areas to ensure we keep developer productivity and product quality high. For a high level overview of the build process and the different pipelines that might run against your PR, please check [pipelines overview](pipelines-overview.md).
+
 ## Merging Pull Requests

-Anyone with write access can merge a pull request manually or by setting the [auto-merge](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/automatically-merging-a-pull-request) label when it satisfies all of the following conditions:
+Anyone with write access can merge a pull request manually when the following conditions have been met:

 * The PR has been approved by at least one reviewer and any other objections are addressed.
    * You can request another review from the original reviewer.
-* The PR successfully builds and passes all tests in the Continuous Integration (CI) system. For more information please read to our [PR Builds](pr-builds.md) doc.
+* The PR successfully builds and passes all tests in the Continuous Integration (CI) system. In case of failures, refer to the [analyzing build failures](failure-analysis.md) doc.

-Typically, PRs are merged as one commit. It creates a simpler history than a Merge Commit. "Special circumstances" are rare, and typically mean that there are a series of cleanly separated changes that will be too hard to understand if squashed together, or for some reason we want to preserve the ability to bisect them.
+Typically, PRs are merged as one commit (squash merges). It creates a simpler history than a Merge Commit. "Special circumstances" are rare, and typically mean that there are a series of cleanly separated changes that will be too hard to understand if squashed together, or for some reason we want to preserve the ability to disect them.

 ## Blocking Pull Request Merging

-If for whatever reason you would like to move your pull request back to an in-progress status to avoid merging it in the current form, you can do that by adding [WIP] prefix to the pull request title.
+If for whatever reason you would like to move your pull request back to an in-progress status to avoid merging it in the current form, you can turn the PR into a draft PR by selecting the option under the reviewers section. Alternatively, you can do that by adding [WIP] prefix to the pull request title.

 ## Old Pull Request Policy


--- a/docs/workflow/ci/theory-azdo.png
+++ b/docs/workflow/ci/theory-azdo.png