• C
    drm/i915: Use time based guilty context banning · 7f4127c4
    Chris Wilson 提交于
    Currently, we accumulate each time a context hangs the GPU, offset
    against the number of requests it submits, and if that score exceeds a
    certain threshold, we ban that context from submitting any more requests
    (cancelling any work in flight). In contrast, we use a simple timer on
    the file, that if we see more than a 9 hangs faster than 60s apart in
    total across all of its contexts, we will ban the client from creating
    any more contexts. This leads to a confusing situation where the file
    may be banned before the context, so lets use a simple timer scheme for
    each.
    
    If the context submits 3 hanging requests within a 120s period, declare
    it forbidden to ever send more requests.
    
    This has the advantage of not being easy to repair by simply sending
    empty requests, but has the disadvantage that if the context is idle
    then it is forgiven. However, if the context is idle, it is not
    disrupting the system, but a hog can evade the request counting and
    cause much more severe disruption to the system.
    
    Updating ban_score from request retirement is dubious as the retirement
    is purposely not in sync with request submission (i.e. we try and batch
    retirement to reduce overhead and avoid latency on submission), which
    leads to surprising situations where we can forgive a hang immediately
    due to a backlog of requests from before the hang being retired
    afterwards.
    Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-2-chris@chris-wilson.co.uk
    7f4127c4
i915_gpu_error.c 48.5 KB