1. 27 12月, 2019 32 次提交
    • J
      sched: loadavg: make calc_load_n() public · 3bf93774
      Johannes Weiner 提交于
      commit 5c54f5b9edb1aa2eabbb1091c458f1b6776a1896 upstream.
      
      It's going to be used in a later patch. Keep the churn separate.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3bf93774
    • J
      sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · 4ca637b4
      Johannes Weiner 提交于
      commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.
      
      There are several definitions of those functions/macros in places that
      mess with fixed-point load averages.  Provide an official version.
      
      [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
      Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: use stat.mean instead of stat->rqs.mean to solve conflict]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      
      Conflicts:
          block/blk-iolatency.c
      4ca637b4
    • J
      delayacct: track delays from thrashing cache pages · 2dde6f77
      Johannes Weiner 提交于
      commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.
      
      Delay accounting already measures the time a task spends in direct reclaim
      and waiting for swapin, but in low memory situations tasks spend can spend
      a significant amount of their time waiting on thrashing page cache.  This
      isn't tracked right now.
      
      To know the full impact of memory contention on an individual task,
      measure the delay when waiting for a recently evicted active cache page to
      read back into memory.
      
      Also update tools/accounting/getdelays.c:
      
           [hannes@computer accounting]$ sudo ./getdelays -d -p 1
           print delayacct stats ON
           PID     1
      
           CPU             count     real total  virtual total    delay total  delay average
                           50318      745000000      847346785      400533713          0.008ms
           IO              count    delay total  delay average
                             435      122601218              0ms
           SWAP            count    delay total  delay average
                               0              0              0ms
           RECLAIM         count    delay total  delay average
                               0              0              0ms
           THRASHING       count    delay total  delay average
                              19       12621439              0ms
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2dde6f77
    • J
      mm: workingset: tell cache transitions from workingset thrashing · e2d3e3cb
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e2d3e3cb
    • J
      mm: workingset: don't drop refault information prematurely · b027d193
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b027d193
    • G
      alinux: net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 0e7b6b7f
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0e7b6b7f
    • A
      PCI: Fix "try" semantics of bus and slot reset · e69730a1
      Alex Williamson 提交于
      commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream
      
      The commit referenced below introduced device locking around save and
      restore of state for each device during a PCI bus "try" reset, making
      it decidely non-"try" and prone to deadlock in the event that a device
      is already locked.  Restore __pci_reset_bus() and __pci_reset_slot()
      to their advertised locking semantics by pushing the save and restore
      functions into the branch where the entire tree is already locked.
      Extend the helper function names with "_locked" and update the comment
      to reflect this calling requirement.
      
      Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()")
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e69730a1
    • K
      alinux: net/hookers: fix link error with ipv6 disabled · 817f2cbf
      kbuild test robot 提交于
      lkp-build bot reported the following link error with ipv6 disabled:
      
      ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific'
      ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped'
      ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops'
      
      Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      817f2cbf
    • K
      alinux: writeback: memcg_blkcg_tree_lock can be static · 2307342e
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      2307342e
    • C
      alinux: net/hookers: only enable on x86 platform · 2c0162e8
      Caspar Zhang 提交于
      read/write_cr0() are used in net/hookers.c, but they are only available
      on x86 platform. Adding a depend-on fields in Kconfig to disable this
      feature in other platforms.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2c0162e8
    • J
      alinux: fs/writeback: wrap cgroup writeback v1 logic · 5e06ec32
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      5e06ec32
    • J
      alinux: writeback: introduce cgwb_v1 boot param · 6f7afdf3
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6f7afdf3
    • L
      alinux: fs/writeback: Attach inode's wb to root if needed · 67dc261c
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      67dc261c
    • J
      alinux: fs/writeback: fix double free of blkcg_css · b8016b41
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b8016b41
    • J
    • J
      alinux: writeback: add memcg_blkcg_link tree · 7396367d
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7396367d
    • G
      alinux: net: kernel hookers service for toa module · e209e3cb
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e209e3cb
    • C
      virtio_blk: add discard and write zeroes support · 0eea9fdf
      Changpeng Liu 提交于
      commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.
      
      In commit 88c85538, "virtio-blk: add discard and write zeroes features
      to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
      block specification has been extended to add VIRTIO_BLK_T_DISCARD and
      VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
      discard and write zeroes in the virtio-blk driver when the device
      advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
      VIRTIO_BLK_F_WRITE_ZEROES.
      Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
      Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0eea9fdf
    • J
      alinux: kconfig: Disable x86 clocksource watchdog · deb14433
      Jiufei Xue 提交于
      Unstable tsc will trigger clocksource watchdog and disable itself, as a
      result other clocksource will be elected as the current clocksource
      which will result in performace issue on our servers.
      
      RHEL7 also disabled this feature for some issues, see changelog:
      [x86] disable clocksource watchdog (Prarit Bhargava) [914709]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      deb14433
    • J
      alinux: Revert "x86/tsc: Prepare warp test for TSC adjustment" · 0fda568f
      Jiufei Xue 提交于
      This reverts commit 76d3b851.
      
      The returned value for check_tsc_warp() is useless now, remove it.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0fda568f
    • J
      alinux: Revert "x86/tsc: Try to adjust TSC if sync test fails" · a7d3a215
      Jiufei Xue 提交于
      This reverts commit cc4db268.
      
      When we do hot-add and enable vCPU, the time inside the VM jumps and
      then VM stucks.
      The dmesg shows like this:
      [   48.402948] CPU2 has been hot-added
      [   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
      [   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
      [   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
      [  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
      [  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
      [  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
      [  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
      [  102.070188] KVM setup async PF for cpu 2
      [  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
      [  102.074530] Will online and init hotplugged CPU: 2
      
      This is because the TSC for the newly added VCPU is initialized to 0
      while others are ahead. Guest will do the TSC ADJUST compensate and
      cause the time jumps.
      
      Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
      on guest CPU hotplug") can fix this problem.  However, the host kernel
      version may be older, so do not ajust TSC if sync test fails, just mark
      it unstable.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a7d3a215
    • J
      alinux: block-throttle: enable hierarchical throttling even on traditional hierarchy · 0430bf1d
      Joseph Qi 提交于
      ECI may have an use case that configuring each device mapper disk
      throttling policy just under root blkio cgroup, but actually using them
      in different containers.
      Since hierarchical throttling is now only supported on cgroup v2 and ECI
      uses cgroup v1, so we have to enable hierarchical throttling on cgroup
      v1.
      This is ported from redhat 7u, and a year ago Jiufei already ported it
      to alikernel 4.9 as well. So I think this change should be acceptable.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      0430bf1d
    • E
      alinux: drivers/virtio: add vring_force_dma_api boot param · 9c4a8e9d
      Eryu Guan 提交于
      Prior to xdragon platform 20181230 release (e.g. 0930 release),
      vring_use_dma_api() is required to return 'true' unconditionally.
      
      Introduce a new kernel boot parameter called "vring_force_dma_api" to
      control the behavior, boot xdragon host with "vring_force_dma_api"
      command line to make ENI hotplug work, so that normal ECS hosts keep the
      original behavior.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      9c4a8e9d
    • A
      boot: give rdrand some credit · 957ba62a
      Arjan van de Ven 提交于
      Cherry-pick from clear-linux patches:
      https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch
      
      try to credit rdrand/rdseed with some entropy
      
      In VMs but even modern hardware, we're super starved for entropy, and while we can
      and do wear a tin foil hat, it's very hard to argue that
      rdrand and rdtsc add zero entropy.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      957ba62a
    • J
    • A
      NEMU: Compile in evged always · f4747532
      Arjan van de Ven 提交于
      Cherry-pick from kata-container patches:
      https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch
      
      We need evged for NEMU (and in general for hw reduced)
      
      The config option cannot be set normally since it breaks all
      regular systems, and hardware reduced is really a runtime choice.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f4747532
    • E
      ext4: fix reserved cluster accounting at page invalidation time · c94af26e
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      c94af26e
    • E
      ext4: adjust reserved cluster count when removing extents · 60a5bf34
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      60a5bf34
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · d49d8b35
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      d49d8b35
    • E
      ext4: fix reserved cluster accounting at delayed write time · f683c7e6
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      f683c7e6
    • E
      ext4: add new pending reservation mechanism · a993adbe
      Eric Whitney 提交于
      commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream.
      
      Add new pending reservation mechanism to help manage reserved cluster
      accounting.  Its primary function is to avoid the need to read extents
      from the disk when invalidating pages as a result of a truncate, punch
      hole, or collapse range operation.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      a993adbe
    • E
      ext4: generalize extents status tree search functions · fccb6f6e
      Eric Whitney 提交于
      commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream.
      
      Ext4 contains a few functions that are used to search for delayed
      extents or blocks in the extents status tree.  Rather than duplicate
      code to add new functions to search for extents with different status
      values, such as written or a combination of delayed and unwritten,
      generalize the existing code to search for caller-specified extents
      status values.  Also, move this code into extents_status.c where it
      is better associated with the data structures it operates upon, and
      where it can be more readily used to implement new extents status tree
      functions that might want a broader scope for i_es_lock.
      
      Three missing static specifiers in RFC version of patch reported and
      fixed by Fengguang Wu <fengguang.wu@intel.com>.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      fccb6f6e
  2. 21 12月, 2019 8 次提交
    • G
      Linux 4.19.91 · 672481c2
      Greg Kroah-Hartman 提交于
      672481c2
    • M
      xhci: fix USB3 device initiated resume race with roothub autosuspend · 6c18dd40
      Mathias Nyman 提交于
      commit 057d476fff778f1d3b9f861fdb5437ea1a3cfc99 upstream.
      
      A race in xhci USB3 remote wake handling may force device back to suspend
      after it initiated resume siganaling, causing a missed resume event or warm
      reset of device.
      
      When a USB3 link completes resume signaling and goes to enabled (UO)
      state a interrupt is issued and the interrupt handler will clear the
      bus_state->port_remote_wakeup resume flag, allowing bus suspend.
      
      If the USB3 roothub thread just finished reading port status before
      the interrupt, finding ports still in suspended (U3) state, but hasn't
      yet started suspending the hub, then the xhci interrupt handler will clear
      the flag that prevented roothub suspend and allow bus to suspend, forcing
      all port links back to suspended (U3) state.
      
      Example case:
      usb_runtime_suspend() # because all ports still show suspended U3
        usb_suspend_both()
          hub_suspend();   # successful as hub->wakeup_bits not set yet
      ==> INTERRUPT
      xhci_irq()
        handle_port_status()
          clear bus_state->port_remote_wakeup
          usb_wakeup_notification()
            sets hub->wakeup_bits;
              kick_hub_wq()
      <== END INTERRUPT
            hcd_bus_suspend()
              xhci_bus_suspend() # success as port_remote_wakeup bits cleared
      
      Fix this by increasing roothub usage count during port resume to prevent
      roothub autosuspend, and by making sure bus_state->port_remote_wakeup
      flag is only cleared after resume completion is visible, i.e.
      after xhci roothub returned U0 or other non-U3 link state link on a
      get port status request.
      
      Issue rootcaused by Chiasheng Lee
      
      Cc: <stable@vger.kernel.org>
      Cc: Lee, Hou-hsun <hou-hsun.lee@intel.com>
      Reported-by: NLee, Chiasheng <chiasheng.lee@intel.com>
      Signed-off-by: NMathias Nyman <mathias.nyman@linux.intel.com>
      Link: https://lore.kernel.org/r/20191211142007.8847-3-mathias.nyman@linux.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c18dd40
    • A
      drm/radeon: fix r1xx/r2xx register checker for POT textures · 33c1d3bc
      Alex Deucher 提交于
      commit 008037d4d972c9c47b273e40e52ae34f9d9e33e7 upstream.
      
      Shift and mask were reversed.  Noticed by chance.
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Reviewed-by: NMichel Dänzer <mdaenzer@redhat.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33c1d3bc
    • R
      scsi: qla2xxx: Change discovery state before PLOGI · 5dc3f408
      Roman Bolshakov 提交于
      commit 58e39a2ce4be08162c0368030cdc405f7fd849aa upstream.
      
      When a port sends PLOGI, discovery state should be changed to login
      pending, otherwise RELOGIN_NEEDED bit is set in
      qla24xx_handle_plogi_done_event(). RELOGIN_NEEDED triggers another PLOGI,
      and it never goes out of the loop until login timer expires.
      
      Fixes: 8777e431 ("scsi: qla2xxx: Migrate NVME N2N handling into state machine")
      Fixes: 8b5292bcfcacf ("scsi: qla2xxx: Fix Relogin to prevent modifying scan_state flag")
      Cc: Quinn Tran <qutran@marvell.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20191125165702.1013-6-r.bolshakov@yadro.comAcked-by: NHimanshu Madhani <hmadhani@marvell.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Tested-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NRoman Bolshakov <r.bolshakov@yadro.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dc3f408
    • B
      scsi: iscsi: Fix a potential deadlock in the timeout handler · 86aa5f87
      Bart Van Assche 提交于
      commit 5480e299b5ae57956af01d4839c9fc88a465eeab upstream.
      
      Some time ago the block layer was modified such that timeout handlers are
      called from thread context instead of interrupt context. Make it safe to
      run the iSCSI timeout handler in thread context. This patch fixes the
      following lockdep complaint:
      
      ================================
      WARNING: inconsistent lock state
      5.5.1-dbg+ #11 Not tainted
      --------------------------------
      inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
      kworker/7:1H/206 [HC0[0]:SC0[0]:HE1:SE1] takes:
      ffff88802d9827e8 (&(&session->frwd_lock)->rlock){+.?.}, at: iscsi_eh_cmd_timed_out+0xa6/0x6d0 [libiscsi]
      {IN-SOFTIRQ-W} state was registered at:
        lock_acquire+0x106/0x240
        _raw_spin_lock+0x38/0x50
        iscsi_check_transport_timeouts+0x3e/0x210 [libiscsi]
        call_timer_fn+0x132/0x470
        __run_timers.part.0+0x39f/0x5b0
        run_timer_softirq+0x63/0xc0
        __do_softirq+0x12d/0x5fd
        irq_exit+0xb3/0x110
        smp_apic_timer_interrupt+0x131/0x3d0
        apic_timer_interrupt+0xf/0x20
        default_idle+0x31/0x230
        arch_cpu_idle+0x13/0x20
        default_idle_call+0x53/0x60
        do_idle+0x38a/0x3f0
        cpu_startup_entry+0x24/0x30
        start_secondary+0x222/0x290
        secondary_startup_64+0xa4/0xb0
      irq event stamp: 1383705
      hardirqs last  enabled at (1383705): [<ffffffff81aace5c>] _raw_spin_unlock_irq+0x2c/0x50
      hardirqs last disabled at (1383704): [<ffffffff81aacb98>] _raw_spin_lock_irq+0x18/0x50
      softirqs last  enabled at (1383690): [<ffffffffa0e2efea>] iscsi_queuecommand+0x76a/0xa20 [libiscsi]
      softirqs last disabled at (1383682): [<ffffffffa0e2e998>] iscsi_queuecommand+0x118/0xa20 [libiscsi]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&session->frwd_lock)->rlock);
        <Interrupt>
          lock(&(&session->frwd_lock)->rlock);
      
       *** DEADLOCK ***
      
      2 locks held by kworker/7:1H/206:
       #0: ffff8880d57bf928 ((wq_completion)kblockd){+.+.}, at: process_one_work+0x472/0xab0
       #1: ffff88802b9c7de8 ((work_completion)(&q->timeout_work)){+.+.}, at: process_one_work+0x476/0xab0
      
      stack backtrace:
      CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 5.5.1-dbg+ #11
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Workqueue: kblockd blk_mq_timeout_work
      Call Trace:
       dump_stack+0xa5/0xe6
       print_usage_bug.cold+0x232/0x23b
       mark_lock+0x8dc/0xa70
       __lock_acquire+0xcea/0x2af0
       lock_acquire+0x106/0x240
       _raw_spin_lock+0x38/0x50
       iscsi_eh_cmd_timed_out+0xa6/0x6d0 [libiscsi]
       scsi_times_out+0xf4/0x440 [scsi_mod]
       scsi_timeout+0x1d/0x20 [scsi_mod]
       blk_mq_check_expired+0x365/0x3a0
       bt_iter+0xd6/0xf0
       blk_mq_queue_tag_busy_iter+0x3de/0x650
       blk_mq_timeout_work+0x1af/0x380
       process_one_work+0x56d/0xab0
       worker_thread+0x7a/0x5d0
       kthread+0x1bc/0x210
       ret_from_fork+0x24/0x30
      
      Fixes: 287922eb ("block: defer timeouts to a workqueue")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20191209173457.187370-1-bvanassche@acm.orgSigned-off-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NLee Duncan <lduncan@suse.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86aa5f87
    • H
      dm btree: increase rebalance threshold in __rebalance2() · 6970c159
      Hou Tao 提交于
      commit 474e559567fa631dea8fb8407ab1b6090c903755 upstream.
      
      We got the following warnings from thin_check during thin-pool setup:
      
        $ thin_check /dev/vdb
        examining superblock
        examining devices tree
          missing devices: [1, 84]
            too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
        examining mapping tree
      
      The phenomenon is the number of entries in one node of details_info tree is
      less than (max_entries / 3). And it can be easily reproduced by the following
      procedures:
      
        $ new a thin pool
        $ presume the max entries of details_info tree is 126
        $ new 127 thin devices (e.g. 1~127) to make the root node being full
          and then split
        $ remove the first 43 (e.g. 1~43) thin devices to make the children
          reblance repeatedly
        $ stop the thin pool
        $ thin_check
      
      The root cause is that the B-tree removal procedure in __rebalance2()
      doesn't guarantee the invariance: the minimal number of entries in
      non-root node should be >= (max_entries / 3).
      
      Simply fix the problem by increasing the rebalance threshold to
      make sure the number of entries in each child will be greater
      than or equal to (max_entries / 3 + 1), so no matter which
      child is used for removal, the number will still be valid.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6970c159
    • M
      dm mpath: remove harmful bio-based optimization · 980b632f
      Mike Snitzer 提交于
      commit dbaf971c9cdf10843071a60dcafc1aaab3162354 upstream.
      
      Removes the branching for edge-case where no SCSI device handler
      exists.  The __map_bio_fast() method was far too limited, by only
      selecting a new pathgroup or path IFF there was a path failure, fix this
      be eliminating it in favor of __map_bio().  __map_bio()'s extra SCSI
      device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
      path anyway.
      
      This change restores full path selector functionality for bio-based
      configurations that don't haave a SCSI device handler.  But it should be
      noted that the path selectors do have an impact on performance for
      certain networks that are extremely fast (and don't require frequent
      switching).
      
      Fixes: 8d47e659 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
      Cc: stable@vger.kernel.org
      Reported-by: NDrew Hastings <dhastings@crucialwebhost.com>
      Suggested-by: NMartin Wilck <mwilck@suse.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      980b632f
    • M
      drm: meson: venc: cvbs: fix CVBS mode matching · a14083fe
      Martin Blumenstingl 提交于
      commit 43cb86799ff03e9819c07f37f72f80f8246ad7ed upstream.
      
      With commit 222ec161 ("drm: Add aspect ratio parsing in DRM
      layer") the drm core started honoring the picture_aspect_ratio field
      when comparing two drm_display_modes. Prior to that it was ignored.
      When the CVBS encoder driver was initially submitted there was no aspect
      ratio check.
      
      Switch from drm_mode_equal() to drm_mode_match() without
      DRM_MODE_MATCH_ASPECT_RATIO to fix "kmscube" and X.org output using the
      CVBS connector. When (for example) kmscube sets the output mode when
      using the CVBS connector it passes HDMI_PICTURE_ASPECT_NONE, making the
      drm_mode_equal() fail as it include the aspect ratio.
      
      Prior to this patch kmscube reported:
        failed to set mode: Invalid argument
      
      The CVBS mode checking in the sun4i (drivers/gpu/drm/sun4i/sun4i_tv.c
      sun4i_tv_mode_to_drm_mode) and ZTE (drivers/gpu/drm/zte/zx_tvenc.c
      tvenc_mode_{pal,ntsc}) drivers don't set the "picture_aspect_ratio" at
      all. The Meson VPU driver does not rely on the aspect ratio for the CVBS
      output so we can safely decouple it from the hdmi_picture_aspect
      setting.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 222ec161 ("drm: Add aspect ratio parsing in DRM layer")
      Fixes: bbbe775e ("drm: Add support for Amlogic Meson Graphic Controller")
      Signed-off-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Acked-by: NNeil Armstrong <narmstrong@baylibre.com>
      [narmstrong: squashed with drm: meson: venc: cvbs: deduplicate the meson_cvbs_mode lookup code]
      Signed-off-by: NNeil Armstrong <narmstrong@baylibre.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20191208171832.1064772-3-martin.blumenstingl@googlemail.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a14083fe