1. 08 8月, 2019 9 次提交
    • P
      sched/fair: Expose newidle_balance() · 5ba553ef
      Peter Zijlstra 提交于
      For pick_next_task_fair() it is the newidle balance that requires
      dropping the rq->lock; provided we do put_prev_task() early, we can
      also detect the condition for doing newidle early.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
      5ba553ef
    • P
      sched: Add task_struct pointer to sched_class::set_curr_task · 03b7fad1
      Peter Zijlstra 提交于
      In preparation of further separating pick_next_task() and
      set_curr_task() we have to pass the actual task into it, while there,
      rename the thing to better pair with put_prev_task().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
      03b7fad1
    • P
      sched: Rework CPU hotplug task selection · 10e7071b
      Peter Zijlstra 提交于
      The CPU hotplug task selection is the only place where we used
      put_prev_task() on a task that is not current. While looking at that,
      it occured to me that we can simplify all that by by using a custom
      pick loop.
      
      Since we don't need to put current, we can do away with the fake task
      too.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      10e7071b
    • P
      sched/{rt,deadline}: Fix set_next_task vs pick_next_task · f95d4eae
      Peter Zijlstra 提交于
      Because pick_next_task() implies set_curr_task() and some of the
      details haven't mattered too much, some of what _should_ be in
      set_curr_task() ended up in pick_next_task, correct this.
      
      This prepares the way for a pick_next_task() variant that does not
      affect the current state; allowing remote picking.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
      f95d4eae
    • P
      sched: Fix kerneldoc comment for ia64_set_curr_task · 5feeb783
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
      5feeb783
    • P
      stop_machine: Fix stop_cpus_in_progress ordering · 99d84bf8
      Peter Zijlstra 提交于
      Make sure the entire for loop has stop_cpus_in_progress set.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lwe@gmail.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: mingo@kernel.org
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Julien Desfossez <jdesfossez@digitalocean.com>
      Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
      Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
      99d84bf8
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · de53fd7a
      Dave Chiluk 提交于
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
      de53fd7a
    • P
      sched: Clean up active_mm reference counting · 139d025c
      Peter Zijlstra 提交于
      The current active_mm reference counting is confusing and sub-optimal.
      
      Rewrite the code to explicitly consider the 4 separate cases:
      
          user -> user
      
      	When switching between two user tasks, all we need to consider
      	is switch_mm().
      
          user -> kernel
      
      	When switching from a user task to a kernel task (which
      	doesn't have an associated mm) we retain the last mm in our
      	active_mm. Increment a reference count on active_mm.
      
        kernel -> kernel
      
      	When switching between kernel threads, all we need to do is
      	pass along the active_mm reference.
      
        kernel -> user
      
      	When switching between a kernel and user task, we must switch
      	from the last active_mm to the next mm, hoping of course that
      	these are the same. Decrement a reference on the active_mm.
      
      The code keeps a different order, because as you'll note, both 'to
      user' cases require switch_mm().
      
      And where the old code would increment/decrement for the 'kernel ->
      kernel' case, the new code observes this is a neutral operation and
      avoids touching the reference count.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: luto@kernel.org
      139d025c
    • P
      rcu/tree: Fix SCHED_FIFO params · 130d9c33
      Peter Zijlstra 提交于
      A rather embarrasing mistake had us call sched_setscheduler() before
      initializing the parameters passed to it.
      
      Fixes: 1a763fd7 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      130d9c33
  2. 25 7月, 2019 23 次提交
  3. 23 7月, 2019 5 次提交
  4. 22 7月, 2019 3 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 83768245
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Several netfilter fixes including a nfnetlink deadlock fix from
          Florian Westphal and fix for dropping VRF packets from Miaohe Lin.
      
       2) Flow offload fixes from Pablo Neira Ayuso including a fix to restore
          proper block sharing.
      
       3) Fix r8169 PHY init from Thomas Voegtle.
      
       4) Fix memory leak in mac80211, from Lorenzo Bianconi.
      
       5) Missing NULL check on object allocation in cxgb4, from Navid
          Emamdoost.
      
       6) Fix scaling of RX power in sfp phy driver, from Andrew Lunn.
      
       7) Check that there is actually an ip header to access in skb->data in
          VRF, from Peter Kosyh.
      
       8) Remove spurious rcu unlock in hv_netvsc, from Haiyang Zhang.
      
       9) One more tweak the the TCP fragmentation memory limit changes, to be
          less harmful to applications setting small SO_SNDBUF values. From
          Eric Dumazet.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
        tcp: be more careful in tcp_fragment()
        hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
        vrf: make sure skb->data contains ip header to make routing
        connector: remove redundant input callback from cn_dev
        qed: Prefer pcie_capability_read_word()
        igc: Prefer pcie_capability_read_word()
        cxgb4: Prefer pcie_capability_read_word()
        be2net: Synchronize be_update_queues with dev_watchdog
        bnx2x: Prevent load reordering in tx completion processing
        net: phy: sfp: hwmon: Fix scaling of RX power
        net: sched: verify that q!=NULL before setting q->flags
        chelsio: Fix a typo in a function name
        allocate_flower_entry: should check for null deref
        net: hns3: typo in the name of a constant
        kbuild: add net/netfilter/nf_tables_offload.h to header-test blacklist.
        tipc: Fix a typo
        mac80211: don't warn about CW params when not using them
        mac80211: fix possible memory leak in ieee80211_assign_beacon
        nl80211: fix NL80211_HE_MAX_CAPABILITY_LEN
        nl80211: fix VENDOR_CMD_RAW_DATA
        ...
      83768245
    • S
      pidfd: fix a poll race when setting exit_state · b191d649
      Suren Baghdasaryan 提交于
      There is a race between reading task->exit_state in pidfd_poll and
      writing it after do_notify_parent calls do_notify_pidfd. Expected
      sequence of events is:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
        tsk->exit_state = EXIT_DEAD
                                        pidfd_poll
                                           if (tsk->exit_state)
      
      However nothing prevents the following sequence:
      
      CPU 0                            CPU 1
      ------------------------------------------------
      exit_notify
        do_notify_parent
          do_notify_pidfd
                                         pidfd_poll
                                            if (tsk->exit_state)
        tsk->exit_state = EXIT_DEAD
      
      This causes a polling task to wait forever, since poll blocks because
      exit_state is 0 and the waiting task is not notified again. A stress
      test continuously doing pidfd poll and process exits uncovered this bug.
      
      To fix it, we make sure that the task's exit_state is always set before
      calling do_notify_pidfd.
      
      Fixes: b53b0b9d ("pidfd: add polling support")
      Cc: kernel-team@android.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
      [christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      b191d649
    • E
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet 提交于
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b617158d