1. 06 4月, 2016 2 次提交
  2. 05 4月, 2016 14 次提交
  3. 03 4月, 2016 2 次提交
    • A
      stmmac: add new DT platform entries for GMAC4 · ee2ae1ed
      Alexandre TORGUE 提交于
      This is to support the snps,dwmac-4.00 and snps,dwmac-4.10a
      and related features on the platform driver.
      See binding doc for further details.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NAlexandre TORGUE <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee2ae1ed
    • Y
      tcp: remove cwnd moderation after recovery · 23492623
      Yuchung Cheng 提交于
      For non-SACK connections, cwnd is lowered to inflight plus 3 packets
      when the recovery ends. This is an optional feature in the NewReno
      RFC 2582 to reduce the potential burst when cwnd is "re-opened"
      after recovery and inflight is low.
      
      This feature is questionably effective because of PRR: when
      the recovery ends (i.e., snd_una == high_seq) NewReno holds the
      CA_Recovery state for another round trip to prevent false fast
      retransmits. But if the inflight is low, PRR will overwrite the
      moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
      receiver responds bogus ACKs (i.e., acking future data) to speed up
      transfer after recovery, it can only induce a burst up to a window
      worth of data packets by acking up to SND.NXT. A restart from (short)
      idle or receiving streched ACKs can both cause such bursts as well.
      
      On the other hand, if the recovery ends because the sender
      detects the losses were spurious (e.g., reordering). This feature
      unconditionally lowers a reverted cwnd even though nothing
      was lost.
      
      By principle loss recovery module should not update cwnd. Further
      pacing is much more effective to reduce burst. Hence this patch
      removes the cwnd moderation feature.
      
      v2 changes: revised commit message on bogus ACKs and burst, and
                  missing signature
      Signed-off-by: NMatt Mathis <mattmathis@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23492623
  4. 02 4月, 2016 5 次提交
    • L
      mm/page_isolation: fix tracepoint to mirror check function behavior · bbe3de25
      Lucas Stach 提交于
      Page isolation has not failed if the fin pfn extends beyond the end pfn
      and test_pages_isolated checks this correctly.  Fix the tracepoint to
      report the same result as the actual check function.
      Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbe3de25
    • C
      include/linux/huge_mm.h: return NULL instead of false for pmd_trans_huge_lock() · 969e8d7e
      Chen Gang 提交于
      The return value of pmd_trans_huge_lock() is a pointer, not a boolean
      value, so use NULL instead of false as the return value.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      969e8d7e
    • G
      stmmac: fix MDIO settings · a7657f12
      Giuseppe CAVALLARO 提交于
      Initially the phy_bus_name was added to manipulate the
      driver name but it was recently just used to manage the
      fixed-link and then to take some decision at run-time.
      So the patch uses the is_pseudo_fixed_link and removes
      the phy_bus_name variable not necessary anymore.
      
      The driver can manage the mdio registration by using phy-handle,
      dwmac-mdio and own parameter e.g. snps,phy-addr.
      This patch takes care about all these possible configurations
      and fixes the mdio registration in case of there is a real
      transceiver or a switch (that needs to be managed by using
      fixed-link).
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Reviewed-by: NAndreas Färber <afaerber@suse.de>
      Tested-by: NFrank Schäfer <fschaefer.oss@googlemail.com>
      Cc: Gabriel Fernandez <gabriel.fernandez@linaro.org>
      Cc: Dinh Nguyen <dinh.linux@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Phil Reid <preid@electromag.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7657f12
    • G
      Revert "stmmac: Fix 'eth0: No PHY found' regression" · d7e944c8
      Giuseppe CAVALLARO 提交于
      This reverts commit 88f8b1bb.
      due to problems on GeekBox and Banana Pi M1 board when
      connected to a real transceiver instead of a switch via
      fixed-link.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Gabriel Fernandez <gabriel.fernandez@linaro.org>
      Cc: Andreas Färber <afaerber@suse.de>
      Cc: Frank Schäfer <fschaefer.oss@googlemail.com>
      Cc: Dinh Nguyen <dinh.linux@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e944c8
    • D
      tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter · 5a5abb1f
      Daniel Borkmann 提交于
      Sasha Levin reported a suspicious rcu_dereference_protected() warning
      found while fuzzing with trinity that is similar to this one:
      
        [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
        [   52.765688] other info that might help us debug this:
        [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
        [   52.765701] 1 lock held by a.out/1525:
        [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
        [   52.765721] stack backtrace:
        [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
        [...]
        [   52.765768] Call Trace:
        [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
        [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
        [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
        [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
        [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
        [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
        [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
        [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
        [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
        [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
        [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
        [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
      from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
      DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
      
      Since the fix in f91ff5b9 ("net: sk_{detach|attach}_filter() rcu
      fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
      filter with rcu_dereference_protected(), checking whether socket lock
      is held in control path.
      
      Since its introduction in 99405162 ("tun: socket filter support"),
      tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
      sock_owned_by_user(sk) doesn't apply in this specific case and therefore
      triggers the false positive.
      
      Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
      that is used by tap filters and pass in lockdep_rtnl_is_held() for the
      rcu_dereference_protected() checks instead.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a5abb1f
  5. 31 3月, 2016 1 次提交
    • D
      bpf: make padding in bpf_tunnel_key explicit · c0e760c9
      Daniel Borkmann 提交于
      Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
      and tunnel_label members explicit. No issue has been observed, and
      gcc/llvm does padding for the old struct already, where tunnel_label
      was not yet present, so the current code works, but since it's part
      of uapi, make sure we don't introduce holes in structs.
      
      Therefore, add tunnel_ext that we can use generically in future
      (f.e. to flag OAM messages for backends, etc). Also add the offset
      to the compat tests to be sure should some compilers not padd the
      tail of the old version of bpf_tunnel_key.
      
      Fixes: 4018ab18 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e760c9
  6. 29 3月, 2016 1 次提交
  7. 28 3月, 2016 1 次提交
    • V
      netfilter: ipset: fix race condition in ipset save, swap and delete · 596cf3fe
      Vishwanath Pai 提交于
      This fix adds a new reference counter (ref_netlink) for the struct ip_set.
      The other reference counter (ref) can be swapped out by ip_set_swap and we
      need a separate counter to keep track of references for netlink events
      like dump. Using the same ref counter for dump causes a race condition
      which can be demonstrated by the following script:
      
      ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
      counters
      ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
      counters
      ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
      counters
      
      ipset save &
      
      ipset swap hash_ip3 hash_ip2
      ipset destroy hash_ip3 /* will crash the machine */
      
      Swap will exchange the values of ref so destroy will see ref = 0 instead of
      ref = 1. With this fix in place swap will not succeed because ipset save
      still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).
      
      Both delete and swap will error out if ref_netlink != 0 on the set.
      
      Note: The changes to *_head functions is because previously we would
      increment ref whenever we called these functions, we don't do that
      anymore.
      Reviewed-by: NJoshua Hunt <johunt@akamai.com>
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      596cf3fe
  8. 26 3月, 2016 14 次提交
    • A
      mm, kasan: stackdepot implementation. Enable stackdepot for SLAB · cd11016e
      Alexander Potapenko 提交于
      Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
      will allow KASAN store allocation/deallocation stack traces for memory
      chunks.  The stack traces are stored in a hash table and referenced by
      handles which reside in the kasan_alloc_meta and kasan_free_meta
      structures in the allocated memory chunks.
      
      IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
      duplication.
      
      Right now stackdepot support is only enabled in SLAB allocator.  Once
      KASAN features in SLAB are on par with those in SLUB we can switch SLUB
      to stackdepot as well, thus removing the dependency on SLUB stack
      bookkeeping, which wastes a lot of memory.
      
      This patch is based on the "mm: kasan: stack depots" patch originally
      prepared by Dmitry Chernenkov.
      
      Joonsoo has said that he plans to reuse the stackdepot code for the
      mm/page_owner.c debugging facility.
      
      [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
      [aryabinin@virtuozzo.com: comment style fixes]
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd11016e
    • A
      arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7
      Alexander Potapenko 提交于
      KASAN needs to know whether the allocation happens in an IRQ handler.
      This lets us strip everything below the IRQ entry point to reduce the
      number of unique stack traces needed to be stored.
      
      Move the definition of __irq_entry to <linux/interrupt.h> so that the
      users don't need to pull in <linux/ftrace.h>.  Also introduce the
      __softirq_entry macro which is similar to __irq_entry, but puts the
      corresponding functions to the .softirqentry.text section.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be7635e7
    • A
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko 提交于
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
    • A
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko 提交于
      Add KASAN hooks to SLAB allocator.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2f9e6
    • T
      include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill() · aaf4fb71
      Tetsuo Handa 提交于
      A leftover from commit c32b3cbe ("oom, PM: make OOM detection in the
      freezer path raceless").
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aaf4fb71
    • T
      oom, oom_reaper: protect oom_reaper_list using simpler way · bb29902a
      Tetsuo Handa 提交于
      "oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
      to protect oom_reaper_list using MMF_OOM_KILLED flag.  But we can do it
      by simply checking tsk->oom_reaper_list != NULL.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb29902a
    • V
      oom: make oom_reaper_list single linked · 29c696e1
      Vladimir Davydov 提交于
      Entries are only added/removed from oom_reaper_list at head so we can
      use a single linked list and hence save a word in task_struct.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c696e1
    • M
      oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task · 855b0183
      Michal Hocko 提交于
      Tetsuo has reported that oom_kill_allocating_task=1 will cause
      oom_reaper_list corruption because oom_kill_process doesn't follow
      standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
      the same task multiple times - e.g.  by sacrificing the same child
      multiple times.
      
      This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
      which is set in oom_kill_process atomically and oom reaper is disabled
      if the flag was already set.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      855b0183
    • M
      mm, oom_reaper: implement OOM victims queuing · 03049269
      Michal Hocko 提交于
      wake_oom_reaper has allowed only 1 oom victim to be queued.  The main
      reason for that was the simplicity as other solutions would require some
      way of queuing.  The current approach is racy and that was deemed
      sufficient as the oom_reaper is considered a best effort approach to
      help with oom handling when the OOM victim cannot terminate in a
      reasonable time.  The race could lead to missing an oom victim which can
      get stuck
      
      out_of_memory
        wake_oom_reaper
          cmpxchg // OK
          			oom_reaper
      			  oom_reap_task
      			    __oom_reap_task
      oom_victim terminates
      			      atomic_inc_not_zero // fail
      out_of_memory
        wake_oom_reaper
          cmpxchg // fails
      			  task_to_reap = NULL
      
      This race requires 2 OOM invocations in a short time period which is not
      very likely but certainly not impossible.  E.g.  the original victim
      might have not released a lot of memory for some reason.
      
      The situation would improve considerably if wake_oom_reaper used a more
      robust queuing.  This is what this patch implements.  This means adding
      oom_reaper_list list_head into task_struct (eat a hole before embeded
      thread_struct for that purpose) and a oom_reaper_lock spinlock for
      queuing synchronization.  wake_oom_reaper will then add the task on the
      queue and oom_reaper will dequeue it.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03049269
    • M
      oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space · 36324a99
      Michal Hocko 提交于
      When oom_reaper manages to unmap all the eligible vmas there shouldn't
      be much of the freable memory held by the oom victim left anymore so it
      makes sense to clear the TIF_MEMDIE flag for the victim and allow the
      OOM killer to select another task.
      
      The lack of TIF_MEMDIE also means that the victim cannot access memory
      reserves anymore but that shouldn't be a problem because it would get
      the access again if it needs to allocate and hits the OOM killer again
      due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
      hide the task from the OOM killer because it is clearly not a good
      candidate anymore as everyhing reclaimable has been torn down already.
      
      This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
      and thus hold off further global OOM killer actions granted the oom
      reaper is able to take mmap_sem for the associated mm struct.  This is
      not guaranteed now but further steps should make sure that mmap_sem for
      write should be blocked killable which will help to reduce such a lock
      contention.  This is not done by this patch.
      
      Note that exit_oom_victim might be called on a remote task from
      __oom_reap_task now so we have to check and clear the flag atomically
      otherwise we might race and underflow oom_victims or wake up waiters too
      early.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36324a99
    • M
      mm, oom: introduce oom reaper · aac45363
      Michal Hocko 提交于
      This patch (of 5):
      
      This is based on the idea from Mel Gorman discussed during LSFMM 2015
      and independently brought up by Oleg Nesterov.
      
      The OOM killer currently allows to kill only a single task in a good
      hope that the task will terminate in a reasonable time and frees up its
      memory.  Such a task (oom victim) will get an access to memory reserves
      via mark_oom_victim to allow a forward progress should there be a need
      for additional memory during exit path.
      
      It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
      construct workloads which break the core assumption mentioned above and
      the OOM victim might take unbounded amount of time to exit because it
      might be blocked in the uninterruptible state waiting for an event (e.g.
      lock) which is blocked by another task looping in the page allocator.
      
      This patch reduces the probability of such a lockup by introducing a
      specialized kernel thread (oom_reaper) which tries to reclaim additional
      memory by preemptively reaping the anonymous or swapped out memory owned
      by the oom victim under an assumption that such a memory won't be needed
      when its owner is killed and kicked from the userspace anyway.  There is
      one notable exception to this, though, if the OOM victim was in the
      process of coredumping the result would be incomplete.  This is
      considered a reasonable constrain because the overall system health is
      more important than debugability of a particular application.
      
      A kernel thread has been chosen because we need a reliable way of
      invocation so workqueue context is not appropriate because all the
      workers might be busy (e.g.  allocating memory).  Kswapd which sounds
      like another good fit is not appropriate as well because it might get
      blocked on locks during reclaim as well.
      
      oom_reaper has to take mmap_sem on the target task for reading so the
      solution is not 100% because the semaphore might be held or blocked for
      write but the probability is reduced considerably wrt.  basically any
      lock blocking forward progress as described above.  In order to prevent
      from blocking on the lock without any forward progress we are using only
      a trylock and retry 10 times with a short sleep in between.  Users of
      mmap_sem which need it for write should be carefully reviewed to use
      _killable waiting as much as possible and reduce allocations requests
      done with the lock held to absolute minimum to reduce the risk even
      further.
      
      The API between oom killer and oom reaper is quite trivial.
      wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
      NULL->mm transition and oom_reaper clear this atomically once it is done
      with the work.  This means that only a single mm_struct can be reaped at
      the time.  As the operation is potentially disruptive we are trying to
      limit it to the ncessary minimum and the reaper blocks any updates while
      it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
      exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Suggested-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac45363
    • A
      sched: add schedule_timeout_idle() · 69b27baf
      Andrew Morton 提交于
      This will be needed in the patch "mm, oom: introduce oom reaper".
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69b27baf
    • Y
      ceph: fix security xattr deadlock · 315f2408
      Yan, Zheng 提交于
      When security is enabled, security module can call filesystem's
      getxattr/setxattr callbacks during d_instantiate(). For cephfs,
      d_instantiate() is usually called by MDS' dispatch thread, while
      handling MDS reply. If the MDS reply does not include xattrs and
      corresponding caps, getxattr/setxattr need to send a new request
      to MDS and waits for the reply. This makes MDS' dispatch sleep,
      nobody handles later MDS replies.
      
      The fix is make sure lookup/atomic_open reply include xattrs and
      corresponding caps. So getxattr can be handled by cached xattrs.
      This requires some modification to both MDS and request message.
      (Client tells MDS what caps it wants; MDS encodes proper caps in
      the reply)
      
      Smack security module may call setxattr during d_instantiate().
      Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
      to us. So just make setxattr return error when called by MDS'
      dispatch thread.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      315f2408
    • Y
      libceph: add helper that duplicates last extent operation · 2c63f49a
      Yan, Zheng 提交于
      This helper duplicates last extent operation in OSD request, then
      adjusts the new extent operation's offset and length. The helper
      is for scatterd page writeback, which adds nonconsecutive dirty
      pages to single OSD request.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      2c63f49a