1. 29 8月, 2017 1 次提交
  2. 26 8月, 2017 1 次提交
    • J
      futex: Remove duplicated code and fix undefined behaviour · 30d6e0a4
      Jiri Slaby 提交于
      There is code duplicated over all architecture's headers for
      futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
      and comparison of the result.
      
      Remove this duplication and leave up to the arches only the needed
      assembly which is now in arch_futex_atomic_op_inuser.
      
      This effectively distributes the Will Deacon's arm64 fix for undefined
      behaviour reported by UBSAN to all architectures. The fix was done in
      commit 5f16a046 (arm64: futex: Fix undefined behaviour with
      FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.
      
      And as suggested by Thomas, check for negative oparg too, because it was
      also reported to cause undefined behaviour report.
      
      Note that s390 removed access_ok check in d12a2970 ("s390/uaccess:
      remove pointless access_ok() checks") as access_ok there returns true.
      We introduce it back to the helper for the sake of simplicity (it gets
      optimized away anyway).
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
      Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Reviewed-by: NDarren Hart (VMware) <dvhart@infradead.org>
      Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
      Cc: linux-mips@linux-mips.org
      Cc: Rich Felker <dalias@libc.org>
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: peterz@infradead.org
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: sparclinux@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: linux-s390@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: linux-hexagon@vger.kernel.org
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: linux-snps-arc@lists.infradead.org
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-xtensa@linux-xtensa.org
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: openrisc@lists.librecores.org
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-parisc@vger.kernel.org
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-alpha@vger.kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
      30d6e0a4
  3. 25 8月, 2017 4 次提交
    • P
      locking/lockdep: Fix workqueue crossrelease annotation · e6f3faa7
      Peter Zijlstra 提交于
      The new completion/crossrelease annotations interact unfavourable with
      the extant flush_work()/flush_workqueue() annotations.
      
      The problem is that when a single work class does:
      
        wait_for_completion(&C)
      
      and
      
        complete(&C)
      
      in different executions, we'll build dependencies like:
      
        lock_map_acquire(W)
        complete_acquire(C)
      
      and
      
        lock_map_acquire(W)
        complete_release(C)
      
      which results in the dependency chain: W->C->W, which lockdep thinks
      spells deadlock, even though there is no deadlock potential since
      works are ran concurrently.
      
      One possibility would be to change the work 'lock' to recursive-read,
      but that would mean hitting a lockdep limitation on recursive locks.
      Also, unconditinoally switching to recursive-read here would fail to
      detect the actual deadlock on single-threaded workqueues, which do
      have a problem with this.
      
      For now, forcefully disregard these locks for crossrelease.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: byungchul.park@lge.com
      Cc: david@fromorbit.com
      Cc: johannes@sipsolutions.net
      Cc: oleg@redhat.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e6f3faa7
    • P
      mm, locking/barriers: Clarify tlb_flush_pending() barriers · 0e709703
      Peter Zijlstra 提交于
      Better document the ordering around tlb_flush_pending().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e709703
    • E
      pty: Repair TIOCGPTPEER · 311fc65c
      Eric W. Biederman 提交于
      The implementation of TIOCGPTPEER has two issues.
      
      When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong
      vfsmount is passed to dentry_open.  Which results in the kernel displaying
      the wrong pathname for the peer.
      
      The second is simply by caching the vfsmount and dentry of the peer it leaves
      them open, in a way they were not previously Which because of the inreased
      reference counts can cause unnecessary behaviour differences resulting in
      regressions.
      
      To fix these move the ioctl into tty_io.c at a generic level allowing
      the ioctl to have access to the struct file on which the ioctl is
      being called.  This allows the path of the slave to be derived when
      opening the slave through TIOCGPTPEER instead of requiring the path to
      the slave be cached.  Thus removing the need for caching the path.
      
      A new function devpts_ptmx_path is factored out of devpts_acquire and
      used to implement a function devpts_mntget.   The new function devpts_mntget
      takes a filp to perform the lookup on and fsi so that it can confirm
      that the superblock that is found by devpts_ptmx_path is the proper superblock.
      
      v2: Lots of fixes to make the code actually work
      v3: Suggestions by Linus
          - Removed the unnecessary initialization of filp in ptm_open_peer
          - Simplified devpts_ptmx_path as gotos are no longer required
      
      [ This is the fix for the issue that was reverted in commit
        143c97cc, but this time without breaking 'pbuilder' due to
        increased reference counts   - Linus ]
      
      Fixes: 54ebbfb1 ("tty: add TIOCGPTPEER ioctl")
      Reported-by: NChristian Brauner <christian.brauner@canonical.com>
      Reported-and-tested-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      311fc65c
    • N
      IB/core: Avoid accessing non-allocated memory when inferring port type · 498ca3c8
      Noa Osherovich 提交于
      Commit 44c58487 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
      introduced the concept of type in ah_attr:
       * During ib_register_device, each port is checked for its type which
         is stored in ib_device's port_immutable array.
       * During uverbs' modify_qp, the type is inferred using the port number
         in ib_uverbs_qp_dest struct (address vector) by accessing the
         relevant port_immutable array and the type is passed on to
         providers.
      
      IB spec (version 1.3) enforces a valid port value only in Reset to
      Init. During Init to RTR, the address vector must be valid but port
      number is not mentioned as a field in the address vector, so its
      value is not validated, which leads to accesses to a non-allocated
      memory when inferring the port type.
      
      Save the real port number in ib_qp during modify to Init (when the
      comp_mask indicates that the port number is valid) and use this value
      to infer the port type.
      
      Avoid copying the address vector fields if the matching bit is not set
      in the attr_mask. Address vector can't be modified before the port, so
      no valid flow is affected.
      
      Fixes: 44c58487 ('IB/core: Define 'ib' and 'roce' rdma_ah_attr types')
      Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      498ca3c8
  4. 24 8月, 2017 1 次提交
    • L
      Revert "pty: fix the cached path of the pty slave file descriptor in the master" · 143c97cc
      Linus Torvalds 提交于
      This reverts commit c8c03f18.
      
      It turns out that while fixing the ptmx file descriptor to have the
      correct 'struct path' to the associated slave pty is a really good
      thing, it breaks some user space tools for a very annoying reason.
      
      The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X)
      are on different mounts.  That was what caused us to have the wrong path
      in the first place (we would mix up the vfsmount of the 'ptmx' node,
      with the dentry of the pty slave node), but it also means that now while
      we use the right vfsmount, having the pty master open also keeps the pts
      mount busy.
      
      And it turn sout that that makes 'pbuilder' very unhappy, as noted by
      Stefan Lippers-Hollmann:
      
       "This patch introduces a regression for me when using pbuilder
        0.228.7[2] (a helper to build Debian packages in a chroot and to
        create and update its chroots) when trying to umount /dev/ptmx (inside
        the chroot) on Debian/ unstable (full log and pbuilder configuration
        file[3] attached).
      
        [...]
        Setting up build-essential (12.3) ...
        Processing triggers for libc-bin (2.24-15) ...
        I: unmounting dev/ptmx filesystem
        W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy
                (In some cases useful info about processes that
                 use the device is found by lsof(8) or fuser(1).)"
      
      apparently pbuilder tries to unmount the /dev/pts filesystem while still
      holding at least one master node open, which is arguably not very nice,
      but we don't break user space even when fixing other bugs.
      
      So this commit has to be reverted.
      
      I'll try to figure out a way to avoid caching the path to the slave pty
      in the master pty.  The only thing that actually wants that slave pty
      path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the
      path at that time.
      Reported-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
      Cc: Eric W Biederman <ebiederm@xmission.com>
      Cc: Christian Brauner <christian.brauner@canonical.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      143c97cc
  5. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  6. 21 8月, 2017 1 次提交
  7. 19 8月, 2017 5 次提交
    • M
      mm, oom: fix potential data corruption when oom_reaper races with writer · 6b31d595
      Michal Hocko 提交于
      Wenwei Tao has noticed that our current assumption that the oom victim
      is dying and never doing any visible changes after it dies, and so the
      oom_reaper can tear it down, is not entirely true.
      
      __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
      but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
      So there is a race window when some threads won't have
      fatal_signal_pending while the oom_reaper could start unmapping the
      address space.  Moreover some paths might not check for fatal signals
      before each PF/g-u-p/copy_from_user.
      
      We already have a protection for oom_reaper vs.  PF races by checking
      MMF_UNSTABLE.  This has been, however, checked only for kernel threads
      (use_mm users) which can outlive the oom victim.  A simple fix would be
      to extend the current check in handle_mm_fault for all tasks but that
      wouldn't be sufficient because the current check assumes that a kernel
      thread would bail out after EFAULT from get_user*/copy_from_user and
      never re-read the same address which would succeed because the PF path
      has established page tables already.  This seems to be the case for the
      only existing use_mm user currently (virtio driver) but it is rather
      fragile in general.
      
      This is even more fragile in general for more complex paths such as
      generic_perform_write which can re-read the same address more times
      (e.g.  iov_iter_copy_from_user_atomic to fail and then
      iov_iter_fault_in_readable on retry).
      
      Therefore we have to implement MMF_UNSTABLE protection in a robust way
      and never make a potentially corrupted content visible.  That requires
      to hook deeper into the PF path and check for the flag _every time_
      before a pte for anonymous memory is established (that means all
      !VM_SHARED mappings).
      
      The corruption can be triggered artificially
      (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
      but there doesn't seem to be any real life bug report.  The race window
      should be quite tight to trigger most of the time.
      
      Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWenwei Tao <wenwei.tww@alibaba-inc.com>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b31d595
    • P
      mm: discard memblock data later · 3010f876
      Pavel Tatashin 提交于
      There is existing use after free bug when deferred struct pages are
      enabled:
      
      The memblock_add() allocates memory for the memory array if more than
      128 entries are needed.  See comment in e820__memblock_setup():
      
        * The bootstrap memblock region count maximum is 128 entries
        * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
        * than that - so allow memblock resizing.
      
      This memblock memory is freed here:
              free_low_memory_core_early()
      
      We access the freed memblock.memory later in boot when deferred pages
      are initialized in this path:
      
              deferred_init_memmap()
                      for_each_mem_pfn_range()
                        __next_mem_pfn_range()
                          type = &memblock.memory;
      
      One possible explanation for why this use-after-free hasn't been hit
      before is that the limit of INIT_MEMBLOCK_REGIONS has never been
      exceeded at least on systems where deferred struct pages were enabled.
      
      Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
      and verifying in qemu that this code is getting excuted and that the
      freed pages are sane.
      
      Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 7e18adb4 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3010f876
    • L
      wait: add wait_event_killable_timeout() · 8ada9279
      Luis R. Rodriguez 提交于
      These are the few pending fixes I have queued up for v4.13-final.  One
      is a a generic regression fix for recursive loops on kmod and the other
      one is a trivial print out correction.
      
      During the v4.13 development we assumed that recursive kmod loops were
      no longer possible.  Clearly that is not true.  The regression fix makes
      use of a new killable wait.  We use a killable wait to be paranoid in
      how signals might be sent to modprobe and only accept a proper SIGKILL.
      The signal will only be available to userspace to issue *iff* a thread
      has already entered a wait state, and that happens only if we've already
      throttled after 50 kmod threads have been hit.
      
      Note that although it may seem excessive to trigger a failure afer 5
      seconds if all kmod thread remain busy, prior to the series of changes
      that went into v4.13 we would actually *always* fatally fail any request
      which came in if the limit was already reached.  The new waiting
      implemented in v4.13 actually gives us *more* breathing room -- the wait
      for 5 seconds is a wait for *any* kmod thread to finish.  We give up and
      fail *iff* no kmod thread has finished and they're *all* running
      straight for 5 consecutive seconds.  If 50 kmod threads are running
      consecutively for 5 seconds something else must be really bad.
      
      Recursive loops with kmod are bad but they're also hard to implement
      properly as a selftest without currently fooling current userspace tools
      like kmod [1].  For instance kmod will complain when you run depmod if
      it finds a recursive loop with symbol dependency between modules as such
      this type of recursive loop cannot go upstream as the modules_install
      target will fail after running depmod.
      
      These tests already exist on userspace kmod upstream though (refer to
      the testsuite/module-playground/mod-loop-*.c files).  The same is not
      true if request_module() is used though, or worst if aliases are used.
      
      Likewise the issue with 64-bit kernels booting 32-bit userspace without
      a binfmt handler built-in is also currently not detected and proactively
      avoided by userspace kmod tools, or kconfig for all architectures.
      Although we could complain in the kernel when some of these individual
      recursive issues creep up, proactively avoiding these situations in
      userspace at build time is what we should keep striving for.
      
      Lastly, since recursive loops could happen with kmod it may mean
      recursive loops may also be possible with other kernel usermode helpers,
      this should be investigated and long term if we can come up with a more
      sensible generic solution even better!
      
      [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20170809-kmod-for-v4.13-final
      [1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git
      
      This patch (of 3):
      
      This wait is similar to wait_event_interruptible_timeout() but only
      accepts SIGKILL interrupt signal.  Other signals are ignored.
      
      Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.orgSigned-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Matt Redfearn <matt.redfearn@imgetc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ada9279
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
    • M
      datagram: When peeking datagrams with offset < 0 don't skip empty skbs · a0917e0b
      Matthew Dawson 提交于
      Due to commit e6afc8ac ("udp: remove
      headers from UDP packets before queueing"), when udp packets are being
      peeked the requested extra offset is always 0 as there is no need to skip
      the udp header.  However, when the offset is 0 and the next skb is
      of length 0, it is only returned once.  The behaviour can be seen with
      the following python script:
      
      from socket import *;
      f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      f.bind(('::', 0));
      addr=('::1', f.getsockname()[1]);
      g.sendto(b'', addr)
      g.sendto(b'b', addr)
      print(f.recvfrom(10, MSG_PEEK));
      print(f.recvfrom(10, MSG_PEEK));
      
      Where the expected output should be the empty string twice.
      
      Instead, make sk_peek_offset return negative values, and pass those values
      to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
      to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
      __skb_try_recv_from_queue will then ensure the offset is reset back to 0
      if a peek is requested without an offset, unless no packets are found.
      
      Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
      greater then 0, and off is greater then or equal to skb->len, then
      (_off || skb->len) must always be true assuming skb->len >= 0 is always
      true.
      
      Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
      as it double checked if MSG_PEEK was set in the flags.
      
      V2:
       - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
      redundant checks
       - Fix peeking in udp{,v6}_recvmsg to report the right value when the
      offset is 0
      
      V3:
       - Marked new branch in __skb_try_recv_from_queue as unlikely.
      Signed-off-by: NMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0917e0b
  8. 18 8月, 2017 2 次提交
    • T
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner 提交于
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: NKan Liang <Kan.liang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
    • L
      pty: fix the cached path of the pty slave file descriptor in the master · c8c03f18
      Linus Torvalds 提交于
      Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to
      get a slave pty file descriptor, the resulting file descriptor doesn't
      look right in /proc/<pid>/fd/<fd>.  In particular, he wanted to use
      readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty
      (basically implementing "ptsname{_r}()").
      
      The reason for that was that we had generated the wrong 'struct path'
      when we create the pty in ptmx_open().
      
      In particular, the dentry was correct, but the vfsmount pointed to the
      mount of the ptmx node. That _can_ be correct - in case you use
      "/dev/pts/ptmx" to open the master - but usually is not.  The normal
      case is to use /dev/ptmx, which then looks up the pts/ directory, and
      then the vfsmount of the ptmx node is obviously the /dev directory, not
      the /dev/pts/ directory.
      
      We actually did have the right vfsmount available, but in the wrong
      place (it gets looked up in 'devpts_acquire()' when we get a reference
      to the pts filesystem), and so ptmx_open() used the wrong mnt pointer.
      
      The end result of this confusion was that the pty worked fine, but when
      if you did TIOCGPTPEER to get the slave side of the pty, end end result
      would also work, but have that dodgy 'struct path'.
      
      And then when doing "d_path()" on to get the pathname, the vfsmount
      would not match the root of the pts directory, and d_path() would return
      an empty pathname thinking that the entry had escaped a bind mount into
      another mount.
      
      This fixes the problem by making devpts_acquire() return the vfsmount
      for the pts filesystem, allowing ptmx_open() to trivially just use the
      right mount for the pts dentry, and create the proper 'struct path'.
      Reported-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NEric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c03f18
  9. 17 8月, 2017 6 次提交
    • B
      locking/lockdep: Explicitly initialize wq_barrier::done::map · 52fa5bc5
      Boqun Feng 提交于
      With the new lockdep crossrelease feature, which checks completions usage,
      a false positive is reported in the workqueue code:
      
      > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
      > Task   B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
      > Task   C : acquired of lock#3 -> wait for completion of barr->done
      > (Task C is in lru_add_drain_all_cpuslocked())
      > Worker D : wait for wfc.work to be released -> will complete barr->done
      
      Such a dead lock can not happen because Task C's barr->done and Worker D's
      barr->done can not be the same instance.
      
      The reason of this false positive is we initialize all wq_barrier::done
      at insert_wq_barrier() via init_completion(), which makes them belong to
      the same lock class, therefore, impossible circles are reported.
      
      To fix this, explicitly initialize the lockdep map for wq_barrier::done
      in insert_wq_barrier(), so that the lock class key of wq_barrier::done
      is a subkey of the corresponding work_struct, as a result we won't build
      a dependency between a wq_barrier with a unrelated work, and we can
      differ wq barriers based on the related works, so the false positive
      above is avoided.
      
      Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
      to make the code simple and away from unnecessary #ifdefs.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      52fa5bc5
    • B
      locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS · ea3f2c0f
      Byungchul Park 提交于
      'complete' is an adjective and LOCKDEP_COMPLETE sounds like 'lockdep is complete',
      so pick a better name that uses a noun.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@lge.com
      Link: http://lkml.kernel.org/r/1502960261-16206-3-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea3f2c0f
    • K
      locking/refcounts, x86/asm: Implement fast refcount overflow protection · 7a46ec0e
      Kees Cook 提交于
      This implements refcount_t overflow protection on x86 without a noticeable
      performance impact, though without the fuller checking of REFCOUNT_FULL.
      
      This is done by duplicating the existing atomic_t refcount implementation
      but with normally a single instruction added to detect if the refcount
      has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
      the handler saturates the refcount_t to INT_MIN / 2. With this overflow
      protection, the erroneous reference release that would follow a wrap back
      to zero is blocked from happening, avoiding the class of refcount-overflow
      use-after-free vulnerabilities entirely.
      
      Only the overflow case of refcounting can be perfectly protected, since
      it can be detected and stopped before the reference is freed and left to
      be abused by an attacker. There isn't a way to block early decrements,
      and while REFCOUNT_FULL stops increment-from-zero cases (which would
      be the state _after_ an early decrement and stops potential double-free
      conditions), this fast implementation does not, since it would require
      the more expensive cmpxchg loops. Since the overflow case is much more
      common (e.g. missing a "put" during an error path), this protection
      provides real-world protection. For example, the two public refcount
      overflow use-after-free exploits published in 2016 would have been
      rendered unexploitable:
      
        http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
      
        http://cyseclabs.com/page?n=02012016
      
      This implementation does, however, notice an unchecked decrement to zero
      (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
      resulted in a zero). Decrements under zero are noticed (since they will
      have resulted in a negative value), though this only indicates that a
      use-after-free may have already happened. Such notifications are likely
      avoidable by an attacker that has already exploited a use-after-free
      vulnerability, but it's better to have them reported than allow such
      conditions to remain universally silent.
      
      On first overflow detection, the refcount value is reset to INT_MIN / 2
      (which serves as a saturation value) and a report and stack trace are
      produced. When operations detect only negative value results (such as
      changing an already saturated value), saturation still happens but no
      notification is performed (since the value was already saturated).
      
      On the matter of races, since the entire range beyond INT_MAX but before
      0 is negative, every operation at INT_MIN / 2 will trap, leaving no
      overflow-only race condition.
      
      As for performance, this implementation adds a single "js" instruction
      to the regular execution flow of a copy of the standard atomic_t refcount
      operations. (The non-"and_test" refcount_dec() function, which is uncommon
      in regular refcount design patterns, has an additional "jz" instruction
      to detect reaching exactly zero.) Since this is a forward jump, it is by
      default the non-predicted path, which will be reinforced by dynamic branch
      prediction. The result is this protection having virtually no measurable
      change in performance over standard atomic_t operations. The error path,
      located in .text.unlikely, saves the refcount location and then uses UD0
      to fire a refcount exception handler, which resets the refcount, handles
      reporting, and returns to regular execution. This keeps the changes to
      .text size minimal, avoiding return jumps and open-coded calls to the
      error reporting routine.
      
      Example assembly comparison:
      
      refcount_inc() before:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
      
      refcount_inc() after:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
        ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
        ...
        .text.unlikely:
        ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
        ffffffff816c36d7:       0f ff                   (bad)
      
      These are the cycle counts comparing a loop of refcount_inc() from 1
      to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
      unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
      (refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
      
        2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
      		    cycles		protections
        atomic_t           82249267387	none
        refcount_t-fast    82211446892	overflow, untested dec-to-zero
        refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero
      
      This code is a modified version of the x86 PAX_REFCOUNT atomic_t
      overflow defense from the last public patch of PaX/grsecurity, based
      on my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code. Thanks
      to PaX Team for various suggestions for improvement for repurposing this
      code to be a refcount-only protection.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Hans Liljestrand <ishkamiel@gmail.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arozansk@redhat.com
      Cc: axboe@kernel.dk
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-arch <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170815161924.GA133115@beastSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7a46ec0e
    • D
      scsi: sd_zbc: Write unlock zone from sd_uninit_cmnd() · 70e42fd0
      Damien Le Moal 提交于
      Releasing a zone write lock only when the write commnand that acquired
      the lock completes can cause deadlocks due to potential command
      reordering if the lock owning request is requeued and not executed. This
      problem exists only with the scsi-mq path as, unlike the legacy path,
      requests are moved out of the dispatch queue before being prepared and
      so before locking a zone for a write command.
      
      Since sd_uninit_cmnd() is now always called when a request is requeued,
      call sd_zbc_write_unlock_zone() from that function for write requests
      that acquired a zone lock instead of from sd_done(). Acquisition of a zone
      lock by a write command is indicated using the new command
      flag SCMD_ZONE_WRITE_LOCK.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      70e42fd0
    • E
      ipv4: better IP_MAX_MTU enforcement · c780a049
      Eric Dumazet 提交于
      While working on yet another syzkaller report, I found
      that our IP_MAX_MTU enforcements were not properly done.
      
      gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
      final result can be bigger than IP_MAX_MTU :/
      
      This is a problem because device mtu can be changed on other cpus or
      threads.
      
      While this patch does not fix the issue I am working on, it is
      probably worth addressing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c780a049
    • E
      ptr_ring: use kmalloc_array() · 81fbfe8a
      Eric Dumazet 提交于
      As found by syzkaller, malicious users can set whatever tx_queue_len
      on a tun device and eventually crash the kernel.
      
      Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
      ring buffer is not fast anyway.
      
      Fixes: 2e0ab8ca ("ptr_ring: array based FIFO for pointers")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81fbfe8a
  10. 16 8月, 2017 2 次提交
    • T
      b3dc8f77
    • E
      ipv6: fix NULL dereference in ip6_route_dev_notify() · 12d94a80
      Eric Dumazet 提交于
      Based on a syzkaller report [1], I found that a per cpu allocation
      failure in snmp6_alloc_dev() would then lead to NULL dereference in
      ip6_route_dev_notify().
      
      It seems this is a very old bug, thus no Fixes tag in this submission.
      
      Let's add in6_dev_put_clear() helper, as we will probably use
      it elsewhere (once available/present in net-next)
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 1 PID: 17294 Comm: syz-executor6 Not tainted 4.13.0-rc2+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff88019f456680 task.stack: ffff8801c6e58000
      RIP: 0010:__read_once_size include/linux/compiler.h:250 [inline]
      RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
      RIP: 0010:refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178
      RSP: 0018:ffff8801c6e5f1b0 EFLAGS: 00010202
      RAX: 0000000000000037 RBX: dffffc0000000000 RCX: ffffc90005d25000
      RDX: ffff8801c6e5f218 RSI: ffffffff82342bbf RDI: 0000000000000001
      RBP: ffff8801c6e5f240 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff10038dcbe37
      R13: 0000000000000006 R14: 0000000000000001 R15: 00000000000001b8
      FS:  00007f21e0429700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001ddbc22000 CR3: 00000001d632b000 CR4: 00000000001426e0
      DR0: 0000000020000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Call Trace:
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       in6_dev_put include/net/addrconf.h:335 [inline]
       ip6_route_dev_notify+0x1c9/0x4a0 net/ipv6/route.c:3732
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1678
       call_netdevice_notifiers net/core/dev.c:1694 [inline]
       rollback_registered_many+0x91c/0xe80 net/core/dev.c:7107
       rollback_registered+0x1be/0x3c0 net/core/dev.c:7149
       register_netdevice+0xbcd/0xee0 net/core/dev.c:7587
       register_netdev+0x1a/0x30 net/core/dev.c:7669
       loopback_net_init+0x76/0x160 drivers/net/loopback.c:214
       ops_init+0x10a/0x570 net/core/net_namespace.c:118
       setup_net+0x313/0x710 net/core/net_namespace.c:294
       copy_net_ns+0x27c/0x580 net/core/net_namespace.c:418
       create_new_namespaces+0x425/0x880 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
       SYSC_unshare kernel/fork.c:2347 [inline]
       SyS_unshare+0x653/0xfa0 kernel/fork.c:2297
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x4512c9
      RSP: 002b:00007f21e0428c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000110
      RAX: ffffffffffffffda RBX: 0000000000718150 RCX: 00000000004512c9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000062020200
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b973d
      R13: 00000000ffffffff R14: 000000002001d000 R15: 00000000000002dd
      Code: 50 2b 34 82 c7 00 f1 f1 f1 f1 c7 40 04 04 f2 f2 f2 c7 40 08 f3 f3
      f3 f3 e8 a1 43 39 ff 4c 89 f8 48 8b 95 70 ff ff ff 48 c1 e8 03 <0f> b6
      0c 18 4c 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
      RIP: __read_once_size include/linux/compiler.h:250 [inline] RSP:
      ffff8801c6e5f1b0
      RIP: atomic_read arch/x86/include/asm/atomic.h:26 [inline] RSP:
      ffff8801c6e5f1b0
      RIP: refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178 RSP:
      ffff8801c6e5f1b0
      ---[ end trace e441d046c6410d31 ]---
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12d94a80
  11. 15 8月, 2017 2 次提交
  12. 12 8月, 2017 3 次提交
    • E
      udp: harden copy_linear_skb() · fd851ba9
      Eric Dumazet 提交于
      syzkaller got crashes with CONFIG_HARDENED_USERCOPY=y configs.
      
      Issue here is that recvfrom() can be used with user buffer of Z bytes,
      and SO_PEEK_OFF of X bytes, from a skb with Y bytes, and following
      condition :
      
      Z < X < Y
      
      kernel BUG at mm/usercopy.c:72!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2917 Comm: syzkaller842281 Not tainted 4.13.0-rc3+ #16
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      task: ffff8801d2fa40c0 task.stack: ffff8801d1fe8000
      RIP: 0010:report_usercopy mm/usercopy.c:64 [inline]
      RIP: 0010:__check_object_size+0x3ad/0x500 mm/usercopy.c:264
      RSP: 0018:ffff8801d1fef8a8 EFLAGS: 00010286
      RAX: 0000000000000078 RBX: ffffffff847102c0 RCX: 0000000000000000
      RDX: 0000000000000078 RSI: 1ffff1003a3fded5 RDI: ffffed003a3fdf09
      RBP: ffff8801d1fef998 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d1ea480e
      R13: fffffffffffffffa R14: ffffffff84710280 R15: dffffc0000000000
      FS:  0000000001360880(0000) GS:ffff8801dc000000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000202ecfe4 CR3: 00000001d1ff8000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       check_object_size include/linux/thread_info.h:108 [inline]
       check_copy_size include/linux/thread_info.h:139 [inline]
       copy_to_iter include/linux/uio.h:105 [inline]
       copy_linear_skb include/net/udp.h:371 [inline]
       udpv6_recvmsg+0x1040/0x1af0 net/ipv6/udp.c:395
       inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
       sock_recvmsg_nosec net/socket.c:792 [inline]
       sock_recvmsg+0xc9/0x110 net/socket.c:799
       SYSC_recvfrom+0x2d6/0x570 net/socket.c:1788
       SyS_recvfrom+0x40/0x50 net/socket.c:1760
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd851ba9
    • D
      net: fix compilation when busy poll is not enabled · e4dde412
      Daniel Borkmann 提交于
      MIN_NAPI_ID is used in various places outside of
      CONFIG_NET_RX_BUSY_POLL wrapping, so when it's not set
      we run into build errors such as:
      
        net/core/dev.c: In function 'dev_get_by_napi_id':
        net/core/dev.c:886:16: error: ‘MIN_NAPI_ID’ undeclared (first use in this function)
          if (napi_id < MIN_NAPI_ID)
                        ^~~~~~~~~~~
      
      Thus, have MIN_NAPI_ID always defined to fix these errors.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4dde412
    • A
      bonding: require speed/duplex only for 802.3ad, alb and tlb · ad729bc9
      Andreas Born 提交于
      The patch c4adfc82 ("bonding: make speed, duplex setting consistent
      with link state") puts the link state to down if
      bond_update_speed_duplex() cannot retrieve speed and duplex settings.
      Assumably the patch was written with 802.3ad mode in mind which relies
      on link speed/duplex settings. For other modes like active-backup these
      settings are not required. Thus, only for these other modes, this patch
      reintroduces support for slaves that do not support reporting speed or
      duplex such as wireless devices. This fixes the regression reported in
      bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547).
      
      Fixes: c4adfc82 ("bonding: make speed, duplex setting consistent
      with link state")
      Signed-off-by: NAndreas Born <futur.andy@googlemail.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad729bc9
  13. 11 8月, 2017 5 次提交
    • M
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim 提交于
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        flush.
      
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
        happening."
      
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      
      NOTE:
      
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reported-by: NNadav Amit <namit@vmware.com>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99baac21
    • M
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim 提交于
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • M
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim 提交于
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56236a59
    • N
      mm: migrate: fix barriers around tlb_flush_pending · 0a2c4048
      Nadav Amit 提交于
      Reading tlb_flush_pending while the page-table lock is taken does not
      require a barrier, since the lock/unlock already acts as a barrier.
      Removing the barrier in mm_tlb_flush_pending() to address this issue.
      
      However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending()
      while the page-table lock is already released, which may present a
      problem on architectures with weak memory model (PPC).  To deal with
      this case, a new parameter is added to mm_tlb_flush_pending() to
      indicate if it is read without the page-table lock taken, and calling
      smp_mb__after_unlock_lock() in this case.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2c4048
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
  14. 10 8月, 2017 6 次提交
    • B
      locking/lockdep: Apply crossrelease to completions · cd8084f9
      Byungchul Park 提交于
      Although wait_for_completion() and its family can cause deadlock, the
      lock correctness validator could not be applied to them until now,
      because things like complete() are usually called in a different context
      from the waiting context, which violates lockdep's assumption.
      
      Thanks to CONFIG_LOCKDEP_CROSSRELEASE, we can now apply the lockdep
      detector to those completion operations. Applied it.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-10-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd8084f9
    • B
      locking/lockdep: Handle non(or multi)-acquisition of a crosslock · 28a903f6
      Byungchul Park 提交于
      No acquisition might be in progress on commit of a crosslock. Completion
      operations enabling crossrelease are the case like:
      
         CONTEXT X                         CONTEXT Y
         ---------                         ---------
         trigger completion context
                                           complete AX
                                              commit AX
         wait_for_complete AX
            acquire AX
            wait
      
         where AX is a crosslock.
      
      When no acquisition is in progress, we should not perform commit because
      the lock does not exist, which might cause incorrect memory access. So
      we have to track the number of acquisitions of a crosslock to handle it.
      
      Moreover, in case that more than one acquisition of a crosslock are
      overlapped like:
      
         CONTEXT W        CONTEXT X        CONTEXT Y        CONTEXT Z
         ---------        ---------        ---------        ---------
         acquire AX (gen_id: 1)
                                           acquire A
                          acquire AX (gen_id: 10)
                                           acquire B
                                           commit AX
                                                            acquire C
                                                            commit AX
      
         where A, B and C are typical locks and AX is a crosslock.
      
      Current crossrelease code performs commits in Y and Z with gen_id = 10.
      However, we can use gen_id = 1 to do it, since not only 'acquire AX in X'
      but 'acquire AX in W' also depends on each acquisition in Y and Z until
      their commits. So make it use gen_id = 1 instead of 10 on their commits,
      which adds an additional dependency 'AX -> A' in the example above.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-8-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a903f6
    • B
      locking/lockdep: Detect and handle hist_lock ring buffer overwrite · 23f873d8
      Byungchul Park 提交于
      The ring buffer can be overwritten by hardirq/softirq/work contexts.
      That cases must be considered on rollback or commit. For example,
      
                |<------ hist_lock ring buffer size ----->|
                ppppppppppppiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
      wrapped > iiiiiiiiiiiiiiiiiiiiiii....................
      
                where 'p' represents an acquisition in process context,
                'i' represents an acquisition in irq context.
      
      On irq exit, crossrelease tries to rollback idx to original position,
      but it should not because the entry already has been invalid by
      overwriting 'i'. Avoid rollback or commit for entries overwritten.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-7-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23f873d8
    • B
      locking/lockdep: Implement the 'crossrelease' feature · b09be676
      Byungchul Park 提交于
      Lockdep is a runtime locking correctness validator that detects and
      reports a deadlock or its possibility by checking dependencies between
      locks. It's useful since it does not report just an actual deadlock but
      also the possibility of a deadlock that has not actually happened yet.
      That enables problems to be fixed before they affect real systems.
      
      However, this facility is only applicable to typical locks, such as
      spinlocks and mutexes, which are normally released within the context in
      which they were acquired. However, synchronization primitives like page
      locks or completions, which are allowed to be released in any context,
      also create dependencies and can cause a deadlock.
      
      So lockdep should track these locks to do a better job. The 'crossrelease'
      implementation makes these primitives also be tracked.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b09be676
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
    • P
      locking: Remove smp_mb__before_spinlock() · a9668cd6
      Peter Zijlstra 提交于
      Now that there are no users of smp_mb__before_spinlock() left, remove
      it entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a9668cd6