1. 29 8月, 2017 3 次提交
  2. 28 8月, 2017 1 次提交
    • L
      Clarify (and fix) MAX_LFS_FILESIZE macros · 0cc3b0ec
      Linus Torvalds 提交于
      We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
      filesystems (and other IO targets) that know they are 64-bit clean and
      don't have any 32-bit limits in their IO path.
      
      It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
      the VM layer is limited by the page cache to only 32-bit index values,
      but our logic for that was confusing and actually wrong.  We used to
      define that value to
      
      	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
      
      which is actually odd in several ways: it limits the index to 31 bits,
      and then it limits files so that they can't have data in that last byte
      of a page that has the highest 31-bit index (ie page index 0x7fffffff).
      
      Neither of those limitations make sense.  The index is actually the full
      32 bit unsigned value, and we can use that whole full page.  So the
      maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
      
      However, we do wan tto avoid the maximum index, because we have code
      that iterates over the page indexes, and we don't want that code to
      overflow.  So the maximum size of a file on a 32-bit host should
      actually be one page less than the full 32-bit index.
      
      So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
      not actually be using the page of that last index (ULONG_MAX), but we
      can grow a file up to that limit.
      
      The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
      Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
      volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
      byte less), but the actual true VM limit is one page less than 16TiB.
      
      This was invisible until commit c2a9737f ("vfs,mm: fix a dead loop
      in truncate_inode_pages_range()"), which started applying that
      MAX_LFS_FILESIZE limit to block devices too.
      
      NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
      actually just the offset type itself (loff_t), which is signed.  But for
      clarity, on 64-bit, just use the maximum signed value, and don't make
      people have to count the number of 'f' characters in the hex constant.
      
      So just use LLONG_MAX for the 64-bit case.  That was what the value had
      been before too, just written out as a hex constant.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-and-tested-by: NDoug Nazar <nazard@nazar.ca>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cc3b0ec
  3. 25 8月, 2017 1 次提交
    • E
      pty: Repair TIOCGPTPEER · 311fc65c
      Eric W. Biederman 提交于
      The implementation of TIOCGPTPEER has two issues.
      
      When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong
      vfsmount is passed to dentry_open.  Which results in the kernel displaying
      the wrong pathname for the peer.
      
      The second is simply by caching the vfsmount and dentry of the peer it leaves
      them open, in a way they were not previously Which because of the inreased
      reference counts can cause unnecessary behaviour differences resulting in
      regressions.
      
      To fix these move the ioctl into tty_io.c at a generic level allowing
      the ioctl to have access to the struct file on which the ioctl is
      being called.  This allows the path of the slave to be derived when
      opening the slave through TIOCGPTPEER instead of requiring the path to
      the slave be cached.  Thus removing the need for caching the path.
      
      A new function devpts_ptmx_path is factored out of devpts_acquire and
      used to implement a function devpts_mntget.   The new function devpts_mntget
      takes a filp to perform the lookup on and fsi so that it can confirm
      that the superblock that is found by devpts_ptmx_path is the proper superblock.
      
      v2: Lots of fixes to make the code actually work
      v3: Suggestions by Linus
          - Removed the unnecessary initialization of filp in ptm_open_peer
          - Simplified devpts_ptmx_path as gotos are no longer required
      
      [ This is the fix for the issue that was reverted in commit
        143c97cc, but this time without breaking 'pbuilder' due to
        increased reference counts   - Linus ]
      
      Fixes: 54ebbfb1 ("tty: add TIOCGPTPEER ioctl")
      Reported-by: NChristian Brauner <christian.brauner@canonical.com>
      Reported-and-tested-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      311fc65c
  4. 24 8月, 2017 2 次提交
    • B
      bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer · 50b4d485
      Benjamin Block 提交于
      Since we split the scsi_request out of struct request bsg fails to
      provide a reply-buffer for the drivers. This was done via the pointer
      for sense-data, that is not preallocated anymore.
      
      Failing to allocate/assign it results in illegal dereferences because
      LLDs use this pointer unquestioned.
      
      An example panic on s390x, using the zFCP driver, looks like this (I had
      debugging on, otherwise NULL-pointer dereferences wouldn't even panic on
      s390x):
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403
      Fault in home space mode while using kernel ASCE.
      AS:0000000001590007 R3:0000000000000024
      Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: <Long List>
      CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3
      Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
      task: 0000000065cb0100 task.stack: 0000000065cb4000
      Krnl PSW : 0704e00180000000 000003ff801e4156 (zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp])
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000001 000000005fa9d0d0 000000005fa9d078 0000000000e16866
                 000003ff00000290 6b6b6b6b6b6b6b6b 0000000059f78f00 000000000000000f
                 00000000593a0958 00000000593a0958 0000000060d88800 000000005ddd4c38
                 0000000058b50100 07000000659cba08 000003ff801e8556 00000000659cb9a8
      Krnl Code: 000003ff801e4146: e31020500004        lg      %r1,80(%r2)
                 000003ff801e414c: 58402040           l       %r4,64(%r2)
                #000003ff801e4150: e35020200004       lg      %r5,32(%r2)
                >000003ff801e4156: 50405004           st      %r4,4(%r5)
                 000003ff801e415a: e54c50080000       mvhi    8(%r5),0
                 000003ff801e4160: e33010280012       lt      %r3,40(%r1)
                 000003ff801e4166: a718fffb           lhi     %r1,-5
                 000003ff801e416a: 1803               lr      %r0,%r3
      Call Trace:
      ([<000003ff801e8556>] zfcp_fsf_req_complete+0x726/0x768 [zfcp])
       [<000003ff801ea82a>] zfcp_fsf_reqid_check+0x102/0x180 [zfcp]
       [<000003ff801eb980>] zfcp_qdio_int_resp+0x230/0x278 [zfcp]
       [<00000000009b91b6>] qdio_kick_handler+0x2ae/0x2c8
       [<00000000009b9e3e>] __tiqdio_inbound_processing+0x406/0xc10
       [<00000000001684c2>] tasklet_action+0x15a/0x1d8
       [<0000000000bd28ec>] __do_softirq+0x3ec/0x848
       [<00000000001675a4>] irq_exit+0x74/0xf8
       [<000000000010dd6a>] do_IRQ+0xba/0xf0
       [<0000000000bd19e8>] io_int_handler+0x104/0x2d4
       [<00000000001033b6>] enabled_wait+0xb6/0x188
      ([<000000000010339e>] enabled_wait+0x9e/0x188)
       [<000000000010396a>] arch_cpu_idle+0x32/0x50
       [<0000000000bd0112>] default_idle_call+0x52/0x68
       [<00000000001cd0fa>] do_idle+0x102/0x188
       [<00000000001cd41e>] cpu_startup_entry+0x3e/0x48
       [<0000000000118c64>] smp_start_secondary+0x11c/0x130
       [<0000000000bd2016>] restart_int_handler+0x62/0x78
       [<0000000000000000>]           (null)
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
       [<000003ff801e41d6>] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp]
      
      Kernel panic - not syncing: Fatal exception in interrupt
      
      This patch moves bsg-lib to allocate and setup struct bsg_job ahead of
      time, including the allocation of a buffer for the reply-data.
      
      This means, struct bsg_job is not allocated separately anymore, but as part
      of struct request allocation - similar to struct scsi_cmd. Reflect this in
      the function names that used to handle creation/destruction of struct
      bsg_job.
      Reported-by: NSteffen Maier <maier@linux.vnet.ibm.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBenjamin Block <bblock@linux.vnet.ibm.com>
      Fixes: 82ed4db4 ("block: split scsi_request out of struct request")
      Cc: <stable@vger.kernel.org> #4.11+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50b4d485
    • L
      Revert "pty: fix the cached path of the pty slave file descriptor in the master" · 143c97cc
      Linus Torvalds 提交于
      This reverts commit c8c03f18.
      
      It turns out that while fixing the ptmx file descriptor to have the
      correct 'struct path' to the associated slave pty is a really good
      thing, it breaks some user space tools for a very annoying reason.
      
      The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X)
      are on different mounts.  That was what caused us to have the wrong path
      in the first place (we would mix up the vfsmount of the 'ptmx' node,
      with the dentry of the pty slave node), but it also means that now while
      we use the right vfsmount, having the pty master open also keeps the pts
      mount busy.
      
      And it turn sout that that makes 'pbuilder' very unhappy, as noted by
      Stefan Lippers-Hollmann:
      
       "This patch introduces a regression for me when using pbuilder
        0.228.7[2] (a helper to build Debian packages in a chroot and to
        create and update its chroots) when trying to umount /dev/ptmx (inside
        the chroot) on Debian/ unstable (full log and pbuilder configuration
        file[3] attached).
      
        [...]
        Setting up build-essential (12.3) ...
        Processing triggers for libc-bin (2.24-15) ...
        I: unmounting dev/ptmx filesystem
        W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy
                (In some cases useful info about processes that
                 use the device is found by lsof(8) or fuser(1).)"
      
      apparently pbuilder tries to unmount the /dev/pts filesystem while still
      holding at least one master node open, which is arguably not very nice,
      but we don't break user space even when fixing other bugs.
      
      So this commit has to be reverted.
      
      I'll try to figure out a way to avoid caching the path to the slave pty
      in the master pty.  The only thing that actually wants that slave pty
      path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the
      path at that time.
      Reported-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
      Cc: Eric W Biederman <ebiederm@xmission.com>
      Cc: Christian Brauner <christian.brauner@canonical.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      143c97cc
  5. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  6. 20 8月, 2017 1 次提交
    • J
      PATCH] iio: Fix some documentation warnings · f00fd7ae
      Jonathan Corbet 提交于
      The kerneldoc description for the trig_readonly field of struct iio_dev
      lacked a colon, leading to this doc build warning:
      
        ./include/linux/iio/iio.h:603: warning: No description found for parameter 'trig_readonly'
      
      A similar issue for iio_trigger_set_immutable() in trigger.h yielded:
      
        ./include/linux/iio/trigger.h:151: warning: No description found for parameter 'indio_dev'
        ./include/linux/iio/trigger.h:151: warning: No description found for parameter 'trig'
      
      Fix the formatting and silence the warnings.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      f00fd7ae
  7. 19 8月, 2017 4 次提交
    • M
      mm, oom: fix potential data corruption when oom_reaper races with writer · 6b31d595
      Michal Hocko 提交于
      Wenwei Tao has noticed that our current assumption that the oom victim
      is dying and never doing any visible changes after it dies, and so the
      oom_reaper can tear it down, is not entirely true.
      
      __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
      but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
      So there is a race window when some threads won't have
      fatal_signal_pending while the oom_reaper could start unmapping the
      address space.  Moreover some paths might not check for fatal signals
      before each PF/g-u-p/copy_from_user.
      
      We already have a protection for oom_reaper vs.  PF races by checking
      MMF_UNSTABLE.  This has been, however, checked only for kernel threads
      (use_mm users) which can outlive the oom victim.  A simple fix would be
      to extend the current check in handle_mm_fault for all tasks but that
      wouldn't be sufficient because the current check assumes that a kernel
      thread would bail out after EFAULT from get_user*/copy_from_user and
      never re-read the same address which would succeed because the PF path
      has established page tables already.  This seems to be the case for the
      only existing use_mm user currently (virtio driver) but it is rather
      fragile in general.
      
      This is even more fragile in general for more complex paths such as
      generic_perform_write which can re-read the same address more times
      (e.g.  iov_iter_copy_from_user_atomic to fail and then
      iov_iter_fault_in_readable on retry).
      
      Therefore we have to implement MMF_UNSTABLE protection in a robust way
      and never make a potentially corrupted content visible.  That requires
      to hook deeper into the PF path and check for the flag _every time_
      before a pte for anonymous memory is established (that means all
      !VM_SHARED mappings).
      
      The corruption can be triggered artificially
      (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
      but there doesn't seem to be any real life bug report.  The race window
      should be quite tight to trigger most of the time.
      
      Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWenwei Tao <wenwei.tww@alibaba-inc.com>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b31d595
    • P
      mm: discard memblock data later · 3010f876
      Pavel Tatashin 提交于
      There is existing use after free bug when deferred struct pages are
      enabled:
      
      The memblock_add() allocates memory for the memory array if more than
      128 entries are needed.  See comment in e820__memblock_setup():
      
        * The bootstrap memblock region count maximum is 128 entries
        * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
        * than that - so allow memblock resizing.
      
      This memblock memory is freed here:
              free_low_memory_core_early()
      
      We access the freed memblock.memory later in boot when deferred pages
      are initialized in this path:
      
              deferred_init_memmap()
                      for_each_mem_pfn_range()
                        __next_mem_pfn_range()
                          type = &memblock.memory;
      
      One possible explanation for why this use-after-free hasn't been hit
      before is that the limit of INIT_MEMBLOCK_REGIONS has never been
      exceeded at least on systems where deferred struct pages were enabled.
      
      Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
      and verifying in qemu that this code is getting excuted and that the
      freed pages are sane.
      
      Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 7e18adb4 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3010f876
    • L
      wait: add wait_event_killable_timeout() · 8ada9279
      Luis R. Rodriguez 提交于
      These are the few pending fixes I have queued up for v4.13-final.  One
      is a a generic regression fix for recursive loops on kmod and the other
      one is a trivial print out correction.
      
      During the v4.13 development we assumed that recursive kmod loops were
      no longer possible.  Clearly that is not true.  The regression fix makes
      use of a new killable wait.  We use a killable wait to be paranoid in
      how signals might be sent to modprobe and only accept a proper SIGKILL.
      The signal will only be available to userspace to issue *iff* a thread
      has already entered a wait state, and that happens only if we've already
      throttled after 50 kmod threads have been hit.
      
      Note that although it may seem excessive to trigger a failure afer 5
      seconds if all kmod thread remain busy, prior to the series of changes
      that went into v4.13 we would actually *always* fatally fail any request
      which came in if the limit was already reached.  The new waiting
      implemented in v4.13 actually gives us *more* breathing room -- the wait
      for 5 seconds is a wait for *any* kmod thread to finish.  We give up and
      fail *iff* no kmod thread has finished and they're *all* running
      straight for 5 consecutive seconds.  If 50 kmod threads are running
      consecutively for 5 seconds something else must be really bad.
      
      Recursive loops with kmod are bad but they're also hard to implement
      properly as a selftest without currently fooling current userspace tools
      like kmod [1].  For instance kmod will complain when you run depmod if
      it finds a recursive loop with symbol dependency between modules as such
      this type of recursive loop cannot go upstream as the modules_install
      target will fail after running depmod.
      
      These tests already exist on userspace kmod upstream though (refer to
      the testsuite/module-playground/mod-loop-*.c files).  The same is not
      true if request_module() is used though, or worst if aliases are used.
      
      Likewise the issue with 64-bit kernels booting 32-bit userspace without
      a binfmt handler built-in is also currently not detected and proactively
      avoided by userspace kmod tools, or kconfig for all architectures.
      Although we could complain in the kernel when some of these individual
      recursive issues creep up, proactively avoiding these situations in
      userspace at build time is what we should keep striving for.
      
      Lastly, since recursive loops could happen with kmod it may mean
      recursive loops may also be possible with other kernel usermode helpers,
      this should be investigated and long term if we can come up with a more
      sensible generic solution even better!
      
      [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20170809-kmod-for-v4.13-final
      [1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git
      
      This patch (of 3):
      
      This wait is similar to wait_event_interruptible_timeout() but only
      accepts SIGKILL interrupt signal.  Other signals are ignored.
      
      Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.orgSigned-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Matt Redfearn <matt.redfearn@imgetc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ada9279
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  8. 18 8月, 2017 2 次提交
    • T
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner 提交于
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: NKan Liang <Kan.liang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
    • L
      pty: fix the cached path of the pty slave file descriptor in the master · c8c03f18
      Linus Torvalds 提交于
      Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to
      get a slave pty file descriptor, the resulting file descriptor doesn't
      look right in /proc/<pid>/fd/<fd>.  In particular, he wanted to use
      readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty
      (basically implementing "ptsname{_r}()").
      
      The reason for that was that we had generated the wrong 'struct path'
      when we create the pty in ptmx_open().
      
      In particular, the dentry was correct, but the vfsmount pointed to the
      mount of the ptmx node. That _can_ be correct - in case you use
      "/dev/pts/ptmx" to open the master - but usually is not.  The normal
      case is to use /dev/ptmx, which then looks up the pts/ directory, and
      then the vfsmount of the ptmx node is obviously the /dev directory, not
      the /dev/pts/ directory.
      
      We actually did have the right vfsmount available, but in the wrong
      place (it gets looked up in 'devpts_acquire()' when we get a reference
      to the pts filesystem), and so ptmx_open() used the wrong mnt pointer.
      
      The end result of this confusion was that the pty worked fine, but when
      if you did TIOCGPTPEER to get the slave side of the pty, end end result
      would also work, but have that dodgy 'struct path'.
      
      And then when doing "d_path()" on to get the pathname, the vfsmount
      would not match the root of the pts directory, and d_path() would return
      an empty pathname thinking that the entry had escaped a bind mount into
      another mount.
      
      This fixes the problem by making devpts_acquire() return the vfsmount
      for the pts filesystem, allowing ptmx_open() to trivially just use the
      right mount for the pts dentry, and create the proper 'struct path'.
      Reported-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NEric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c03f18
  9. 17 8月, 2017 1 次提交
  10. 16 8月, 2017 1 次提交
  11. 15 8月, 2017 2 次提交
  12. 11 8月, 2017 5 次提交
    • M
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim 提交于
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        flush.
      
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
        happening."
      
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      
      NOTE:
      
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reported-by: NNadav Amit <namit@vmware.com>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99baac21
    • M
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim 提交于
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • M
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim 提交于
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56236a59
    • N
      mm: migrate: fix barriers around tlb_flush_pending · 0a2c4048
      Nadav Amit 提交于
      Reading tlb_flush_pending while the page-table lock is taken does not
      require a barrier, since the lock/unlock already acts as a barrier.
      Removing the barrier in mm_tlb_flush_pending() to address this issue.
      
      However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending()
      while the page-table lock is already released, which may present a
      problem on architectures with weak memory model (PPC).  To deal with
      this case, a new parameter is added to mm_tlb_flush_pending() to
      indicate if it is read without the page-table lock taken, and calling
      smp_mb__after_unlock_lock() in this case.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2c4048
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
  13. 10 8月, 2017 2 次提交
    • P
      perf/x86: Fix RDPMC vs. mm_struct tracking · bfe33492
      Peter Zijlstra 提交于
      Vince reported the following rdpmc() testcase failure:
      
       > Failing test case:
       >
       >	fd=perf_event_open();
       >	addr=mmap(fd);
       >	exec()  // without closing or unmapping the event
       >	fd=perf_event_open();
       >	addr=mmap(fd);
       >	rdpmc()	// GPFs due to rdpmc being disabled
      
      The problem is of course that exec() plays tricks with what is
      current->mm, only destroying the old mappings after having
      installed the new mm.
      
      Fix this confusion by passing along vma->vm_mm instead of relying on
      current->mm.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 1e0fb9ec ("perf: Add pmu callbacks to track event mapping and unmapping")
      Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net
      [ Minor cleanups. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bfe33492
    • J
      nvmet_fc: add defer_req callback for deferment of cmd buffer return · 0fb228d3
      James Smart 提交于
      At queue creation, the transport allocates a local job struct
      (struct nvmet_fc_fcp_iod) for each possible element of the queue.
      When a new CMD is received from the wire, a jobs struct is allocated
      from the queue and then used for the duration of the command.
      The job struct contains buffer space for the wire command iu. Thus,
      upon allocation of the job struct, the cmd iu buffer is copied to
      the job struct and the LLDD may immediately free/reuse the CMD IU
      buffer passed in the call.
      
      However, in some circumstances, due to the packetized nature of FC
      and the api of the FC LLDD which may issue a hw command to send the
      wire response, but the LLDD may not get the hw completion for the
      command and upcall the nvmet_fc layer before a new command may be
      asynchronously received on the wire. In other words, its possible
      for the initiator to get the response from the wire, thus believe a
      command slot free, and send a new command iu. The new command iu
      may be received by the LLDD and passed to the transport before the
      LLDD had serviced the hw completion and made the teardown calls for
      the original job struct. As such, there is no available job struct
      available for the new io. E.g. it appears like the host sent more
      queue elements than the queue size. It didn't based on it's
      understanding.
      
      Rather than treat this as a hard connection failure queue the new
      request until the job struct does free up. As the buffer isn't
      copied as there's no job struct, a special return value must be
      returned to the LLDD to signify to hold off on recycling the cmd
      iu buffer.  And later, when a job struct is allocated and the
      buffer copied, a new LLDD callback is introduced to notify the
      LLDD and allow it to recycle it's command iu buffer.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      0fb228d3
  14. 07 8月, 2017 1 次提交
  15. 05 8月, 2017 1 次提交
  16. 03 8月, 2017 6 次提交
    • K
      mm: allow page_cache_get_speculative in interrupt context · 1ee1c3f5
      Kan Liang 提交于
      Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI
      handler.
      
      The bug was introduced by commit 2947ba05 ("x86/mm/gup: Switch GUP
      to the generic get_user_page_fast() implementation").
      
      The original x86 __get_user_page_fast used plain get_page() or
      page_ref_add().  However, the generic __get_user_page_fast uses
      page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()).
      
      There is no reason to prevent page_cache_get_speculative from using in
      interrupt context.  According to the author, putting a BUG_ON there is
      just because the code is not verifying correctness of interrupt races.
      I did some tests in interrupt context.  There is no issue found.
      
      Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative().
      
      Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com
      Fixes: 2947ba05 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ee1c3f5
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • J
      kthread: fix documentation build warning · d16977f3
      Jonathan Corbet 提交于
      The kerneldoc comment for kthread_create() had an incorrect argument
      name, leading to a warning in the docs build.
      
      Correct it, and make one more small step toward a warning-free build.
      
      Link: http://lkml.kernel.org/r/20170724135916.7f486c6f@lwn.netSigned-off-by: NJonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d16977f3
    • M
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman 提交于
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
    • P
      KVM: avoid using rcu_dereference_protected · 3898da94
      Paolo Bonzini 提交于
      During teardown, accesses to memslots and buses are using
      rcu_dereference_protected with an always-true condition because
      these accesses are done outside the usual mutexes.  This
      is because the last reference is gone and there cannot be any
      concurrent modifications, but rcu_dereference_protected is
      ugly and unobvious.
      
      Instead, check the refcount in kvm_get_bus and __kvm_memslots.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      3898da94
    • I
      net/mlx4_en: Fix wrong indication of Wake-on-LAN (WoL) support · c994f778
      Inbar Karmy 提交于
      Currently when WoL is supported but disabled, ethtool reports:
      "Supports Wake-on: d".
      Fix the indication of Wol support, so that the indication
      remains "g" all the time if the NIC supports WoL.
      
      Tested:
      As accepted, when NIC supports WoL- ethtool reports:
      	Supports Wake-on: g
      	Wake-on: d
      when NIC doesn't support WoL- ethtool reports:
              Supports Wake-on: d
              Wake-on: d
      
      Fixes: 14c07b13 ("mlx4: Wake on LAN support")
      Signed-off-by: NInbar Karmy <inbark@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c994f778
  17. 02 8月, 2017 4 次提交
    • B
      mtd: nand: Declare tBERS, tR and tPROG as u64 to avoid integer overflow · 6d292310
      Boris Brezillon 提交于
      All timings in nand_sdr_timings are expressed in picoseconds but some
      of them may not fit in an u32.
      Signed-off-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
      Fixes: 204e7ecd ("mtd: nand: Add a few more timings to nand_sdr_timings")
      Reported-by: NAlexander Dahl <ada@thorsis.com>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NAlexander Dahl <ada@thorsis.com>
      Tested-by: NAlexander Dahl <ada@thorsis.com>
      Signed-off-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
      6d292310
    • M
      PCI: Add pci_reset_function_locked() · a477b9cd
      Marc Zyngier 提交于
      The implementation of PCI workarounds may require that the device is reset
      from its probe function.  This implies that the PCI device lock is already
      held, and makes calling pci_reset_function() impossible (since it will
      itself try to take that lock).
      
      Add pci_reset_function_locked(), which is the equivalent of
      pci_reset_function(), except that it requires the PCI device lock to be
      already held by the caller.
      Tested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      [bhelgaas: folded in fix for conflict with 52354b9d ("PCI: Remove
      __pci_dev_reset() and pci_dev_reset()")]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org	# 4.11: 52354b9d: PCI: Remove __pci_dev_reset() and pci_dev_reset()
      Cc: stable@vger.kernel.org	# 4.11
      a477b9cd
    • G
      ptp: introduce ptp auxiliary worker · d9535cb7
      Grygorii Strashko 提交于
      Many PTP drivers required to perform some asynchronous or periodic work,
      like periodically handling PHC counter overflow or handle delayed timestamp
      for RX/TX network packets. In most of the cases, such work is implemented
      using workqueues. Unfortunately, Kernel workqueues might introduce
      significant delay in work scheduling under high system load and on -RT,
      which could cause misbehavior of PTP drivers due to internal counter
      overflow, for example, and there is no way to tune its execution policy and
      priority manuallly.
      
      Hence, The kthread_worker can be used insted of workqueues, as it create
      separte named kthread for each worker and its its execution policy and
      priority can be configured using chrt tool.
      
      This prblem was reported for two drivers TI CPSW CPTS and dp83640, so
      instead of modifying each of these driver it was proposed to add PTP
      auxiliary worker to the PHC subsystem.
      
      The patch adds PTP auxiliary worker in PHC subsystem using kthread_worker
      and kthread_delayed_work and introduces two new PHC subsystem APIs:
      
      - long (*do_aux_work)(struct ptp_clock_info *ptp) callback in
      ptp_clock_info structure, which driver should assign if it require to
      perform asynchronous or periodic work. Driver should return the delay of
      the PTP next auxiliary work scheduling time (>=0) or negative value in case
      further scheduling is not required.
      
      - int ptp_schedule_worker(struct ptp_clock *ptp, unsigned long delay) which
      allows schedule PTP auxiliary work.
      
      The name of kthread_worker thread corresponds PTP PHC device name "ptp%d".
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9535cb7
    • T
      NFSv4: Fix EXCHANGE_ID corrupt verifier issue · fd40559c
      Trond Myklebust 提交于
      The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
      changed to be asynchronous by commit 8d89bd70. If we interrrupt
      the call to rpc_wait_for_completion_task(), we can therefore end up
      transmitting random stack contents in lieu of the verifier.
      
      Fixes: 8d89bd70 ("NFS setup async exchange_id")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      fd40559c
  18. 01 8月, 2017 2 次提交