1. 20 3月, 2018 1 次提交
    • T
      RCU, workqueue: Implement rcu_work · 05f0fe6b
      Tejun Heo 提交于
      There are cases where RCU callback needs to be bounced to a sleepable
      context.  This is currently done by the RCU callback queueing a work
      item, which can be cumbersome to write and confusing to read.
      
      This patch introduces rcu_work, a workqueue work variant which gets
      executed after a RCU grace period, and converts the open coded
      bouncing in fs/aio and kernel/cgroup.
      
      v3: Dropped queue_rcu_work_on().  Documented rcu grace period behavior
          after queue_rcu_work().
      
      v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
          completion of previously queued rcu callback as per Paul.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      05f0fe6b
  2. 16 3月, 2018 1 次提交
    • E
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman 提交于
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  3. 15 3月, 2018 1 次提交
    • M
      KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid · 16ca6a60
      Marc Zyngier 提交于
      The vgic code is trying to be clever when injecting GICv2 SGIs,
      and will happily populate LRs with the same interrupt number if
      they come from multiple vcpus (after all, they are distinct
      interrupt sources).
      
      Unfortunately, this is against the letter of the architecture,
      and the GICv2 architecture spec says "Each valid interrupt stored
      in the List registers must have a unique VirtualID for that
      virtual CPU interface.". GICv3 has similar (although slightly
      ambiguous) restrictions.
      
      This results in guests locking up when using GICv2-on-GICv3, for
      example. The obvious fix is to stop trying so hard, and inject
      a single vcpu per SGI per guest entry. After all, pending SGIs
      with multiple source vcpus are pretty rare, and are mostly seen
      in scenario where the physical CPUs are severely overcomitted.
      
      But as we now only inject a single instance of a multi-source SGI per
      vcpu entry, we may delay those interrupts for longer than strictly
      necessary, and run the risk of injecting lower priority interrupts
      in the meantime.
      
      In order to address this, we adopt a three stage strategy:
      - If we encounter a multi-source SGI in the AP list while computing
        its depth, we force the list to be sorted
      - When populating the LRs, we prevent the injection of any interrupt
        of lower priority than that of the first multi-source SGI we've
        injected.
      - Finally, the injection of a multi-source SGI triggers the request
        of a maintenance interrupt when there will be no pending interrupt
        in the LRs (HCR_NPIE).
      
      At the point where the last pending interrupt in the LRs switches
      from Pending to Active, the maintenance interrupt will be delivered,
      allowing us to add the remaining SGIs using the same process.
      
      Cc: stable@vger.kernel.org
      Fixes: 0919e84c ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework")
      Acked-by: NChristoffer Dall <cdall@kernel.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      16ca6a60
  4. 07 3月, 2018 1 次提交
    • D
      usb: quirks: add control message delay for 1b1c:1b20 · cb88a058
      Danilo Krummrich 提交于
      Corsair Strafe RGB keyboard does not respond to usb control messages
      sometimes and hence generates timeouts.
      
      Commit de3af5bf ("usb: quirks: add delay init quirk for Corsair
      Strafe RGB keyboard") tried to fix those timeouts by adding
      USB_QUIRK_DELAY_INIT.
      
      Unfortunately, even with this quirk timeouts of usb_control_msg()
      can still be seen, but with a lower frequency (approx. 1 out of 15):
      
      [   29.103520] usb 1-8: string descriptor 0 read error: -110
      [   34.363097] usb 1-8: can't set config #1, error -110
      
      Adding further delays to different locations where usb control
      messages are issued just moves the timeouts to other locations,
      e.g.:
      
      [   35.400533] usbhid 1-8:1.0: can't add hid device: -110
      [   35.401014] usbhid: probe of 1-8:1.0 failed with error -110
      
      The only way to reliably avoid those issues is having a pause after
      each usb control message. In approx. 200 boot cycles no more timeouts
      were seen.
      
      Addionaly, keep USB_QUIRK_DELAY_INIT as it turned out to be necessary
      to have the delay in hub_port_connect() after hub_port_init().
      
      The overall boot time seems not to be influenced by these additional
      delays, even on fast machines and lightweight distributions.
      
      Fixes: de3af5bf ("usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDanilo Krummrich <danilokrummrich@dk-develop.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb88a058
  5. 06 3月, 2018 2 次提交
  6. 05 3月, 2018 2 次提交
  7. 03 3月, 2018 1 次提交
    • M
      signals: Move put_compat_sigset to compat.h to silence hardened usercopy · fde9fc76
      Matt Redfearn 提交于
      Since commit afcc90f8 ("usercopy: WARN() on slab cache usercopy
      region violations"), MIPS systems booting with a compat root filesystem
      emit a warning when copying compat siginfo to userspace:
      
      WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
      Bad or missing usercopy whitelist? Kernel memory exposure attempt
      detected from SLAB object 'task_struct' (offset 1432, size 16)!
      Modules linked in:
      CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
      Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
      	65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
      	00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
      	000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
      	ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
      	0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
      	000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
      	ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
      	ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
      	0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
      	...
      Call Trace:
      [<ffffffff8010d02c>] show_stack+0x9c/0x130
      [<ffffffff80698810>] dump_stack+0x90/0xd0
      [<ffffffff80137b78>] __warn+0x100/0x118
      [<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
      [<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
      [<ffffffff8021e68c>] __check_object_size+0xfc/0x250
      [<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
      [<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
      [<ffffffff8010b8b4>] do_signal+0x19c/0x230
      [<ffffffff8010c408>] do_notify_resume+0x60/0x78
      [<ffffffff80106f50>] work_notifysig+0x10/0x18
      ---[ end trace 88fffbf69147f48a ]---
      
      Commit 5905429a ("fork: Provide usercopy whitelisting for
      task_struct") noted that:
      
      "While the blocked and saved_sigmask fields of task_struct are copied to
      userspace (via sigmask_to_save() and setup_rt_frame()), it is always
      copied with a static length (i.e. sizeof(sigset_t))."
      
      However, this is not true in the case of compat signals, whose sigset
      is copied by put_compat_sigset and receives size as an argument.
      
      At most call sites, put_compat_sigset is copying a sigset from the
      current task_struct. This triggers a warning when
      CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
      static inline, the warning can be avoided because in all of these cases
      the size is constant at compile time, which is allowed. The only site
      where this is not the case is handling the rt_sigpending syscall, but
      there the copy is being made from a stack local variable so does not
      trigger the warning.
      
      Move put_compat_sigset to compat.h, and mark it static inline. This
      fixes the WARN on MIPS.
      
      Fixes: afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations")
      Signed-off-by: NMatt Redfearn <matt.redfearn@mips.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Dmitry V . Levin" <ldv@altlinux.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18639/Signed-off-by: NJames Hogan <jhogan@kernel.org>
      fde9fc76
  8. 01 3月, 2018 1 次提交
  9. 28 2月, 2018 2 次提交
    • T
      tty: make n_tty_read() always abort if hangup is in progress · 28b0f8a6
      Tejun Heo 提交于
      A tty is hung up by __tty_hangup() setting file->f_op to
      hung_up_tty_fops, which is skipped on ttys whose write operation isn't
      tty_write().  This means that, for example, /dev/console whose write
      op is redirected_tty_write() is never actually marked hung up.
      
      Because n_tty_read() uses the hung up status to decide whether to
      abort the waiting readers, the lack of hung-up marking can lead to the
      following scenario.
      
       1. A session contains two processes.  The leader and its child.  The
          child ignores SIGHUP.
      
       2. The leader exits and starts disassociating from the controlling
          terminal (/dev/console).
      
       3. __tty_hangup() skips setting f_op to hung_up_tty_fops.
      
       4. SIGHUP is delivered and ignored.
      
       5. tty_ldisc_hangup() is invoked.  It wakes up the waits which should
          clear the read lockers of tty->ldisc_sem.
      
       6. The reader wakes up but because tty_hung_up_p() is false, it
          doesn't abort and goes back to sleep while read-holding
          tty->ldisc_sem.
      
       7. The leader progresses to tty_ldisc_lock() in tty_ldisc_hangup()
          and is now stuck in D sleep indefinitely waiting for
          tty->ldisc_sem.
      
      The following is Alan's explanation on why some ttys aren't hung up.
      
       http://lkml.kernel.org/r/20171101170908.6ad08580@alans-desktop
      
       1. It broke the serial consoles because they would hang up and close
          down the hardware. With tty_port that *should* be fixable properly
          for any cases remaining.
      
       2. The console layer was (and still is) completely broken and doens't
          refcount properly. So if you turn on console hangups it breaks (as
          indeed does freeing consoles and half a dozen other things).
      
      As neither can be fixed quickly, this patch works around the problem
      by introducing a new flag, TTY_HUPPING, which is used solely to tell
      n_tty_read() that hang-up is in progress for the console and the
      readers should be aborted regardless of the hung-up status of the
      device.
      
      The following is a sample hung task warning caused by this issue.
      
        INFO: task agetty:2662 blocked for more than 120 seconds.
              Not tainted 4.11.3-dbg-tty-lockup-02478-gfd6c7ee-dirty #28
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            0  2662      1 0x00000086
        Call Trace:
         __schedule+0x267/0x890
         schedule+0x36/0x80
         schedule_timeout+0x23c/0x2e0
         ldsem_down_write+0xce/0x1f6
         tty_ldisc_lock+0x16/0x30
         tty_ldisc_hangup+0xb3/0x1b0
         __tty_hangup+0x300/0x410
         disassociate_ctty+0x6c/0x290
         do_exit+0x7ef/0xb00
         do_group_exit+0x3f/0xa0
         get_signal+0x1b3/0x5d0
         do_signal+0x28/0x660
         exit_to_usermode_loop+0x46/0x86
         do_syscall_64+0x9c/0xb0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      The following is the repro.  Run "$PROG /dev/console".  The parent
      process hangs in D state.
      
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <sys/wait.h>
        #include <sys/ioctl.h>
        #include <fcntl.h>
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <errno.h>
        #include <signal.h>
        #include <time.h>
        #include <termios.h>
      
        int main(int argc, char **argv)
        {
      	  struct sigaction sact = { .sa_handler = SIG_IGN };
      	  struct timespec ts1s = { .tv_sec = 1 };
      	  pid_t pid;
      	  int fd;
      
      	  if (argc < 2) {
      		  fprintf(stderr, "test-hung-tty /dev/$TTY\n");
      		  return 1;
      	  }
      
      	  /* fork a child to ensure that it isn't already the session leader */
      	  pid = fork();
      	  if (pid < 0) {
      		  perror("fork");
      		  return 1;
      	  }
      
      	  if (pid > 0) {
      		  /* top parent, wait for everyone */
      		  while (waitpid(-1, NULL, 0) >= 0)
      			  ;
      		  if (errno != ECHILD)
      			  perror("waitpid");
      		  return 0;
      	  }
      
      	  /* new session, start a new session and set the controlling tty */
      	  if (setsid() < 0) {
      		  perror("setsid");
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_RDWR);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  if (ioctl(fd, TIOCSCTTY, 1) < 0) {
      		  perror("ioctl");
      		  return 1;
      	  }
      
      	  /* fork a child, sleep a bit and exit */
      	  pid = fork();
      	  if (pid < 0) {
      		  perror("fork");
      		  return 1;
      	  }
      
      	  if (pid > 0) {
      		  nanosleep(&ts1s, NULL);
      		  printf("Session leader exiting\n");
      		  exit(0);
      	  }
      
      	  /*
      	   * The child ignores SIGHUP and keeps reading from the controlling
      	   * tty.  Because SIGHUP is ignored, the child doesn't get killed on
      	   * parent exit and the bug in n_tty makes the read(2) block the
      	   * parent's control terminal hangup attempt.  The parent ends up in
      	   * D sleep until the child is explicitly killed.
      	   */
      	  sigaction(SIGHUP, &sact, NULL);
      	  printf("Child reading tty\n");
      	  while (1) {
      		  char buf[1024];
      
      		  if (read(fd, buf, sizeof(buf)) < 0) {
      			  perror("read");
      			  return 1;
      		  }
      	  }
      
      	  return 0;
        }
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alan Cox <alan@llwyncelyn.cymru>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28b0f8a6
    • A
      net: phy: Restore phy_resume() locking assumption · 9c2c2e62
      Andrew Lunn 提交于
      commit f5e64032 ("net: phy: fix resume handling") changes the
      locking semantics for phy_resume() such that the caller now needs to
      hold the phy mutex. Not all call sites were adopted to this new
      semantic, resulting in warnings from the added
      WARN_ON(!mutex_is_locked(&phydev->lock)).  Rather than change the
      semantics, add a __phy_resume() and restore the old behavior of
      phy_resume().
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Fixes: f5e64032 ("net: phy: fix resume handling")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c2c2e62
  10. 27 2月, 2018 4 次提交
    • D
      dax: fix vma_is_fsdax() helper · 230f5a89
      Dan Williams 提交于
      Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use
      S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on
      device-dax instances when those are meant to be explicitly allowed.
      
      Fixes: 2bb6d283 ("mm: introduce get_user_pages_longterm")
      Cc: <stable@vger.kernel.org>
      Reported-by: NGerd Rausch <gerd.rausch@oracle.com>
      Acked-by: NJane Chu <jane.chu@oracle.com>
      Reported-by: NHaozhong Zhang <haozhong.zhang@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      230f5a89
    • J
      genhd: Fix BUG in blkdev_open() · 56c0908c
      Jan Kara 提交于
      When two blkdev_open() calls for a partition race with device removal
      and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
      blkdev_open(). The race can happen as follows:
      
      CPU0				CPU1			CPU2
      							del_gendisk()
      							  bdev_unhash_inode(part1);
      
      blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
        bdev = bd_acquire()		  bdev = bd_acquire()
        blkdev_get(bdev)
          bd_start_claiming(bdev)
            - finds old inode 'whole'
            bd_prepare_to_claim() -> 0
      							  bdev_unhash_inode(whole);
      							<device removed>
      							<new device under same
      							 number created>
      				  blkdev_get(bdev);
      				    bd_start_claiming(bdev)
      				      - finds new inode 'whole'
      				      bd_prepare_to_claim()
      					- this also succeeds as we have
      					  different 'whole' here...
      					- bad things happen now as we
      					  have two exclusive openers of
      					  the same bdev
      
      The problem here is that block device opens can see various intermediate
      states while gendisk is shutting down and then being recreated.
      
      We fix the problem by introducing new lookup_sem in gendisk that
      synchronizes gendisk deletion with get_gendisk() and furthermore by
      making sure that get_gendisk() does not return gendisk that is being (or
      has been) deleted. This makes sure that once we ever manage to look up
      newly created bdev inode, we are also guaranteed that following
      get_gendisk() will either return failure (and we fail open) or it
      returns gendisk for the new device and following bdget_disk() will
      return new bdev inode (i.e., blkdev_open() follows the path as if it is
      completely run after new device is created).
      Reported-and-analyzed-by: NHou Tao <houtao1@huawei.com>
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56c0908c
    • J
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara 提交于
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9df6c299
    • J
      genhd: Rename get_disk() to get_disk_and_module() · 3079c22e
      Jan Kara 提交于
      Rename get_disk() to get_disk_and_module() to make sure what the
      function does. It's not a great name but at least it is now clear that
      put_disk() is not it's counterpart.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3079c22e
  11. 24 2月, 2018 2 次提交
  12. 23 2月, 2018 2 次提交
  13. 22 2月, 2018 5 次提交
    • A
      bug.h: work around GCC PR82365 in BUG() · 173a3efd
      Arnd Bergmann 提交于
      Looking at functions with large stack frames across all architectures
      led me discovering that BUG() suffers from the same problem as
      fortify_panic(), which I've added a workaround for already.
      
      In short, variables that go out of scope by calling a noreturn function
      or __builtin_unreachable() keep using stack space in functions
      afterwards.
      
      A workaround that was identified is to insert an empty assembler
      statement just before calling the function that doesn't return.  I'm
      adding a macro "barrier_before_unreachable()" to document this, and
      insert calls to that in all instances of BUG() that currently suffer
      from this problem.
      
      The files that saw the largest change from this had these frame sizes
      before, and much less with my patch:
      
        fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
        drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]
      
      In case of ARC and CRIS, it turns out that the BUG() implementation
      actually does return (or at least the compiler thinks it does),
      resulting in lots of warnings about uninitialized variable use and
      leaving noreturn functions, such as:
      
        block/cfq-iosched.c: In function 'cfq_async_queue_prio':
        block/cfq-iosched.c:3804:1: error: control reaches end of non-void function [-Werror=return-type]
        include/linux/dmaengine.h: In function 'dma_maxpq':
        include/linux/dmaengine.h:1123:1: error: control reaches end of non-void function [-Werror=return-type]
      
      This makes them call __builtin_trap() instead, which should normally
      dump the stack and kill the current process, like some of the other
      architectures already do.
      
      I tried adding barrier_before_unreachable() to panic() and
      fortify_panic() as well, but that had very little effect, so I'm not
      submitting that patch.
      
      Vineet said:
      
      : For ARC, it is double win.
      :
      : 1. Fixes 3 -Wreturn-type warnings
      :
      : | ../net/core/ethtool.c:311:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../kernel/sched/core.c:3246:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../include/linux/sunrpc/svc_xprt.h:180:1: warning: control reaches end of
      : non-void function [-Wreturn-type]
      :
      : 2.  bloat-o-meter reports code size improvements as gcc elides the
      :    generated code for stack return.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
      Link: http://lkml.kernel.org/r/20171219114112.939391-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christopher Li <sparse@chrisli.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      173a3efd
    • S
      mm, mlock, vmscan: no more skipping pagevecs · 9c4e6b1a
      Shakeel Butt 提交于
      When a thread mlocks an address space backed either by file pages which
      are currently not present in memory or swapped out anon pages (not in
      swapcache), a new page is allocated and added to the local pagevec
      (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
      On I/O completion, the thread can wake on a different CPU, the mlock
      syscall will then sets the PageMlocked() bit of the page but will not be
      able to put that page in unevictable LRU as the page is on the pagevec
      of a different CPU.  Even on drain, that page will go to evictable LRU
      because the PageMlocked() bit is not checked on pagevec drain.
      
      The page will eventually go to right LRU on reclaim but the LRU stats
      will remain skewed for a long time.
      
      This patch puts all the pages, even unevictable, to the pagevecs and on
      the drain, the pages will be added on their LRUs correctly by checking
      their evictability.  This resolves the mlocked pages on pagevec of other
      CPUs issue because when those pagevecs will be drained, the mlocked file
      pages will go to unevictable LRU.  Also this makes the race with munlock
      easier to resolve because the pagevec drains happen in LRU lock.
      
      However there is still one place which makes a page evictable and does
      PageLRU check on that page without LRU lock and needs special attention.
      TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
      
      	#0: __pagevec_lru_add_fn	#1: clear_page_mlock
      
      	SetPageLRU()			if (!TestClearPageMlocked())
      					  return
      	smp_mb() // <--required
      					// inside does PageLRU
      	if (!PageMlocked())		if (isolate_lru_page())
      	  move to evictable LRU		  putback_lru_page()
      	else
      	  move to unevictable LRU
      
      In '#1', TestClearPageMlocked() provides full memory barrier semantics
      and thus the PageLRU check (inside isolate_lru_page) can not be
      reordered before it.
      
      In '#0', without explicit memory barrier, the PageMlocked() check can be
      reordered before SetPageLRU().  If that happens, '#0' can put a page in
      unevictable LRU and '#1' might have just cleared the Mlocked bit of that
      page but fails to isolate as PageLRU fails as '#0' still hasn't set
      PageLRU bit of that page.  That page will be stranded on the unevictable
      LRU.
      
      There is one (good) side effect though.  Without this patch, the pages
      allocated for System V shared memory segment are added to evictable LRUs
      even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
      put such pages to unevictable LRU.
      
      Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c4e6b1a
    • J
      mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats · c3cc3911
      Johannes Weiner 提交于
      After commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK
      counts over the course of several days, both the per-memcg stats as well
      as the system counter in e.g.  /proc/meminfo.
      
      The conversion from full per-cpu stat counts to per-cpu cached atomic
      stat counts introduced an irq-unsafe RMW operation into the updates.
      
      Most stat updates come from process context, but one notable exception
      is the NR_WRITEBACK counter.  While writebacks are issued from process
      context, they are retired from (soft)irq context.
      
      When writeback completions interrupt the RMW counter updates of new
      writebacks being issued, the decs from the completions are lost.
      
      Since the global updates are routed through the joint lruvec API, both
      the memcg counters as well as the system counters are affected.
      
      This patch makes the joint stat and event API irq safe.
      
      Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org
      Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Debugged-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3cc3911
    • A
      Kbuild: always define endianess in kconfig.h · 101110f6
      Arnd Bergmann 提交于
      Build testing with LTO found a couple of files that get compiled
      differently depending on whether asm/byteorder.h gets included early
      enough or not.  In particular, include/asm-generic/qrwlock_types.h is
      affected by this, but there are probably others as well.
      
      The symptom is a series of LTO link time warnings, including these:
      
          net/netlabel/netlabel_unlabeled.h:223: error: type of 'netlbl_unlhsh_add' does not match original declaration [-Werror=lto-type-mismatch]
           int netlbl_unlhsh_add(struct net *net,
          net/netlabel/netlabel_unlabeled.c:377: note: 'netlbl_unlhsh_add' was previously declared here
      
          include/net/ipv6.h:360: error: type of 'ipv6_renew_options_kern' does not match original declaration [-Werror=lto-type-mismatch]
           ipv6_renew_options_kern(struct sock *sk,
          net/ipv6/exthdrs.c:1162: note: 'ipv6_renew_options_kern' was previously declared here
      
          net/core/dev.c:761: note: 'dev_get_by_name_rcu' was previously declared here
           struct net_device *dev_get_by_name_rcu(struct net *net, const char *name)
          net/core/dev.c:761: note: code may be misoptimized unless -fno-strict-aliasing is used
      
          drivers/gpu/drm/i915/i915_drv.h:3377: error: type of 'i915_gem_object_set_to_wc_domain' does not match original declaration [-Werror=lto-type-mismatch]
           i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write);
          drivers/gpu/drm/i915/i915_gem.c:3639: note: 'i915_gem_object_set_to_wc_domain' was previously declared here
      
          include/linux/debugfs.h:92:9: error: type of 'debugfs_attr_read' does not match original declaration [-Werror=lto-type-mismatch]
           ssize_t debugfs_attr_read(struct file *file, char __user *buf,
          fs/debugfs/file.c:318: note: 'debugfs_attr_read' was previously declared here
      
          include/linux/rwlock_api_smp.h:30: error: type of '_raw_read_unlock' does not match original declaration [-Werror=lto-type-mismatch]
           void __lockfunc _raw_read_unlock(rwlock_t *lock) __releases(lock);
          kernel/locking/spinlock.c:246:26: note: '_raw_read_unlock' was previously declared here
      
          include/linux/fs.h:3308:5: error: type of 'simple_attr_open' does not match original declaration [-Werror=lto-type-mismatch]
           int simple_attr_open(struct inode *inode, struct file *file,
          fs/libfs.c:795: note: 'simple_attr_open' was previously declared here
      
      All of the above are caused by include/asm-generic/qrwlock_types.h
      failing to include asm/byteorder.h after commit e0d02285
      ("locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock'")
      in linux-4.15.
      
      Similar bugs may or may not exist in older kernels as well, but there is
      no easy way to test those with link-time optimizations, and kernels
      before 4.14 are harder to fix because they don't have Babu's patch
      series
      
      We had similar issues with CONFIG_ symbols in the past and ended up
      always including the configuration headers though linux/kconfig.h.  This
      works around the issue through that same file, defining either
      __BIG_ENDIAN or __LITTLE_ENDIAN depending on CONFIG_CPU_BIG_ENDIAN,
      which is now always set on all architectures since commit 4c97a0c8
      ("arch: define CPU_BIG_ENDIAN for all fixed big endian archs").
      
      Link: http://lkml.kernel.org/r/20180202154104.1522809-2-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Babu Moger <babu.moger@amd.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      101110f6
    • A
      include/linux/sched/mm.h: re-inline mmdrop() · d34bc48f
      Andrew Morton 提交于
      As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.
      
      Fixes: d70f2a14 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
      Acked-by: NPeter Zijlstra (Intel) <peterz@infraded.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34bc48f
  14. 21 2月, 2018 3 次提交
  15. 20 2月, 2018 6 次提交
  16. 17 2月, 2018 5 次提交
  17. 16 2月, 2018 1 次提交