1. 06 11月, 2015 17 次提交
  2. 02 11月, 2015 1 次提交
    • L
      mm: get rid of 'vmalloc_info' from /proc/meminfo · a5ad88ce
      Linus Torvalds 提交于
      It turns out that at least some versions of glibc end up reading
      /proc/meminfo at every single startup, because glibc wants to know the
      amount of memory the machine has.  And while that's arguably insane,
      it's just how things are.
      
      And it turns out that it's not all that expensive most of the time, but
      the vmalloc information statistics (amount of virtual memory used in the
      vmalloc space, and the biggest remaining chunk) can be rather expensive
      to compute.
      
      The 'get_vmalloc_info()' function actually showed up on my profiles as
      4% of the CPU usage of "make test" in the git source repository, because
      the git tests are lots of very short-lived shell-scripts etc.
      
      It turns out that apparently this same silly vmalloc info gathering
      shows up on the facebook servers too, according to Dave Jones.  So it's
      not just "make test" for git.
      
      We had two patches to just cache the information (one by me, one by
      Ingo) to mitigate this issue, but the whole vmalloc information of of
      rather dubious value to begin with, and people who *actually* want to
      know what the situation is wrt the vmalloc area should just look at the
      much more complete /proc/vmallocinfo instead.
      
      In fact, according to my testing - and perhaps more importantly,
      according to that big search engine in the sky: Google - there is
      nothing out there that actually cares about those two expensive fields:
      VmallocUsed and VmallocChunk.
      
      So let's try to just remove them entirely.  Actually, this just removes
      the computation and reports the numbers as zero for now, just to try to
      be minimally intrusive.
      
      If this breaks anything, we'll obviously have to re-introduce the code
      to compute this all and add the caching patches on top.  But if given
      the option, I'd really prefer to just remove this bad idea entirely
      rather than add even more code to work around our historical mistake
      that likely nobody really cares about.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ad88ce
  3. 01 11月, 2015 2 次提交
    • L
      vfs: conditionally clear close-on-exec flag · fc90888d
      Linus Torvalds 提交于
      We clear the close-on-exec flag when opening and closing files, and the
      bit was almost always already clear before.  Avoid dirtying the
      cacheline if the clearning isn't necessary.  That avoids unnecessary
      cacheline dirtying and bouncing in multi-socket environments.
      
      Eric Dumazet has a file descriptor benchmark that goes 4% faster from
      this on his two-socket machine.  It's probably partly superlinear
      improvement due to getting slightly less spinlock contention on the
      file_lock spinlock due to less work in the critical section.
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc90888d
    • L
      vfs: Fix pathological performance case for __alloc_fd() · f3f86e33
      Linus Torvalds 提交于
      Al Viro points out that:
      > >     * [Linux-specific aside] our __alloc_fd() can degrade quite badly
      > > with some use patterns.  The cacheline pingpong in the bitmap is probably
      > > inevitable, unless we accept considerably heavier memory footprint,
      > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
      > > to trigger - close(3);open(...); will have the next open() after that
      > > scanning the entire in-use bitmap.
      
      And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
      that opens and closes a lot of sockets with minimal work per socket.
      
      This patch largely fixes it.  We keep a 2nd-level bitmap of the open
      file bitmaps, showing which words are already full.  So then we can
      traverse that second-level bitmap to efficiently skip already allocated
      file descriptors.
      
      On his benchmark, this improves performance by up to an order of
      magnitude, by avoiding the excessive open file bitmap scanning.
      Tested-and-acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3f86e33
  4. 28 10月, 2015 1 次提交
    • T
      fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() · b33e18f6
      Tejun Heo 提交于
      bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
      to walk @bdi->wb_list.  To set up the initial iteration
      condition, it uses list_entry_rcu() to calculate the entry
      pointer corresponding to the list head; however, this isn't an
      actual RCU dereference and using list_entry_rcu() for it ended
      up breaking a proposed list_entry_rcu() change because it was
      feeding an non-lvalue pointer into the macro.
      
      Don't use the RCU variant for simple pointer offsetting.  Use
      list_entry() instead.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Patrick Marlier <patrick.marlier@gmail.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pranith kumar <bobby.prani@gmail.com>
      Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b33e18f6
  5. 23 10月, 2015 1 次提交
  6. 22 10月, 2015 2 次提交
    • C
      btrfs: fix possible leak in btrfs_ioctl_balance() · 0f89abf5
      Christian Engelmayer 提交于
      Commit 8eb93459 ("btrfs: check unsupported filters in balance
      arguments") adds a jump to exit label out_bargs in case the argument
      check fails. At this point in addition to the bargs memory, the
      memory for struct btrfs_balance_control has already been allocated.
      Ownership of bctl is passed to btrfs_balance() in the good case,
      thus the memory is not freed due to the introduced jump. Make sure
      that the memory gets freed in any case as necessary. Detected by
      Coverity CID 1328378.
      Signed-off-by: NChristian Engelmayer <cengelma@gmx.at>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f89abf5
    • M
      block: Inline blk_integrity in struct gendisk · 25520d55
      Martin K. Petersen 提交于
      Up until now the_integrity profile has been dynamically allocated and
      attached to struct gendisk after the disk has been made active.
      
      This causes problems because NVMe devices need to register the profile
      prior to the partition table being read due to a mandatory metadata
      buffer requirement. In addition, DM goes through hoops to deal with
      preallocating, but not initializing integrity profiles.
      
      Since the integrity profile is small (4 bytes + a pointer), Christoph
      suggested moving it to struct gendisk proper. This requires several
      changes:
      
       - Moving the blk_integrity definition to genhd.h.
      
       - Inlining blk_integrity in struct gendisk.
      
       - Removing the dynamic allocation code.
      
       - Adding helper functions which allow gendisk to set up and tear down
         the integrity sysfs dir when a disk is added/deleted.
      
       - Adding a blk_integrity_revalidate() callback for updating the stable
         pages bdi setting.
      
       - The calls that depend on whether a device has an integrity profile or
         not now key off of the bi->profile pointer.
      
       - Simplifying the integrity support routines in DM (Mike Snitzer).
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      25520d55
  7. 21 10月, 2015 1 次提交
    • Q
      btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size · 0f6925fa
      Qu Wenruo 提交于
      Current code will always truncate tailing page if its alloc_start is
      smaller than inode size.
      
      For example, the file extent layout is like:
      0	4K	8K	16K	32K
      |<-----Extent A---------------->|
      |<--Inode size: 18K---------->|
      
      But if calling fallocate even for range [0,4K), it will cause btrfs to
      re-truncate the range [16,32K), causing COW and a new extent.
      
      0	4K	8K	16K	32K
      |///////|	<- Fallocate call range
      |<-----Extent A-------->|<--B-->|
      
      The cause is quite easy, just a careless btrfs_truncate_inode() in a
      else branch without extra judgment.
      Fix it by add judgment on whether the fallocate range is beyond isize.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f6925fa
  8. 19 10月, 2015 1 次提交
  9. 18 10月, 2015 4 次提交
  10. 17 10月, 2015 2 次提交
    • R
      mm, dax: fix DAX deadlocks · 0f90cc66
      Ross Zwisler 提交于
      The following two locking commits in the DAX code:
      
      commit 84317297 ("dax: fix race between simultaneous faults")
      commit 46c043ed ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
      
      introduced a number of deadlocks and other issues which need to be fixed
      for the v4.3 kernel.  The list of issues in DAX after these commits
      (some newly introduced by the commits, some preexisting) can be found
      here:
      
        https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").
      
      This undoes most of the changes introduced by those two commits,
      essentially returning us to the DAX locking scheme that was used in
      v4.2.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f90cc66
    • M
      mm, fs: obey gfp_mapping for add_to_page_cache() · 063d99b4
      Michal Hocko 提交于
      Commit 6afdb859 ("mm: do not ignore mapping_gfp_mask in page cache
      allocation paths") has caught some users of hardcoded GFP_KERNEL used in
      the page cache allocation paths.  This, however, wasn't complete and
      there were others which went unnoticed.
      
      Dave Chinner has reported the following deadlock for xfs on loop device:
      : With the recent merge of the loop device changes, I'm now seeing
      : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
      :
      : The deadlocked is as follows:
      :
      : kloopd1: loop_queue_read_work
      :       xfs_file_iter_read
      :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
      :       page cache read (GFP_KERNEL)
      :       radix tree alloc
      :       memory reclaim
      :       reclaim XFS inodes
      :       log force to unpin inodes
      :       <wait for log IO completion>
      :
      : xfs-cil/loop1: <does log force IO work>
      :       xlog_cil_push
      :       xlog_write
      :       <loop issuing log writes>
      :               xlog_state_get_iclog_space()
      :               <blocks due to all log buffers under write io>
      :               <waits for IO completion>
      :
      : kloopd1: loop_queue_write_work
      :       xfs_file_write_iter
      :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
      :       <wait for inode to be unlocked>
      :
      : i.e. the kloopd, with it's split read and write work queues, has
      : introduced a dependency through memory reclaim. i.e. that writes
      : need to be able to progress for reads make progress.
      :
      : The problem, fundamentally, is that mpage_readpages() does a
      : GFP_KERNEL allocation, rather than paying attention to the inode's
      : mapping gfp mask, which is set to GFP_NOFS.
      :
      : The didn't used to happen, because the loop device used to issue
      : reads through the splice path and that does:
      :
      :       error = add_to_page_cache_lru(page, mapping, index,
      :                       GFP_KERNEL & mapping_gfp_mask(mapping));
      
      This has changed by commit aa4d8616 ("block: loop: switch to VFS
      ITER_BVEC").
      
      This patch changes mpage_readpage{s} to follow gfp mask set for the
      mapping.  There are, however, other places which are doing basically the
      same.
      
      lustre:ll_dir_filler is doing GFP_KERNEL from the function which
      apparently uses GFP_NOFS for other allocations so let's make this
      consistent.
      
      cifs:readpages_get_pages is called from cifs_readpages and
      __cifs_readpages_from_fscache called from the same path obeys mapping
      gfp.
      
      ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
      regardless it uses mapping_gfp_mask for the page allocation.
      
      ext4_mpage_readpages is the called from the page cache allocation path
      same as read_pages and read_cache_pages
      
      As I've noticed in my previous post I cannot say I would be happy about
      sprinkling mapping_gfp_mask all over the place and it sounds like we
      should drop gfp_mask argument altogether and use it internally in
      __add_to_page_cache_locked that would require all the filesystems to use
      mapping gfp consistently which I am not sure is the case here.  From a
      quick glance it seems that some file system use it all the time while
      others are selective.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      063d99b4
  11. 14 10月, 2015 3 次提交
  12. 13 10月, 2015 2 次提交
    • T
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo 提交于
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b817525a
    • T
      writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() · 6fdf860f
      Tejun Heo 提交于
      wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
      unfortunately, it was always waking up bdi->wb instead of the wb being
      walked.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 001fe6f6 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6fdf860f
  13. 12 10月, 2015 3 次提交
    • K
      ovl: free lower_mnt array in ovl_put_super · 5ffdbe8b
      Konstantin Khlebnikov 提交于
      This fixes memory leak after umount.
      
      Kmemleak report:
      
      unreferenced object 0xffff8800ba791010 (size 8):
        comm "mount", pid 2394, jiffies 4294996294 (age 53.920s)
        hex dump (first 8 bytes):
          20 1c 13 02 00 88 ff ff                           .......
        backtrace:
          [<ffffffff811f8cd4>] create_object+0x124/0x2c0
          [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
          [<ffffffff811dffe6>] __kmalloc+0x106/0x340
          [<ffffffffa0152bfc>] ovl_fill_super+0x55c/0x9b0 [overlay]
          [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
          [<ffffffffa0152118>] ovl_mount+0x18/0x20 [overlay]
          [<ffffffff81201ab3>] mount_fs+0x43/0x170
          [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
          [<ffffffff812233ad>] do_mount+0x22d/0xdf0
          [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
          [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Fixes: dd662667 ("ovl: add mutli-layer infrastructure")
      Cc: <stable@vger.kernel.org> # v4.0+
      5ffdbe8b
    • K
      ovl: free stack of paths in ovl_fill_super · 0f95502a
      Konstantin Khlebnikov 提交于
      This fixes small memory leak after mount.
      
      Kmemleak report:
      
      unreferenced object 0xffff88003683fe00 (size 16):
        comm "mount", pid 2029, jiffies 4294909563 (age 33.380s)
        hex dump (first 16 bytes):
          20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff   '......@K.6....
        backtrace:
          [<ffffffff811f8cd4>] create_object+0x124/0x2c0
          [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
          [<ffffffff811dffe6>] __kmalloc+0x106/0x340
          [<ffffffffa01b7a29>] ovl_fill_super+0x389/0x9a0 [overlay]
          [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
          [<ffffffffa01b7118>] ovl_mount+0x18/0x20 [overlay]
          [<ffffffff81201ab3>] mount_fs+0x43/0x170
          [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
          [<ffffffff812233ad>] do_mount+0x22d/0xdf0
          [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
          [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Fixes: a78d9f0d ("ovl: support multiple lower layers")
      Cc: <stable@vger.kernel.org> # v4.0+
      0f95502a
    • M
      ovl: fix open in stacked overlay · 1c8a47df
      Miklos Szeredi 提交于
      If two overlayfs filesystems are stacked on top of each other, then we need
      recursion in ovl_d_select_inode().
      
      I guess d_backing_inode() is supposed to do that.  But currently it doesn't
      and that functionality is open coded in vfs_open().  This is now copied
      into ovl_d_select_inode() to fix this regression.
      Reported-by: NAlban Crequy <alban.crequy@gmail.com>
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay...")
      Cc: David Howells <dhowells@redhat.com>
      Cc: <stable@vger.kernel.org> # v4.2+
      1c8a47df