1. 02 6月, 2015 1 次提交
    • T
      bdi: make inode_to_bdi() inline · a212b105
      Tejun Heo 提交于
      Now that bdi definitions are moved to backing-dev-defs.h,
      backing-dev.h can include blkdev.h and inline inode_to_bdi() without
      worrying about introducing circular include dependency.  The function
      gets called from hot paths and fairly trivial.
      
      This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
      function calls inline.  blockdev_superblock and noop_backing_dev_info
      are EXPORT_GPL'd to allow the inline functions to be used from
      modules.
      
      While at it, make sb_is_blkdev_sb() return bool instead of int.
      
      v2: Fixed typo in description as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a212b105
  2. 20 5月, 2015 1 次提交
    • J
      block: export blkdev_reread_part() and __blkdev_reread_part() · be324177
      Jarod Wilson 提交于
      This patch exports blkdev_reread_part() for block drivers, also
      introduce __blkdev_reread_part().
      
      For some drivers, such as loop, reread of partitions can be run
      from the release path, and bd_mutex may already be held prior to
      calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
      __blkdev_reread_part for use in such cases.
      
      CC: Christoph Hellwig <hch@lst.de>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Tejun Heo <tj@kernel.org>
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
      CC: Markus Pargmann <mpa@pengutronix.de>
      CC: Stefan Weinhuber <wein@de.ibm.com>
      CC: Stefan Haberland <stefan.haberland@de.ibm.com>
      CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
      CC: Fabian Frederick <fabf@skynet.be>
      CC: Ming Lei <ming.lei@canonical.com>
      CC: David Herrmann <dh.herrmann@gmail.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: nbd-general@lists.sourceforge.net
      CC: linux-s390@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      be324177
  3. 25 4月, 2015 2 次提交
    • E
      fix I_DIO_WAKEUP definition · ac74d8d6
      Eric Sandeen 提交于
      I_DIO_WAKEUP is never directly used, but fix it up anyway.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ac74d8d6
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  4. 17 4月, 2015 2 次提交
    • A
      proc: show locks in /proc/pid/fdinfo/X · 6c8c9031
      Andrey Vagin 提交于
      Let's show locks which are associated with a file descriptor in
      its fdinfo file.
      
      Currently we don't have a reliable way to determine who holds a lock.  We
      can find some information in /proc/locks, but PID which is reported there
      can be wrong.  For example, a process takes a lock, then forks a child and
      dies.  In this case /proc/locks contains the parent pid, which can be
      reused by another process.
      
      $ cat /proc/locks
      ...
      6: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF
      ...
      
      $ ps -C rpcbind
        PID TTY          TIME CMD
        332 ?        00:00:00 rpcbind
      
      $ cat /proc/332/fdinfo/4
      pos:	0
      flags:	0100000
      mnt_id:	22
      lock:	1: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF
      
      $ ls -l /proc/332/fd/4
      lr-x------ 1 root root 64 Mar  5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock
      
      $ ls -l /proc/324/fd/
      total 0
      lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0
      lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0
      lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0
      
      You can see that the process with the 324 pid doesn't hold the lock.
      
      This information is required for proper dumping and restoring file
      locks.
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Acked-by: NJeff Layton <jlayton@poochiereds.net>
      Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c8c9031
    • K
      mm: rcu-protected get_mm_exe_file() · 90f31d0e
      Konstantin Khlebnikov 提交于
      This patch removes mm->mmap_sem from mm->exe_file read side.
      Also it kills dup_mm_exe_file() and moves exe_file duplication into
      dup_mmap() where both mmap_sems are locked.
      
      [akpm@linux-foundation.org: fix comment typo]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f31d0e
  5. 16 4月, 2015 2 次提交
  6. 12 4月, 2015 11 次提交
  7. 07 4月, 2015 1 次提交
    • A
      fix mremap() vs. ioctx_kill() race · b2edffdd
      Al Viro 提交于
      teach ->mremap() method to return an error and have it fail for
      aio mappings in process of being killed
      
      Note that in case of ->mremap() failure we need to undo move_page_tables()
      we'd already done; we could call ->mremap() first, but then the failure of
      move_page_tables() would require undoing whatever _successful_ ->mremap()
      has done, which would be a lot more headache in general.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2edffdd
  8. 03 4月, 2015 1 次提交
    • J
      locks: change lm_get_owner and lm_put_owner prototypes · cae80b30
      Jeff Layton 提交于
      The current prototypes for these operations are somewhat awkward as they
      deal with fl_owners but take struct file_lock arguments. In the future,
      we'll want to be able to take references without necessarily dealing
      with a struct file_lock.
      
      Change them to take fl_owner_t arguments instead and have the callers
      deal with assigning the values to the file_lock structs.
      Signed-off-by: NJeff Layton <jlayton@primarydata.com>
      cae80b30
  9. 26 3月, 2015 1 次提交
  10. 18 3月, 2015 1 次提交
    • T
      fs: make sure the timestamps for lazytime inodes eventually get written · a2f48706
      Theodore Ts'o 提交于
      Jan Kara pointed out that if there is an inode which is constantly
      getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
      will never be written since inode->dirtied_when is constantly getting
      updated.  We fix this by adding an extra field to the inode,
      dirtied_time_when, so inodes with a stale dirtytime can get detected
      and handled.
      
      In addition, if we have a dirtytime inode caused by an atime update,
      and there is no write activity on the file system, we need to have a
      secondary system to make sure these inodes get written out.  We do
      this by setting up a second delayed work structure which wakes up the
      CPU much more rarely compared to writeback_expire_centisecs.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      a2f48706
  11. 17 2月, 2015 9 次提交
  12. 13 2月, 2015 1 次提交
    • V
      fs: consolidate {nr,free}_cached_objects args in shrink_control · 4101b624
      Vladimir Davydov 提交于
      We are going to make FS shrinkers memcg-aware.  To achieve that, we will
      have to pass the memcg to scan to the nr_cached_objects and
      free_cached_objects VFS methods, which currently take only the NUMA node
      to scan.  Since the shrink_control structure already holds the node, and
      the memcg to scan will be added to it when we introduce memcg-aware
      vmscan, let us consolidate the methods' arguments in this structure to
      keep things clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4101b624
  13. 11 2月, 2015 3 次提交
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub · d83a08db
      Kirill A. Shutemov 提交于
      Nobody uses it anymore.
      
      [akpm@linux-foundation.org: fix filemap_xip.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d83a08db
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  14. 05 2月, 2015 2 次提交
    • T
      vfs: add find_inode_nowait() function · fe032c42
      Theodore Ts'o 提交于
      Add a new function find_inode_nowait() which is an even more general
      version of ilookup5_nowait().  It is designed for callers which need
      very fine grained control over when the function is allowed to block
      or increment the inode's reference count.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe032c42
    • T
      vfs: add support for a lazytime mount option · 0ae45f63
      Theodore Ts'o 提交于
      Add a new mount option which enables a new "lazytime" mode.  This mode
      causes atime, mtime, and ctime updates to only be made to the
      in-memory version of the inode.  The on-disk times will only get
      updated when (a) if the inode needs to be updated for some non-time
      related change, (b) if userspace calls fsync(), syncfs() or sync(), or
      (c) just before an undeleted inode is evicted from memory.
      
      This is OK according to POSIX because there are no guarantees after a
      crash unless userspace explicitly requests via a fsync(2) call.
      
      For workloads which feature a large number of random write to a
      preallocated file, the lazytime mount option significantly reduces
      writes to the inode table.  The repeated 4k writes to a single block
      will result in undesirable stress on flash devices and SMR disk
      drives.  Even on conventional HDD's, the repeated writes to the inode
      table block will trigger Adjacent Track Interference (ATI) remediation
      latencies, which very negatively impact long tail latencies --- which
      is a very big deal for web serving tiers (for example).
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ae45f63
  15. 03 2月, 2015 2 次提交
    • C
      fs: add FL_LAYOUT lease type · 11afe9f7
      Christoph Hellwig 提交于
      This (ab-)uses the file locking code to allow filesystems to recall
      outstanding pNFS layouts on a file.  This new lease type is similar but
      not quite the same as FL_DELEG.  A FL_LAYOUT lease can always be granted,
      an a per-filesystem lock (XFS iolock for the initial implementation)
      ensures not FL_LAYOUT leases granted when we would need to recall them.
      
      Also included are changes that allow multiple outstanding read
      leases of different types on the same file as long as they have a
      differnt owner.  This wasn't a problem until now as nfsd never set
      FL_LEASE leases, and no one else used FL_DELEG leases, but given that
      nfsd will also issues FL_LAYOUT leases we will have to handle it now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      11afe9f7
    • A
      Make super_blocks and sb_lock static · 15d0f5ea
      Al Viro 提交于
      The only user outside of fs/super.c is gone now
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      15d0f5ea