1. 23 1月, 2016 1 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
  2. 16 1月, 2016 2 次提交
  3. 15 1月, 2016 2 次提交
    • A
      fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c
      Andrew Morton 提交于
      bdev_write_page() is used by swapout and by writepage where we cannot
      use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
      here.
      
      blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
      bugs were harmed in the making of this patch.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b832861c
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  4. 14 1月, 2016 1 次提交
  5. 09 1月, 2016 2 次提交
  6. 07 1月, 2016 1 次提交
  7. 05 12月, 2015 1 次提交
    • I
      block: detach bdev inode from its wb in __blkdev_put() · 43d1c0eb
      Ilya Dryomov 提交于
      Since 52ebea74 ("writeback: make backing_dev_info host
      cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
      gets attached to a wb (struct bdi_writeback).  Detaching happens on
      evict, in inode_detach_wb() called from __destroy_inode(), and involves
      updating wb.
      
      However, detaching an internal bdev inode from its wb in
      __destroy_inode() is too late.  Its bdi and by extension root wb are
      embedded into struct request_queue, which has different lifetime rules
      and can be freed long before the final bdput() is called (can be from
      __fput() of a corresponding /dev inode, through dput() - evict() -
      bd_forget().  bdevs hold onto the underlying disk/queue pair only while
      opened; as soon as bdev is closed all bets are off.  In fact,
      disk/queue can be gone before __blkdev_put() even returns:
      
      1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
      1500 {
      ...
      1518         if (bdev->bd_contains == bdev) {
      1519                 if (disk->fops->release)
      1520                         disk->fops->release(disk, mode);
      
      [ Driver puts its references to disk/queue ]
      
      1521         }
      1522         if (!bdev->bd_openers) {
      1523                 struct module *owner = disk->fops->owner;
      1524
      1525                 disk_put_part(bdev->bd_part);
      1526                 bdev->bd_part = NULL;
      1527                 bdev->bd_disk = NULL;
      1528                 if (bdev != bdev->bd_contains)
      1529                         victim = bdev->bd_contains;
      1530                 bdev->bd_contains = NULL;
      1531
      1532                 put_disk(disk);
      
      [ We put ours, the queue is gone
        The last bdput() would result in a write to invalid memory ]
      
      1533                 module_put(owner);
      ...
      1539 }
      
      Since bdev inodes are special anyway, detach them in __blkdev_put()
      after clearing inode's dirty bits, turning the problematic
      inode_detach_wb() in __destroy_inode() into a noop.
      
      add_disk() grabs its disk->queue since 523e1d39 ("block: make
      gendisk hold a reference to its queue"), so the old ->release comment
      is removed in favor of the new inode_detach_wb() comment.
      
      Cc: stable@vger.kernel.org # 4.2+, needs backporting
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Tested-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      43d1c0eb
  8. 02 12月, 2015 1 次提交
  9. 20 11月, 2015 1 次提交
    • D
      block: protect rw_page against device teardown · 2e6edc95
      Dan Williams 提交于
      Fix use after free crashes like the following:
      
       general protection fault: 0000 [#1] SMP
       Call Trace:
        [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
        [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
        [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
        [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
        [<ffffffff81297657>] mpage_readpages+0x107/0x170
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
        [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
        [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
        [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
        [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
        [<ffffffff811c76f6>] filemap_fault+0x396/0x530
        [<ffffffff811f816e>] __do_fault+0x4e/0xf0
        [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50
      
      Cc: <stable@vger.kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMatthew Wilcox <willy@linux.intel.com>
      [willy: symmetry fixups]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      2e6edc95
  10. 12 11月, 2015 1 次提交
  11. 22 10月, 2015 1 次提交
    • M
      block: Inline blk_integrity in struct gendisk · 25520d55
      Martin K. Petersen 提交于
      Up until now the_integrity profile has been dynamically allocated and
      attached to struct gendisk after the disk has been made active.
      
      This causes problems because NVMe devices need to register the profile
      prior to the partition table being read due to a mandatory metadata
      buffer requirement. In addition, DM goes through hoops to deal with
      preallocating, but not initializing integrity profiles.
      
      Since the integrity profile is small (4 bytes + a pointer), Christoph
      suggested moving it to struct gendisk proper. This requires several
      changes:
      
       - Moving the blk_integrity definition to genhd.h.
      
       - Inlining blk_integrity in struct gendisk.
      
       - Removing the dynamic allocation code.
      
       - Adding helper functions which allow gendisk to set up and tear down
         the integrity sysfs dir when a disk is added/deleted.
      
       - Adding a blk_integrity_revalidate() callback for updating the stable
         pages bdi setting.
      
       - The calls that depend on whether a device has an integrity profile or
         not now key off of the bi->profile pointer.
      
       - Simplifying the integrity support routines in DM (Mike Snitzer).
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      25520d55
  12. 16 9月, 2015 1 次提交
  13. 09 9月, 2015 1 次提交
    • M
      dax: move DAX-related functions to a new header · c94c2acf
      Matthew Wilcox 提交于
      In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
      return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
      defined in linux/mm.h.  Given that we don't want to include <linux/mm.h>
      in <linux/fs.h>, the easiest solution is to move the DAX-related
      functions to a new header, <linux/dax.h>.  We could also have moved
      VM_FAULT_* definitions to a new header, or a different header that isn't
      quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
      the best option.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c94c2acf
  14. 28 8月, 2015 1 次提交
  15. 21 8月, 2015 1 次提交
  16. 18 8月, 2015 1 次提交
  17. 05 7月, 2015 2 次提交
  18. 28 6月, 2015 1 次提交
  19. 26 6月, 2015 1 次提交
  20. 02 6月, 2015 2 次提交
    • T
      bdi: make inode_to_bdi() inline · a212b105
      Tejun Heo 提交于
      Now that bdi definitions are moved to backing-dev-defs.h,
      backing-dev.h can include blkdev.h and inline inode_to_bdi() without
      worrying about introducing circular include dependency.  The function
      gets called from hot paths and fairly trivial.
      
      This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
      function calls inline.  blockdev_superblock and noop_backing_dev_info
      are EXPORT_GPL'd to allow the inline functions to be used from
      modules.
      
      While at it, make sb_is_blkdev_sb() return bool instead of int.
      
      v2: Fixed typo in description as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a212b105
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
  21. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  22. 16 4月, 2015 1 次提交
  23. 12 4月, 2015 5 次提交
  24. 26 3月, 2015 1 次提交
  25. 21 1月, 2015 2 次提交
  26. 14 1月, 2015 1 次提交
  27. 17 11月, 2014 1 次提交
    • B
      fs: add freeze_super/thaw_super fs hooks · 48b6bca6
      Benjamin Marzinski 提交于
      Currently, freezing a filesystem involves calling freeze_super, which locks
      sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
      hard for gfs2 (and potentially other cluster filesystems) to use the vfs
      freezing code to do freezes on all the cluster nodes.
      
      In order to communicate that a freeze has been requested, and to make sure
      that only one node is trying to freeze at a time, gfs2 uses a glock
      (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
      this lock before calling freeze_super. This means that two nodes can
      attempt to freeze the filesystem by both calling freeze_super, acquiring
      the sb->s_umount lock, and then attempting to grab the cluster glock
      sd_freeze_gl. Only one will succeed, and the other will be stuck in
      freeze_super, making it impossible to finish freezing the node.
      
      To solve this problem, this patch adds the freeze_super and thaw_super
      hooks.  If a filesystem implements these hooks, they are called instead of
      the vfs freeze_super and thaw_super functions. This means that every
      filesystem that implements these hooks must call the vfs freeze_super and
      thaw_super functions itself within the hook function to make use of the vfs
      freezing code.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      48b6bca6
  28. 31 10月, 2014 1 次提交
    • D
      Return short read or 0 at end of a raw device, not EIO · b2de525f
      David Jeffery 提交于
      Author: David Jeffery <djeffery@redhat.com>
      Changes to the basic direct I/O code have broken the raw driver when reading
      to the end of a raw device.  Instead of returning a short read for a read that
      extends partially beyond the device's end or 0 when at the end of the device,
      these reads now return EIO.
      
      The raw driver needs the same end of device handling as was added for normal
      block devices.  Using blkdev_read_iter, which has the needed size checks,
      prevents the EIO conditions at the end of the device.
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2de525f
  29. 10 10月, 2014 1 次提交
    • A
      block_dev: implement readpages() to optimize sequential read · 447f05bb
      Akinobu Mita 提交于
      Sequential read from a block device is expected to be equal or faster than
      from the file on a filesystem.  But it is not correct due to the lack of
      effective readpages() in the address space operations for block device.
      
      This implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      
      Install 1GB of RAM disk storage:
      
      	# modprobe scsi_debug dev_size_mb=1024 delay=0
      
      Sequential read from file on a filesystem:
      
      	# mkfs.ext4 /dev/$DEV
      	# mount /dev/$DEV /mnt
      	# fio --name=t --size=512m --rw=read --filename=/mnt/file
      	...
      	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec
      
      Sequential read from a block device:
      	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
      	...
      (Without this commit)
      	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec
      
      (With this commit)
      	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      447f05bb
  30. 09 9月, 2014 1 次提交
    • T
      bdi: reimplement bdev_inode_switch_bdi() · 018a17bd
      Tejun Heo 提交于
      A block_device may be attached to different gendisks and thus
      different bdis over time.  bdev_inode_switch_bdi() is used to switch
      the associated bdi.  The function assumes that the inode could be
      dirty and transfers it between bdis if so.  This is a bit nasty in
      that it reaches into bdi internals.
      
      This patch reimplements the function so that it writes out the inode
      if dirty.  This is a lot simpler and can be implemented without
      exposing bdi internals.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      018a17bd