1. 23 2月, 2016 7 次提交
    • J
      mbcache2: rename to mbcache · 7a2508e1
      Jan Kara 提交于
      Since old mbcache code is gone, let's rename new code to mbcache since
      number 2 is now meaningless. This is just a mechanical replacement.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7a2508e1
    • J
      mbcache2: Use referenced bit instead of LRU · f0c8b462
      Jan Kara 提交于
      Currently we maintain perfect LRU list by moving entry to the tail of
      the list when it gets used. However these operations on cache-global
      list are relatively expensive.
      
      In this patch we switch to lazy updates of LRU list. Whenever entry gets
      used, we set a referenced bit in it. When reclaiming entries, we give
      referenced entries another round in the LRU. Since the list is not a
      real LRU anymore, rename it to just 'list'.
      
      In my testing this logic gives about 30% boost to workloads with mostly
      unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique
      xattr values).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f0c8b462
    • J
      mbcache2: limit cache size · c2f3140f
      Jan Kara 提交于
      So far number of entries in mbcache is limited only by the pressure from
      the shrinker. Since too many entries degrade the hash table and
      generally we expect that caching more entries has diminishing returns,
      limit number of entries the same way as in the old mbcache to 16 * hash
      table size.
      
      Once we exceed the desired maximum number of entries, we schedule a
      backround work to reclaim entries. If the background work cannot keep up
      and the number of entries exceeds two times the desired maximum, we
      reclaim some entries directly when allocating a new entry.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c2f3140f
    • J
      mbcache: remove mbcache · ecd1e644
      Jan Kara 提交于
      Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache
      code.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ecd1e644
    • J
      ext2: convert to mbcache2 · be0726d3
      Jan Kara 提交于
      The conversion is generally straightforward. We convert filesystem from
      a global cache to per-fs one. Similarly to ext4 the tricky part is that
      xattr block corresponding to found mbcache entry can get freed before we
      get buffer lock for that block. So we have to check whether the entry is
      still valid after getting the buffer lock.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      be0726d3
    • J
      ext4: convert to mbcache2 · 82939d79
      Jan Kara 提交于
      The conversion is generally straightforward. The only tricky part is
      that xattr block corresponding to found mbcache entry can get freed
      before we get buffer lock for that block. So we have to check whether
      the entry is still valid after getting buffer lock.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      82939d79
    • J
      mbcache2: reimplement mbcache · f9a61eb4
      Jan Kara 提交于
      Original mbcache was designed to have more features than what ext?
      filesystems ended up using. It supported entry being in more hashes, it
      had a home-grown rwlocking of each entry, and one cache could cache
      entries from multiple filesystems. This genericity also resulted in more
      complex locking, larger cache entries, and generally more code
      complexity.
      
      This is reimplementation of the mbcache functionality to exactly fit the
      purpose ext? filesystems use it for. Cache entries are now considerably
      smaller (7 instead of 13 longs), the code is considerably smaller as
      well (414 vs 913 lines of code), and IMO also simpler. The new code is
      also much more lightweight.
      
      I have measured the speed using artificial xattr-bench benchmark, which
      spawns P processes, each process sets xattr for F different files, and
      the value of xattr is randomly chosen from a pool of V values. Averages
      of runtimes for 5 runs for various combinations of parameters are below.
      The first value in each cell is old mbache, the second value is the new
      mbcache.
      
      V=10
      F\P	1		2		4		8		16		32		64
      10	0.158,0.157	0.208,0.196	0.500,0.277	0.798,0.400	3.258,0.584	13.807,1.047	61.339,2.803
      100	0.172,0.167	0.279,0.222	0.520,0.275	0.825,0.341	2.981,0.505	12.022,1.202	44.641,2.943
      1000	0.185,0.174	0.297,0.239	0.445,0.283	0.767,0.340	2.329,0.480	6.342,1.198	16.440,3.888
      
      V=100
      F\P	1		2		4		8		16		32		64
      10	0.162,0.153	0.200,0.186	0.362,0.257	0.671,0.496	1.433,0.943	3.801,1.345	7.938,2.501
      100	0.153,0.160	0.221,0.199	0.404,0.264	0.945,0.379	1.556,0.485	3.761,1.156	7.901,2.484
      1000	0.215,0.191	0.303,0.246	0.471,0.288	0.960,0.347	1.647,0.479	3.916,1.176	8.058,3.160
      
      V=1000
      F\P	1		2		4		8		16		32		64
      10	0.151,0.129	0.210,0.163	0.326,0.245	0.685,0.521	1.284,0.859	3.087,2.251	6.451,4.801
      100	0.154,0.153	0.211,0.191	0.276,0.282	0.687,0.506	1.202,0.877	3.259,1.954	8.738,2.887
      1000	0.145,0.179	0.202,0.222	0.449,0.319	0.899,0.333	1.577,0.524	4.221,1.240	9.782,3.579
      
      V=10000
      F\P	1		2		4		8		16		32		64
      10	0.161,0.154	0.198,0.190	0.296,0.256	0.662,0.480	1.192,0.818	2.989,2.200	6.362,4.746
      100	0.176,0.174	0.236,0.203	0.326,0.255	0.696,0.511	1.183,0.855	4.205,3.444	19.510,17.760
      1000	0.199,0.183	0.240,0.227	1.159,1.014	2.286,2.154	6.023,6.039	---,10.933	---,36.620
      
      V=100000
      F\P	1		2		4		8		16		32		64
      10	0.171,0.162	0.204,0.198	0.285,0.230	0.692,0.500	1.225,0.881	2.990,2.243	6.379,4.771
      100	0.151,0.171	0.220,0.210	0.295,0.255	0.720,0.518	1.226,0.844	3.423,2.831	19.234,17.544
      1000	0.192,0.189	0.249,0.225	1.162,1.043	2.257,2.093	5.853,4.997	---,10.399	---,32.198
      
      We see that the new code is faster in pretty much all the cases and
      starting from 4 processes there are significant gains with the new code
      resulting in upto 20-times shorter runtimes. Also for large numbers of
      cached entries all values for the old code could not be measured as the
      kernel started hitting softlockups and died before the test completed.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      f9a61eb4
  2. 22 2月, 2016 2 次提交
  3. 19 2月, 2016 2 次提交
    • J
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara 提交于
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      74dae427
    • J
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara 提交于
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ed8ad838
  4. 16 2月, 2016 1 次提交
  5. 12 2月, 2016 6 次提交
    • E
      ext4: remove unused parameter "newblock" in convert_initialized_extent() · 56263b4c
      Eryu Guan 提交于
      The "newblock" parameter is not used in convert_initialized_extent(),
      remove it.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      56263b4c
    • E
      ext4: don't read blocks from disk after extents being swapped · bcff2488
      Eryu Guan 提交于
      I notice ext4/307 fails occasionally on ppc64 host, reporting md5
      checksum mismatch after moving data from original file to donor file.
      
      The reason is that move_extent_per_page() calls __block_write_begin()
      and block_commit_write() to write saved data from original inode blocks
      to donor inode blocks, but __block_write_begin() not only maps buffer
      heads but also reads block content from disk if the size is not block
      size aligned.  At this time the physical block number in mapped buffer
      head is pointing to the donor file not the original file, and that
      results in reading wrong data to page, which get written to disk in
      following block_commit_write call.
      
      This also can be reproduced by the following script on 1k block size ext4
      on x86_64 host:
      
          mnt=/mnt/ext4
          donorfile=$mnt/donor
          testfile=$mnt/testfile
          e4compact=~/xfstests/src/e4compact
      
          rm -f $donorfile $testfile
      
          # reserve space for donor file, written by 0xaa and sync to disk to
          # avoid EBUSY on EXT4_IOC_MOVE_EXT
          xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
      
          # create test file written by 0xbb
          xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
      
          # compute initial md5sum
          md5sum $testfile | tee md5sum.txt
          # drop cache, force e4compact to read data from disk
          echo 3 > /proc/sys/vm/drop_caches
      
          # test defrag
          echo "$testfile" | $e4compact -i -v -f $donorfile
          # check md5sum
          md5sum -c md5sum.txt
      
      Fix it by creating & mapping buffer heads only but not reading blocks
      from disk, because all the data in page is guaranteed to be up-to-date
      in mext_page_mkuptodate().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bcff2488
    • I
      ext4: fix potential integer overflow · 46901760
      Insu Yun 提交于
      Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
      integer overflow could be happened.
      Therefore, need to fix integer overflow sanitization.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      46901760
    • H
      ext4: add a line break for proc mb_groups display · 802cf1f9
      Huaitong Han 提交于
      This patch adds a line break for proc mb_groups display.
      Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      802cf1f9
    • A
      ext4: ioctl: fix erroneous return value · fdde368e
      Anton Protopopov 提交于
      The ext4_ioctl_setflags() function which is used in the ioctls
      EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
      EPERM instead of -EPERM in case of error. This bug was introduced by a
      recent commit 9b7365fc.
      
      The following program can be used to illustrate the wrong behavior:
      
          #include <sys/types.h>
          #include <sys/ioctl.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <err.h>
      
          #define FS_IOC_GETFLAGS _IOR('f', 1, long)
          #define FS_IOC_SETFLAGS _IOW('f', 2, long)
          #define FS_IMMUTABLE_FL 0x00000010
      
          int main(void)
          {
              int fd;
              long flags;
      
              fd = open("file", O_RDWR|O_CREAT, 0600);
              if (fd < 0)
                  err(1, "open");
      
              if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_GETFLAGS");
      
              flags |= FS_IMMUTABLE_FL;
      
              if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_SETFLAGS");
      
              warnx("ioctl returned no error");
      
              return 0;
          }
      
      Running it gives the following result:
      
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
          test: ioctl returned no error
          +++ exited with 0 +++
      
      Running the program on a kernel with the bug fixed gives the proper result:
      
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
          test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
          +++ exited with 1 +++
      Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      fdde368e
    • J
      ext4: fix scheduling in atomic on group checksum failure · 05145bd7
      Jan Kara 提交于
      When block group checksum is wrong, we call ext4_error() while holding
      group spinlock from ext4_init_block_bitmap() or
      ext4_init_inode_bitmap() which results in scheduling while in atomic.
      Fix the issue by calling ext4_error() later after dropping the spinlock.
      
      CC: stable@vger.kernel.org
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      05145bd7
  6. 08 2月, 2016 2 次提交
  7. 30 1月, 2016 1 次提交
  8. 27 1月, 2016 2 次提交
  9. 26 1月, 2016 3 次提交
    • D
      Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" · 80ad623e
      David Sterba 提交于
      This reverts commit 69624913. The
      cleaner thread can block freezing when there's a snapshot cleaning in
      progress and the other threads get suspended first. From the logs
      provided by Martin we're waiting for reading extent pages:
      
      kernel: PM: Syncing filesystems ... done.
      kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
      kernel: Freezing remaining freezable tasks ...
      kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
      kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
      kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
      kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
      kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
      kernel: Call Trace:
      kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
      kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
      kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
      kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
      kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
      kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
      kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
      kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
      kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
      kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
      kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
      kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
      kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
      kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
      kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
      kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
      kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
      kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
      kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
      kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
      kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
      kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
      kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
      kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      
      As this affects a released kernel (4.4) we need a minimal fix for
      stable kernels.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361Reported-by: NMartin Ziegler <ziegler@uni-freiburg.de>
      CC: stable@vger.kernel.org # 4.4
      CC: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      80ad623e
    • Q
      btrfs: async-thread: Fix a use-after-free error for trace · 0a95b851
      Qu Wenruo 提交于
      Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
      So no one use use that pointer after queue_work().
      
      Fix the user-after-free bug by move the trace line before queue_work().
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a95b851
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  10. 25 1月, 2016 2 次提交
  11. 23 1月, 2016 12 次提交
    • D
      vfs: abort dedupe loop if fatal signals are pending · e62e560f
      Darrick J. Wong 提交于
      If the program running dedupe receives a fatal signal during the
      dedupe loop, we should bail out to avoid tying up the system.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e62e560f
    • T
      tree wide: use kvfree() than conditional kfree()/vfree() · 1d5cfdb0
      Tetsuo Handa 提交于
      There are many locations that do
      
        if (memory_was_allocated_by_vmalloc)
          vfree(ptr);
        else
          kfree(ptr);
      
      but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
      using is_vmalloc_addr().  Unless callers have special reasons, we can
      replace this branch with kvfree().  Please check and reply if you found
      problems.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Boris Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5cfdb0
    • R
      dax: never rely on bh.b_dev being set by get_block() · eab95db6
      Ross Zwisler 提交于
      Previously in DAX we assumed that calls to get_block() would set
      bh.b_bdev, and we would then use that value even in error cases for
      debugging.  This caused a NULL pointer dereference in __dax_dbg() which
      was fixed by a previous commit, but that commit only changed the one
      place where we were hitting an error.
      
      Instead, update dax.c so that we always initialize bh.b_bdev as best we
      can based on the information that DAX has.  get_block() may or may not
      update to a new value, but this at least lets us get something helpful
      from bh.b_bdev for error messages and not have to worry about whether it
      was set by get_block() or not.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab95db6
    • R
      xfs: call dax_pfn_mkwrite() for DAX fsync/msync · 5eb88dca
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5eb88dca
    • R
      ext4: call dax_pfn_mkwrite() for DAX fsync/msync · d5be7a03
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5be7a03
    • R
      ext2: call dax_pfn_mkwrite() for DAX fsync/msync · 80b4adca
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80b4adca
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • R
      dax: fix conversion of holes to PMDs · de14b9cb
      Ross Zwisler 提交于
      When we get a DAX PMD fault for a write it is possible that there could
      be some number of 4k zero pages already present for the same range that
      were inserted to service reads from a hole.  These 4k zero pages need to
      be unmapped from the VMAs and removed from the struct address_space
      radix tree before the real DAX PMD entry can be inserted.
      
      For PTE faults this same use case also exists and is handled by a
      combination of unmap_mapping_range() to unmap the VMAs and
      delete_from_page_cache() to remove the page from the address_space radix
      tree.
      
      For PMD faults we do have a call to unmap_mapping_range() (protected by
      a buffer_new() check), but nothing clears out the radix tree entry.  The
      buffer_new() check is also incorrect as the current ext4 and XFS
      filesystem code will never return a buffer_head with BH_New set, even
      when allocating new blocks over a hole.  Instead the filesystem will
      zero the blocks manually and return a buffer_head with only BH_Mapped
      set.
      
      Fix this situation by removing the buffer_new() check and adding a call
      to truncate_inode_pages_range() to clear out the radix tree entries
      before we insert the DAX PMD.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de14b9cb
    • R
      dax: fix NULL pointer dereference in __dax_dbg() · d4bbe706
      Ross Zwisler 提交于
      In __dax_pmd_fault() we currently assume that get_block() will always
      set bh.b_bdev and we unconditionally dereference it in __dax_dbg().
      
      This assumption isn't always true - when called for reads of holes
      ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is
      never set.  I hit this BUG while testing the DAX PMD fault path.
      
      Instead, initialize bh.b_bdev before passing bh into get_block().  It is
      possible that the filesystem's get_block() will update bh.b_bdev, and
      this is fine - we just want to initialize bh.b_bdev to something
      reasonable so that the calls to __dax_dbg() work and print something
      useful.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4bbe706
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
    • D
      btrfs: tweak free space tree bitmap allocation · 79b134a2
      David Sterba 提交于
      The requested bitmap size varies, observed numbers were < 4K up to 16K.
      Using vmalloc unconditionally would be too heavy, we'll try contiguous
      allocations first and fall back to vmalloc if there's no contig memory.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79b134a2