1. 01 10月, 2014 1 次提交
    • J
      nfsd4: fix corruption of NFSv4 read data · 15b23ef5
      J. Bruce Fields 提交于
      The calculation of page_ptr here is wrong in the case the read doesn't
      start at an offset that is a multiple of a page.
      
      The result is that nfs4svc_encode_compoundres sets rq_next_page to a
      value one too small, and then the loop in svc_free_res_pages may
      incorrectly fail to clear a page pointer in rq_respages[].
      
      Pages left in rq_respages[] are available for the next rpc request to
      use, so xdr data may be written to that page, which may hold data still
      waiting to be transmitted to the client or data in the page cache.
      
      The observed result was silent data corruption seen on an NFSv4 client.
      
      We tag this as "fixing" 05638dc7 because that commit exposed this
      bug, though the incorrect calculation predates it.
      
      Particular thanks to Andrea Arcangeli and David Gilbert for analysis and
      testing.
      
      Fixes: 05638dc7 "nfsd4: simplify server xdr->next_page use"
      Cc: stable@vger.kernel.org
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: N"Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      15b23ef5
  2. 28 9月, 2014 2 次提交
    • M
      vfs: Don't exchange "short" filenames unconditionally. · d2fa4a84
      Mikhail Efremov 提交于
      Only exchange source and destination filenames
      if flags contain RENAME_EXCHANGE.
      In case if executable file was running and replaced by
      other file /proc/PID/exe should still show correct file name,
      not the old name of the file by which it was replaced.
      
      The scenario when this bug manifests itself was like this:
      * ALT Linux uses rpm and start-stop-daemon;
      * during a package upgrade rpm creates a temporary file
        for an executable to rename it upon successful unpacking;
      * start-stop-daemon is run subsequently and it obtains
        the (nonexistant) temporary filename via /proc/PID/exe
        thus failing to identify the running process.
      
      Note that "long" filenames (> DNAiME_INLINE_LEN) are still
      exchanged without RENAME_EXCHANGE and this behaviour exists
      long enough (should be fixed too apparently).
      So this patch is just an interim workaround that restores
      behavior for "short" names as it was before changes
      introduced by commit da1ce067 ("vfs: add cross-rename").
      
      See https://lkml.org/lkml/2014/9/7/6 for details.
      
      AV: the comments about being more careful with ->d_name.hash
      than with ->d_name.name are from back in 2.3.40s; they
      became obsolete by 2.3.60s, when we started to unhash the
      target instead of swapping hash chain positions followed
      by d_delete() as we used to do when dcache was first
      introduced.
      Acked-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: da1ce067 "vfs: add cross-rename"
      Signed-off-by: NMikhail Efremov <sem@altlinux.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d2fa4a84
    • L
      fold swapping ->d_name.hash into switch_names() · a28ddb87
      Linus Torvalds 提交于
      and do it along with ->d_name.len there
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a28ddb87
  3. 27 9月, 2014 8 次提交
  4. 26 9月, 2014 5 次提交
  5. 22 9月, 2014 1 次提交
    • A
      Fix nasty 32-bit overflow bug in buffer i/o code. · f2d5a944
      Anton Altaparmakov 提交于
      On 32-bit architectures, the legacy buffer_head functions are not always
      handling the sector number with the proper 64-bit types, and will thus
      fail on 4TB+ disks.
      
      Any code that uses __getblk() (and thus bread(), breadahead(),
      sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
      block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
      in __getblk_slow() with an infinite stream of errors logged to dmesg
      like this:
      
        __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
        b_state=0x00000020, b_size=512
        device sda1 blocksize: 512
      
      Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
      top 32-bits are missing (in this case the 0x1 at the top).
      
      This is because grow_dev_page() is broken and has a 32-bit overflow due
      to shifting the page index value (a pgoff_t - which is just 32 bits on
      32-bit architectures) left-shifted as the block number.  But the top
      bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
      before the shift.
      
      This patch fixes this issue by type casting "index" to sector_t before
      doing the left shift.
      
      Note this is not a theoretical bug but has been seen in the field on a
      4TiB hard drive with logical sector size 512 bytes.
      
      This patch has been verified to fix the infinite loop problem on 3.17-rc5
      kernel using a 4TB disk image mounted using "-o loop".  Without this patch
      doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
      100% reproducibly whilst with the patch it works fine as expected.
      Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2d5a944
  6. 19 9月, 2014 2 次提交
  7. 18 9月, 2014 4 次提交
    • C
      Revert "Btrfs: device_list_add() should not update list when mounted" · 0f23ae74
      Chris Mason 提交于
      This reverts commit b96de000.
      
      This commit is triggering failures to mount by subvolume id in some
      configurations.  The main problem is how many different ways this
      scanning function is used, both for scanning while mounted and
      unmounted.  A proper cleanup is too big for late rcs.
      
      For now, just revert the commit and we'll put a better fix into a later
      merge window.
      Signed-off-by: NChris Mason <clm@fb.com>
      0f23ae74
    • D
      CacheFiles: Handle rename2 · e2cf1f1c
      David Howells 提交于
      Not all filesystems now provide the rename i_op - ext4 for one - but rather
      provide the rename2 i_op.  CacheFiles checks that the filesystem has rename
      and so will reject ext4 now with EPERM:
      
      	CacheFiles: Failed to register: -1
      
      Fix this by checking for rename2 as an alternative.  The call to vfs_rename()
      actually handles selection of the appropriate function, so we needn't worry
      about that.
      
      Turning on debugging shows:
      
      	[cachef] ==> cachefiles_get_directory(,,cache)
      	[cachef] subdir -> ffff88000b22b778 positive
      	[cachef] <== cachefiles_get_directory() = -1 [check]
      
      where -1 is EPERM.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e2cf1f1c
    • N
      cachefiles: remove two unused pagevecs. · 696382f9
      NeilBrown 提交于
      
      These two have been unused since
      
      commit c4d6d8db
          CacheFiles: Fix the marking of cached pages
      
      in 3.8.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      696382f9
    • M
      FS-Cache: refcount becomes corrupt under vma pressure. · 3e1199dc
      Milosz Tanski 提交于
      In rare cases under heavy VMA pressure the ref count for a fscache cookie
      becomes corrupt. In this case we decrement ref count even if we fail before
      incrementing the refcount.
      
      FS-Cache: Assertion failed bnode-eca5f9c6/syslog
      0 > 0 is false
      ------------[ cut here ]------------
      kernel BUG at fs/fscache/cookie.c:519!
      invalid opcode: 0000 [#1] SMP
      Call Trace:
      [<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache]
      [<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
      [<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph]
      [<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10
      [<ffffffff811a9e0c>] destroy_inode+0x3c/0x70
      [<ffffffff811a9f51>] evict+0x111/0x180
      [<ffffffff811aa763>] iput+0x103/0x190
      [<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220
      [<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250
      [<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60
      [<ffffffff811930af>] super_cache_scan+0xff/0x170
      [<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0
      [<ffffffff8113f2da>] shrink_slab+0x8a/0x130
      [<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0
      [<ffffffff811428ca>] kswapd+0x16a/0x4a0
      [<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20
      [<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0
      [<ffffffff81083e09>] kthread+0xc9/0xe0
      [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
      [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
      [<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0
      [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
      RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache]
      RSP <ffff8803bc85f9b8>
      ---[ end trace 254d0d7c74a01f25 ]---
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      3e1199dc
  8. 17 9月, 2014 1 次提交
    • F
      Btrfs: set inode's logged_trans/last_log_commit after ranged fsync · 125c4cf9
      Filipe Manana 提交于
      When a ranged fsync finishes if there are still extent maps in the modified
      list, still set the inode's logged_trans and last_log_commit. This is important
      in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
      inode ref gets deleted from the log and the respective dentries in its parent
      are deleted too from the log (if the parent directory was fsync'ed in the same
      transaction).
      
      Instead make btrfs_inode_in_log() return false if the list of modified extent
      maps isn't empty.
      
      This is an incremental on top of the v4 version of the patch:
      
          "Btrfs: fix fsync data loss after a ranged fsync"
      
      which was added to its v5, but didn't make it on time.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      125c4cf9
  9. 16 9月, 2014 6 次提交
  10. 15 9月, 2014 5 次提交
    • S
      [SMB3] Fix oops when creating symlinks on smb3 · da80659d
      Steve French 提交于
      We were not checking for symlink support properly for SMB2/SMB3
      mounts so could oops when mounted with mfsymlinks when try
      to create symlink when mfsymlinks on smb2/smb3 mounts
      Signed-off-by: NSteve French <smfrench@gmail.com>
      Cc: <stable@vger.kernel.org> # 3.14+
      CC: Sachin Prabhu <sprabhu@redhat.com>
      da80659d
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
    • S
      [CIFS] Fix setting time before epoch (negative time values) · 2ae83bf9
      Steve French 提交于
      xfstest generic/258 sets the time on a file to a negative value
      (before 1970) which fails since do_div can not handle negative
      numbers.  In addition 'normal' division of 64 bit values does
      not build on 32 bit arch so have to workaround this by special
      casing negative values in cifs_NTtimeToUnix
      
      Samba server also has a bug with this (see samba bugzilla 7771)
      but it works to Windows server.
      Signed-off-by: NSteve French <smfrench@gmail.com>
      2ae83bf9
    • A
      be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9
      Al Viro 提交于
      in the former we simply check if dentry is still valid after picking
      its ->d_inode; in the latter we fetch ->d_inode in the same places
      where we fetch dentry and its ->d_seq, under the same checks.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4023bfc9
    • A
      don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377
      Al Viro 提交于
      return the value instead, and have path_init() do the assignment.  Broken by
      "vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
      which was Cc-stable with 2.6.38+ as destination.  This one should go where
      it went.
      
      To avoid dummy value returned in case when root is already set (it would do
      no harm, actually, since the only caller that doesn't ignore the return value
      is guaranteed to have nd->root *not* set, but it's more obvious that way),
      lift the check into callers.  And do the same to set_root(), to keep them
      in sync.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bd88377
  11. 14 9月, 2014 3 次提交
    • A
      fix bogus read_seqretry() checks introduced in b37199e6 · f5be3e29
      Al Viro 提交于
      read_seqretry() returns true on mismatch, not on match...
      
      Cc: stable@vger.kernel.org # 3.15+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f5be3e29
    • A
      move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
      Al Viro 提交于
      and lock the right list there
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6f18493e
    • L
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds 提交于
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  12. 13 9月, 2014 2 次提交