1. 31 8月, 2013 1 次提交
  2. 10 8月, 2013 11 次提交
    • Z
      btrfs: don't loop on large offsets in readdir · db62efbb
      Zach Brown 提交于
      When btrfs readdir() hits the last entry it sets the readdir offset to a
      huge value to stop buggy apps from breaking when the same name is
      returned by readdir() with concurrent rename()s.
      
      But unconditionally setting the offset to INT_MAX causes readdir() to
      loop returning any entries with offsets past INT_MAX.  It only takes a
      few hours of constant file creation and removal to create entries past
      INT_MAX.
      
      So let's set the huge offset to LLONG_MAX if the last entry has already
      overflowed 32bit loff_t.   Without large offsets behaviour is identical.
      With large offsets 64bit apps will work and 32bit apps will be no more
      broken than they currently are if they see large offsets.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      db62efbb
    • J
      Btrfs: check to see if root_list is empty before adding it to dead roots · cfad392b
      Josef Bacik 提交于
      A user reported a panic when running with autodefrag and deleting snapshots.
      This is because we could end up trying to add the root to the dead roots list
      twice.  To fix this check to see if we are empty before adding ourselves to the
      dead roots list.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      cfad392b
    • J
      Btrfs: release both paths before logging dir/changed extents · f3b15ccd
      Josef Bacik 提交于
      The ceph guys tripped over this bug where we were still holding onto the
      original path that we used to copy the inode with when logging.  This is based
      on Chris's fix which was reported to fix the problem.  We need to drop the paths
      in two cases anyway so just move the drop up so that we don't have duplicate
      code.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      f3b15ccd
    • J
      Btrfs: allow splitting of hole em's when dropping extent cache · ee20a983
      Josef Bacik 提交于
      I noticed while running multi-threaded fsync tests that sometimes fsck would
      complain about an improper gap.  This happens because we fail to add a hole
      extent to the file, which was happening when we'd split a hole EM because
      btrfs_drop_extent_cache was just discarding the whole em instead of splitting
      it.  So this patch fixes this by allowing us to split a hole em properly, which
      means that added holes actually get logged properly and we no longer see this
      fsck error.  Thankfully we're tolerant of these sort of problems so a user would
      not see any adverse effects of this bug, other than fsck complaining.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ee20a983
    • J
      Btrfs: make sure the backref walker catches all refs to our extent · ed8c4913
      Josef Bacik 提交于
      Because we don't mess with the offset into the extent for compressed we will
      properly find both extents for this case
      
      [extent a][extent b][rest of extent a]
      
      but because we already added a ref for the front half we won't add the inode
      information for the second half.  This causes us to leak that memory and not
      print out the other offset when we do logical-resolve.  So fix this by calling
      ulist_add_merge and then add our eie to the existing entry if there is one.
      With this patch we get both offsets out of logical-resolve.  With this and the
      other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
      set.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ed8c4913
    • J
      Btrfs: fix backref walking when we hit a compressed extent · 8ca15e05
      Josef Bacik 提交于
      If you do btrfs inspect-internal logical-resolve on a compressed extent that has
      been partly overwritten it won't find anything.  This is because we try and
      match the extent offset we've searched for based on the extent offset in the
      data extent entry.  However this doesn't work for compressed extents because the
      offsets are for the uncompressed size, not the compressed size.  So instead only
      do this check if we are not compressed, that way we can get an actual entry for
      the physical offset rather than nothing for compressed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8ca15e05
    • J
      Btrfs: do not offset physical if we're compressed · b76bb701
      Josef Bacik 提交于
      xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
      offsetting the physical based on the extent offset.  This is perfectly fine with
      uncompressed extents, however the extent offset is into the uncompressed area,
      not the compressed.  So we can return a physical value that isn't at all within
      the area we have allocated on disk.  Fix this by returning the start of the
      extent if it is compressed no matter what the offset.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b76bb701
    • L
      Btrfs: fix extent buffer leak after backref walking · b5b9b5b3
      Liu Bo 提交于
      commit 47fb091f(Btrfs: fix unlock after free on rewinded tree blocks)
      takes an extra increment on the reference of allocated dummy extent buffer, so now we
      cannot free this dummy one, and end up with extent buffer leak.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b5b9b5b3
    • L
      Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents · e68afa49
      Liu Bo 提交于
      For partial extents, snapshot-aware defrag does not work as expected,
      since
      a) we use the wrong logical offset to search for parents, which should be
         disk_bytenr + extent_offset, not just disk_bytenr,
      b) 'offset' returned by the backref walking just refers to key.offset, not
         the 'offset' stored in btrfs_extent_data_ref which is
         (key.offset - extent_offset).
      
      The reproducer:
      $ mkfs.btrfs sda
      $ mount sda /mnt
      $ btrfs sub create /mnt/sub
      $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
      $ btrfs sub snap /mnt/sub /mnt/snap1
      $ btrfs sub snap /mnt/sub /mnt/snap2
      $ sync; btrfs filesystem defrag /mnt/sub/foo;
      $ umount /mnt
      $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
      
      This addresses the above two problems.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e68afa49
    • J
      btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified · 7cddc193
      Jie Liu 提交于
      Create a small file and fallocate it to a big size with
      FALLOC_FL_KEEP_SIZE option, then truncate it back to the
      small size again, the disk free space is not changed back
      in this case. i.e,
      
      total 4
      -rw-r--r-- 1 root root 512 Jun 28 11:35 test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      
      -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      With this fix, the truncated up space is back as:
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7cddc193
    • O
      dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() · 201d3dfa
      Oleg Nesterov 提交于
      device_close()->recalc_sigpending() is not needed, sigprocmask() takes
      care of TIF_SIGPENDING correctly.
      
      And without ->siglock it is racy and wrong, it can wrongly clear
      TIF_SIGPENDING and miss a signal.
      
      But even with this patch device_close() is still buggy:
      
        1. sigprocmask() should not be used, we have set_task_blocked(),
           but this is minor.
      
        2. We should never block SIGKILL or SIGSTOP, and this is what
           the code tries to do.
      
        3. This can't protect against SIGKILL or SIGSTOP anyway. Another
           thread can do signal_wake_up(), say, do_signal_stop() or
           complete_signal() or debugger.
      
        4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
           TIF_SIGPENDING, say, freezing() or ->jobctl.
      
        5. device_write() looks equally wrong by the same reason.
      
      Looks like, this tries to protect some wait_event_interruptible() logic
      from signals, it should be turned into uninterruptible wait.  Or we need
      to implement something like signals_stop/start for such a use-case.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      201d3dfa
  3. 08 8月, 2013 6 次提交
  4. 06 8月, 2013 1 次提交
    • T
      LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs · 9a1b6bf8
      Trond Myklebust 提交于
      Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
      which case we're in entirely the wrong namespace.
      
      Secondly, commit 8aac6270 (move
      exit_task_namespaces() outside of exit_notify()) now means that
      exit_task_work() is called after exit_task_namespaces(), which
      triggers an Oops when we're freeing up the locks.
      
      Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
      time, so that the cl_nodename field is initialised to the value of
      utsname()->nodename that the net namespace uses. Then replace the
      lockd callers of utsname()->nodename.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Toralf Förster <toralf.foerster@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 3.10.x
      9a1b6bf8
  5. 05 8月, 2013 4 次提交
    • Z
      vfs: add missing check for __O_TMPFILE in fcntl_init() · 3d62c45b
      Zheng Liu 提交于
      As comment in include/uapi/asm-generic/fcntl.h described, when
      introducing new O_* bits, we need to check its uniqueness in
      fcntl_init().  But __O_TMPFILE bit is missing.  So fix it.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3d62c45b
    • A
      fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink · bb2314b4
      Andy Lutomirski 提交于
      Every now and then someone proposes a new flink syscall, and this spawns
      a long discussion of whether it would be a security problem.  I think
      that this is missing the point: flink is *already* allowed without
      privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.
      
      Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
      write it, and link it in is very convenient.  The only problem is that
      it requires that /proc be mounted so that you can do:
      
      linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)
      
      This sucks -- it's much nicer to do:
      
      linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)
      
      Let's allow it.
      
      If this turns out to be excessively scary, it we could instead require
      that the inode in question be I_LINKABLE, but this seems pointless given
      the /proc situation
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb2314b4
    • A
      fs: Fix file mode for O_TMPFILE · e305f48b
      Andy Lutomirski 提交于
      O_TMPFILE, like O_CREAT, should respect the requested mode and should
      create regular files.
      
      This fixes two bugs: O_TMPFILE required privilege (because the mode
      ended up as 000) and it produced bogus inodes with no type.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e305f48b
    • A
      reiserfs: fix deadlock in umount · 672fe15d
      Al Viro 提交于
      Since remove_proc_entry() started to wait for IO in progress (i.e.
      since 2007 or so), the locking in fs/reiserfs/proc.c became wrong;
      if procfs read happens between the moment when umount() locks the
      victim superblock and removal of /proc/fs/reiserfs/<device>/*,
      we'll get a deadlock - read will wait for s_umount (in sget(),
      called by r_start()), while umount will wait in remove_proc_entry()
      for that read to finish, holding s_umount all along.
      
      Fortunately, the same change allows a much simpler race avoidance -
      all we need to do is remove the procfs entries in the very beginning
      of reiserfs ->kill_sb(); that'll guarantee that pointer to superblock
      will remain valid for the duration for procfs IO, so we don't need
      sget() to keep the sucker alive.  As the matter of fact, we can
      get rid of the home-grown iterator completely, and use single_open()
      instead.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      672fe15d
  6. 01 8月, 2013 2 次提交
  7. 30 7月, 2013 2 次提交
  8. 27 7月, 2013 3 次提交
  9. 25 7月, 2013 1 次提交
    • D
      xfs: di_flushiter considered harmful · e1b4271a
      Dave Chinner 提交于
      When we made all inode updates transactional, we no longer needed
      the log recovery detection for inodes being newer on disk than the
      transaction being replayed - it was redundant as replay of the log
      would always result in the latest version of the inode would be on
      disk. It was redundant, but left in place because it wasn't
      considered to be a problem.
      
      However, with the new "don't read inodes on create" optimisation,
      flushiter has come back to bite us. Essentially, the optimisation
      made always initialises flushiter to zero in the create transaction,
      and so if we then crash and run recovery and the inode already on
      disk has a non-zero flushiter it will skip recovery of that inode.
      As a result, log recovery does the wrong thing and we end up with a
      corrupt filesystem.
      
      Because we have to support old kernel to new kernel upgrades, we
      can't just get rid of the flushiter support in log recovery as we
      might be upgrading from a kernel that doesn't have fully transactional
      inode updates.  Unfortunately, for v4 superblocks there is no way to
      guarantee that log recovery knows about this fact.
      
      We cannot add a new inode format flag to say it's a "special inode
      create" because it won't be understood by older kernels and so
      recovery could do the wrong thing on downgrade. We cannot specially
      detect the combination of zero mode/non-zero flushiter on disk to
      non-zero mode, zero flushiter in the log item during recovery
      because wrapping of the flushiter can result in false detection.
      
      Hence that makes this "don't use flushiter" optimisation limited to
      a disk format that guarantees that we don't need it. And that means
      the only fix here is to limit the "no read IO on create"
      optimisation to version 5 superblocks....
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit e60896d8)
      e1b4271a
  10. 24 7月, 2013 3 次提交
    • T
      NFSv4: Fix brainfart in attribute length calculation · 4f3cc480
      Trond Myklebust 提交于
      The calculation of the attribute length was 4 bytes off.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Tested-by: NAndre Heider <a.heider@gmail.com>
      Reported-and-tested-by: NHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f3cc480
    • H
      nfsd: nfs4_file_get_access: need to be more careful with O_RDWR · df66e753
      Harshula Jayasuriya 提交于
      If fi_fds = {non-NULL, NULL, non-NULL} and oflag = O_WRONLY
      the WARN_ON_ONCE(!(fp->fi_fds[oflag] || fp->fi_fds[O_RDWR]))
      doesn't trigger when it should.
      Signed-off-by: NHarshula Jayasuriya <harshula@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      df66e753
    • H
      nfsd: nfsd_open: when dentry_open returns an error do not propagate as struct file · e4daf1ff
      Harshula Jayasuriya 提交于
      The following call chain:
      ------------------------------------------------------------
      nfs4_get_vfs_file
      - nfsd_open
        - dentry_open
          - do_dentry_open
            - __get_file_write_access
              - get_write_access
                - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
      ------------------------------------------------------------
      
      can result in the following state:
      ------------------------------------------------------------
      struct nfs4_file {
      ...
        fi_fds = {0xffff880c1fa65c80, 0xffffffffffffffe6, 0x0},
        fi_access = {{
            counter = 0x1
          }, {
            counter = 0x0
          }},
      ...
      ------------------------------------------------------------
      
      1) First time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
      NULL, hence nfsd_open() is called where we get status set to an error
      and fp->fi_fds[O_WRONLY] to -ETXTBSY. Thus we do not reach
      nfs4_file_get_access() and fi_access[O_WRONLY] is not incremented.
      
      2) Second time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
      NOT NULL (-ETXTBSY), so nfsd_open() is NOT called, but
      nfs4_file_get_access() IS called and fi_access[O_WRONLY] is incremented.
      Thus we leave a landmine in the form of the nfs4_file data structure in
      an incorrect state.
      
      3) Eventually, when __nfs4_file_put_access() is called it finds
      fi_access[O_WRONLY] being non-zero, it decrements it and calls
      nfs4_file_put_fd() which tries to fput -ETXTBSY.
      ------------------------------------------------------------
      ...
           [exception RIP: fput+0x9]
           RIP: ffffffff81177fa9  RSP: ffff88062e365c90  RFLAGS: 00010282
           RAX: ffff880c2b3d99cc  RBX: ffff880c2b3d9978  RCX: 0000000000000002
           RDX: dead000000100101  RSI: 0000000000000001  RDI: ffffffffffffffe6
           RBP: ffff88062e365c90   R8: ffff88041fe797d8   R9: ffff88062e365d58
           R10: 0000000000000008  R11: 0000000000000000  R12: 0000000000000001
           R13: 0000000000000007  R14: 0000000000000000  R15: 0000000000000000
           ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        #9 [ffff88062e365c98] __nfs4_file_put_access at ffffffffa0562334 [nfsd]
       #10 [ffff88062e365cc8] nfs4_file_put_access at ffffffffa05623ab [nfsd]
       #11 [ffff88062e365ce8] free_generic_stateid at ffffffffa056634d [nfsd]
       #12 [ffff88062e365d18] release_open_stateid at ffffffffa0566e4b [nfsd]
       #13 [ffff88062e365d38] nfsd4_close at ffffffffa0567401 [nfsd]
       #14 [ffff88062e365d88] nfsd4_proc_compound at ffffffffa0557f28 [nfsd]
       #15 [ffff88062e365dd8] nfsd_dispatch at ffffffffa054543e [nfsd]
       #16 [ffff88062e365e18] svc_process_common at ffffffffa04ba5a4 [sunrpc]
       #17 [ffff88062e365e98] svc_process at ffffffffa04babe0 [sunrpc]
       #18 [ffff88062e365eb8] nfsd at ffffffffa0545b62 [nfsd]
       #19 [ffff88062e365ee8] kthread at ffffffff81090886
       #20 [ffff88062e365f48] kernel_thread at ffffffff8100c14a
      ------------------------------------------------------------
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHarshula Jayasuriya <harshula@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e4daf1ff
  11. 21 7月, 2013 2 次提交
    • Z
      ext3: fix a BUG when opening a file with O_TMPFILE flag · dda5690d
      Zheng Liu 提交于
      When we try to open a file with O_TMPFILE flag, we will trigger a bug.
      The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
      this check always fails because we set ->i_nlink = 1 in
      inode_init_always().  We can use the following program to trigger it:
      
      int main(int argc, char *argv[])
      {
      	int fd;
      
      	fd = open(argv[1], O_TMPFILE, 0666);
      	if (fd < 0) {
      		perror("open ");
      		return -1;
      	}
      	close(fd);
      	return 0;
      }
      
      The oops message looks like this:
      
      kernel: kernel BUG at fs/ext3/namei.c:1992!
      kernel: invalid opcode: 0000 [#1] SMP
      kernel: Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand ipv6 dm_mirror dm_region_hash dm_log dm_mod parport_pc parport serio_raw sg dcdbas pcspkr i2c_i801 ehci_pci ehci_hcd button acpi_cpufreq mperf e1000e ptp pps_core ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core ext3 jbd sd_mod ahci libahci libata scsi_mod uhci_hcd
      kernel: CPU: 0 PID: 2882 Comm: tst_tmpfile Not tainted 3.11.0-rc1+ #4
      kernel: Hardware name: Dell Inc. OptiPlex 780 /0V4W66, BIOS A05 08/11/2010
      kernel: task: ffff880112d30050 ti: ffff8801124d4000 task.ti: ffff8801124d4000
      kernel: RIP: 0010:[<ffffffffa00db5ae>] [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3]
      kernel: RSP: 0018:ffff8801124d5cc8  EFLAGS: 00010202
      kernel: RAX: 0000000000000000 RBX: ffff880111510128 RCX: ffff8801114683a0
      kernel: RDX: 0000000000000000 RSI: ffff880111510128 RDI: ffff88010fcf65a8
      kernel: RBP: ffff8801124d5d18 R08: 0080000000000000 R09: ffffffffa00d3b7f
      kernel: R10: ffff8801114683a0 R11: ffff8801032a2558 R12: 0000000000000000
      kernel: R13: ffff88010fcf6800 R14: ffff8801032a2558 R15: ffff8801115100d8
      kernel: FS:  00007f5d172b5700(0000) GS:ffff880117c00000(0000) knlGS:0000000000000000
      kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      kernel: CR2: 00007f5d16df15d0 CR3: 0000000110b1d000 CR4: 00000000000407f0
      kernel: Stack:
      kernel: 000000000000000c ffff8801048a7dc8 ffff8801114685a8 ffffffffa00b80d7
      kernel: ffff8801124d5e38 ffff8801032a2558 ffff88010ce24d68 0000000000000000
      kernel: ffff88011146b300 ffff8801124d5d44 ffff8801124d5d78 ffffffffa00db7e1
      kernel: Call Trace:
      kernel: [<ffffffffa00b80d7>] ? journal_start+0x8c/0xbd [jbd]
      kernel: [<ffffffffa00db7e1>] ext3_tmpfile+0xb2/0x13b [ext3]
      kernel: [<ffffffff821076f8>] path_openat+0x11f/0x5e7
      kernel: [<ffffffff821c86b4>] ? list_del+0x11/0x30
      kernel: [<ffffffff82065fa2>] ?  __dequeue_entity+0x33/0x38
      kernel: [<ffffffff82107cd5>] do_filp_open+0x3f/0x8d
      kernel: [<ffffffff82112532>] ? __alloc_fd+0x50/0x102
      kernel: [<ffffffff820f9296>] do_sys_open+0x13b/0x1cd
      kernel: [<ffffffff820f935c>] SyS_open+0x1e/0x20
      kernel: [<ffffffff82398c02>] system_call_fastpath+0x16/0x1b
      kernel: Code: 39 c7 0f 85 67 01 00 00 0f b7 03 25 00 f0 00 00 3d 00 40 00 00 74 18 3d 00 80 00 00 74 11 3d 00 a0 00 00 74 0a 83 7b 48 00 74 04 <0f> 0b eb fe 49 8b 85 50 03 00 00 4c 89 f6 48 c7 c7 c0 99 0e a0
      kernel: RIP  [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3]
      kernel: RSP <ffff8801124d5cc8>
      
      Here we couldn't call clear_nlink() directly because in d_tmpfile() we
      will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
      tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      dda5690d
    • Z
      ext4: fix a BUG when opening a file with O_TMPFILE flag · e94bd349
      Zheng Liu 提交于
      When we try to open a file with O_TMPFILE flag, we will trigger a bug.
      The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
      this check always fails because we set ->i_nlink = 1 in
      inode_init_always().  We can use the following program to trigger it:
      
      int main(int argc, char *argv[])
      {
      	int fd;
      
      	fd = open(argv[1], O_TMPFILE, 0666);
      	if (fd < 0) {
      		perror("open ");
      		return -1;
      	}
      	close(fd);
      	return 0;
      }
      
      The oops message looks like this:
      
      kernel BUG at fs/ext4/namei.c:2572!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
      nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
      da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
      kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
      CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
      Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
      task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
      RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
      RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
      RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
      R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
      FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
      DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Stack:
       0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
       ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
       ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
      Call Trace:
       [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
       [<ffffffff811cba78>] path_openat+0x238/0x700
       [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
       [<ffffffff811cc647>] do_filp_open+0x47/0xa0
       [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
       [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
       [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
       [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
       [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
       [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
      Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00
      
      Here we couldn't call clear_nlink() directly because in d_tmpfile() we
      will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
      tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Tested-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NDave Jones <davej@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      e94bd349
  12. 20 7月, 2013 4 次提交
    • A
      livelock avoidance in sget() · acfec9a5
      Al Viro 提交于
      Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
      to fail.  The superblock is on ->fs_supers, ->s_umount is held exclusive,
      ->s_active is 1.  Along comes two more processes, trying to mount the same
      thing; sget() in each is picking that superblock, bumping ->s_count and
      trying to grab ->s_umount.  ->s_active is 3 now.  Original mount(2)
      finally gets to deactivate_locked_super() on failure; ->s_active is 2,
      superblock is still ->fs_supers because shutdown will *not* happen until
      ->s_active hits 0.  ->s_umount is dropped and now we have two processes
      chasing each other:
      s_active = 2, A acquired ->s_umount, B blocked
      A sees that the damn thing is stillborn, does deactivate_locked_super()
      s_active = 1, A drops ->s_umount, B gets it
      A restarts the search and finds the same superblock.  And bumps it ->s_active.
      s_active = 2, B holds ->s_umount, A blocked on trying to get it
      ... and we are in the earlier situation with A and B switched places.
      
      The root cause, of course, is that ->s_active should not grow until we'd
      got MS_BORN.  Then failing ->mount() will have deactivate_locked_super()
      shut the damn thing down.  Fortunately, it's easy to do - the key point
      is that grab_super() is called only for superblocks currently on ->fs_supers,
      so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
      bump ->s_active; we must never increment ->s_count for superblocks past
      ->kill_sb(), but grab_super() is never called for those.
      
      The bug is pretty old; we would've caught it by now, if not for accidental
      exclusion between sget() for block filesystems; the things like cgroup or
      e.g. mtd-based filesystems don't have anything of that sort, so they get
      bitten.  The right way to deal with that is obviously to fix sget()...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acfec9a5
    • A
      allow O_TMPFILE to work with O_WRONLY · ba57ea64
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ba57ea64
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb
    • J
      Btrfs: re-add root to dead root list if we stop dropping it · d29a9f62
      Josef Bacik 提交于
      If we stop dropping a root for whatever reason we need to add it back to the
      dead root list so that we will re-start the dropping next transaction commit.
      The other case this happens is if we recover a drop because we will add a root
      without adding it to the fs radix tree, so we can leak it's root and commit root
      extent buffer, adding this to the dead root list makes this cleanup happen.
      Thanks,
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d29a9f62