1. 24 6月, 2014 9 次提交
    • X
      ocfs2/dlm: do not purge lockres that is queued for assert master · ac4fef4d
      Xue jiufei 提交于
      When workqueue is delayed, it may occur that a lockres is purged while it
      is still queued for master assert.  it may trigger BUG() as follows.
      
      N1                                         N2
      dlm_get_lockres()
      ->dlm_do_master_requery
                                        is the master of lockres,
                                        so queue assert_master work
      
                                        dlm_thread() start running
                                        and purge the lockres
      
                                        dlm_assert_master_worker()
                                        send assert master message
                                        to other nodes
      receiving the assert_master
      message, set master to N2
      
      dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
      if it is RECOVERY lockres, it triggers the BUG().
      
      Another BUG() is triggered when N3 become the new master and send
      assert_master to N1, N1 will trigger the BUG() because owner doesn't
      match.  So we should not purge lockres when it is queued for assert
      master.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac4fef4d
    • J
      ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during umount · b9aaac5a
      jiangyiwen 提交于
      The following case may lead to endless loop during umount.
      
      node A         node B               node C       node D
      umount volume,
      migrate lockres1
      to B
                                                       want to lock lockres1,
                                                       send
                                                       MASTER_REQUEST_MSG
                                                       to C
                                          init block mle
                     send
                     MIGRATE_REQUEST_MSG
                     to C
                                          find a block
                                          mle, and then
                                          return
                                          DLM_MIGRATE_RESPONSE_MASTERY_REF
                                          to B
                     set C in refmap
                                          umount successfully
                     try to umount, endless
                     loop occurs when migrate
                     lockres1 since C is in
                     refmap
      
      So we can fix this endless loop case by only returning
      DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
      MIGRATE_REQUEST_MSG.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: Njiangyiwen <jiangyiwen@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9aaac5a
    • J
      ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod · 595297a8
      jiangyiwen 提交于
      When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
      ocfs2_mknod(), iput() will not be called during dput(dentry) because no
      d_instantiate(), and this will lead to umount hung.
      Signed-off-by: Njiangyiwen <jiangyiwen@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      595297a8
    • Y
      ocfs2: fix a tiny race when running dirop_fileop_racer · f7a14f32
      Yiwen Jiang 提交于
      When running dirop_fileop_racer we found a dead lock case.
      
      2 nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
      /race/16/1 in the filesystem, and let the inode number of dir 16 is less
      than the inode number of dir race.
      
      Node A                            Node B
      mv /race/16/1 /race/
                                        right after Node A has got the
                                        EX mode of /race/16/, and tries to
                                        get EX mode of /race
                                        ls /race/16/
      
      In this case, Node A has got the EX mode of /race/16/, and wants to get EX
      mode of /race/.  Node B has got the PR mode of /race/, and wants to get
      the PR mode of /race/16/.  Since EX and PR are mutually exclusive, dead
      lock happens.
      
      This patch fixes this case by locking in ancestor order before trying
      inode number order.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7a14f32
    • X
      ocfs2/dlm: fix misuse of list_move_tail() in dlm_run_purge_list() · a270c6d3
      Xue jiufei 提交于
      When a lockres in purge list but is still in use, it should be moved to
      the tail of purge list.  dlm_thread will continue to check next lockres in
      purge list.  However, code list_move_tail(&dlm->purge_list,
      &lockres->purge) will do *no* movements, so dlm_thread will purge the same
      lockres in this loop again and again.  If it is in use for a long time,
      other lockres will not be processed.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a270c6d3
    • W
      ocfs2: refcount: take rw_lock in ocfs2_reflink · 8a8ad1c2
      Wengang Wang 提交于
      This patch tries to fix this crash:
      
       #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
       #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
          [exception RIP: ocfs2_direct_IO_get_blocks+359]
          RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
          RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
          RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
          R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
          R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
       #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
       #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
      #10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
      #11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
      #12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
      #13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
      #14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
      #15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
      #16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
      #17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
      #18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
      #19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
      #20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159
      
      It crashes at
              BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
      in ocfs2_direct_IO_get_blocks.
      
      ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
      ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
      during the time before (or inside) ocfs2_prepare_inode_for_write() and after
      ocfs2_direct_IO_get_blocks().
      
      It can happen in this case:
      
      Node A(which crashes)				Node B
      ------------------------                 ---------------------------
      ocfs2_file_aio_write
        ocfs2_prepare_inode_for_write
          ocfs2_inode_lock
          ...
          ocfs2_inode_unlock
        #no refcount found
      ....					ocfs2_reflink
                                                ocfs2_inode_lock
                                                ...
                                                ocfs2_inode_unlock
                                                #now, refcount flag set on extent
      
                                              ...
                                              flush change to disk
      
      ocfs2_direct_IO_get_blocks
        ocfs2_get_clusters
          #extent map miss
          #buffer_head miss
          read extents from disk
        found refcount flag on extent
        crash..
      
      Fix:
      Take rw_lock in ocfs2_reflink path
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a8ad1c2
    • X
      ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously" · b253bfd8
      Xue jiufei 提交于
      75f82eaa ("ocfs2: fix NULL pointer dereference when dismount and
      ocfs2rec simultaneously") may cause umount hang while shutting down
      truncate log.
      
      The situation is as followes:
      ocfs2_dismout_volume
      -> ocfs2_recovery_exit
        -> free osb->recovery_map
      -> ocfs2_truncate_shutdown
        -> lock global bitmap inode
          -> ocfs2_wait_for_recovery
                -> check whether osb->recovery_map->rm_used is zero
      
      Because osb->recovery_map is already freed, rm_used can be any other
      values, so it may yield umount hang.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b253bfd8
    • T
      ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and... · 27bf6305
      Tariq Saeed 提交于
      ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletimeout closes conn
      
      Orabug: 18639535
      
      Two node cluster and both nodes hold a lock at PR level and both want to
      convert to EX at the same time.  Master node 1 has sent BAST and then
      closes the connection due to idletime out.  Node 0 receives BAST, sends
      unlock req with cancel flag but gets error -ENOTCONN.  The problem is
      this error is ignored in dlm_send_remote_unlock_request() on the
      **incorrect** assumption that the master is dead.  See NOTE in comment
      why it returns DLM_NORMAL.  Upon getting DLM_NORMAL, node 0 proceeds to
      sends convert (without cancel flg) which fails with -ENOTCONN.  waits 5
      sec and resends.
      
      This time gets DLM_IVLOCKID from the master since lock not found in
      grant, it had been moved to converting queue in response to conv PR->EX
      req.  No way out.
      
      Node 1 (master)				Node 0
      ==============				======
      
        lock mode PR				PR
      
        convert PR -> EX
        mv grant -> convert and que BAST
        ...
                           <-------- convert PR -> EX
        convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
        ...
                              BAST (want PR -> NL)
                           ------------------>
        ...
        idle timout, conn closed
                                      ...
                                      In response to BAST,
                                      sends unlock with cancel convert flag
                                      gets -ENOTCONN. Ignores and
                                      sends remote convert request
                                      gets -ENOTCONN, waits 5 Sec, retries
        ...
        reconnects
                         <----------------- convert req goes through on next try
        does not find lock on grant que
                         status DLM_IVLOCKID
                         ------------------>
        ...
      
      No way out.  Fix is to keep retrying unlock with cancel flag until it
      succeeds or the master dies.
      Signed-off-by: NTariq Saeed <tariq.x.saeed@oracle.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27bf6305
    • A
      ocfs2: should add inode into orphan dir after updating entry in ocfs2_rename() · 5fb1beb0
      alex chen 提交于
      There are two files a and b in dir /mnt/ocfs2.
      
          node A                           node B
      
        mv a b
        In ocfs2_rename(), after calling
        ocfs2_orphan_add(), the inode of
        file b will be added into orphan
        dir.
      
        If ocfs2_update_entry() fails,
        ocfs2_rename return error and mv
        operation fails. But file b still
        exists in the parent dir.
      
        ocfs2_queue_orphan_scan
         -> ocfs2_queue_recovery_completion
         -> ocfs2_complete_recovery
         -> ocfs2_recover_orphans
        The inode of the file b will be
        put with iput().
      
        ocfs2_evict_inode
         -> ocfs2_delete_inode
         -> ocfs2_wipe_inode
         -> ocfs2_remove_inode
        OCFS2_VALID_FL in the inode
        i_flags will be cleared.
      
                                         The file b still can be accessed
                                         on node B.
                                         ls /mnt/ocfs2
                                         When first read the file b with
                                         ocfs2_read_inode_block(). It will
                                         validate the inode using
                                         ocfs2_validate_inode_block().
                                         Because OCFS2_VALID_FL not set in
                                         the inode i_flags, so the file
                                         system will be readonly.
      
      So we should add inode into orphan dir after updating entry in
      ocfs2_rename().
      Signed-off-by: Nalex.chen <alex.chen@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fb1beb0
  2. 20 6月, 2014 9 次提交
    • M
      Btrfs: fix wrong error handle when the device is missing or is not writeable · 8408c716
      Miao Xie 提交于
      The original bio might be submitted, so we shoud increase bi_remaining to
      account for it when we deal with the error that the device is missing or
      is not writeable, or we would skip the endio handle.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8408c716
    • M
      Btrfs: fix deadlock when mounting a degraded fs · c55f1396
      Miao Xie 提交于
      The deadlock happened when we mount degraded filesystem, the reproduced
      steps are following:
       # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
       # echo 1 > /sys/block/`basename <dev0>`/device/delete
       # mount -o degraded <dev1> <mnt>
      
      The reason was that the counter -- bi_remaining was wrong. If the missing
      or unwriteable device was the last device in the mapping array, we would
      not submit the original bio, so we shouldn't increase bi_remaining of it
      in btrfs_end_bio(), or we would skip the final endio handle.
      
      Fix this problem by adding a flag into btrfs bio structure. If we submit
      the original bio, we will set the flag, and we increase bi_remaining counter,
      or we don't.
      
      Though there is another way to fix it -- decrease bi_remaining counter of the
      original bio when we make sure the original bio is not submitted, this method
      need add more check and is easy to make mistake.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c55f1396
    • M
      Btrfs: use bio_endio_nodec instead of open code · e990f167
      Miao Xie 提交于
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e990f167
    • W
      Btrfs: fix NULL pointer crash when running balance and scrub concurrently · 298a8f9c
      Wang Shilong 提交于
      While running balance, scrub, fsstress concurrently we hit the
      following kernel crash:
      
      [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
      [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
      [56561.524325] Oops: 0000 [#1] SMP
      [....]
      [56561.527237] Call Trace:
      [56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
      [56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
      [56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
      [56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
      [56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
      [56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
      [56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
      [56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
      [56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
      [...]
      [56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
      [56561.528395]  RSP <ffff88004c0f5be8>
      [56561.528454] CR2: 0000000000000078
      
      This is because in btrfs_relocate_chunk(), we will free @bdev directly while
      scrub may still hold extent mapping, and may access freed memory.
      
      Fix this problem by wrapping freeing @bdev work into free_extent_map() which
      is based on reference count.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      298a8f9c
    • Q
      btrfs: Skip scrubbing removed chunks to avoid -ENOENT. · ced96edc
      Qu Wenruo 提交于
      When run scrub with balance, sometimes -ENOENT will be returned, since
      in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
      btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
      chunk is removed but not committed, -ENOENT will be returned.
      
      However, there is no need to stop scrubbing since other chunks may be
      scrubbed without problem.
      
      So this patch changes the behavior to skip removed chunks and continue
      to scrub the rest.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ced96edc
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
    • M
      Btrfs: make free space cache write out functions more readable · 5349d6c3
      Miao Xie 提交于
      This patch makes the free space cache write out functions more readable,
      and beisdes that, it also reduces the stack space that the function --
      __btrfs_write_out_cache uses from 194bytes to 144bytes.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5349d6c3
    • F
      Btrfs: remove unused wait queue in struct extent_buffer · 46fefe41
      Filipe Manana 提交于
      The lock_wq wait queue is not used anywhere, therefore just remove it.
      On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
      bytes down to 296 bytes, which means a 4Kb page can now be used for
      13 extent buffers instead of 12.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      46fefe41
    • C
      Btrfs: fix deadlocks with trylock on tree nodes · ea4ebde0
      Chris Mason 提交于
      The Btrfs tree trylock function is poorly named.  It always takes
      the spinlock and backs off if the blocking lock is held.  This
      can lead to surprising lockups because people expect it to really be a
      trylock.
      
      This commit makes it a pure trylock, both for the spinlock and the
      blocking lock.  It also reworks the nested lock handling slightly to
      avoid taking the read lock while a spinning write lock might be held.
      Signed-off-by: NChris Mason <clm@fb.com>
      ea4ebde0
  3. 18 6月, 2014 2 次提交
    • K
      NFSD: fix bug for readdir of pseudofs · f41c5ad2
      Kinglong Mee 提交于
      Commit 561f0ed4 (nfsd4: allow large readdirs) introduces a bug
      about readdir the root of pseudofs.
      
      Call xdr_truncate_encode() revert encoded name when skipping.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f41c5ad2
    • N
      NFSD: Don't hand out delegations for 30 seconds after recalling them. · 6282cd56
      NeilBrown 提交于
      If nfsd needs to recall a delegation for some reason it implies that there is
      contention on the file, so further delegations should not be handed out.
      
      The current code fails to do so, and the result is effectively a
      live-lock under some workloads: a client attempting a conflicting
      operation on a read-delegated file receives NFS4ERR_DELAY and retries
      the operation, but by the time it retries the server may already have
      given out another delegation.
      
      We could simply avoid delegations for (say) 30 seconds after any recall, but
      this is probably too heavy handed.
      
      We could keep a list of inodes (or inode numbers or filehandles) for recalled
      delegations, but that requires memory allocation and searching.
      
      The approach taken here is to use a bloom filter to record the filehandles
      which are currently blocked from delegation, and to accept the cost of a few
      false positives.
      
      We have 2 bloom filters, each of which is valid for 30 seconds.   When a
      delegation is recalled the filehandle is added to one filter and will remain
      disabled for between 30 and 60 seconds.
      
      We keep a count of the number of filehandles that have been added, so when
      that count is zero we can bypass all other tests.
      
      The bloom filters have 256 bits and 3 hash functions.  This should allow a
      couple of dozen blocked  filehandles with minimal false positives.  If many
      more filehandles are all blocked at once, behaviour will degrade towards
      rejecting all delegations for between 30 and 60 seconds, then resetting and
      allowing new delegations.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6282cd56
  4. 17 6月, 2014 1 次提交
  5. 14 6月, 2014 7 次提交
    • E
      btrfs: fix error handling in create_pending_snapshot · 47a306a7
      Eric Sandeen 提交于
      fcebe456 cut and pasted some code to a later point
      in create_pending_snapshot(), but didn't switch
      to the appropriate error handling for this stage
      of the function.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47a306a7
    • E
      btrfs: fix use of uninit "ret" in end_extent_writepage() · 3e2426bd
      Eric Sandeen 提交于
      If this condition in end_extent_writepage() is false:
      
      	if (tree->ops && tree->ops->writepage_end_io_hook)
      
      we will then test an uninitialized "ret" at:
      
      	ret = ret < 0 ? ret : -EIO;
      
      The test for ret is for the case where ->writepage_end_io_hook
      failed, and we'd choose that ret as the error; but if
      there is no ->writepage_end_io_hook, nothing sets ret.
      
      Initializing ret to 0 should be sufficient; if
      writepage_end_io_hook wasn't set, (!uptodate) means
      non-zero err was passed in, so we choose -EIO in that case.
      Signed-of-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3e2426bd
    • E
      btrfs: free ulist in qgroup_shared_accounting() error path · d7372780
      Eric Sandeen 提交于
      If tmp = ulist_alloc(GFP_NOFS) fails, we return without
      freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
      and cause a memory leak.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7372780
    • F
      Btrfs: fix qgroups sanity test crash or hang · b050f9f6
      Filipe Manana 提交于
      Often when running the qgroups sanity test, a crash or a hang happened.
      This is because the extent buffer the test uses for the root node doesn't
      have an header level explicitly set, making it have a random level value.
      This is a problem when it's not zero for the btrfs_search_slot() calls
      the test ends up doing, resulting in crashes or hangs such as the following:
      
      [ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
      (...)
      [ 6454.127760] BTRFS: selftest: Running qgroup tests
      [ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
      [ 6454.127966] BTRFS: selftest: Qgroup basic add
      [ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
      [ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
      [ 6480.152005] irq event stamp: 188448
      [ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
      [ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
      [ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
      [ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
      [ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
      [ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
      [ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
      [ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
      [ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
      [ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
      [ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
      [ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
      [ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
      [ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
      [ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
      [ 6480.152005] Stack:
      [ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
      [ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
      [ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
      [ 6480.152005] Call Trace:
      [ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
      [ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
      [ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
      [ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
      [ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
      [ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
      [ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
      [ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
      [ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
      [ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
      [ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
      [ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
      [ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
      [ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
      [ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
      [ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
      (...)
      
      Therefore initialize the extent buffer as an empty leaf (level 0).
      
      Issue easy to reproduce when btrfs is built as a module via:
      
          $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b050f9f6
    • S
      btrfs: prevent RCU warning when dereferencing radix tree slot · f1e3c289
      Sasha Levin 提交于
      Mark the dereference as protected by lock. Not doing so triggers
      an RCU warning since the radix tree assumed that RCU is in use.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1e3c289
    • W
      Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting · 5fbc7c59
      Wang Shilong 提交于
      Steps to reproduce:
      
       # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
       # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
       # mount /dev/sdb /mnt -o degraded
       # btrfs scrub start -BRd /mnt
      
      This is because readahead would skip missing device, this is not true
      for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
      mapping. If expected data locates in missing device, readahead thread
      would not call __readahead_hook() which makes event @rc->elems=0
      wait forever.
      
      Fix this problem by checking return value of btrfs_map_block(),we
      can only skip missing device safely if there are several mirrors.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5fbc7c59
    • G
      btrfs: new ioctl TREE_SEARCH_V2 · cc68a8a5
      Gerhard Heift 提交于
      This new ioctl call allows the user to supply a buffer of varying size in which
      a tree search can store its results. This is much more flexible if you want to
      receive items which are larger than the current fixed buffer of 3992 bytes or
      if you want to fetch more items at once. Items larger than this buffer are for
      example some of the type EXTENT_CSUM.
      Signed-off-by: NGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NDavid Sterba <dsterba@suse.cz>
      cc68a8a5
  6. 13 6月, 2014 6 次提交
  7. 12 6月, 2014 6 次提交