1. 15 2月, 2019 1 次提交
  2. 12 12月, 2018 1 次提交
  3. 20 10月, 2018 1 次提交
  4. 09 10月, 2018 1 次提交
    • T
      GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads · 1eb8d738
      Tim Smith 提交于
      Flushing the workqueue can cause operations to happen which might
      call gfs2_log_reserve(), or get stuck waiting for locks taken by such
      operations.  gfs2_log_reserve() can io_schedule(). If this happens, it
      will never wake because the only thing which can wake it is gfs2_logd()
      which was already stopped.
      
      This causes umount of a gfs2 filesystem to wedge permanently if, for
      example, the umount immediately follows a large delete operation.
      
      When this occured, the following stack trace was obtained from the
      umount command
      
      [<ffffffff81087968>] flush_workqueue+0x1c8/0x520
      [<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2]
      [<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2]
      [<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100
      [<ffffffff811b7ff7>] kill_block_super+0x27/0x70
      [<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2]
      [<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70
      [<ffffffff811b79b9>] deactivate_super+0x59/0x60
      [<ffffffff811d2998>] cleanup_mnt+0x58/0x80
      [<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20
      [<ffffffff8108c87d>] task_work_run+0x7d/0xa0
      [<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98
      [<ffffffff81003961>] syscall_return_slowpath+0x41/0x50
      [<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f
      [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NTim Smith <tim.smith@citrix.com>
      Signed-off-by: NMark Syms <mark.syms@citrix.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      1eb8d738
  5. 05 7月, 2018 1 次提交
  6. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  7. 28 3月, 2018 1 次提交
  8. 31 1月, 2018 1 次提交
  9. 23 1月, 2018 2 次提交
  10. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  11. 31 10月, 2017 1 次提交
  12. 25 8月, 2017 1 次提交
    • B
      GFS2: Withdraw for IO errors writing to the journal or statfs · 942b0cdd
      Bob Peterson 提交于
      Before this patch, if GFS2 encountered IO errors while writing to
      the journal, it would not report the problem, so they would go
      unnoticed, sometimes for many hours. Sometimes this would only be
      noticed later, when recovery tried to do journal replay and failed
      due to invalid metadata at the blocks that resulted in IO errors.
      
      This patch makes GFS2's log daemon check for IO errors. If it
      encounters one, it withdraws from the file system and reports
      why in dmesg. A similar action is taken when IO errors occur when
      writing to the system statfs file.
      
      These errors are also reported back to any callers of fsync, since
      that requires the journal to be flushed. Therefore, any IO errors
      that would previously go unnoticed are now noticed and the file
      system is withdrawn as early as possible, thus preventing further
      file system damage.
      
      Also note that this reintroduces superblock variable sd_log_error,
      which Christoph removed with commit f729b66f.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      942b0cdd
  13. 10 8月, 2017 2 次提交
    • A
      gfs2: Defer deleting inodes under memory pressure · 6a1c8f6d
      Andreas Gruenbacher 提交于
      When under memory pressure and an inode's link count has dropped to
      zero, defer deleting the inode to the delete workqueue.  This avoids
      calling into DLM under memory pressure, which can deadlock.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      6a1c8f6d
    • A
      gfs2: gfs2_evict_inode: Put glocks asynchronously · 71c1b213
      Andreas Gruenbacher 提交于
      gfs2_evict_inode is called to free inodes under memory pressure.  The
      function calls into DLM when an inode's last cluster-wide reference goes
      away (remote unlink) and to release the glock and associated DLM lock
      before finally destroying the inode.  However, if DLM is blocked on
      memory to become available, calling into DLM again will deadlock.
      
      Avoid that by decoupling releasing glocks from destroying inodes in that
      case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously
      in work queue context, when the associated inodes have likely already
      been destroyed.
      
      With this change, inodes can end up being unlinked, remote-unlink can be
      triggered, and then the inode can be reallocated before all
      remote-unlink callbacks are processed.  To detect that, revalidate the
      link count in gfs2_evict_inode to make sure we're not deleting an
      allocated, referenced inode.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      71c1b213
  14. 09 8月, 2017 3 次提交
  15. 21 7月, 2017 1 次提交
  16. 17 7月, 2017 1 次提交
    • D
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells 提交于
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bc98a42c
  17. 05 7月, 2017 3 次提交
  18. 03 4月, 2017 1 次提交
    • A
      Revert "GFS2: Wait for iopen glock dequeues" · d4da3198
      Andreas Gruenbacher 提交于
      Revert commit 86d067a7: it turns out
      that waiting for iopen glock dequeues here isn't needed anymore because
      the bugs that commit was meant to fix have been fixed otherwise.
      
      In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
      shrinker context because the shrinker may be invoked on behalf of DLM,
      in which case calling into DLM again would deadlock.  This commit makes
      the described scenario less likely without completely avoiding it; it's
      still a step in the right direction, though.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      d4da3198
  19. 16 3月, 2017 1 次提交
    • B
      GFS2: Prevent BUG from occurring when normal Withdraws occur · 0d1c7ae9
      Bob Peterson 提交于
      When the GFS2 file system withdraws due to metadata corruption, it
      often has outstanding transactions in the journal and delayed work
      queued for its glocks. This patch adds some new checks for a
      withdrawn file system before proceeding with operations that would
      obviously cause a BUG() to be triggered. That allows GFS2 to be
      safely unmounted rather than cause the system to go down.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      0d1c7ae9
  20. 02 3月, 2017 1 次提交
  21. 03 8月, 2016 1 次提交
  22. 27 6月, 2016 1 次提交
    • A
      gfs2: Lock holder cleanup · 6df9f9a2
      Andreas Gruenbacher 提交于
      Make the code more readable by cleaning up the different ways of
      initializing lock holders and checking for initialized lock holders:
      mark lock holders as uninitialized by setting the holder's glock to NULL
      (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
      object or using a separate flag.  Recognize initialized holders by their
      non-NULL glock (gfs2_holder_initialized).  Don't zero out holder objects
      which are immeditiately initialized via gfs2_holder_init or
      gfs2_glock_nq_init.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      6df9f9a2
  23. 11 4月, 2016 1 次提交
  24. 14 1月, 2016 1 次提交
  25. 19 12月, 2015 2 次提交
  26. 15 12月, 2015 4 次提交
    • B
      gfs2: clear journal live bit in gfs2_log_flush · 400ac52e
      Benjamin Marzinski 提交于
      When gfs2 was unmounting filesystems or changing them to read-only it
      was clearing the SDF_JOURNAL_LIVE bit before the final log flush.  This
      caused a race.  If an inode glock got demoted in the gap between
      clearing the bit and the shutdown flush, it would be unable to reserve
      log space to clear out the active items list in inode_go_sync, causing an
      error in inode_go_inval because the glock was still dirty.
      
      To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
      shutdown log flush.  This means that, because of the locking on the log
      blocks, either inode_go_sync will be able to reserve space to clean the
      glock before the shutdown flush, or the shutdown flush will clean the
      glock itself, before inode_go_sync fails to reserve the space. Either
      way, the glock will be clean before inode_go_inval.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      400ac52e
    • B
      gfs2: change gfs2 readdir cookie · 471f3db2
      Benjamin Marzinski 提交于
      gfs2 currently returns 31 bits of filename hash as a cookie that readdir
      uses for an offset into the directory.  When there are a large number of
      directory entries, the likelihood of a collision goes up way too
      quickly.  GFS2 will now return cookies that are guaranteed unique for a
      while, and then fail back to using 30 bits of filename hash.
      Specifically, the directory leaf blocks are divided up into chunks based
      on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
      cookie is based off the chunk where it starts, in the linked list of
      leaf blocks that it hashes to (there are 131072 hash buckets). Directory
      entries will have unique names until they take reach chunk 8192.
      Assuming the largest filenames possible, and the least efficient spacing
      possible, this new method will still be able to return unique names when
      the previous method has statistically more than a 99% chance of a
      collision.  The non-unique names it fails back to are guaranteed to not
      collide with the unique names.
      
      unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "0"
      - 13 bits for the offset
      
      non-unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "1"
      - 13 more bits of the name hash
      
      Another benefit of location based cookies, is that once a directory's
      exhash table is fully extended (so that multiple hash table indexs do
      not use the same leaf blocks), gfs2 can skip sorting the directory
      entries until it reaches the non-unique ones, and then it only needs to
      sort these. This provides a significant speed up for directory reads of
      very large directories.
      
      The only issue is that for these cookies to continue to point to the
      correct entry as files are added and removed from the directory, gfs2
      must keep the entries at the same offset in the leaf block when they are
      split (see my previous patch). This means that until all the nodes in a
      cluster are running with code that will split the directory leaf blocks
      this way, none of the nodes can use the new cookie code. To deal with
      this, gfs2 now has the mount option loccookie, which, if set, will make
      it return these new location based cookies.  This option must not be set
      until all nodes in the cluster are at least running this version of the
      kernel code, and you have guaranteed that there are no outstanding
      cookies required by other software, such as NFS.
      
      gfs2 uses some of the extra space at the end of the gfs2_dirent
      structure to store the calculated readdir cookies. This keeps us from
      needing to allocate a seperate array to hold these values.  gfs2
      recomputes the cookie stored in de_cookie for every readdir call.  The
      time it takes to do so is small, and if gfs2 expected this value to be
      saved on disk, the new code wouldn't work correctly on filesystems
      created with an earlier version of gfs2.
      
      One issue with adding de_cookie to the union in the gfs2_dirent
      structure is that it caused the union to align itself to a 4 byte
      boundary, instead of its previous 2 byte boundary. This changed the
      offset of de_rahead. To solve that, I pulled de_rahead out of the union,
      since it does not need to be there.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      471f3db2
    • B
      GFS2: Update master statfs buffer with sd_statfs_spin locked · 901c6c66
      Bob Peterson 提交于
      Before this patch, function update_statfs called gfs2_statfs_change_out
      to update the master statfs buffer without the sd_statfs_spin held.
      In theory, another process could call gfs2_statfs_sync, which takes
      the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
      a theoretical timing window in which one process could write the
      master statfs buffer, then another comes along and re-reads it, wiping
      out the changes.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      901c6c66
    • B
      GFS2: Make rgrp reservations part of the gfs2_inode structure · a097dc7e
      Bob Peterson 提交于
      Before this patch, multi-block reservation structures were allocated
      from a special slab. This patch folds the structure into the gfs2_inode
      structure. The disadvantage is that the gfs2_inode needs more memory,
      even when a file is opened read-only. The advantages are: (a) we don't
      need the special slab and the extra time it takes to allocate and
      deallocate from it. (b) we no longer need to worry that the structure
      exists for things like quota management. (c) This also allows us to
      remove the calls to get_write_access and put_write_access since we
      know the structure will exist.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a097dc7e
  27. 24 11月, 2015 1 次提交
    • B
      GFS2: Extract quota data from reservations structure (revert 5407e242) · b54e9a0b
      Bob Peterson 提交于
      This patch basically reverts the majority of patch 5407e242.
      That patch eliminated the gfs2_qadata structure in favor of just
      using the reservations structure. The problem with doing that is that
      it increases the size of the reservations structure. That is not an
      issue until it comes time to fold the reservations structure into the
      inode in memory so we know it's always there. By separating out the
      quota structure again, we aren't punishing the non-quota users by
      making all the inodes bigger, requiring more slab space. This patch
      creates a new slab area to allocate the quota stuff so it's managed
      a little more sanely.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b54e9a0b
  28. 17 11月, 2015 1 次提交
  29. 05 9月, 2015 1 次提交
    • K
      fs: create and use seq_show_option for escaping · a068acf2
      Kees Cook 提交于
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a068acf2
  30. 02 6月, 2015 1 次提交
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a