1. 06 10月, 2018 1 次提交
  2. 29 8月, 2018 2 次提交
    • B
      gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated · 4f36cb36
      Bob Peterson 提交于
      The GFS2_RDF_UPTODATE flag in the rgrp is used to determine when
      a rgrp buffer is valid. It's cleared when the glock is invalidated,
      signifying that the buffer data is now invalid. But before this
      patch, function update_rgrp_lvb was setting the flag when it
      determined it had a valid lvb. But that's an invalid assumption:
      just because you have a valid lvb doesn't mean you have valid
      buffers. After all, another node may have made the lvb valid,
      and this node just fetched it from the glock via dlm.
      
      Consider this scenario:
      1. The file system is mounted with RGRPLVB option.
      2. In gfs2_inplace_reserve it locks the rgrp glock EX, but thanks
         to GL_SKIP, it skips the gfs2_rgrp_bh_get.
      3. Since loops == 0 and the allocation target (ap->target) is
         bigger than the largest known chunk of blocks in the rgrp
         (rs->rs_rbm.rgd->rd_extfail_pt) it skips that rgrp and bypasses
         the call to gfs2_rgrp_bh_get there as well.
      4. update_rgrp_lvb sees the lvb MAGIC number is valid, so bypasses
         gfs2_rgrp_bh_get, but it still sets sets GFS2_RDF_UPTODATE due
         to this invalid assumption.
      5. The next time update_rgrp_lvb is called, it sees the bit is set
         and just returns 0, assuming both the lvb and rgrp are both
         uptodate. But since this is a smaller allocation, or space has
         been freed by another node, thus adjusting the lvb values,
         it decides to use the rgrp for allocations, with invalid rd_free
         due to the fact it was never updated.
      
      This patch changes update_rgrp_lvb so it doesn't set the UPTODATE
      flag anymore. That way, it has no choice but to fetch the latest
      values.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      4f36cb36
    • B
      gfs2: improve debug information when lvb mismatches are found · 72244b6b
      Bob Peterson 提交于
      Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
      but it wouldn't tell you what was actually wrong. This patch adds
      more information to help us debug it. It also makes rgrp consistency
      checks dump any bad rgrps, and the rgrp dump code dump any lvbs
      as well as the rgrp itself.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      72244b6b
  3. 08 8月, 2018 1 次提交
  4. 07 8月, 2018 1 次提交
    • B
      gfs2: Fix gfs2_testbit to use clone bitmaps · dffe12a8
      Bob Peterson 提交于
      Function gfs2_testbit is called in three places. Two of those places,
      gfs2_alloc_extent and gfs2_unaligned_extlen, should be using the clone
      bitmaps, not the "real" bitmaps. Function gfs2_unaligned_extlen is used
      by the block reservations scheme to determine the length of an extent of
      free blocks. Before this patch, it wasn't using the clone bitmap, which
      means recently-freed blocks were treated as free blocks for the purposes
      of an allocation.
      
      This patch adds a new parameter to gfs2_testbit to indicate whether or
      not the clone bitmaps should be used (if available).
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      dffe12a8
  5. 27 7月, 2018 1 次提交
  6. 25 7月, 2018 2 次提交
    • B
      GFS2: rgrp free blocks used incorrectly · f6753df3
      Bob Peterson 提交于
      Before this patch, several functions in rgrp.c checked the value of
      rgd->rd_free_clone. That does not take into account blocks that were
      reserved by a multi-block reservation. This causes a problem when
      space gets tight in the file system. For example, when function
      gfs2_inplace_reserve checks to see if a rgrp has enough blocks to
      satisfy the request, it can accept a rgrp that it should reject
      because, although there are enough blocks to satisfy the request
      _now_, those blocks may be reserved for another running process.
      
      A second problem with this occurs when we've reserved the remaining
      blocks in an rgrp: function rg_mblk_search() can reject an rgrp
      improperly because it calculates:
      
         u32 free_blocks = rgd->rd_free_clone - rgd->rd_reserved;
      
      But rd_reserved includes blocks that the current process just
      reserved in its own call to inplace_reserve. For example, it can
      reserve the last 128 blocks of an rgrp, then reject that same rgrp
      because the above calculates out to free_blocks = 0;
      
      Consequences include, but are not limited to, (1) leaving holes,
      and thus increasing file system fragmentation, and (2) reporting
      file system is full long before it actually is.
      
      This patch introduces a new function, rgd_free, which returns the
      number of clone-free blocks (blocks that are truly free as opposed
      to blocks that are still being used because an unlinked file is
      still open) minus the number of blocks reserved by processes, but
      not counting the blocks we ourselves reserved (because obviously
      we need to allocate them).
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      f6753df3
    • B
      gfs2: Don't reject a supposedly full bitmap if we have blocks reserved · e79e0e14
      Bob Peterson 提交于
      Before this patch, you could get into situations like this:
      
      1. Process 1 searches for X free blocks, finds them, makes a reservation
      2. Process 2 searches for free blocks in the same rgrp, but now the
         bitmap is full because process 1's reservation is skipped over.
         So it marks the bitmap as GBF_FULL.
      3. Process 1 tries to allocate blocks from its own reservation, but
         since the GBF_FULL bit is set, it skips over the rgrp and searches
         elsewhere, thus not using its own reservation.
      
      This patch adds an additional check to allow processes to use their
      own reservations.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      e79e0e14
  7. 05 7月, 2018 2 次提交
  8. 21 6月, 2018 1 次提交
  9. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  10. 04 6月, 2018 1 次提交
    • B
      GFS2: gfs2_free_extlen can return an extent that is too long · dc8fbb03
      Bob Peterson 提交于
      Function gfs2_free_extlen calculates the length of an extent of
      free blocks that may be reserved. The end pointer was calculated as
      end = start + bh->b_size but b_size is incorrect because the
      bitmap usually stops prior to the end of the buffer data on
      the last bitmap.
      
      What this means is that when you do a write, you can reserve a
      chunk of blocks that runs off the end of the last bitmap. For
      example, I've got a file system where there is only one bitmap
      for each rgrp, so ri_length==1. I saw cases in which iozone
      tried to do a big write, grabbed a large block reservation,
      chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188.
      So 5464153 + 8188 = 5472341 which is the end of the rgrp.
      
      When it grabbed a reservation it got back: 5470936, length 7229.
      But 5470936 + 7229 = 5478165. So the reservation starts inside
      the rgrp but runs 5824 blocks past the end of the bitmap.
      
      This patch fixes the calculation so it won't exceed the last
      bitmap. It also adds a BUG_ON to guard against overflows in the
      future.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      dc8fbb03
  11. 31 1月, 2018 1 次提交
  12. 23 1月, 2018 2 次提交
  13. 17 1月, 2018 1 次提交
  14. 13 12月, 2017 3 次提交
    • A
      gfs2: Add a crc field to resource group headers · 850d2d91
      Andrew Price 提交于
      Add the rg_crc field to store a crc32 of the gfs2_rgrp structure. This
      allows us to check resource group headers' integrity and removes the
      requirement to check them against the rindex entries in fsck. If this
      field is found to be zero, it should be ignored (or updated with an
      accurate value).
      Signed-off-by: NAndrew Price <anprice@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      850d2d91
    • A
      gfs2: Add rindex fields to rgrp headers · 166725d9
      Andrew Price 提交于
      Add rg_data0, rg_data and rg_bitbytes to struct gfs2_rgrp. The fields
      are identical to their counterparts in struct gfs2_rindex and are
      intended to reduce the use of the rindex. For now the fields are only
      written back as the in-memory equivalents in struct gfs2_rgrpd are set
      using values from the rindex. However, they are needed at this point so
      that userspace can make use of them, allowing a migration away from the
      rindex over time.
      
      The new fields take up previously reserved space which was explicitly
      zeroed on write so, in clusters with mixed kernels, these fields could
      get zeroed after being set and this should not be treated as an error.
      Signed-off-by: NAndrew Price <anprice@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      166725d9
    • A
      gfs2: Add a next-resource-group pointer to resource groups · 65adc273
      Andrew Price 提交于
      Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
      rg_skip field has the following meaning:
      
      - If rg_skip is zero, it is considered unset and not useful.
      - If rg_skip is non-zero, its value will be the number of blocks between
        this rgrp's address and the next rgrp's address. This can be used as a
        hint by fsck.gfs2 when rebuilding a bad rindex, for example.
      
      This will provide less dependency on the rindex in future, and allow
      tools such as fsck.gfs2 to iterate the resource groups without keeping
      the rindex around.
      
      The field is updated in gfs2_rgrp_out() so that existing file systems
      will have it set. This means that any resource groups that aren't ever
      written will not be updated. The final rgrp is a special case as there
      is no next rgrp, so it will always have a rg_skip of 0 (unless the fs is
      extended).
      
      Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
      the rg_skip field can get set back to 0 in cases where nodes with and
      without this patch are mixed in a cluster. In some cases, the field may
      bounce between being set by one node and then zeroed by another which
      may harm performance slightly, e.g. when two nodes create many small
      files. In testing this situation is rare but it becomes more likely as
      the filesystem fills up and there are fewer resource groups to choose
      from. The problem goes away when all nodes are running with this patch.
      Dipping into the space currently occupied by the rg_reserved field would
      have resulted in the same problem as it is also explicitly zeroed, so
      unfortunately there is no other way around it.
      Signed-off-by: NAndrew Price <anprice@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      65adc273
  15. 28 11月, 2017 1 次提交
    • B
      GFS2: Combine gfs2_free_di with gfs2_free_uninit_di · a18c78c5
      Bob Peterson 提交于
      Before this patch, function gfs2_free_di was 4 lines of code, and
      one of those lines was to call gfs2_free_uninit_di. Although
      unlikely, if function gfs2_free_uninit_di encountered an error
      finding the block to be freed, the error was silently ignored by the
      caller, which went ahead and improperly did a quota-change operation
      and meta_wipe despite the error. This patch combines the two
      functions into one to make the code more readable and fixes the bug
      by returning from the combined function before it takes those next
      incorrect steps.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a18c78c5
  16. 30 8月, 2017 1 次提交
  17. 09 8月, 2017 1 次提交
  18. 05 7月, 2017 1 次提交
  19. 19 4月, 2017 1 次提交
    • B
      GFS2: Non-recursive delete · d552a2b9
      Bob Peterson 提交于
      Implement truncate/delete as a non-recursive algorithm. The older
      algorithm was implemented with recursion to strip off each layer
      at a time (going by height, starting with the maximum height.
      This version tries to do the same thing but without recursion,
      and without needing to allocate new structures or lists in memory.
      
      For example, say you want to truncate a very large file to 1 byte,
      and its end-of-file metapath is: 0.505.463.428. The starting
      metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
      needs to preserve that byte, and all metadata pointing to it.
      So it would start at 0.0.0.0, look up all its metadata buffers,
      then free all data blocks pointed to at the highest level.
      After that buffer is "swept", it moves on to 0.0.0.1, then
      0.0.0.2, etc., reading in buffers and sweeping them clean.
      When it gets to the end of the 0.0.0 metadata buffer (for 4K
      blocks the last valid one is 0.0.0.508), it backs up to the
      previous height and starts working on 0.0.1.0, then 0.0.1.1,
      and so forth. After it reaches the end and sweeps 0.0.1.508,
      it continues with 0.0.2.0, and so on. When that height is
      exhausted, and it reaches 0.0.508.508 it backs up another level,
      to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
      marching backwards and forwards through the metadata until it's
      all swept clean. Once it has all the data blocks freed, it
      lowers the strip height, and begins the process all over again,
      but with one less height. This time it sweeps 0.0.0 through
      0.505.463. When that's clean, it lowers the strip height again
      and works to free 0.505. Eventually it strips the lowest height, 0.
      For a delete or truncate to 0, all metadata for all heights of
      0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
      be preserved.
      
      This isn't much different from normal integer incrementing,
      where an integer gets incremented from 0000 (0.0.0.0) to 3021
      (3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
      then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
      just that each "digit" goes from 0 to 508 (for a total of 509
      pointers) rather than from 0 to 9.
      
      Note that the dinode will only have 483 pointers due to the
      dinode structure itself.
      
      Also note: this is just an example. These numbers (509 and 483)
      are based on a standard 4K block size. Smaller block sizes will
      yield smaller numbers of indirect pointers accordingly.
      
      The truncation process is accomplished with the help of two
      major functions and a few helper functions.
      
      Functions do_strip and recursive_scan are obsolete, so removed.
      
      New function sweep_bh_for_rgrps cleans a buffer_head pointed to
      by the given metapath and height. By cleaning, I mean it frees
      all blocks starting at the offset passed in metapath. It starts
      at the first block in the buffer pointed to by the metapath and
      identifies its resource group (rgrp). From there it frees all
      subsequent block pointers that lie within that rgrp. If it's
      already inside a transaction, it stays within it as long as it
      can. In other words, it doesn't close a transaction until it knows
      it's freed what it can from the resource group. In this way,
      multiple buffers may be cleaned in a single transaction, as long
      as those blocks in the buffer all lie within the same rgrp.
      
      If it's not in a transaction, it starts one. If the buffer_head
      has references to blocks within multiple rgrps, it frees all the
      blocks inside the first rgrp it finds, then closes the
      transaction. Then it repeats the cycle: identifies the next
      unfreed block, uses it to find its rgrp, then starts a new
      transaction for that set. It repeats this process repeatedly
      until the buffer_head contains no more references to any blocks
      past the given metapath.
      
      Function trunc_dealloc has been reworked into a finite state
      automaton. It has basically 3 active states:
      DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:
      
      The DEALLOC_MP_FULL state implies the metapath has a full set
      of buffers out to the "shrink height", and therefore, it can
      call function sweep_bh_for_rgrps to free the blocks within the
      highest height of the metapath. If it's just swept the lowest
      level (or an error has occurred) the state machine is ended.
      Otherwise it proceeds to the DEALLOC_MP_LOWER state.
      
      The DEALLOC_MP_LOWER state implies we are finished with a given
      buffer_head, which may now be released, and therefore we are
      then missing some buffer information from the metapath. So we
      need to find more buffers to read in. In most cases, this is
      just a matter of releasing the buffer_head and moving to the
      next pointer from the previous height, so it may be read in and
      swept as well. If it can't find another non-null pointer to
      process, it checks whether it's reached the end of a height
      and needs to lower the strip height, or whether it still needs
      move forward through the previous height's metadata. In this
      state, all zero-pointers are skipped. From this state, it can
      only loop around (once more backing up another height) or,
      once a valid metapath is found (one that has non-zero
      pointers), proceed to state DEALLOC_FILL_MP.
      
      The DEALLOC_FILL_MP state implies that we have a metapath
      but not all its buffers are read in. So we must proceed to read
      in buffer_heads until the metapath has a valid buffer for every
      height. If the previous state backed us up 3 heights, we may
      need to read in a buffer, increment the height, then repeat the
      process until buffers have been read in for all required heights.
      If it's successful reading a buffer, and it's at the highest
      height we need, it proceeds back to the DEALLOC_MP_FULL state.
      If it's unable to fill in a buffer, (encounters a hole, etc.)
      it tries to find another non-zero block pointer. If they're all
      zero, it lowers the height and returns to the DEALLOC_MP_LOWER
      state. If it finds a good non-null pointer, it loops around and
      reads it in, while keeping the metapath in lock-step with the
      pointers it examines.
      
      The state machine runs until the truncation request is
      satisfied. Then any transactions are ended, the quota and
      statfs data are updated, and the function is complete.
      
      Helper function metaptr1 was introduced to be an easy way to
      determine the start of a buffer_head's indirect pointers.
      
      Helper function lookup_mp_height was introduced to find a
      metapath index and read in the buffer that corresponds to it.
      In this way, function lookup_metapath becomes a simple loop to
      call it for every height.
      
      Helper function fillup_metapath is similar to lookup_metapath
      except it can do partial lookups. If the state machine
      backed up multiple levels (like 2999 wrapping to 3000) it
      needs to find out the next starting point and start issuing
      metadata reads at that point.
      
      Helper function hptrs is a shortcut to determine how many
      pointers should be expected in a buffer. Height 0 is the dinode
      which has fewer pointers than the others.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      d552a2b9
  20. 13 7月, 2016 1 次提交
    • B
      GFS2: Check rs_free with rd_rsspin protection · 44f52122
      Bob Peterson 提交于
      For the last process to close a file opened for write, function
      gfs2_rsqa_delete was deleting the file's inode's block reservation
      out of the rgrp reservations tree. Then it was checking to make sure
      rs_free was 0, but it was performing the check outside the protection
      of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
      to prevent a race between the process freeing the reservation and
      another who is allocating a new set of blocks inside the same rgrp
      for the same inode, thus changing its value.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      44f52122
  21. 27 6月, 2016 1 次提交
    • A
      gfs2: Lock holder cleanup · 6df9f9a2
      Andreas Gruenbacher 提交于
      Make the code more readable by cleaning up the different ways of
      initializing lock holders and checking for initialized lock holders:
      mark lock holders as uninitialized by setting the holder's glock to NULL
      (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
      object or using a separate flag.  Recognize initialized holders by their
      non-NULL glock (gfs2_holder_initialized).  Don't zero out holder objects
      which are immeditiately initialized via gfs2_holder_init or
      gfs2_glock_nq_init.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      6df9f9a2
  22. 10 6月, 2016 1 次提交
    • B
      GFS2: don't set rgrp gl_object until it's inserted into rgrp tree · 36e4ad03
      Bob Peterson 提交于
      Before this patch, function read_rindex_entry would set a rgrp
      glock's gl_object pointer to itself before inserting the rgrp into
      the rgrp rbtree. The problem is: if another process was also reading
      the rgrp in, and had already inserted its newly created rgrp, then
      the second call to read_rindex_entry would overwrite that value,
      then return a bad return code to the caller. Later, other functions
      would reference the now-freed rgrp memory by way of gl_object.
      In some cases, that could result in gfs2_rgrp_brelse being called
      twice for the same rgrp: once for the failed attempt and once for
      the "real" rgrp release. Eventually the kernel would panic.
      There are also a number of other things that could go wrong when
      a kernel module is accessing freed storage. For example, this could
      result in rgrp corruption because the fake rgrp would point to a
      fake bitmap in memory too, causing gfs2_inplace_reserve to search
      some random memory for free blocks, and find some, since we were
      never setting rgd->rd_bits to NULL before freeing it.
      
      This patch fixes the problem by not setting gl_object until we
      have successfully inserted the rgrp into the rbtree. Also, it sets
      rd_bits to NULL as it frees them, which will ensure any accidental
      access to the wrong rgrp will result in a kernel panic rather than
      file system corruption, which is preferred.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      36e4ad03
  23. 02 5月, 2016 1 次提交
  24. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  25. 19 12月, 2015 1 次提交
    • B
      GFS2: Always use iopen glock for gl_deletes · 5ea31bc0
      Bob Peterson 提交于
      Before this patch, when function try_rgrp_unlink queued a glock for
      delete_work to reclaim the space, it used the inode glock to do so.
      That's different from the iopen callback which uses the iopen glock
      for the same purpose. We should be consistent and always use the
      iopen glock. This may also save us reference counting problems with
      the inode glock, since clear_glock does an extra glock_put() for the
      inode glock.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      5ea31bc0
  26. 15 12月, 2015 1 次提交
    • B
      GFS2: Make rgrp reservations part of the gfs2_inode structure · a097dc7e
      Bob Peterson 提交于
      Before this patch, multi-block reservation structures were allocated
      from a special slab. This patch folds the structure into the gfs2_inode
      structure. The disadvantage is that the gfs2_inode needs more memory,
      even when a file is opened read-only. The advantages are: (a) we don't
      need the special slab and the extra time it takes to allocate and
      deallocate from it. (b) we no longer need to worry that the structure
      exists for things like quota management. (c) This also allows us to
      remove the calls to get_write_access and put_write_access since we
      know the structure will exist.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a097dc7e
  27. 24 11月, 2015 1 次提交
    • B
      GFS2: Extract quota data from reservations structure (revert 5407e242) · b54e9a0b
      Bob Peterson 提交于
      This patch basically reverts the majority of patch 5407e242.
      That patch eliminated the gfs2_qadata structure in favor of just
      using the reservations structure. The problem with doing that is that
      it increases the size of the reservations structure. That is not an
      issue until it comes time to fold the reservations structure into the
      inode in memory so we know it's always there. By separating out the
      quota structure again, we aren't punishing the non-quota users by
      making all the inodes bigger, requiring more slab space. This patch
      creates a new slab area to allocate the quota stuff so it's managed
      a little more sanely.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b54e9a0b
  28. 17 11月, 2015 1 次提交
  29. 09 11月, 2015 1 次提交
  30. 30 10月, 2015 1 次提交
  31. 04 9月, 2015 2 次提交
  32. 19 6月, 2015 1 次提交
  33. 19 5月, 2015 1 次提交