1. 06 9月, 2013 1 次提交
  2. 05 9月, 2013 2 次提交
    • J
      f2fs: optimize gc for better performance · a26b7c8a
      Jin Xu 提交于
      This patch improves the gc efficiency by optimizing the victim
      selection policy. With this optimization, the random re-write
      performance could increase up to 20%.
      
      For f2fs, when disk is in shortage of free spaces, gc will selects
      dirty segments and moves valid blocks around for making more space
      available. The gc cost of a segment is determined by the valid blocks
      in the segment. The less the valid blocks, the higher the efficiency.
      The ideal victim segment is the one that has the most garbage blocks.
      
      Currently, it searches up to 20 dirty segments for a victim segment.
      The selected victim is not likely the best victim for gc when there
      are much more dirty segments. Why not searching more dirty segments
      for a better victim? The cost of searching dirty segments is
      negligible in comparison to moving blocks.
      
      In this patch, it enlarges the MAX_VICTIM_SEARCH to 4096 to make
      the search more aggressively for a possible better victim. Since
      it also applies to victim selection for SSR, it will likely improve
      the SSR efficiency as well.
      
      The test case is simple. It creates as many files until the disk full.
      The size for each file is 32KB. Then it writes as many as 100000
      records of 4KB size to random offsets of random files in sync mode.
      The testing was done on a 2GB partition of a SDHC card. Let's see the
      test result of f2fs without and with the patch.
      
      ---------------------------------------
      2GB partition, SDHC
      create 52023 files of size 32768 bytes
      random re-write 100000 records of 4KB
      ---------------------------------------
      | file creation (s) | rewrite time (s) | gc count | gc garbage blocks |
      [no patch]  341         4227             1174          174840
      [patched]   324         2958             645           106682
      
      It's obvious that, with the patch, f2fs finishes the test in 20+% less
      time than without the patch. And internally it does much less gc with
      higher efficiency than before.
      
      Since the performance improvement is related to gc, it might not be so
      obvious for other tests that do not trigger gc as often as this one (
      This is because f2fs selects dirty segments for SSR use most of the
      time when free space is in shortage). The well-known iozone test tool
      was not used for benchmarking the patch becuase it seems do not have
      a test case that performs random re-write on a full disk.
      
      This patch is the revised version based on the suggestion from
      Jaegeuk Kim.
      Signed-off-by: NJin Xu <jinuxstyle@gmail.com>
      [Jaegeuk Kim: suggested simpler solution]
      Reviewed-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      a26b7c8a
    • J
      f2fs: merge more bios of node block writes · 423e95cc
      Jaegeuk Kim 提交于
      Previously, we experience bio traces as follows when running simple sequential
      write test.
      
       f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500104928, size = 4K
       f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499922208, size = 368K
       f2fs_do_submit_bio: type = NODE, io = no sync, sector = 499914752, size = 140K
      
       -> total 512K
      
      The first one is to write an indirect node block, and the others are to write
      direct node blocks.
      
      The reason why there are two separate bios for direct node blocks is:
      0. initial state
      ------------------    ------------------
      |                |    |xxxxxxxx        |
      ------------------    ------------------
      
      1. write 368K
      ------------------    ------------------
      |                |    |xxxxxxxxWWWWWWWW|
      ------------------    ------------------
      
      2. write 140K
      ------------------    ------------------
      |WWWWWWW         |    |xxxxxxxxWWWWWWWW|
      ------------------    ------------------
      
      This is because f2fs_write_node_pages tries to write just 512K totally, so that
      we can lose the chance to merge more bios nicely.
      
      After this patch is applied, we can get the following bio traces.
      
        f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500103168, size = 8K
        f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500111368, size = 4K
        f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500107272, size = 512K
        f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500108296, size = 512K
        f2fs_do_submit_bio: type = NODE, io = no sync, sector = 500109320, size = 500K
      
      And finally, we can improve the sequential write performance,
          from 458.775 MB/s to 479.945 MB/s on SSD.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      423e95cc
  3. 04 9月, 2013 10 次提交
  4. 03 9月, 2013 4 次提交
    • J
      f2fs: avoid an overflow during utilization calculation · 222cbdc4
      Jaegeuk Kim 提交于
      The current f2fs uses all the block counts with 32 bit numbers, which is able to
      cover about 15TB volume.
      
      But in calculation of utilization, f2fs multiplies the count by 100 which can
      induce overflow.
      This patch fixes this.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      222cbdc4
    • J
      f2fs: trigger GC when there are prefree segments · c34e333f
      Jaegeuk Kim 提交于
      Previously, f2fs conducts SSR when free_sections() < overprovision_sections.
      But, even though there are a lot of prefree segments, it can consider SSR only.
      So, let's consider the number of prefree segments too for triggering SSR.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      c34e333f
    • L
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds 提交于
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      possible.
      Acked-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15570086
    • W
      vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() · df3d0bbc
      Waiman Long 提交于
      A valid parent pointer is always going to have a non-zero reference
      count, but if we look up the parent optimistically without locking, we
      have to protect against the (very unlikely) race against renaming
      changing the parent from under us.
      
      We do that by using lockref_get_not_zero(), and then re-checking the
      parent pointer after getting a valid reference.
      
      [ This is a re-implementation of a chunk from the original patch by
        Waiman Long: "dcache: Enable lockless update of dentry's refcount".
        I've completely rewritten the patch-series and split it up, but I'm
        attributing this part to Waiman as it's close enough to his earlier
        patch  - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3d0bbc
  5. 31 8月, 2013 1 次提交
  6. 29 8月, 2013 14 次提交
    • G
      fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers · 49fa8140
      Goldwyn Rodrigues 提交于
      While using pacemaker/corosync, the node numbers are generated using IP
      address as opposed to serial node number generation.  This may not fit
      in a 8-byte string.  Use a bigger string to print the complete node
      number.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49fa8140
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
    • E
      ext4: allow specifying external journal by pathname mount option · ad4eec61
      Eric Sandeen 提交于
      It's always been a hassle that if an external journal's
      device number changes, the filesystem won't mount.
      And since boot-time enumeration can change, device number
      changes aren't unusual.
      
      The current mechanism to update the journal location is by
      passing in a mount option w/ a new devnum, but that's a hassle;
      it's a manual approach, fixing things after the fact.
      
      Adding a mount option, "-o journal_path=/dev/$DEVICE" would
      help, since then we can do i.e.
      
      # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
      
      and it'll mount even if the devnum has changed, as shown here:
      
      # losetup /dev/loop0 journalfile
      # mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
      # mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
      
      Change the journal device number:
      
      # losetup -d /dev/loop0
      # losetup /dev/loop1 journalfile 
      
      And today it will fail:
      
      # mount /dev/sdb1 /mnt/test
      mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
             missing codepage or helper program, or other error
             In some cases useful info is found in syslog - try
             dmesg | tail  or so
      
      # dmesg | tail -n 1
      [17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
      
      But with this new mount option, we can specify the new path:
      
      # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
      #
      
      (which does update the encoded device number, incidentally):
      
      # umount /dev/sdb1
      # dumpe2fs -h /dev/sdb1 | grep "Journal device"
      dumpe2fs 1.41.12 (17-May-2010)
      Journal device:	          0x0701
      
      But best of all we can just always mount by journal-path, and
      it'll always work:
      
      # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
      #
      
      So the journal_path option can be specified in fstab, and as long as
      the disk is available somewhere, and findable by label (or by UUID),
      we can mount.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      ad4eec61
    • D
      ext4: mark group corrupt on group descriptor checksum · bdfb6ff4
      Darrick J. Wong 提交于
      If the group descriptor fails validation, mark the whole blockgroup
      corrupt so that the inode/block allocators skip this group.  The
      previous approach takes the risk of writing to a damaged group
      descriptor; hopefully it was never the case that the [ib]bitmap fields
      pointed to another valid block and got dirtied, since the memset would
      fill the page with 1s.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bdfb6ff4
    • D
      ext4: mark block group as corrupt on inode bitmap error · 87a39389
      Darrick J. Wong 提交于
      If we detect either a discrepancy between the inode bitmap and the
      inode counts or the inode bitmap fails to pass validation checks, mark
      the block group corrupt and refuse to allocate or deallocate inodes
      from the group.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87a39389
    • D
      ext4: mark block group as corrupt on block bitmap error · 163a203d
      Darrick J. Wong 提交于
      When we notice a block-bitmap corruption (because of device failure or
      something else), we should mark this group as corrupt and prevent
      further block allocations/deallocations from it. Currently, we end up
      generating one error message for every block in the bitmap. This
      potentially could make the system unstable as noticed in some
      bugs. With this patch, the error will be printed only the first time
      and mark the entire block group as corrupted. This prevents future
      access allocations/deallocations from it.
      
      Also tested by corrupting the block
      bitmap and forcefully introducing the mb_free_blocks error:
      (1) create a largefile (2Gb)
      $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
      (2) umount filesystem. use dumpe2fs to see which block-bitmaps
      are in use by largefile and note their block numbers
      (3) use dd to zero-out the used block bitmaps
      $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
      (4) mount the FS and delete the largefile.
      (5) recreate the largefile. verify that the new largefile does not
      get any blocks from the groups marked as bad.
      Without the patch, we will see mb_free_blocks error for each bit in
      each zero'ed out bitmap at (4). With the patch, we only see the error
      once per blockgroup:
      [  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      
      Google-Bug-Id: 7258357
      
      [darrick.wong@oracle.com]
      Further modifications (by Darrick) to make more obvious that this corruption
      bit applies to blocks only.  Set the corruption flag if the block group bitmap
      verification fails.
      
      Original-author: Aditya Kali <adityakali@google.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      163a203d
    • D
      ext4: fix type declaration of ext4_validate_block_bitmap · dbde0abe
      Darrick J. Wong 提交于
      The block_group parameter to ext4_validate_block_bitmap is both used
      as a ext4_group_t inside the function and the same type is passed in
      by all callers.  We might as well use the typedef consistently instead
      of open-coding the 'unsigned int'.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dbde0abe
    • D
      ext4: error out if verifying the block bitmap fails · 48d9eb97
      Darrick J. Wong 提交于
      The block bitmap verification code assumes that calling ext4_error()
      either panics the system or makes the fs readonly.  However, this is
      not always true: when 'errors=continue' is specified, an error is
      printed but we don't return any indication of error to the caller,
      which is (probably) the block allocator, which pretends that the crud
      we read in off the disk is a usable bitmap.  Yuck.
      
      A block bitmap that fails the check should at least return no bitmap
      to the caller.  The block allocator should be told to go look in a
      different group, but that's a separate issue.
      
      The easiest way to reproduce this is to modify bg_block_bitmap (on a
      ^flex_bg fs) to point to a block outside the block group; or you can
      create a metadata_csum filesystem and zero out the block bitmaps.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      48d9eb97
    • D
      jbd2: Fix endian mixing problems in the checksumming code · 18a6ea1e
      Darrick J. Wong 提交于
      In the jbd2 checksumming code, explicitly declare separate variables with
      endianness information so that we don't get confused and screw things up again.
      Also fixes sparse warnings.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      18a6ea1e
    • Z
      ext4: isolate ext4_extents.h file · d7b2a00c
      Zheng Liu 提交于
      After applied the commit (4a092d73), we have reduced the number of
      source files that need to #include ext4_extents.h.  But we can do
      better.
      
      This commit defines ext4_zeroout_es() in extents.c and move
      EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
      indirect.c and ioctl.c.  Meanwhile we just need to include this file in
      extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
      commit removes a duplicated declaration in trace/events/ext4.h.
      
      After applied this patch, we just need to include ext4_extents.h file
      in {super,migrate,move_extents,extents}.c, and it is easy for us to
      define a new extent disk layout.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d7b2a00c
    • A
      70261f56
    • D
      ext4: convert write_begin methods to stable_page_writes semantics · 7afe5aa5
      Dmitry Monakhov 提交于
      Use wait_for_stable_page() instead of wait_on_page_writeback()
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      7afe5aa5
    • A
      ext4: fix use of potentially uninitialized variables in debugging code · 27b1b228
      Andi Shyti 提交于
      If ext_debugging is enabled and path[depth].p_ext is NULL, len
      and lblock are printed non initialized
      Signed-off-by: NAndi Shyti <andi@etezian.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      27b1b228
    • L
      Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink" · f0cc6ffb
      Linus Torvalds 提交于
      This reverts commit bb2314b4.
      
      It wasn't necessarily wrong per se, but we're still busily discussing
      the exact details of this all, so I'm going to revert it for now.
      
      It's true that you can already do flink() through /proc and that flink()
      isn't new.  But as Brad Spengler points out, some secure environments do
      not mount proc, and flink adds a new interface that can avoid path
      lookup of the source for those kinds of environments.
      
      We may re-do this (and even mark it for stable backporting back in 3.11
      and possibly earlier) once the whole discussion about the interface is done.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cc6ffb
  7. 27 8月, 2013 2 次提交
  8. 26 8月, 2013 6 次提交
    • J
      f2fs: support the inline xattrs · 65985d93
      Jaegeuk Kim 提交于
      0. modified inode structure
      --------------------------------------
      metadata (e.g., i_mtime, i_ctime, etc)
      --------------------------------------
      direct pointers [0 ~ 873]
      
      inline xattrs (200 bytes by default)
      
      indirect pointers [0 ~ 4]
      --------------------------------------
      node footer
      --------------------------------------
      
      1. setxattr flow
       - read_all_xattrs copies all the xattrs from inline and xattr node block.
       - handle xattr entries
       - write_all_xattrs copies modified xattrs into inline and xattr node block.
      
      2. getxattr flow
       - read_all_xattrs copies all the xattrs from inline and xattr node block.
       - check target entries
      
      3. Usage
       # mount -t f2fs -o inline_xattr $DEV $MNT
      
       Once mounted with the inline_xattr option, f2fs marks all the newly created
       files to reserve an amount of inline xattr space explicitly inside the inode
       block. Without the mount option, f2fs will not touch any existing files and
       newly created files as well.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      65985d93
    • J
      f2fs: add the truncate_xattr_node function · 4f16fb0f
      Jaegeuk Kim 提交于
      The truncate_xattr_node function will be used by inline xattr.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      4f16fb0f
    • J
      f2fs: introduce __find_xattr for readability · dd9cfe23
      Jaegeuk Kim 提交于
      The __find_xattr is to search the wanted xattr entry starting from the
      base_addr.
      
      If not found, the returned entry is the last empty xattr entry that can be
      allocated newly.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      dd9cfe23
    • J
      f2fs: reserve the xattr space dynamically · de93653f
      Jaegeuk Kim 提交于
      This patch enables the number of direct pointers inside on-disk inode block to
      be changed dynamically according to the size of inline xattr space.
      
      The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
      has inline xattr flag.
      
      The number of direct pointers that will be used by inline xattrs is defined as
      F2FS_INLINE_XATTR_ADDRS.
      Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      de93653f
    • J
      f2fs: add flags for inline xattrs · 444c580f
      Jaegeuk Kim 提交于
      This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
      and add a mount option, inline_xattr, which is enabled when xattr is set.
      
      If the mount option is enabled, all the files are marked with the inline_xattrs
      flag.
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      444c580f
    • W
      f2fs: fix error return code in init_f2fs_fs() · 6e6b978c
      Wei Yongjun 提交于
      Fix to return -ENOMEM in the kset create and add error handling
      case instead of 0, as done elsewhere in this function.
      
      Introduced by commit b59d0bae.
      (f2fs: add sysfs support for controlling the gc_thread)
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NNamjae Jeon <namjae.jeon@samsung.com>
      [Jaegeuk Kim: merge the patch with previous modification]
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      6e6b978c