1. 19 12月, 2015 1 次提交
  2. 15 12月, 2015 7 次提交
    • B
      gfs2: clear journal live bit in gfs2_log_flush · 400ac52e
      Benjamin Marzinski 提交于
      When gfs2 was unmounting filesystems or changing them to read-only it
      was clearing the SDF_JOURNAL_LIVE bit before the final log flush.  This
      caused a race.  If an inode glock got demoted in the gap between
      clearing the bit and the shutdown flush, it would be unable to reserve
      log space to clear out the active items list in inode_go_sync, causing an
      error in inode_go_inval because the glock was still dirty.
      
      To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
      shutdown log flush.  This means that, because of the locking on the log
      blocks, either inode_go_sync will be able to reserve space to clean the
      glock before the shutdown flush, or the shutdown flush will clean the
      glock itself, before inode_go_sync fails to reserve the space. Either
      way, the glock will be clean before inode_go_inval.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      400ac52e
    • B
      gfs2: change gfs2 readdir cookie · 471f3db2
      Benjamin Marzinski 提交于
      gfs2 currently returns 31 bits of filename hash as a cookie that readdir
      uses for an offset into the directory.  When there are a large number of
      directory entries, the likelihood of a collision goes up way too
      quickly.  GFS2 will now return cookies that are guaranteed unique for a
      while, and then fail back to using 30 bits of filename hash.
      Specifically, the directory leaf blocks are divided up into chunks based
      on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
      cookie is based off the chunk where it starts, in the linked list of
      leaf blocks that it hashes to (there are 131072 hash buckets). Directory
      entries will have unique names until they take reach chunk 8192.
      Assuming the largest filenames possible, and the least efficient spacing
      possible, this new method will still be able to return unique names when
      the previous method has statistically more than a 99% chance of a
      collision.  The non-unique names it fails back to are guaranteed to not
      collide with the unique names.
      
      unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "0"
      - 13 bits for the offset
      
      non-unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "1"
      - 13 more bits of the name hash
      
      Another benefit of location based cookies, is that once a directory's
      exhash table is fully extended (so that multiple hash table indexs do
      not use the same leaf blocks), gfs2 can skip sorting the directory
      entries until it reaches the non-unique ones, and then it only needs to
      sort these. This provides a significant speed up for directory reads of
      very large directories.
      
      The only issue is that for these cookies to continue to point to the
      correct entry as files are added and removed from the directory, gfs2
      must keep the entries at the same offset in the leaf block when they are
      split (see my previous patch). This means that until all the nodes in a
      cluster are running with code that will split the directory leaf blocks
      this way, none of the nodes can use the new cookie code. To deal with
      this, gfs2 now has the mount option loccookie, which, if set, will make
      it return these new location based cookies.  This option must not be set
      until all nodes in the cluster are at least running this version of the
      kernel code, and you have guaranteed that there are no outstanding
      cookies required by other software, such as NFS.
      
      gfs2 uses some of the extra space at the end of the gfs2_dirent
      structure to store the calculated readdir cookies. This keeps us from
      needing to allocate a seperate array to hold these values.  gfs2
      recomputes the cookie stored in de_cookie for every readdir call.  The
      time it takes to do so is small, and if gfs2 expected this value to be
      saved on disk, the new code wouldn't work correctly on filesystems
      created with an earlier version of gfs2.
      
      One issue with adding de_cookie to the union in the gfs2_dirent
      structure is that it caused the union to align itself to a 4 byte
      boundary, instead of its previous 2 byte boundary. This changed the
      offset of de_rahead. To solve that, I pulled de_rahead out of the union,
      since it does not need to be there.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      471f3db2
    • B
      gfs2: keep offset when splitting dir leaf blocks · 34017472
      Benjamin Marzinski 提交于
      Currently, when gfs2 splits a directory leaf block, the dirents that
      need to be copied to the new leaf block are packed into the start of it.
      This is good for space efficiency. However, if gfs2 were to copy those
      dirents into the exact same offset in the new leaf block as they had in
      the old block, it would be able to generate a readdir cookie based on
      the dirent location, that would be guaranteed to be unique up well past
      where the current code is statistically almost guaranteed to have
      collisions. So, gfs2 now keeps the dirent's offset in the block the
      same when it copies it to the new leaf block.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      34017472
    • B
      GFS2: Reintroduce a timeout in function gfs2_gl_hash_clear · 2aba1b5b
      Bob Peterson 提交于
      At some point in the past, we used to have a timeout when GFS2 was
      unmounting, trying to clear out its glocks. If the timeout expires,
      it would dump the remaining glocks to the kernel messages so that
      developers can debug the problem. That timeout was eliminated,
      probably by accident. This patch reintroduces it.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      2aba1b5b
    • B
      GFS2: Update master statfs buffer with sd_statfs_spin locked · 901c6c66
      Bob Peterson 提交于
      Before this patch, function update_statfs called gfs2_statfs_change_out
      to update the master statfs buffer without the sd_statfs_spin held.
      In theory, another process could call gfs2_statfs_sync, which takes
      the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
      a theoretical timing window in which one process could write the
      master statfs buffer, then another comes along and re-reads it, wiping
      out the changes.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      901c6c66
    • B
      GFS2: Reduce size of incore inode · b58bf407
      Bob Peterson 提交于
      This patch makes no functional changes. Its goal is to reduce the
      size of the gfs2 inode in memory by rearranging structures and
      changing the size of some variables within the structure.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b58bf407
    • B
      GFS2: Make rgrp reservations part of the gfs2_inode structure · a097dc7e
      Bob Peterson 提交于
      Before this patch, multi-block reservation structures were allocated
      from a special slab. This patch folds the structure into the gfs2_inode
      structure. The disadvantage is that the gfs2_inode needs more memory,
      even when a file is opened read-only. The advantages are: (a) we don't
      need the special slab and the extra time it takes to allocate and
      deallocate from it. (b) we no longer need to worry that the structure
      exists for things like quota management. (c) This also allows us to
      remove the calls to get_write_access and put_write_access since we
      know the structure will exist.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a097dc7e
  3. 24 11月, 2015 1 次提交
    • B
      GFS2: Extract quota data from reservations structure (revert 5407e242) · b54e9a0b
      Bob Peterson 提交于
      This patch basically reverts the majority of patch 5407e242.
      That patch eliminated the gfs2_qadata structure in favor of just
      using the reservations structure. The problem with doing that is that
      it increases the size of the reservations structure. That is not an
      issue until it comes time to fold the reservations structure into the
      inode in memory so we know it's always there. By separating out the
      quota structure again, we aren't punishing the non-quota users by
      making all the inodes bigger, requiring more slab space. This patch
      creates a new slab area to allocate the quota stuff so it's managed
      a little more sanely.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b54e9a0b
  4. 19 11月, 2015 1 次提交
  5. 17 11月, 2015 3 次提交
  6. 11 11月, 2015 1 次提交
  7. 10 11月, 2015 1 次提交
    • A
      remove abs64() · 79211c8e
      Andrew Morton 提交于
      Switch everything to the new and more capable implementation of abs().
      Mainly to give the new abs() a bit of a workout.
      
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79211c8e
  8. 09 11月, 2015 1 次提交
  9. 05 11月, 2015 1 次提交
  10. 30 10月, 2015 1 次提交
  11. 23 10月, 2015 1 次提交
  12. 02 10月, 2015 1 次提交
    • B
      gfs2: Add missing else in trans_add_meta/data · 491e94f7
      Bob Peterson 提交于
      This patch fixes a timing window that causes a segfault.
      The problem is that bd can remain NULL throughout the function
      and then reference that NULL pointer if the bh->b_private starts
      out NULL, then someone sets it to non-NULL inside the locking.
      In that case, bd still needs to be set.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      491e94f7
  13. 23 9月, 2015 1 次提交
    • A
      GFS2: Set s_mode before parsing mount options · 6de20eb0
      Andrew Price 提交于
      In the generic mount_bdev() function, deactivate_locked_super() is
      called after the fill_super() call fails, at which point s_mode has been
      set. kill_block_super() expects this and dumps a warning when
      FMODE_EXCL is not set in s_mode.
      
      In gfs2_mount() we call deactivate_locked_super() on failure of
      gfs2_mount_args(), at which point s_mode has not yet been set. This
      causes kill_block_super() to dump a stack trace when gfs2 fails to mount
      with invalid options. Set s_mode earlier in gfs2_mount() to avoid that.
      Signed-off-by: NAndrew Price <anprice@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      6de20eb0
  14. 22 9月, 2015 1 次提交
  15. 05 9月, 2015 1 次提交
    • K
      fs: create and use seq_show_option for escaping · a068acf2
      Kees Cook 提交于
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a068acf2
  16. 04 9月, 2015 6 次提交
  17. 14 8月, 2015 1 次提交
  18. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  19. 19 6月, 2015 2 次提交
    • B
      GFS2: Don't brelse rgrp buffer_heads every allocation · 39b0f1e9
      Bob Peterson 提交于
      This patch allows the block allocation code to retain the buffers
      for the resource groups so they don't need to be re-read from buffer
      cache with every request. This is a performance improvement that's
      especially noticeable when resource groups are very large. For
      example, with 2GB resource groups and 4K blocks, there can be 33
      blocks for every resource group. This patch allows those 33 buffers
      to be kept around and not read in and thrown away with every
      operation. The buffers are released when the resource group is
      either synced or invalidated.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NBenjamin Marzinski <bmarzins@redhat.com>
      39b0f1e9
    • B
      GFS2: Don't add all glocks to the lru · e7ccaf5f
      Bob Peterson 提交于
      The glocks used for resource groups often come and go hundreds of
      thousands of times per second. Adding them to the lru list just
      adds unnecessary contention for the lru_lock spin_lock, especially
      considering we're almost certainly going to re-use the glock and
      take it back off the lru microseconds later. We never want the
      glock shrinker to cull them anyway. This patch adds a new bit in
      the glops that determines which glock types get put onto the lru
      list and which ones don't.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      e7ccaf5f
  20. 09 6月, 2015 2 次提交
  21. 03 6月, 2015 2 次提交
    • A
      gfs2: limit quota log messages · 9cde2898
      Abhi Das 提交于
      This patch makes the quota subsystem only report once that a
      particular user/group has exceeded their allotted quota.
      
      Previously, it was possible for a program to continuously try
      exceeding quota (despite receiving EDQUOT) and in turn trigger
      gfs2 to issue a kernel log message about quota exceed. In theory,
      this could get out of hand and flood the log and the filesystem
      hosting the log files.
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      9cde2898
    • A
      gfs2: fix quota updates on block boundaries · 39a72580
      Abhi Das 提交于
      For smaller block sizes (512B, 1K, 2K), some quotas straddle block
      boundaries such that the usage value is on one block and the rest
      of the quota is on the previous block. In such cases, the value
      does not get updated correctly. This patch fixes that by addressing
      the boundary conditions correctly.
      
      This patch also adds a (s64) cast that was missing in a call to
      gfs2_quota_change() in inode.c
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      39a72580
  22. 02 6月, 2015 1 次提交
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
  23. 19 5月, 2015 1 次提交
  24. 11 5月, 2015 1 次提交