1. 11 12月, 2014 1 次提交
  2. 04 12月, 2014 1 次提交
  3. 02 12月, 2014 1 次提交
    • D
      jbd2: fix regression where we fail to initialize checksum seed when loading · 32f38691
      Darrick J. Wong 提交于
      When we're enabling journal features, we cannot use the predicate
      jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
      feature flag fields!  Moreover, we just finished loading the shash
      driver, so the test is unnecessary; calculate the seed always.
      
      Without this patch, we fail to initialize the checksum seed the first
      time we turn on journal_checksum, which means that all journal blocks
      written during that first mount are corrupt.  Transactions written
      after the second mount will be fine, since the feature flag will be
      set in the journal superblock.  xfstests generic/{034,321,322} are the
      regression tests.
      
      (This is important for 3.18.)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.coM>
      Reported-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      32f38691
  4. 01 12月, 2014 1 次提交
  5. 20 11月, 2014 12 次提交
  6. 18 11月, 2014 3 次提交
    • D
      fs: Do not include mpx.h in exec.c · abe1e395
      Dave Hansen 提交于
      We no longer need mpx.h in exec.c.  This will obviously also
      break the build for non-x86 builds.  We get the MPX includes that
      we need from mmu_context.h now.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      abe1e395
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
    • Q
      x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific · 4aae7e43
      Qiaowei Ren 提交于
      MPX-enabled applications using large swaths of memory can
      potentially have large numbers of bounds tables in process
      address space to save bounds information. These tables can take
      up huge swaths of memory (as much as 80% of the memory on the
      system) even if we clean them up aggressively. In the worst-case
      scenario, the tables can be 4x the size of the data structure
      being tracked. IOW, a 1-page structure can require 4 bounds-table
      pages.
      
      Being this huge, our expectation is that folks using MPX are
      going to be keen on figuring out how much memory is being
      dedicated to it. So we need a way to track memory use for MPX.
      
      If we want to specifically track MPX VMAs we need to be able to
      distinguish them from normal VMAs, and keep them from getting
      merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
      both of those things. With this flag, MPX bounds-table VMAs can
      be distinguished from other VMAs, and userspace can also walk
      /proc/$pid/smaps to get memory usage for MPX.
      
      In addition to this flag, we also introduce a special ->vm_ops
      specific to MPX VMAs (see the patch "add MPX specific mmap
      interface"), but currently different ->vm_ops do not by
      themselves prevent VMA merging, so we still need this flag.
      
      We understand that VM_ flags are scarce and are open to other
      options.
      Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4aae7e43
  7. 14 11月, 2014 2 次提交
  8. 13 11月, 2014 11 次提交
  9. 07 11月, 2014 7 次提交
    • D
      xfs: track bulkstat progress by agino · 00275899
      Dave Chinner 提交于
      The bulkstat main loop progress is tracked by the "lastino"
      variable, which is a full 64 bit inode. However, the loop actually
      works on agno/agino pairs, and so there's a significant disconnect
      between the rest of the loop and the main cursor. Convert this to
      use the agino, and pass the agino into the chunk formatting function
      and convert it too.
      
      This gets rid of the inconsistency in the loop processing, and
      finally makes it simple for us to skip inodes at any point in the
      loop simply by incrementing the agino cursor.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      00275899
    • D
      xfs: bulkstat error handling is broken · febe3cbe
      Dave Chinner 提交于
      The error propagation is a horror - xfs_bulkstat() returns
      a rval variable which is only set if there are formatter errors. Any
      sort of btree walk error or corruption will cause the bulkstat walk
      to terminate but will not pass an error back to userspace. Worse
      is the fact that formatter errors will also be ignored if any inodes
      were correctly formatted into the user buffer.
      
      Hence bulkstat can fail badly yet still report success to userspace.
      This causes significant issues with xfsdump not dumping everything
      in the filesystem yet reporting success. It's not until a restore
      fails that there is any indication that the dump was bad and tha
      bulkstat failed. This patch now triggers xfsdump to fail with
      bulkstat errors rather than silently missing files in the dump.
      
      This now causes bulkstat to fail when the lastino cookie does not
      fall inside an existing inode chunk. The pre-3.17 code tolerated
      that error by allowing the code to move to the next inode chunk
      as the agino target is guaranteed to fall into the next btree
      record.
      
      With the fixes up to this point in the series, xfsdump now passes on
      the troublesome filesystem image that exposes all these bugs.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      febe3cbe
    • D
      xfs: bulkstat main loop logic is a mess · 6e57c542
      Dave Chinner 提交于
      There are a bunch of variables tha tare more wildy scoped than they
      need to be, obfuscated user buffer checks and tortured "next inode"
      tracking. This all needs cleaning up to expose the real issues that
      need fixing.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6e57c542
    • D
      xfs: bulkstat chunk-formatter has issues · 2b831ac6
      Dave Chinner 提交于
      The loop construct has issues:
      	- clustidx is completely unused, so remove it.
      	- the loop tries to be smart by terminating when the
      	  "freecount" tells it that all inodes are free. Just drop
      	  it as in most cases we have to scan all inodes in the
      	  chunk anyway.
      	- move the "user buffer left" condition check to the only
      	  point where we consume space int eh user buffer.
      	- move the initialisation of agino out of the loop, leaving
      	  just a simple loop control logic using the clusteridx.
      
      Also, double handling of the user buffer variables leads to problems
      tracking the current state - use the cursor variables directly
      rather than keeping local copies and then having to update the
      cursor before returning.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2b831ac6
    • D
      xfs: bulkstat chunk formatting cursor is broken · bf4a5af2
      Dave Chinner 提交于
      The xfs_bulkstat_agichunk formatting cursor takes buffer values from
      the main loop and passes them via the structure to the chunk
      formatter, and the writes the changed values back into the main loop
      local variables. Unfortunately, this complex dance is full of corner
      cases that aren't handled correctly.
      
      The biggest problem is that it is double handling the information in
      both the main loop and the chunk formatting function, leading to
      inconsistent updates and endless loops where progress is not made.
      
      To fix this, push the struct xfs_bulkstat_agichunk outwards to be
      the primary holder of user buffer information. this removes the
      double handling in the main loop.
      
      Also, pass the last inode processed by the chunk formatter as a
      separate parameter as it purely an output variable and is not
      related to the user buffer consumption cursor.
      
      Finally, the chunk formatting code is not shared by anyone, so make
      it local to xfs_itable.c.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bf4a5af2
    • D
      xfs: bulkstat btree walk doesn't terminate · afa947cb
      Dave Chinner 提交于
      The bulkstat code has several different ways of detecting the end of
      an AG when doing a walk. They are not consistently detected, and the
      code that checks for the end of AG conditions is not consistently
      coded. Hence the are conditions where the walk code can get stuck in
      an endless loop making no progress and not triggering any
      termination conditions.
      
      Convert all the "tmp/i" status return codes from btree operations
      to a common name (stat) and apply end-of-ag detection to these
      operations consistently.
      
      cc: <stable@vger.kernel.org> # 3.17
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      afa947cb
    • G
      aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer · 835f252c
      Gu Zheng 提交于
      https://bugzilla.kernel.org/show_bug.cgi?id=86831
      
      Markus reported that when shutting down mysqld (with AIO support,
      on a ext3 formatted Harddrive) leads to a negative number of dirty pages
      (underrun to the counter). The negative number results in a drastic reduction
      of the write performance because the page cache is not used, because the kernel
      thinks it is still 2 ^ 32 dirty pages open.
      
      Add a warn trace in __dec_zone_state will catch this easily:
      
      static inline void __dec_zone_state(struct zone *zone, enum
      	zone_stat_item item)
      {
           atomic_long_dec(&zone->vm_stat[item]);
      +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
      	atomic_long_read(&zone->vm_stat[item]) < 0);
           atomic_long_dec(&vm_stat[item]);
      }
      
      [   21.341632] ------------[ cut here ]------------
      [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
      cancel_dirty_page+0x164/0x224()
      [   21.355296] Modules linked in: wutbox_cp sata_mv
      [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
      [   21.366793] Workqueue: events free_ioctx
      [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
      (show_stack+0x20/0x24)
      [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
      (dump_stack+0x24/0x28)
      [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
      (warn_slowpath_common+0x84/0x9c)
      [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
      (warn_slowpath_null+0x2c/0x34)
      [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
      (cancel_dirty_page+0x164/0x224)
      [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
      (truncate_inode_page+0x8c/0x158)
      [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
      (truncate_inode_pages_range+0x11c/0x53c)
      [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
      [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
      [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
      (truncate_setsize+0x5c/0x74)
      [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
      (put_aio_ring_file.isra.14+0x34/0x90)
      [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
      [<c013b424>] (aio_free_ring+0x20/0xcc)
      [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
      (free_ioctx+0x24/0x44)
      [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
      (process_one_work+0x134/0x47c)
      [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
      (worker_thread+0x130/0x414)
      [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
      (kthread+0xd4/0xec)
      [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
      (ret_from_fork+0x14/0x20)
      [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
      
      The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
      (bypasses the VFS dirty pages increment) when init, and aio fs uses
      *default_backing_dev_info* as the backing dev, which does not disable
      the dirty pages accounting capability.
      So truncating aio ring file will contribute to accounting dirty pages (VFS
      dirty pages decrement), then error occurs.
      
      The original goal is keeping these pages in memory (can not be reclaimed
      or swapped) in life-time via marking it dirty. But thinking more, we have
      already pinned pages via elevating the page's refcount, which can already
      achieve the goal, so the SetPageDirty seems unnecessary.
      
      In order to fix the issue, using the __set_page_dirty_no_writeback instead
      of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
      set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
      
      With the above change, the dirty pages accounting can work well. But as we
      known, aio fs is an anonymous one, which should never cause any real write-back,
      we can ignore the dirty pages (write back) accounting by disabling the dirty
      pages (write back) accounting capability. So we introduce an aio private
      backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
      replace the default one.
      Reported-by: NMarkus Königshaus <m.koenigshaus@wut.de>
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      835f252c
  10. 06 11月, 2014 1 次提交