1. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  2. 04 1月, 2016 1 次提交
  3. 10 11月, 2015 1 次提交
    • C
      xfs: give all workqueues rescuer threads · 7a29ac47
      Chris Mason 提交于
      We're consistently hitting deadlocks here with XFS on recent kernels.
      After some digging through the crash files, it looks like everyone in
      the system is waiting for XFS to reclaim memory.
      
      Something like this:
      
      PID: 2733434  TASK: ffff8808cd242800  CPU: 19  COMMAND: "java"
       #0 [ffff880019c53588] __schedule at ffffffff818c4df2
       #1 [ffff880019c535d8] schedule at ffffffff818c5517
       #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
       #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
       #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
       #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
       #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
       #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
       #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
       #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73
      #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6
      #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f
      #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba
      #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a
      #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8
      #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671
      #16 [ffff880019c53dd8] copy_process at ffffffff8104f781
      #17 [ffff880019c53ec8] do_fork at ffffffff8105129c
      #18 [ffff880019c53f38] sys_clone at ffffffff810515b6
      #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d
      
      xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
      for IO, which is waiting for workers to complete the IO which is waiting
      for worker threads that don't exist yet:
      
      PID: 2752451  TASK: ffff880bd6bdda00  CPU: 37  COMMAND: "kworker/37:1"
       #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
       #1 [ffff8808d20abc00] schedule at ffffffff818c5517
       #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
       #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
       #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
       #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
       #6 [ffff8808d20abe40] worker_thread at ffffffff810699be
       #7 [ffff8808d20abec0] kthread at ffffffff8106ef59
       #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8
      
      I think we should be using WQ_MEM_RECLAIM to make sure this thread
      pool makes progress when we're not able to allocate new workers.
      
      [dchinner: make all workqueues WQ_MEM_RECLAIM]
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7a29ac47
  4. 03 11月, 2015 1 次提交
  5. 19 10月, 2015 1 次提交
  6. 12 10月, 2015 4 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
    • B
      xfs: per-filesystem stats in sysfs · 225e4635
      Bill O'Donnell 提交于
      This patch implements per-filesystem stats objects in sysfs. It
      depends on the application of the previous patch series that
      develops the infrastructure to support both xfs global stats and
      xfs per-fs stats in sysfs.
      
      Stats objects are instantiated when an xfs filesystem is mounted
      and deleted on unmount. With this patch, the stats directory is
      created and populated with the familiar stats and stats_clear files.
      Example:
              /sys/fs/xfs/sda9/stats/stats
              /sys/fs/xfs/sda9/stats/stats_clear
      
      With this patch, the individual counts within the new per-fs
      stats file(s) remain at zero. Functions that use the the macros
      to increment, decrement, and add-to the per-fs stats counts will
      be covered in a separate new patch to follow this one. Note that
      the counts within the global stats file (/sys/fs/xfs/stats/stats)
      advance normally and can be cleared as it was prior to this patch.
      
      [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
      it is down before/after any path that uses the per-mount stats. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      225e4635
    • B
      xfs: pass xfsstats structures to handlers and macros · 80529c45
      Bill O'Donnell 提交于
      This patch is the next step toward per-fs xfs stats. The patch makes
      the show and clear routines able to handle any stats structure
      associated with a kobject.
      
      Instead of a single global xfsstats structure, add kobject and a pointer
      to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
      accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
      xfsstats->xs_stats.
      
      The sysfs functions need to get from the kobject back to the xfsstats
      structure which contains it, and pass the pointer to the ->xs_stats
      percpu structure into the show & clear routines.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      80529c45
    • B
      xfs: create global stats and stats_clear in sysfs · bb230c12
      Bill O'Donnell 提交于
      Currently, xfs global stats are in procfs. This patch introduces
      (replicates) the global stats in sysfs. Additionally a stats_clear file
      is introduced in sysfs.
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bb230c12
  7. 05 9月, 2015 1 次提交
    • K
      fs: create and use seq_show_option for escaping · a068acf2
      Kees Cook 提交于
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a068acf2
  8. 25 8月, 2015 1 次提交
  9. 19 8月, 2015 1 次提交
    • B
      xfs: relocate sparse inode mount warning · 1b867d3a
      Brian Foster 提交于
      The sparse inodes feature is currently considered experimental. We warn
      at mount time from xfs_mount_validate_sb(). This function is part of the
      superblock verifier codepath, however, which means it could be invoked
      repeatedly on superblock reads or writes. This is currently only
      noticeable from userspace, where mkfs produces multiple warnings at
      format time.
      
      As mkfs warnings were not the intent of this change, relocate the mount
      time warning to xfs_fs_fill_super(), which is only invoked once and only
      in kernel space.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1b867d3a
  10. 04 6月, 2015 1 次提交
  11. 16 4月, 2015 1 次提交
  12. 13 4月, 2015 1 次提交
  13. 25 3月, 2015 1 次提交
  14. 24 2月, 2015 2 次提交
  15. 23 2月, 2015 5 次提交
    • D
      xfs: introduce mmap/truncate lock · 653c60b6
      Dave Chinner 提交于
      Right now we cannot serialise mmap against truncate or hole punch
      sanely. ->page_mkwrite is not able to take locks that the read IO
      path normally takes (i.e. the inode iolock) because that could
      result in lock inversions (read - iolock - page fault - page_mkwrite
      - iolock) and so we cannot use an IO path lock to serialise page
      write faults against truncate operations.
      
      Instead, introduce a new lock that is used *only* in the
      ->page_mkwrite path that is the equivalent of the iolock. The lock
      ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
      and so in truncate we can i_iolock -> i_mmaplock and so lock out
      new write faults during the process of truncation.
      
      Because i_mmap_lock is outside the page lock, we can hold it across
      all the same operations we hold the i_iolock for. The only
      difference is that we never hold the i_mmaplock in the normal IO
      path and so do not ever have the possibility that we can page fault
      inside it. Hence there are no recursion issues on the i_mmap_lock
      and so we can use it to serialise page fault IO against inode
      modification operations that affect the IO path.
      
      This patch introduces the i_mmaplock infrastructure, lockdep
      annotations and initialisation/destruction code. Use of the new lock
      will be in subsequent patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      653c60b6
    • D
      xfs: Remove icsb infrastructure · 5681ca40
      Dave Chinner 提交于
      Now that the in-core superblock infrastructure has been replaced with
      generic per-cpu counters, we don't need it anymore. Nuke it from
      orbit so we are sure that it won't haunt us again...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5681ca40
    • D
      xfs: use generic percpu counters for free block counter · 0d485ada
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free block counter is
      special in that it is used for ENOSPC detection outside transaction
      contexts for for delayed allocation. This means that the counter
      needs to be accurate at zero. The current per-cpu counter code jumps
      through lots of hoops to ensure we never run past zero, but we don't
      need to make all those jumps with the generic counter
      implementation.
      
      The generic counter implementation allows us to pass a "batch"
      threshold at which the addition/subtraction to the counter value
      will be folded back into global value under lock. We can use this
      feature to reduce the batch size as we approach 0 in a very similar
      manner to the existing counters and their rebalance algorithm. If we
      use a batch size of 1 as we approach 0, then every addition and
      subtraction will be done against the global value and hence allow
      accurate detection of zero threshold crossing.
      
      Hence we can replace the handrolled, accurate-at-zero counters with
      generic percpu counters.
      
      Note: this removes just enough of the icsb infrastructure to compile
      without warnings. The rest will go in subsequent commits.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d485ada
    • D
      xfs: use generic percpu counters for free inode counter · e88b64ea
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free inode counter is not
      used for any limit enforcement - the per-AG free inode counters are
      used during allocation to determine if there are inode available for
      allocation.
      
      Hence we don't need any of the complexity of the hand-rolled
      counters and we can simply replace them with generic per-cpu
      counters similar to the inode counter.
      
      This version introduces a xfs_mod_ifree() helper function from
      Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e88b64ea
    • D
      xfs: use generic percpu counters for inode counter · 501ab323
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. There are some warts around
      the  use of them for the inode counter as the hand rolled counter is
      designed to be accurate at zero, but has no specific accurracy at
      any other value. This design causes problems for the maximum inode
      count threshold enforcement, as there is no trigger that balances
      the counters as they get close tothe maximum threshold.
      
      Instead of designing new triggers for balancing, just replace the
      handrolled per-cpu counter with a generic counter.  This enables us
      to update the counter through the normal superblock modification
      funtions, but rather than do that we add a xfs_mod_icount() helper
      function (from Christoph Hellwig) and keep the percpu counter
      outside the superblock in the struct xfs_mount.
      
      This means we still need to initialise the per-cpu counter
      specifically when we read the superblock, and vice versa when we
      log/write it, but it does mean that we don't need to change any
      other code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      501ab323
  16. 13 2月, 2015 1 次提交
    • V
      fs: consolidate {nr,free}_cached_objects args in shrink_control · 4101b624
      Vladimir Davydov 提交于
      We are going to make FS shrinkers memcg-aware.  To achieve that, we will
      have to pass the memcg to scan to the nr_cached_objects and
      free_cached_objects VFS methods, which currently take only the NUMA node
      to scan.  Since the shrink_control structure already holds the node, and
      the memcg to scan will be added to it when we introduce memcg-aware
      vmscan, let us consolidate the methods' arguments in this structure to
      keep things clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4101b624
  17. 06 2月, 2015 1 次提交
  18. 22 1月, 2015 2 次提交
    • D
      xfs: consolidate superblock logging functions · 61e63ecb
      Dave Chinner 提交于
      We now have several superblock loggin functions that are identical
      except for the transaction reservation and whether it shoul dbe a
      synchronous transaction or not. Consolidate these all into a single
      function, a single reserveration and a sync flag and call it
      xfs_sync_sb().
      
      Also, xfs_mod_sb() is not really a modification function - it's the
      operation of logging the superblock buffer. hence change the name of
      it to reflect this.
      
      Note that we have to change the mp->m_update_flags that are passed
      around at mount time to a boolean simply to indicate a superblock
      update is needed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61e63ecb
    • D
      xfs: remove bitfield based superblock updates · 4d11a402
      Dave Chinner 提交于
      When we log changes to the superblock, we first have to write them
      to the on-disk buffer, and then log that. Right now we have a
      complex bitfield based arrangement to only write the modified field
      to the buffer before we log it.
      
      This used to be necessary as a performance optimisation because we
      logged the superblock buffer in every extent or inode allocation or
      freeing, and so performance was extremely important. We haven't done
      this for years, however, ever since the lazy superblock counters
      pulled the superblock logging out of the transaction commit
      fast path.
      
      Hence we have a bunch of complexity that is not necessary that makes
      writing the in-core superblock to disk much more complex than it
      needs to be. We only need to log the superblock now during
      management operations (e.g. during mount, unmount or quota control
      operations) so it is not a performance critical path anymore.
      
      As such, remove the complex field based logging mechanism and
      replace it with a simple conversion function similar to what we use
      for all other on-disk structures.
      
      This means we always log the entirity of the superblock, but again
      because we rarely modify the superblock this is not an issue for log
      bandwidth or CPU time. Indeed, if we do log the superblock
      frequently, delayed logging will minimise the impact of this
      overhead.
      
      [Fixed gquota/pquota inode sharing regression noticed by bfoster.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4d11a402
  19. 24 12月, 2014 1 次提交
  20. 04 12月, 2014 2 次提交
  21. 01 12月, 2014 1 次提交
  22. 28 11月, 2014 4 次提交
  23. 10 11月, 2014 1 次提交
  24. 29 9月, 2014 1 次提交
  25. 09 9月, 2014 2 次提交
    • B
      xfs: add debug sysfs attribute set · 65b65735
      Brian Foster 提交于
      Create a top-level debug directory for global debug sysfs attributes.
      This directory is added and removed on XFS module initialization and
      removal respectively for DEBUG mode kernels only. It typically resides
      at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
      hierarchy as attributes might define global behavior or behavior that
      must be configured before an xfs mount is available (e.g., log recovery
      behavior).
      
      Define the global debug kobject that represents the debug sysfs
      directory and add generic attribute show/store helpers to support future
      attributes. No debug attributes are exported as of yet.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      65b65735
    • B
      xfs: mark all internal workqueues as freezable · 8018ec08
      Brian Foster 提交于
      Workqueues must be explicitly set as freezable to ensure they are frozen
      in the assocated part of the hibernation/suspend sequence. Freezing of
      workqueues and kernel threads is important to ensure that modifications
      are not made on-disk after the hibernation image has been created.
      Otherwise, the in-memory state can become inconsistent with what is on
      disk and eventually lead to filesystem corruption. We have reports of
      free space btree corruptions that occur immediately after restore from
      hibernate that suggest the xfs-eofblocks workqueue could be causing
      such problems if it races with hibernation.
      
      Mark all of the internal XFS workqueues as freezable to ensure nothing
      changes on-disk once the freezer infrastructure freezes kernel threads
      and creates the hibernation image.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reported-by: NCarlos E. R. <carlos.e.r@opensuse.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8018ec08
  26. 30 7月, 2014 1 次提交