1. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  2. 06 11月, 2015 20 次提交
  3. 04 11月, 2015 1 次提交
  4. 03 11月, 2015 1 次提交
  5. 02 11月, 2015 1 次提交
    • L
      mm: get rid of 'vmalloc_info' from /proc/meminfo · a5ad88ce
      Linus Torvalds 提交于
      It turns out that at least some versions of glibc end up reading
      /proc/meminfo at every single startup, because glibc wants to know the
      amount of memory the machine has.  And while that's arguably insane,
      it's just how things are.
      
      And it turns out that it's not all that expensive most of the time, but
      the vmalloc information statistics (amount of virtual memory used in the
      vmalloc space, and the biggest remaining chunk) can be rather expensive
      to compute.
      
      The 'get_vmalloc_info()' function actually showed up on my profiles as
      4% of the CPU usage of "make test" in the git source repository, because
      the git tests are lots of very short-lived shell-scripts etc.
      
      It turns out that apparently this same silly vmalloc info gathering
      shows up on the facebook servers too, according to Dave Jones.  So it's
      not just "make test" for git.
      
      We had two patches to just cache the information (one by me, one by
      Ingo) to mitigate this issue, but the whole vmalloc information of of
      rather dubious value to begin with, and people who *actually* want to
      know what the situation is wrt the vmalloc area should just look at the
      much more complete /proc/vmallocinfo instead.
      
      In fact, according to my testing - and perhaps more importantly,
      according to that big search engine in the sky: Google - there is
      nothing out there that actually cares about those two expensive fields:
      VmallocUsed and VmallocChunk.
      
      So let's try to just remove them entirely.  Actually, this just removes
      the computation and reports the numbers as zero for now, just to try to
      be minimally intrusive.
      
      If this breaks anything, we'll obviously have to re-introduce the code
      to compute this all and add the caching patches on top.  But if given
      the option, I'd really prefer to just remove this bad idea entirely
      rather than add even more code to work around our historical mistake
      that likely nobody really cares about.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ad88ce
  6. 01 11月, 2015 2 次提交
    • L
      vfs: conditionally clear close-on-exec flag · fc90888d
      Linus Torvalds 提交于
      We clear the close-on-exec flag when opening and closing files, and the
      bit was almost always already clear before.  Avoid dirtying the
      cacheline if the clearning isn't necessary.  That avoids unnecessary
      cacheline dirtying and bouncing in multi-socket environments.
      
      Eric Dumazet has a file descriptor benchmark that goes 4% faster from
      this on his two-socket machine.  It's probably partly superlinear
      improvement due to getting slightly less spinlock contention on the
      file_lock spinlock due to less work in the critical section.
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc90888d
    • L
      vfs: Fix pathological performance case for __alloc_fd() · f3f86e33
      Linus Torvalds 提交于
      Al Viro points out that:
      > >     * [Linux-specific aside] our __alloc_fd() can degrade quite badly
      > > with some use patterns.  The cacheline pingpong in the bitmap is probably
      > > inevitable, unless we accept considerably heavier memory footprint,
      > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
      > > to trigger - close(3);open(...); will have the next open() after that
      > > scanning the entire in-use bitmap.
      
      And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
      that opens and closes a lot of sockets with minimal work per socket.
      
      This patch largely fixes it.  We keep a 2nd-level bitmap of the open
      file bitmaps, showing which words are already full.  So then we can
      traverse that second-level bitmap to efficiently skip already allocated
      file descriptors.
      
      On his benchmark, this improves performance by up to an order of
      magnitude, by avoiding the excessive open file bitmap scanning.
      Tested-and-acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3f86e33
  7. 28 10月, 2015 2 次提交
    • T
      fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() · b33e18f6
      Tejun Heo 提交于
      bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
      to walk @bdi->wb_list.  To set up the initial iteration
      condition, it uses list_entry_rcu() to calculate the entry
      pointer corresponding to the list head; however, this isn't an
      actual RCU dereference and using list_entry_rcu() for it ended
      up breaking a proposed list_entry_rcu() change because it was
      feeding an non-lvalue pointer into the macro.
      
      Don't use the RCU variant for simple pointer offsetting.  Use
      list_entry() instead.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Patrick Marlier <patrick.marlier@gmail.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pranith kumar <bobby.prani@gmail.com>
      Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b33e18f6
    • D
      namei: permit linking with CAP_FOWNER in userns · f2ca3796
      Dirk Steinmetz 提交于
      Attempting to hardlink to an unsafe file (e.g. a setuid binary) from
      within an unprivileged user namespace fails, even if CAP_FOWNER is held
      within the namespace. This may cause various failures, such as a gentoo
      installation within a lxc container failing to build and install specific
      packages.
      
      This change permits hardlinking of files owned by mapped uids, if
      CAP_FOWNER is held for that namespace. Furthermore, it improves consistency
      by using the existing inode_owner_or_capable(), which is aware of
      namespaced capabilities as of 23adbe12 ("fs,userns: Change
      inode_capable to capable_wrt_inode_uidgid").
      Signed-off-by: NDirk Steinmetz <public@rsjtdrjgfuzkfg.com>
      
      This is hitting us in Ubuntu during some dpkg upgrades in containers.
      When upgrading a file dpkg creates a hard link to the old file to back
      it up before overwriting it. When packages upgrade suid files owned by a
      non-root user the link isn't permitted, and the package upgrade fails.
      This patch fixes our problem.
      Tested-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f2ca3796
  8. 23 10月, 2015 8 次提交
  9. 22 10月, 2015 4 次提交