1. 10 8月, 2016 1 次提交
    • V
      mm: memcontrol: only mark charged pages with PageKmemcg · c4159a75
      Vladimir Davydov 提交于
      To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
      which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
      in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
      with __GFP_ACCOUNT, including those that aren't actually charged to any
      cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
      in case cgroups are not used, we only do that if memcg_kmem_enabled() is
      true.  The latter is set iff there are kmem-enabled memory cgroups
      (online or offline).  The root cgroup is not considered kmem-enabled.
      
      As a result, if a page is allocated with __GFP_ACCOUNT for the root
      cgroup when there are kmem-enabled memory cgroups and is freed after all
      kmem-enabled memory cgroups were removed, e.g.
      
        # no memory cgroups has been created yet, create one
        mkdir /sys/fs/cgroup/memory/test
        # run something allocating pages with __GFP_ACCOUNT, e.g.
        # a program using pipe
        dmesg | tail
        # remove the memory cgroup
        rmdir /sys/fs/cgroup/memory/test
      
      we'll get bad page state bug complaining about page->_mapcount != -1:
      
        BUG: Bad page state in process swapper/0  pfn:1fd945c
        page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
        flags: 0x1000000000000000()
      
      To avoid that, let's mark with PageKmemcg only those pages that are
      actually charged to and hence pin a non-root memory cgroup.
      
      Fixes: 4949148a ("mm: charge/uncharge kmemcg from generic page allocator paths")
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4159a75
  2. 27 7月, 2016 1 次提交
    • V
      pipe: account to kmemcg · d86133bd
      Vladimir Davydov 提交于
      Pipes can consume a significant amount of system memory, hence they
      should be accounted to kmemcg.
      
      This patch marks pipe_inode_info and anonymous pipe buffer page
      allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
      Note, since a pipe buffer page can be "stolen" and get reused for other
      purposes, including mapping to userspace, we clear PageKmemcg thus
      resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
      is introduced by this patch.
      
      A note regarding anon_pipe_buf_steal implementation.  We allow to steal
      the page if its ref count equals 1.  It looks racy, but it is correct
      for anonymous pipe buffer pages, because:
      
       - We lock out all other pipe users, because ->steal is called with
         pipe_lock held, so the page can't be spliced to another pipe from
         under us.
      
       - The page is not on LRU and it never was.
      
       - Thus a parallel thread can access it only by PFN. Although this is
         quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
         this is not dangerous, because all such functions do is increase page
         ref count, check if the page is the one they are looking for, and
         decrease ref count if it isn't. Since our page is clean except for
         PageKmemcg mark, which doesn't conflict with other _mapcount users,
         the worst that can happen is we see page_count > 2 due to a transient
         ref, in which case we false-positively abort ->steal, which is still
         fine, because ->steal is not guaranteed to succeed.
      
      Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanzaSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d86133bd
  3. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  4. 20 1月, 2016 1 次提交
    • W
      pipe: limit the per-user amount of pages allocated in pipes · 759c0114
      Willy Tarreau 提交于
      On no-so-small systems, it is possible for a single process to cause an
      OOM condition by filling large pipes with data that are never read. A
      typical process filling 4000 pipes with 1 MB of data will use 4 GB of
      memory. On small systems it may be tricky to set the pipe max size to
      prevent this from happening.
      
      This patch makes it possible to enforce a per-user soft limit above
      which new pipes will be limited to a single page, effectively limiting
      them to 4 kB each, as well as a hard limit above which no new pipes may
      be created for this user. This has the effect of protecting the system
      against memory abuse without hurting other users, and still allowing
      pipes to work correctly though with less data at once.
      
      The limit are controlled by two new sysctls : pipe-user-pages-soft, and
      pipe-user-pages-hard. Both may be disabled by setting them to zero. The
      default soft limit allows the default number of FDs per process (1024)
      to create pipes of the default size (64kB), thus reaching a limit of 64MB
      before starting to create only smaller pipes. With 256 processes limited
      to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
      1084 MB of memory allocated for a user. The hard limit is disabled by
      default to avoid breaking existing applications that make intensive use
      of pipes (eg: for splicing).
      
      Reported-by: socketpair@gmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      759c0114
  5. 11 11月, 2015 2 次提交
    • E
      fs/pipe.c: return error code rather than 0 in pipe_write() · 6ae08069
      Eric Biggers 提交于
      pipe_write() would return 0 if it failed to merge the beginning of the
      data to write with the last, partially filled pipe buffer.  It should
      return an error code instead.  Userspace programs could be confused by
      write() returning 0 when called with a nonzero 'count'.
      
      The EFAULT error case was a regression from f0d1bec9 ("new helper:
      copy_page_from_iter()"), while the ops->confirm() error case was a much
      older bug.
      
      Test program:
      
      	#include <assert.h>
      	#include <errno.h>
      	#include <unistd.h>
      
      	int main(void)
      	{
      		int fd[2];
      		char data[1] = {0};
      
      		assert(0 == pipe(fd));
      		assert(1 == write(fd[1], data, 1));
      
      		/* prior to this patch, write() returned 0 here  */
      		assert(-1 == write(fd[1], NULL, 1));
      		assert(errno == EFAULT);
      	}
      
      Cc: stable@vger.kernel.org # at least v3.15+
      Signed-off-by: NEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6ae08069
    • E
      fs/pipe.c: preserve alloc_file() error code · e9bb1f9b
      Eric Biggers 提交于
      If sys_pipe() was unable to allocate a 'struct file', it always failed
      with ENFILE, which means "The number of simultaneously open files in the
      system would exceed a system-imposed limit." However, alloc_file()
      actually returns an ERR_PTR value and might fail with other error codes.
      Currently, in addition to ENFILE, it can fail with ENOMEM, potentially
      when there are few open files in the system.  Update sys_pipe() to
      preserve this error code.
      
      In a prior submission of a similar patch (1) some concern was raised
      about introducing a new error code for sys_pipe().  However, for most
      system calls, programs cannot assume that new error codes will never be
      introduced.  In addition, ENOMEM was, in fact, already a possible error
      code for sys_pipe(), in the case where the file descriptor table could
      not be expanded due to insufficient memory.
      
      	(1) http://comments.gmane.org/gmane.linux.kernel/1357942Signed-off-by: NEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e9bb1f9b
  6. 16 4月, 2015 1 次提交
  7. 12 4月, 2015 1 次提交
  8. 26 3月, 2015 1 次提交
  9. 07 5月, 2014 3 次提交
  10. 02 4月, 2014 2 次提交
  11. 24 1月, 2014 1 次提交
  12. 03 12月, 2013 1 次提交
    • L
      vfs: fix subtle use-after-free of pipe_inode_info · b0d8d229
      Linus Torvalds 提交于
      The pipe code was trying (and failing) to be very careful about freeing
      the pipe info only after the last access, with a pattern like:
      
              spin_lock(&inode->i_lock);
              if (!--pipe->files) {
                      inode->i_pipe = NULL;
                      kill = 1;
              }
              spin_unlock(&inode->i_lock);
              __pipe_unlock(pipe);
              if (kill)
                      free_pipe_info(pipe);
      
      where the final freeing is done last.
      
      HOWEVER.  The above is actually broken, because while the freeing is
      done at the end, if we have two racing processes releasing the pipe
      inode info, the one that *doesn't* free it will decrement the ->files
      count, and unlock the inode i_lock, but then still use the
      "pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".
      
      This is *very* hard to trigger in practice, since the race window is
      very small, and adding debug options seems to just hide it by slowing
      things down.
      
      Simon originally reported this way back in July as an Oops in
      kmem_cache_allocate due to a single bit corruption (due to the final
      "spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
      allocation that had re-used the free'd pipe-info), it's taken this long
      to figure out.
      
      Since the 'pipe->files' accesses aren't even protected by the pipe lock
      (we very much use the inode lock for that), the simple solution is to
      just drop the pipe lock early.  And since there were two users of this
      pattern, create a helper function for it.
      
      Introduced commit ba5bb147 ("pipe: take allocation and freeing of
      pipe_inode_info out of ->i_mutex").
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Reported-by: NIan Applegate <ia@cloudflare.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org   # v3.10+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0d8d229
  13. 08 5月, 2013 1 次提交
  14. 10 4月, 2013 11 次提交
  15. 12 3月, 2013 1 次提交
    • A
      vfs: fix pipe counter breakage · a930d879
      Al Viro 提交于
      If you open a pipe for neither read nor write, the pipe code will not
      add any usage counters to the pipe, causing the 'struct pipe_inode_info"
      to be potentially released early.
      
      That doesn't normally matter, since you cannot actually use the pipe,
      but the pipe release code - particularly fasync handling - still expects
      the actual pipe infrastructure to all be there.  And rather than adding
      NULL pointer checks, let's just disallow this case, the same way we
      already do for the named pipe ("fifo") case.
      
      This is ancient going back to pre-2.4 days, and until trinity, nobody
      naver noticed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a930d879
  16. 23 2月, 2013 2 次提交
  17. 27 9月, 2012 1 次提交
  18. 30 7月, 2012 1 次提交
  19. 24 7月, 2012 1 次提交
  20. 02 6月, 2012 1 次提交
    • J
      fs: introduce inode operation ->update_time · c3b2da31
      Josef Bacik 提交于
      Btrfs has to make sure we have space to allocate new blocks in order to modify
      the inode, so updating time can fail.  We've gotten around this by having our
      own file_update_time but this is kind of a pain, and Christoph has indicated he
      would like to make xfs do something different with atime updates.  So introduce
      ->update_time, where we will deal with i_version an a/m/c time updates and
      indicate which changes need to be made.  The normal version just does what it
      has always done, updates the time and marks the inode dirty, and then
      filesystems can choose to do something different.
      
      I've gone through all of the users of file_update_time and made them check for
      errors with the exception of the fault code since it's complicated and I wasn't
      quite sure what to do there, also Jan is going to be pushing the file time
      updates into page_mkwrite for those who have it so that should satisfy btrfs and
      make it not a big deal to check the file_update_time() return code in the
      generic fault path. Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3b2da31
  21. 01 6月, 2012 1 次提交
  22. 31 5月, 2012 1 次提交
    • W
      pipe: return -ENOIOCTLCMD instead of -EINVAL on unknown ioctl command · 46ce341b
      Will Deacon 提交于
      As described in commit 07d106d0 ("vfs: fix up ENOIOCTLCMD error
      handling"), drivers should return -ENOIOCTLCMD if they receive an ioctl
      command which they don't understand. Doing so will result in -ENOTTY
      being returned to userspace, which matches the behaviour of the compat
      layer if it fails to translate an ioctl command.
      
      This patch fixes the pipe ioctl to return -ENOIOCTLCMD instead of
      -EINVAL when passed an unknown ioctl command.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46ce341b
  23. 30 4月, 2012 1 次提交
    • L
      pipes: add a "packetized pipe" mode for writing · 9883035a
      Linus Torvalds 提交于
      The actual internal pipe implementation is already really about
      individual packets (called "pipe buffers"), and this simply exposes that
      as a special packetized mode.
      
      When we are in the packetized mode (marked by O_DIRECT as suggested by
      Alan Cox), a write() on a pipe will not merge the new data with previous
      writes, so each write will get a pipe buffer of its own.  The pipe
      buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
      will tell the reader side to break the read at that boundary (and throw
      away any partial packet contents that do not fit in the read buffer).
      
      End result: as long as you do writes less than PIPE_BUF in size (so that
      the pipe doesn't have to split them up), you can now treat the pipe as a
      packet interface, where each read() system call will read one packet at
      a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
      sufficient, since bigger than that doesn't guarantee atomicity anyway),
      and the return value of the read() will naturally give you the size of
      the packet.
      
      NOTE! We do not support zero-sized packets, and zero-sized reads and
      writes to a pipe continue to be no-ops.  Also note that big packets will
      currently be split at write time, but that the size at which that
      happens is not really specified (except that it's bigger than PIPE_BUF).
      Currently that limit is the system page size, but we might want to
      explicitly support bigger packets some day.
      
      The main user for this is going to be the autofs packet interface,
      allowing us to stop having to care so deeply about exact packet sizes
      (which have had bugs with 32/64-bit compatibility modes).  But user
      space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
      fail with an EINVAL on kernels that do not support this interface.
      Tested-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Thomas Meyer <thomas@m3y3r.de>
      Cc: stable@kernel.org  # needed for systemd/autofs interaction fix
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9883035a
  24. 24 3月, 2012 1 次提交
  25. 20 3月, 2012 1 次提交