1. 09 8月, 2014 14 次提交
  2. 07 8月, 2014 2 次提交
    • P
      mm: softdirty: respect VM_SOFTDIRTY in PTE holes · 68b5a652
      Peter Feiner 提交于
      After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
      should report that the VMA's virtual pages are soft-dirty until
      VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
      /proc/pid/clear_refs).  However, pagemap ignores the VM_SOFTDIRTY flag
      for virtual addresses that fall in PTE holes (i.e., virtual addresses
      that don't have a PMD, PUD, or PGD allocated yet).
      
      To observe this bug, use mmap to create a VMA large enough such that
      there's a good chance that the VMA will occupy an unused PMD, then test
      the soft-dirty bit on its pages.  In practice, I found that a VMA that
      covered a PMD's worth of address space was big enough.
      
      This patch adds the necessary VMA lookup to the PTE hole callback in
      /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
      VM_SOFTDIRTY flag.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b5a652
    • R
      mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces · cc7452b6
      Rafael Aquini 提交于
      Historically, we exported shared pages to userspace via sysinfo(2)
      sharedram and /proc/meminfo's "MemShared" fields.  With the advent of
      tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
      was deemed inaccurate and we started to export a hard-coded 0 for
      sysinfo.sharedram.  Later on, during the 2.6 timeframe, "MemShared" got
      re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
      reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
      "shared memory" report inconsistent across interfaces.
      
      This patch leverages the addition of explicit accounting for pages used
      by shmem/tmpfs -- "4b02108a mm: oom analysis: add shmem vmstat" -- in
      order to make the users of sysinfo(2) and si_meminfo*() friends aware of
      that vmstat entry and make them report it consistently across the
      interfaces, as well to make sysinfo(2) returned data consistent with our
      current API documentation states.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc7452b6
  3. 24 7月, 2014 2 次提交
    • E
      CAPABILITIES: remove undefined caps from all processes · 7d8b6c63
      Eric Paris 提交于
      This is effectively a revert of 7b9a7ec5
      plus fixing it a different way...
      
      We found, when trying to run an application from an application which
      had dropped privs that the kernel does security checks on undefined
      capability bits.  This was ESPECIALLY difficult to debug as those
      undefined bits are hidden from /proc/$PID/status.
      
      Consider a root application which drops all capabilities from ALL 4
      capability sets.  We assume, since the application is going to set
      eff/perm/inh from an array that it will clear not only the defined caps
      less than CAP_LAST_CAP, but also the higher 28ish bits which are
      undefined future capabilities.
      
      The BSET gets cleared differently.  Instead it is cleared one bit at a
      time.  The problem here is that in security/commoncap.c::cap_task_prctl()
      we actually check the validity of a capability being read.  So any task
      which attempts to 'read all things set in bset' followed by 'unset all
      things set in bset' will not even attempt to unset the undefined bits
      higher than CAP_LAST_CAP.
      
      So the 'parent' will look something like:
      CapInh:	0000000000000000
      CapPrm:	0000000000000000
      CapEff:	0000000000000000
      CapBnd:	ffffffc000000000
      
      All of this 'should' be fine.  Given that these are undefined bits that
      aren't supposed to have anything to do with permissions.  But they do...
      
      So lets now consider a task which cleared the eff/perm/inh completely
      and cleared all of the valid caps in the bset (but not the invalid caps
      it couldn't read out of the kernel).  We know that this is exactly what
      the libcap-ng library does and what the go capabilities library does.
      They both leave you in that above situation if you try to clear all of
      you capapabilities from all 4 sets.  If that root task calls execve()
      the child task will pick up all caps not blocked by the bset.  The bset
      however does not block bits higher than CAP_LAST_CAP.  So now the child
      task has bits in eff which are not in the parent.  These are
      'meaningless' undefined bits, but still bits which the parent doesn't
      have.
      
      The problem is now in cred_cap_issubset() (or any operation which does a
      subset test) as the child, while a subset for valid cap bits, is not a
      subset for invalid cap bits!  So now we set durring commit creds that
      the child is not dumpable.  Given it is 'more priv' than its parent.  It
      also means the parent cannot ptrace the child and other stupidity.
      
      The solution here:
      1) stop hiding capability bits in status
      	This makes debugging easier!
      
      2) stop giving any task undefined capability bits.  it's simple, it you
      don't put those invalid bits in CAP_FULL_SET you won't get them in init
      and you won't get them in any other task either.
      	This fixes the cap_issubset() tests and resulting fallout (which
      	made the init task in a docker container untraceable among other
      	things)
      
      3) mask out undefined bits when sys_capset() is called as it might use
      ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
      	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
      
      4) mask out undefined bit when we read a file capability off of disk as
      again likely all bits are set in the xattr for forward/backward
      compatibility.
      	This lets 'setcap all+pe /bin/bash; /bin/bash' run
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Dan Walsh <dwalsh@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      7d8b6c63
    • T
      sched: Make task->real_start_time nanoseconds based · 57e0be04
      Thomas Gleixner 提交于
      Simplify the only user of this data by removing the timespec
      conversion.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      57e0be04
  4. 04 7月, 2014 1 次提交
    • H
      /proc/stat: convert to single_open_size() · f74373a5
      Heiko Carstens 提交于
      These two patches are supposed to "fix" failed order-4 memory
      allocations which have been observed when reading /proc/stat.  The
      problem has been observed on s390 as well as on x86.
      
      To address the problem change the seq_file memory allocations to
      fallback to use vmalloc, so that allocations also work if memory is
      fragmented.
      
      This approach seems to be simpler and less intrusive than changing
      /proc/stat to use an interator.  Also it "fixes" other users as well,
      which use seq_file's single_open() interface.
      
      This patch (of 2):
      
      Use seq_file's single_open_size() to preallocate a buffer that is large
      enough to hold the whole output, instead of open coding it.  Also
      calculate the requested size using the number of online cpus instead of
      possible cpus, since the size of the output only depends on the number
      of online cpus.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
      Cc: Andrea Righi <andrea@betterlinux.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f74373a5
  5. 07 6月, 2014 3 次提交
  6. 05 6月, 2014 1 次提交
  7. 21 5月, 2014 1 次提交
  8. 08 4月, 2014 10 次提交
  9. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  10. 02 4月, 2014 1 次提交
  11. 20 3月, 2014 1 次提交
  12. 13 3月, 2014 2 次提交
    • F
      cputime: Default implementation of nsecs -> cputime conversion · bfc3f028
      Frederic Weisbecker 提交于
      The architectures that override cputime_t (s390, ppc) don't provide
      any version of nsecs_to_cputime(). Indeed this cputime_t implementation
      by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
      which the core code doesn't make any use of nsecs_to_cputime().
      
      At least for now.
      
      We are going to make a broader use of it so lets provide a default
      version with a per usecs granularity. It should be good enough for most
      usecases.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      bfc3f028
    • T
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o 提交于
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
  13. 12 3月, 2014 1 次提交
    • G
      of: remove /proc/device-tree · 8357041a
      Grant Likely 提交于
      The same data is now available in sysfs, so we can remove the code
      that exports it in /proc and replace it with a symlink to the sysfs
      version.
      
      Tested on versatile qemu model and mpc5200 eval board. More testing
      would be appreciated.
      
      v5: Fixed up conflicts with mainline changes
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Pantelis Antoniou <panto@antoniou-consulting.com>
      8357041a