1. 09 8月, 2014 1 次提交
  2. 07 8月, 2014 2 次提交
    • P
      mm: softdirty: respect VM_SOFTDIRTY in PTE holes · 68b5a652
      Peter Feiner 提交于
      After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
      should report that the VMA's virtual pages are soft-dirty until
      VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
      /proc/pid/clear_refs).  However, pagemap ignores the VM_SOFTDIRTY flag
      for virtual addresses that fall in PTE holes (i.e., virtual addresses
      that don't have a PMD, PUD, or PGD allocated yet).
      
      To observe this bug, use mmap to create a VMA large enough such that
      there's a good chance that the VMA will occupy an unused PMD, then test
      the soft-dirty bit on its pages.  In practice, I found that a VMA that
      covered a PMD's worth of address space was big enough.
      
      This patch adds the necessary VMA lookup to the PTE hole callback in
      /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
      VM_SOFTDIRTY flag.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b5a652
    • R
      mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces · cc7452b6
      Rafael Aquini 提交于
      Historically, we exported shared pages to userspace via sysinfo(2)
      sharedram and /proc/meminfo's "MemShared" fields.  With the advent of
      tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
      was deemed inaccurate and we started to export a hard-coded 0 for
      sysinfo.sharedram.  Later on, during the 2.6 timeframe, "MemShared" got
      re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
      reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
      "shared memory" report inconsistent across interfaces.
      
      This patch leverages the addition of explicit accounting for pages used
      by shmem/tmpfs -- "4b02108a mm: oom analysis: add shmem vmstat" -- in
      order to make the users of sysinfo(2) and si_meminfo*() friends aware of
      that vmstat entry and make them report it consistently across the
      interfaces, as well to make sysinfo(2) returned data consistent with our
      current API documentation states.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc7452b6
  3. 24 7月, 2014 2 次提交
    • E
      CAPABILITIES: remove undefined caps from all processes · 7d8b6c63
      Eric Paris 提交于
      This is effectively a revert of 7b9a7ec5
      plus fixing it a different way...
      
      We found, when trying to run an application from an application which
      had dropped privs that the kernel does security checks on undefined
      capability bits.  This was ESPECIALLY difficult to debug as those
      undefined bits are hidden from /proc/$PID/status.
      
      Consider a root application which drops all capabilities from ALL 4
      capability sets.  We assume, since the application is going to set
      eff/perm/inh from an array that it will clear not only the defined caps
      less than CAP_LAST_CAP, but also the higher 28ish bits which are
      undefined future capabilities.
      
      The BSET gets cleared differently.  Instead it is cleared one bit at a
      time.  The problem here is that in security/commoncap.c::cap_task_prctl()
      we actually check the validity of a capability being read.  So any task
      which attempts to 'read all things set in bset' followed by 'unset all
      things set in bset' will not even attempt to unset the undefined bits
      higher than CAP_LAST_CAP.
      
      So the 'parent' will look something like:
      CapInh:	0000000000000000
      CapPrm:	0000000000000000
      CapEff:	0000000000000000
      CapBnd:	ffffffc000000000
      
      All of this 'should' be fine.  Given that these are undefined bits that
      aren't supposed to have anything to do with permissions.  But they do...
      
      So lets now consider a task which cleared the eff/perm/inh completely
      and cleared all of the valid caps in the bset (but not the invalid caps
      it couldn't read out of the kernel).  We know that this is exactly what
      the libcap-ng library does and what the go capabilities library does.
      They both leave you in that above situation if you try to clear all of
      you capapabilities from all 4 sets.  If that root task calls execve()
      the child task will pick up all caps not blocked by the bset.  The bset
      however does not block bits higher than CAP_LAST_CAP.  So now the child
      task has bits in eff which are not in the parent.  These are
      'meaningless' undefined bits, but still bits which the parent doesn't
      have.
      
      The problem is now in cred_cap_issubset() (or any operation which does a
      subset test) as the child, while a subset for valid cap bits, is not a
      subset for invalid cap bits!  So now we set durring commit creds that
      the child is not dumpable.  Given it is 'more priv' than its parent.  It
      also means the parent cannot ptrace the child and other stupidity.
      
      The solution here:
      1) stop hiding capability bits in status
      	This makes debugging easier!
      
      2) stop giving any task undefined capability bits.  it's simple, it you
      don't put those invalid bits in CAP_FULL_SET you won't get them in init
      and you won't get them in any other task either.
      	This fixes the cap_issubset() tests and resulting fallout (which
      	made the init task in a docker container untraceable among other
      	things)
      
      3) mask out undefined bits when sys_capset() is called as it might use
      ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
      	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
      
      4) mask out undefined bit when we read a file capability off of disk as
      again likely all bits are set in the xattr for forward/backward
      compatibility.
      	This lets 'setcap all+pe /bin/bash; /bin/bash' run
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Dan Walsh <dwalsh@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      7d8b6c63
    • T
      sched: Make task->real_start_time nanoseconds based · 57e0be04
      Thomas Gleixner 提交于
      Simplify the only user of this data by removing the timespec
      conversion.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      57e0be04
  4. 04 7月, 2014 1 次提交
    • H
      /proc/stat: convert to single_open_size() · f74373a5
      Heiko Carstens 提交于
      These two patches are supposed to "fix" failed order-4 memory
      allocations which have been observed when reading /proc/stat.  The
      problem has been observed on s390 as well as on x86.
      
      To address the problem change the seq_file memory allocations to
      fallback to use vmalloc, so that allocations also work if memory is
      fragmented.
      
      This approach seems to be simpler and less intrusive than changing
      /proc/stat to use an interator.  Also it "fixes" other users as well,
      which use seq_file's single_open() interface.
      
      This patch (of 2):
      
      Use seq_file's single_open_size() to preallocate a buffer that is large
      enough to hold the whole output, instead of open coding it.  Also
      calculate the requested size using the number of online cpus instead of
      possible cpus, since the size of the output only depends on the number
      of online cpus.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
      Cc: Andrea Righi <andrea@betterlinux.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f74373a5
  5. 07 6月, 2014 3 次提交
  6. 05 6月, 2014 1 次提交
  7. 21 5月, 2014 1 次提交
  8. 08 4月, 2014 10 次提交
  9. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  10. 02 4月, 2014 1 次提交
  11. 20 3月, 2014 1 次提交
  12. 13 3月, 2014 2 次提交
    • F
      cputime: Default implementation of nsecs -> cputime conversion · bfc3f028
      Frederic Weisbecker 提交于
      The architectures that override cputime_t (s390, ppc) don't provide
      any version of nsecs_to_cputime(). Indeed this cputime_t implementation
      by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
      which the core code doesn't make any use of nsecs_to_cputime().
      
      At least for now.
      
      We are going to make a broader use of it so lets provide a default
      version with a per usecs granularity. It should be good enough for most
      usecases.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      bfc3f028
    • T
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o 提交于
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
  13. 12 3月, 2014 1 次提交
    • G
      of: remove /proc/device-tree · 8357041a
      Grant Likely 提交于
      The same data is now available in sysfs, so we can remove the code
      that exports it in /proc and replace it with a symlink to the sysfs
      version.
      
      Tested on versatile qemu model and mpc5200 eval board. More testing
      would be appreciated.
      
      v5: Fixed up conflicts with mainline changes
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Pantelis Antoniou <panto@antoniou-consulting.com>
      8357041a
  14. 11 3月, 2014 1 次提交
  15. 04 3月, 2014 1 次提交
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  16. 11 2月, 2014 1 次提交
    • G
      vmcore: prevent PT_NOTE p_memsz overflow during header update · 38dfac84
      Greg Pearson 提交于
      Currently, update_note_header_size_elf64() and
      update_note_header_size_elf32() will add the size of a PT_NOTE entry to
      real_sz even if that causes real_sz to exceeds max_sz.  This patch
      corrects the while loop logic in those routines to ensure that does not
      happen and prints a warning if a PT_NOTE entry is dropped.  If zero
      PT_NOTE entries are found or this condition is encountered because the
      only entry was dropped, a warning is printed and an error is returned.
      
      One possible negative side effect of exceeding the max_sz limit is an
      allocation failure in merge_note_headers_elf64() or
      merge_note_headers_elf32() which would produce console output such as
      the following while booting the crash kernel.
      
        vmalloc: allocation failure: 14076997632 bytes
        swapper/0: page allocation failure: order:0, mode:0x80d2
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7
        Call Trace:
          dump_stack+0x19/0x1b
          warn_alloc_failed+0xf0/0x160
          __vmalloc_node_range+0x19e/0x250
          vmalloc_user+0x4c/0x70
          merge_note_headers_elf64.constprop.9+0x116/0x24a
          vmcore_init+0x2d4/0x76c
          do_one_initcall+0xe2/0x190
          kernel_init_freeable+0x17c/0x207
          kernel_init+0xe/0x180
          ret_from_fork+0x7c/0xb0
      
        Kdump: vmcore not initialized
      
        kdump: dump target is /dev/sda4
        kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/
        kdump: saving vmcore-dmesg.txt
        Cannot open /proc/vmcore: No such file or directory
        kdump: saving vmcore-dmesg.txt failed
        kdump: saving vmcore
        kdump: saving vmcore failed
      
      This type of failure has been seen on a four socket prototype system
      with certain memory configurations.  Most PT_NOTE sections have a single
      entry similar to:
      
        n_namesz = 0x5
        n_descsz = 0x150
        n_type   = 0x1
      
      Occasionally, a second entry is encountered with very large n_namesz and
      n_descsz sizes:
      
        n_namesz = 0x80000008
        n_descsz = 0x510ae163
        n_type   = 0x80000008
      
      Not yet sure of the source of these extra entries, they seem bogus, but
      they shouldn't cause crash dump to fail.
      Signed-off-by: NGreg Pearson <greg.pearson@hp.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38dfac84
  17. 24 1月, 2014 10 次提交