1. 25 2月, 2013 6 次提交
    • L
      Merge tag 'mfd-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 · ab782659
      Linus Torvalds 提交于
      Pull MFS updates from Samuel Ortiz:
       "This is the MFD pull request for the 3.9 merge window.
      
        No new drivers this time, but a bunch of fairly big cleanups:
      
         - Roger Quadros worked on a OMAP USBHS and TLL platform data
           consolidation, OMAP5 support and clock management code cleanup.
      
         - The first step of a major sync for the ab8500 driver from Lee
           Jones.  In particular, the debugfs and the sysct interfaces got
           extended and improved.
      
         - Peter Ujfalusi sent a nice patchset for cleaning and fixing the
           twl-core driver, with a much needed module id lookup code
           improvement.
      
         - The regular wm5102 and arizona cleanups and fixes from Mark Brown.
      
         - Laxman Dewangan extended the palmas APIs in order to implement the
           palmas GPIO and rt drivers.
      
         - Laxman also added DT support for the tps65090 driver.
      
         - The Intel SCH and ICH drivers got a couple fixes from Aaron Sierra
           and Darren Hart.
      
         - Linus Walleij patchset for the ab8500 driver allowed ab8500 and
           ab9540 based devices to switch to the new abx500 pin-ctrl driver.
      
         - The max8925 now has device tree and irqdomain support thanks to
           Qing Xu.
      
         - The recently added rtsx driver got a few cleanups and fixes for a
           better card detection code path and now also supports the RTS5227
           chipset, thanks to Wei Wang and Roger Tseng."
      
      * tag 'mfd-3.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (109 commits)
        mfd: lpc_ich: Use devres API to allocate private data
        mfd: lpc_ich: Add Device IDs for Intel Wellsburg PCH
        mfd: lpc_sch: Accomodate partial population of the MFD devices
        mfd: da9052-i2c: Staticize da9052_i2c_fix()
        mfd: syscon: Fix sparse warning
        mfd: twl-core: Fix kernel panic on boot
        mfd: rtsx: Fix issue that booting OS with SD card inserted
        mfd: ab8500: Fix compile error
        mfd: Add missing GENERIC_HARDIRQS dependecies
        Documentation: Add docs for max8925 dt
        mfd: max8925: Add dts
        mfd: max8925: Support dt for backlight
        mfd: max8925: Fix onkey driver irq base
        mfd: max8925: Fix mfd device register failure
        mfd: max8925: Add irqdomain for dt
        mfd: vexpress: Allow vexpress-sysreg to self-initialise
        mfd: rtsx: Support RTS5227
        mfd: rtsx: Implement driving adjustment to device-dependent callbacks
        mfd: vexpress: Add pseudo-GPIO based LEDs
        mfd: ab8500: Rename ab8500 to abx500 for hwmon driver
        ...
      ab782659
    • L
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 21fbd580
      Linus Torvalds 提交于
      Pull media updates from Mauro Carvalho Chehab:
      
       - Some cleanups at V4L2 documentation
      
       - new drivers: ts2020 frontend, ov9650 sensor, s5c73m3 sensor,
         sh-mobile veu mem2mem driver, radio-ma901, davinci_vpfe staging
         driver
      
       - Lots of missing MAINTAINERS entries added
      
       - several em28xx driver improvements, including its conversion to
         videobuf2
      
       - several fixups on drivers to make them to better comply with the API
      
       - DVB core: add support for DVBv5 stats, allowing the implementation of
         statistics for new standards like ISDB
      
       - mb86a20s: add statistics to the driver
      
       - lots of new board additions, cleanups, and driver improvements.
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (596 commits)
        [media] media: Add 0x3009 USB PID to ttusb2 driver (fixed diff)
        [media] rtl28xxu: Add USB IDs for Compro VideoMate U620F
        [media] em28xx: add usb id for terratec h5 rev. 3
        [media] media: rc: gpio-ir-recv: add support for device tree parsing
        [media] mceusb: move check earlier to make smatch happy
        [media] radio-si470x doc: add info about v4l2-ctl and sox+alsa
        [media] staging: media: Remove unnecessary OOM messages
        [media] sh_vou: Use vou_dev instead of vou_file wherever possible
        [media] sh_vou: Use video_drvdata()
        [media] drivers/media/platform/soc_camera/pxa_camera.c: use devm_ functions
        [media] mt9t112: mt9t111 format set up differs from mt9t112
        [media] sh-mobile-ceu-camera: fix SHARPNESS control default
        Revert "[media] fc0011: Return early, if the frequency is already tuned"
        [media] cx18/ivtv: fix regression: remove __init from a non-init function
        [media] em28xx: fix analog streaming with USB bulk transfers
        [media] stv0900: remove unnecessary null pointer check
        [media] fc0011: Return early, if the frequency is already tuned
        [media] fc0011: Add some sanity checks and cleanups
        [media] fc0011: Fix xin value clamping
        Revert "[media] [PATH,1/2] mxl5007 move reset to attach"
        ...
      21fbd580
    • L
      Merge tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev · d9978ec5
      Linus Torvalds 提交于
      Pull libata updates from Jeff Garzik:
      
      1) apply, and then revert, the sysfs export of ATA host controller
         number.  Discussion was continuing after patch application, trying to
         figure out how to best mesh exported data with the installers,
         boot-time agents and other parties that want this info.
      
      2) Merge Zero-Power Optical Device Driver (ZPODD) support, bringing the
         wonderfulness of sane power management to your CD/DVD device.
      
         Includes one SCSI-subsystem patch (with appropriate ACKs), adding
         runtime PM support to 'sr' driver.  That is the ZPODD interaction
         bits.
      
         Patchset went through some 13 revisions before it got here; kudos to
         Intel for persistence.
      
      3) pata_samsung_cf: use devm_clk_get()
      
      4) more ata_piix, ahci PCI IDs
      
      5) Add SATA driver for R-Car SoC
      
      6) Convert libata to use devm_ioremap_resource (Note: I think Greg sent
         this to you, also)
      
      7) Set proper Sense Key (SK) in the SCSI simulator when ATA passthrough
         indicates check condition.  Google and specification hawks everywhere
         shall rejoice.
      
      * tag 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: (22 commits)
        [libata] fix smatch warning for zpodd_wake_dev
        [libata] Set proper SK when CK_COND is set.
        [libata] Convert to devm_ioremap_resource()
        libata: add R-Car SATA driver
        ahci: Add Device IDs for Intel Wellsburg PCH
        ata_piix: Add Device IDs for Intel Wellsburg PCH
        [SCSI] remove can_power_off flag from scsi_device
        [libata] scsi: no poll when ODD is powered off
        [SCSI] sr: support runtime pm
        ahci: AHCI-mode SATA patch for Intel Avoton DeviceIDs
        ata_piix: IDE-mode SATA patch for Intel Avoton DeviceIDs
        [libata] PM code cleanup for ata port
        [libata] pm: differentiate system and runtime pm for ata port
        Revert "libata: export host controller number thru /sys"
        libata: do not suspend port if normal ODD is attached
        libata: expose pm qos flags for ata device
        libata: handle power transition of ODD
        libata: check zero power ready status for ZPODD
        libata: move acpi notification code to zpodd
        libata: identify and init ZPODD devices
        ...
      d9978ec5
    • N
      tty vt: fix character insertion overflow · a883b70d
      Nicolas Pitre 提交于
      Commit 81732c3b ("tty vt: Fix line garbage in virtual console on
      command line edition") broke insert_char() in multiple ways.  Then
      commit b1a925f4 ("tty vt: Fix a regression in command line edition")
      partially fixed it.  However, the buffer being moved is still too large
      and overflowing beyond the end of the current line, corrupting existing
      characters on the next line.
      
      Example test case:
      
      echo -e "abc\nde\x1b[A\x1b[4h \x1b[4l\x1b[B"
      
      Expected result:
      
      ab c
      de
      
      Current result:
      
      ab c
       e
      
      Needless to say that this is very annoying when inserting words in the
      middle of paragraphs with certain text editors.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Cc: Jean-François Moine <moinejf@free.fr>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a883b70d
    • L
      Merge tag 'stable/for-linus-3.9-rc0-tag' of... · 77be36de
      Linus Torvalds 提交于
      Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
      
      Pull Xen update from Konrad Rzeszutek Wilk:
       "This has two new ACPI drivers for Xen - a physical CPU offline/online
        and a memory hotplug.  The way this works is that ACPI kicks the
        drivers and they make the appropiate hypercall to the hypervisor to
        tell it that there is a new CPU or memory.  There also some changes to
        the Xen ARM ABIs and couple of fixes.  One particularly nasty bug in
        the Xen PV spinlock code was fixed by Stefan Bader - and has been
        there since the 2.6.32!
      
        Features:
         - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor
           to be aware of new CPU and new DIMMs
         - Cleanups
        Bug-fixes:
         - Fixes a long-standing bug in the PV spinlock wherein we did not
           kick VCPUs that were in a tight loop.
         - Fixes in the error paths for the event channel machinery"
      
      Fix up a few semantic conflicts with the ACPI interface changes in
      drivers/xen/xen-acpi-{cpu,mem}hotplug.c.
      
      * tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
        xen: event channel arrays are xen_ulong_t and not unsigned long
        xen: Send spinlock IPI to all waiters
        xen: introduce xen_remap, use it instead of ioremap
        xen: close evtchn port if binding to irq fails
        xen-evtchn: correct comment and error output
        xen/tmem: Add missing %s in the printk statement.
        xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0
        xen/acpi: ACPI cpu hotplug
        xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h
        xen/stub: driver for CPU hotplug
        xen/acpi: ACPI memory hotplug
        xen/stub: driver for memory hotplug
        xen: implement updated XENMEM_add_to_physmap_range ABI
        xen/smp: Move the common CPU init code a bit to prep for PVH patch.
      77be36de
    • L
      Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 89f88337
      Linus Torvalds 提交于
      Pull KVM updates from Marcelo Tosatti:
       "KVM updates for the 3.9 merge window, including x86 real mode
        emulation fixes, stronger memory slot interface restrictions, mmu_lock
        spinlock hold time reduction, improved handling of large page faults
        on shadow, initial APICv HW acceleration support, s390 channel IO
        based virtio, amongst others"
      
      * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
        Revert "KVM: MMU: lazily drop large spte"
        x86: pvclock kvm: align allocation size to page size
        KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
        x86 emulator: fix parity calculation for AAD instruction
        KVM: PPC: BookE: Handle alignment interrupts
        booke: Added DBCR4 SPR number
        KVM: PPC: booke: Allow multiple exception types
        KVM: PPC: booke: use vcpu reference from thread_struct
        KVM: Remove user_alloc from struct kvm_memory_slot
        KVM: VMX: disable apicv by default
        KVM: s390: Fix handling of iscs.
        KVM: MMU: cleanup __direct_map
        KVM: MMU: remove pt_access in mmu_set_spte
        KVM: MMU: cleanup mapping-level
        KVM: MMU: lazily drop large spte
        KVM: VMX: cleanup vmx_set_cr0().
        KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
        KVM: VMX: disable SMEP feature when guest is in non-paging mode
        KVM: Remove duplicate text in api.txt
        Revert "KVM: MMU: split kvm_mmu_free_page"
        ...
      89f88337
  2. 24 2月, 2013 34 次提交
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal · 9e2d59ad
      Linus Torvalds 提交于
      Pull signal handling cleanups from Al Viro:
       "This is the first pile; another one will come a bit later and will
        contain SYSCALL_DEFINE-related patches.
      
         - a bunch of signal-related syscalls (both native and compat)
           unified.
      
         - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
           (fixing several potential problems with missing argument
           validation, while we are at it)
      
         - a lot of now-pointless wrappers killed
      
         - a couple of architectures (cris and hexagon) forgot to save
           altstack settings into sigframe, even though they used the
           (uninitialized) values in sigreturn; fixed.
      
         - microblaze fixes for delivery of multiple signals arriving at once
      
         - saner set of helpers for signal delivery introduced, several
           architectures switched to using those."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
        x86: convert to ksignal
        sparc: convert to ksignal
        arm: switch to struct ksignal * passing
        alpha: pass k_sigaction and siginfo_t using ksignal pointer
        burying unused conditionals
        make do_sigaltstack() static
        arm64: switch to generic old sigaction() (compat-only)
        arm64: switch to generic compat rt_sigaction()
        arm64: switch compat to generic old sigsuspend
        arm64: switch to generic compat rt_sigqueueinfo()
        arm64: switch to generic compat rt_sigpending()
        arm64: switch to generic compat rt_sigprocmask()
        arm64: switch to generic sigaltstack
        sparc: switch to generic old sigsuspend
        sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
        sparc: kill sign-extending wrappers for native syscalls
        kill sparc32_open()
        sparc: switch to use of generic old sigaction
        sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
        mips: switch to generic sys_fork() and sys_clone()
        ...
      9e2d59ad
    • L
      Merge branch 'akpm' (more incoming from Andrew) · 5ce1a70e
      Linus Torvalds 提交于
      Merge second patch-bomb from Andrew Morton:
      
       - A little DM fix
      
       - the MM queue
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
        ksm: allocate roots when needed
        mm: cleanup "swapcache" in do_swap_page
        mm,ksm: swapoff might need to copy
        mm,ksm: FOLL_MIGRATION do migration_entry_wait
        ksm: shrink 32-bit rmap_item back to 32 bytes
        ksm: treat unstable nid like in stable tree
        ksm: add some comments
        tmpfs: fix mempolicy object leaks
        tmpfs: fix use-after-free of mempolicy object
        mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
        mm: export mmu notifier invalidates
        mm: accelerate mm_populate() treatment of THP pages
        mm: use long type for page counts in mm_populate() and get_user_pages()
        mm: accurately document nr_free_*_pages functions with code comments
        HWPOISON: change order of error_states[]'s elements
        HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
        memcg: stop warning on memcg_propagate_kmem
        net: change type of virtio_chan->p9_max_pages
        vmscan: change type of vm_total_pages to unsigned long
        fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
        ...
      5ce1a70e
    • H
      ksm: allocate roots when needed · ef53d16c
      Hugh Dickins 提交于
      It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
      allocated, particularly when very few users will ever actually tune
      merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
      deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
      
      Start off with 1+1 statically allocated, then if merge_across_nodes is
      ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
      free up the extra if it's tuned back, that would be a waste of effort.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef53d16c
    • H
      mm: cleanup "swapcache" in do_swap_page · 56f31801
      Hugh Dickins 提交于
      I dislike the way in which "swapcache" gets used in do_swap_page():
      there is always a page from swapcache there (even if maybe uncached by
      the time we lock it), but tests are made according to "swapcache".
      Rework that with "page != swapcache", as has been done in unuse_pte().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56f31801
    • H
      mm,ksm: swapoff might need to copy · 9e16b7fb
      Hugh Dickins 提交于
      Before establishing that KSM page migration was the cause of my
      WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
      lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
      many respects is equivalent to faulting in a page.
      
      In fact I've never caught that as the cause: but in theory it does at
      least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
      avoid bringing a KSM page back in when it's not supposed to be.
      
      I intended to copy how it's done in do_swap_page(), but have a strong
      aversion to how "swapcache" ends up being used there: rework it with
      "page != swapcache".
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e16b7fb
    • H
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins 提交于
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • H
      ksm: shrink 32-bit rmap_item back to 32 bytes · bc56620b
      Hugh Dickins 提交于
      Think of struct rmap_item as an extension of struct page (restricted to
      MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
      small, especially on 32-bit architectures of limited lowmem.
      
      Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
      making no change to its 64-byte struct rmap_item; but bloats the 32-bit
      struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
      rounds up to 40 bytes once allocated from slab.  We'd better avoid that.
      
      Hey, I only just remembered that the anon_vma pointer in struct
      rmap_item has no purpose until the rmap_item is hung from a stable tree
      node (which has its own nid field); and rmap_item's nid field no purpose
      than to say which tree root to tell rb_erase() when unlinking from an
      unstable tree.
      
      Double them up in a union.  There's just one place where we set anon_vma
      early (when we already hold mmap_sem): now we must remove tree_rmap_item
      from its unstable tree there, before overwriting nid.  No need to
      spatter BUG()s around: we'd be seeing oopses if this were wrong.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc56620b
    • H
      ksm: treat unstable nid like in stable tree · b599cbdf
      Hugh Dickins 提交于
      An inconsistency emerged in reviewing the NUMA node changes to KSM: when
      meeting a page from the wrong NUMA node in a stable tree, we say that
      it's okay for comparisons, but not as a leaf for merging; whereas when
      meeting a page from the wrong NUMA node in an unstable tree, we bail out
      immediately.
      
      Now, it might be that a wrong NUMA node in an unstable tree is more
      likely to correlate with instablility (different content, with rbnode
      now misplaced) than page migration; but even so, we are accustomed to
      instablility in the unstable tree.
      
      Without strong evidence for which strategy is generally better, I'd
      rather be consistent with what's done in the stable tree: accept a page
      from the wrong NUMA node for comparison, but not as a leaf for merging.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b599cbdf
    • H
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins 提交于
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • G
      tmpfs: fix mempolicy object leaks · 49cd0a5c
      Greg Thelen 提交于
      Fix several mempolicy leaks in the tmpfs mount logic.  These leaks are
      slow - on the order of one object leaked per mount attempt.
      
      Leak 1 (umount doesn't free mpol allocated in mount):
          while true; do
              mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      Leak 2 (errors parsing remount options will leak mpol):
          mount -t tmpfs -o size=100M nodev /mnt
          while true; do
              mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
          done
          umount /mnt
      
      Leak 3 (multiple mpol per mount leak mpol):
          while true; do
              mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      This patch fixes all of the above.  I could have broken the patch into
      three pieces but is seemed easier to review as one.
      
      [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49cd0a5c
    • G
      tmpfs: fix use-after-free of mempolicy object · 5f00110f
      Greg Thelen 提交于
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f00110f
    • M
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 67d46b29
      Mel Gorman 提交于
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NRob van der Heij <rvdheij@gmail.com>
      Tested-by: NRob van der Heij <rvdheij@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d46b29
    • C
      mm: export mmu notifier invalidates · fa794199
      Cliff Wickman 提交于
      We at SGI have a need to address some very high physical address ranges
      with our GRU (global reference unit), sometimes across partitioned
      machine boundaries and sometimes with larger addresses than the cpu
      supports.  We do this with the aid of our own 'extended vma' module
      which mimics the vma.  When something (either unmap or exit) frees an
      'extended vma' we use the mmu notifiers to clean them up.
      
      We had been able to mimic the functions
      __mmu_notifier_invalidate_range_start() and
      __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
      walking the per-mm notifier list.  But with the change to a global srcu
      lock (static in mmu_notifier.c) we can no longer do that.  Our module has
      no access to that lock.
      
      So we request that these two functions be exported.
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Acked-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa794199
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • Z
      mm: accurately document nr_free_*_pages functions with code comments · e0fb5815
      Zhang Yanfei 提交于
      nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
      are horribly badly named, so accurately document them with code comments
      in case of the misuse of them.
      
      [akpm@linux-foundation.org: tweak comments]
      Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0fb5815
    • N
      HWPOISON: change order of error_states[]'s elements · 5f4b9fc5
      Naoya Horiguchi 提交于
      error_states[] has two separate states "unevictable LRU page" and
      "mlocked LRU page", and the former one has the higher priority now.  But
      because of that the latter one is rarely chosen because pages with
      PageMlocked highly likely have PG_unevictable set.  On the other hand,
      PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
      shared memory, so reversing the priority of these two states helps us
      clearly distinguish them.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f4b9fc5
    • N
      HWPOISON: fix misjudgement of page_action() for errors on mlocked pages · 524fca1e
      Naoya Horiguchi 提交于
      memory_failure() can't handle memory errors on mlocked pages correctly,
      because page_action() judges such errors as ones on "unknown pages"
      instead of ones on "unevictable LRU page" or "mlocked LRU page".  In
      order to determine page_state page_action() checks page flags at the
      timing of the judgement, but such page flags are not the same with those
      just after memory_failure() is called, because memory_failure() does
      unmapping of the error pages before doing page_action().  This unmapping
      changes the page state, especially page_remove_rmap() (called from
      try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
      mlocked pages after that.
      
      With this patch, we store the page flag of the error page before doing
      unmap, and (only) if the first check with page flags at the time decided
      the error page is unknown, we do the second check with the stored page
      flag.  This implementation doesn't change error handling for the page
      types for which the first check can determine the page state correctly.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      524fca1e
    • H
      memcg: stop warning on memcg_propagate_kmem · 6d043990
      Hugh Dickins 提交于
      Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
      I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
      "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
      used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d043990
    • Z
      net: change type of virtio_chan->p9_max_pages · 7293bfba
      Zhang Yanfei 提交于
      This member of struct virtio_chan is calculated from nr_free_buffer_pages
      so change its type to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7293bfba
    • Z
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei 提交于
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • Z
      fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used · 697ce9be
      Zhang Yanfei 提交于
      The three variables are calculated from nr_free_buffer_pages so change
      their types to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      697ce9be
    • Z
      fs/buffer.c: change type of max_buffer_heads to unsigned long · 43be594a
      Zhang Yanfei 提交于
      max_buffer_heads is calculated from nr_free_buffer_pages(), so change
      its type to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43be594a
    • Z
      ia64: use %ld to print pages calculated in nr_free_buffer_pages · 6434b94a
      Zhang Yanfei 提交于
      Now the function nr_free_buffer_pages returns unsigned long, so use %ld
      to print its return value.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6434b94a
    • Z
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei 提交于
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • M
      memcg: cleanup mem_cgroup_init comment · 1081312f
      Michal Hocko 提交于
      We should encourage all memcg controller initialization independent on a
      specific mem_cgroup to be done here rather than exploit css_alloc
      callback and assume that nothing happens before root cgroup is created.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1081312f
    • M
      memcg: move memcg_stock initialization to mem_cgroup_init · e4777496
      Michal Hocko 提交于
      memcg_stock are currently initialized during the root cgroup allocation
      which is OK but it pointlessly pollutes memcg allocation code with
      something that can be called when the memcg subsystem is initialized by
      mem_cgroup_init along with other controller specific parts.
      
      This patch wraps the current memcg_stock initialization code into a
      helper calls it from the controller subsystem initialization code.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4777496
    • M
      memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init · 8787a1df
      Michal Hocko 提交于
      Per-node-zone soft limit tree is currently initialized when the root
      cgroup is created which is OK but it pointlessly pollutes memcg
      allocation code with something that can be called when the memcg
      subsystem is initialized by mem_cgroup_init along with other controller
      specific parts.
      
      While we are at it let's make mem_cgroup_soft_limit_tree_init void
      because it doesn't make much sense to report memory failure because if
      we fail to allocate memory that early during the boot then we are
      screwed anyway (this saves some code).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8787a1df
    • M
      mm: use up free swap space before reaching OOM kill · 0e50ce3b
      Minchan Kim 提交于
      Recently, Luigi reported there are lots of free swap space when OOM
      happens.  It's easily reproduced on zram-over-swap, where many instance
      of memory hogs are running and laptop_mode is enabled.  He said there
      was no problem when he disabled laptop_mode.  The problem when I
      investigate problem is following as.
      
      Assumption for easy explanation: There are no page cache page in system
      because they all are already reclaimed.
      
      1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
      2. shrink_inactive_list isolates victim pages from inactive anon lru list.
      3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
         pageout because sc->may_writepage is 0 so the page is rotated back into
         inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
      4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
         retry reclaim with higher priority.
      5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
         but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
         inactive anon lru list is full of dirty pages by 3 so it just returns
         without  any reclaim progress.
      6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
         Because sc->nr_scanned is increased by shrink_page_list but we don't call
         shrink_page_list in 5 due to short of isolated pages.
      
      Above loop is continued until OOM happens.
      
      The problem didn't happen before [1] was merged because old logic's
      isolatation in shrink_inactive_list was successful and tried to call
      shrink_page_list to pageout them but it still ends up failed to page out
      by may_writepage.  But important point is that sc->nr_scanned was
      increased although we couldn't swap out them so do_try_to_free_pages
      could set may_writepages.
      
      Since commit f80c0673 ("mm: zone_reclaim: make isolate_lru_page()
      filter-aware") was introduced, it's not a good idea any more to depends
      on only the number of scanned pages for setting may_writepage.  So this
      patch adds new trigger point of setting may_writepage as below
      DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
      in VM so it's good fit for our purpose which would be better to lose
      power saving or clickety rather than OOM killing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NLuigi Semenzato <semenzato@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e50ce3b
    • D
      mm: use NUMA_NO_NODE · 00ef2d2f
      David Rientjes 提交于
      Make a sweep through mm/ and convert code that uses -1 directly to using
      the more appropriate NUMA_NO_NODE.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00ef2d2f
    • R
      mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts · 751efd86
      Robin Holt 提交于
      There is a race condition between mmu_notifier_unregister() and
      __mmu_notifier_release().
      
      Assume two tasks, one calling mmu_notifier_unregister() as a result of a
      filp_close() ->flush() callout (task A), and the other calling
      mmu_notifier_release() from an mmput() (task B).
      
                      A                               B
      t1                                              srcu_read_lock()
      t2              if (!hlist_unhashed())
      t3                                              srcu_read_unlock()
      t4              srcu_read_lock()
      t5                                              hlist_del_init_rcu()
      t6                                              synchronize_srcu()
      t7              srcu_read_unlock()
      t8              hlist_del_rcu()  <--- NULL pointer deref.
      
      Additionally, the list traversal in __mmu_notifier_release() is not
      protected by the by the mmu_notifier_mm->hlist_lock which can result in
      callouts to the ->release() notifier from both mmu_notifier_unregister()
      and __mmu_notifier_release().
      
      -stable suggestions:
      
      The stable trees prior to 3.7.y need commits 21a92735 and
      70400303 cherry-picked in that order prior to cherry-picking this
      commit.  The 3.7.y tree already has those two commits.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751efd86
    • C
      mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same. · c1f19495
      Cody P Schafer 提交于
      Replace open coded pgdat_end_pfn() with helper function.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1f19495
    • C
      mm/memory_hotplug: use ensure_zone_is_initialized() · 64dd1b29
      Cody P Schafer 提交于
      Remove open coding of ensure_zone_is_initialzied().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64dd1b29
    • C
      mm: add helper ensure_zone_is_initialized() · f6bbb78e
      Cody P Schafer 提交于
      ensure_zone_is_initialized() checks if a zone is in a empty & not
      initialized state (typically occuring after it is created in memory
      hotplugging), and, if so, calls init_currently_empty_zone() to
      initialize the zone.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6bbb78e