1. 14 9月, 2012 1 次提交
  2. 22 8月, 2012 3 次提交
    • A
      introduce kref_put_mutex() · 8ad5db8a
      Al Viro 提交于
      equivalent of
      	mutex_lock(mutex);
      	if (!kref_put(kref, release))
      		mutex_unlock(mutex);
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ad5db8a
    • M
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman 提交于
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • W
      string: do not export memweight() to userspace · c3a5ce04
      WANG Cong 提交于
      Fix the following warning:
      
        usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3a5ce04
  3. 20 8月, 2012 2 次提交
  4. 17 8月, 2012 2 次提交
  5. 15 8月, 2012 8 次提交
  6. 10 8月, 2012 2 次提交
    • K
      Yama: higher restrictions should block PTRACE_TRACEME · 9d8dad74
      Kees Cook 提交于
      The higher ptrace restriction levels should be blocking even
      PTRACE_TRACEME requests. The comments in the LSM documentation are
      misleading about when the checks happen (the parent does not go through
      security_ptrace_access_check() on a PTRACE_TRACEME call).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org # 3.5.x and later
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      9d8dad74
    • P
      netfilter: nf_ct_sip: fix IPv6 address parsing · 02b69cbd
      Patrick McHardy 提交于
      Within SIP messages IPv6 addresses are enclosed in square brackets in most
      cases, with the exception of the "received=" header parameter. Currently
      the helper fails to parse enclosed addresses.
      
      This patch:
      
      - changes the SIP address parsing function to enforce square brackets
        when required, and accept them when not required but present, as
        recommended by RFC 5118.
      
      - adds a new SDP address parsing function that never accepts square
        brackets since SDP doesn't use them.
      
      With these changes, the SIP helper correctly parses all test messages
      from RFC 5118 (Session Initiation Protocol (SIP) Torture Test Messages
      for Internet Protocol Version 6 (IPv6)).
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      02b69cbd
  7. 09 8月, 2012 3 次提交
    • A
      Input: eeti_ts: pass gpio value instead of IRQ · 4eef6cbf
      Arnd Bergmann 提交于
      The EETI touchscreen asserts its IRQ line as soon as it has data in its
      internal buffers. The line is automatically deasserted once all data has
      been read via I2C. Hence, the driver has to monitor the GPIO line and
      cannot simply rely on the interrupt handler reception.
      
      In the current implementation of the driver, irq_to_gpio() is used to
      determine the GPIO number from the i2c_client's IRQ value.
      
      As irq_to_gpio() is not available on all platforms, this patch changes
      this and makes the driver ignore the passed in IRQ. Instead, a GPIO is
      added to the platform_data struct and gpio_to_irq is used to derive the
      IRQ from that GPIO. If this fails, bail out. The driver is only able to
      work in environments where the touchscreen GPIO can be mapped to an
      IRQ.
      
      Without this patch, building raumfeld_defconfig results in:
      
      drivers/input/touchscreen/eeti_ts.c: In function 'eeti_ts_irq_active':
      drivers/input/touchscreen/eeti_ts.c:65:2: error: implicit declaration of function 'irq_to_gpio' [-Werror=implicit-function-declaration]
      Signed-off-by: NDaniel Mack <zonque@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org (v3.2+)
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Sven Neumann <s.neumann@raumfeld.com>
      Cc: linux-input@vger.kernel.org
      Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
      4eef6cbf
    • A
      ARM: pxa: remove irq_to_gpio from ezx-pcap driver · 59ee93a5
      Arnd Bergmann 提交于
      The irq_to_gpio function was removed from the pxa platform
      in linux-3.2, and this driver has been broken since.
      
      There is actually no in-tree user of this driver that adds
      this platform device, but the driver can and does get enabled
      on some platforms.
      
      Without this patch, building ezx_defconfig results in:
      
      drivers/mfd/ezx-pcap.c: In function 'pcap_isr_work':
      drivers/mfd/ezx-pcap.c:205:2: error: implicit declaration of function 'irq_to_gpio' [-Werror=implicit-function-declaration]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NHaojian Zhuang <haojian.zhuang@gmail.com>
      Cc: stable@vger.kernel.org (v3.2+)
      Cc: Samuel Ortiz <sameo@linux.intel.com>
      Cc: Daniel Ribeiro <drwyrm@gmail.com>
      59ee93a5
    • R
      Revert "NMI watchdog: fix for lockup detector breakage on resume" · 300d3739
      Rafael J. Wysocki 提交于
      Revert commit 45226e94 (NMI watchdog: fix for lockup detector breakage
      on resume) which breaks resume from system suspend on my SH7372
      Mackerel board (by causing a NULL pointer dereference to happen) and
      is generally wrong, because it abuses the CPU hotplug functionality
      in a shamelessly blatant way.
      
      The original issue should be addressed through appropriate syscore
      resume callback instead.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      300d3739
  8. 08 8月, 2012 1 次提交
  9. 07 8月, 2012 2 次提交
    • O
      canfd: remove redundant CAN FD flag · 035534ed
      Oliver Hartkopp 提交于
      The first idea of the CAN FD implementation started with a new struct
      canfd_frame to be used for both CAN FD frames and legacy CAN frames.
      The now mainlined implementation supports both CAN frame types simultaneously
      and distinguishes them only by their required sizes: CAN_MTU and CANFD_MTU.
      
      Only the struct canfd_frame contains a flags element which is needed for the
      additional CAN FD information. As CAN FD implicitly means that the 'Extened
      Data Length' mode is enabled the formerly defined CANFD_EDL bit became
      redundant and also confusing as an unset bit would be an error and would
      always need to be tested.
      
      This patch removes the obsolete CANFD_EDL bit and clarifies the documentation
      for the use of struct canfd_frame and the CAN FD relevant flags.
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      035534ed
    • E
      net: ipv6: fix TCP early demux · 5d299f3d
      Eric Dumazet 提交于
      IPv6 needs a cookie in dst_check() call.
      
      We need to add rx_dst_cookie and provide a family independent
      sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d299f3d
  10. 06 8月, 2012 1 次提交
    • T
      ext4: make sure the journal sb is written in ext4_clear_journal_err() · d796c52e
      Theodore Ts'o 提交于
      After we transfer set the EXT4_ERROR_FS bit in the file system
      superblock, it's not enough to call jbd2_journal_clear_err() to clear
      the error indication from journal superblock --- we need to call
      jbd2_journal_update_sb_errno() as well.  Otherwise, when the root file
      system is mounted read-only, the journal is replayed, and the error
      indicator is transferred to the superblock --- but the s_errno field
      in the jbd2 superblock is left set (since although we cleared it in
      memory, we never flushed it out to disk).
      
      This can end up confusing e2fsck.  We should make e2fsck more robust
      in this case, but the kernel shouldn't be leaving things in this
      confused state, either.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      
      d796c52e
  11. 04 8月, 2012 2 次提交
    • A
      vfs: nuke pdflush from comments · 0d5c3eba
      Artem Bityutskiy 提交于
      The pdflush thread is long gone, so this patch removes references to pdflush
      from vfs comments.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0d5c3eba
    • A
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy 提交于
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  12. 03 8月, 2012 5 次提交
  13. 02 8月, 2012 2 次提交
    • B
      net: Allow driver to limit number of GSO segments per skb · 30b678d8
      Ben Hutchings 提交于
      A peer (or local user) may cause TCP to use a nominal MSS of as little
      as 88 (actual MSS of 76 with timestamps).  Given that we have a
      sufficiently prodigious local sender and the peer ACKs quickly enough,
      it is nevertheless possible to grow the window for such a connection
      to the point that we will try to send just under 64K at once.  This
      results in a single skb that expands to 861 segments.
      
      In some drivers with TSO support, such an skb will require hundreds of
      DMA descriptors; a substantial fraction of a TX ring or even more than
      a full ring.  The TX queue selected for the skb may stall and trigger
      the TX watchdog repeatedly (since the problem skb will be retried
      after the TX reset).  This particularly affects sfc, for which the
      issue is designated as CVE-2012-3412.
      
      Therefore:
      1. Add the field net_device::gso_max_segs holding the device-specific
         limit.
      2. In netif_skb_features(), if the number of segments is too high then
         mask out GSO features to force fall back to software GSO.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30b678d8
    • J
      locks: remove unused lm_release_private · 068535f1
      J. Bruce Fields 提交于
      In commit 3b6e2723 ("locks: prevent side-effects of
      locks_release_private before file_lock is initialized") we removed the
      last user of lm_release_private without removing the field itself.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      068535f1
  14. 01 8月, 2012 6 次提交
    • V
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal 提交于
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f6bf9
    • G
      dmaengine: shdma: restore partial transfer calculation · 4f46f8ac
      Guennadi Liakhovetski 提交于
      The recent shdma driver split has mistakenly removed support for partial
      DMA transfer size calculation on forced termination. This patch restores
      it.
      Signed-off-by: NGuennadi Liakhovetski <g.liakhovetski@gmx.de>
      Acked-by: NVinod Koul <vinod.koul@linux.intel.com>
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      4f46f8ac
    • A
      Platform: OLPC: turn EC driver into a platform_driver · ac250415
      Andres Salomon 提交于
      The 1.75-based OLPC EC driver already does this; let's do it for all EC
      drivers.  This gives us nice suspend/resume hooks, amongst other things.
      
      We want to run the EC's suspend hooks later than other drivers (which may
      be setting wakeup masks or be running EC commands).  We also want to run
      the EC's resume hooks earlier than other drivers (which may want to run EC
      commands).
      Signed-off-by: NAndres Salomon <dilinger@queued.net>
      Acked-by: NPaul Fox <pgf@laptop.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      ac250415
    • A
      Platform: OLPC: allow EC cmd to be overridden, and create a workqueue to call it · 3d26c20b
      Andres Salomon 提交于
      This provides a new API allows different OLPC architectures to override the
      EC driver.  x86 and ARM OLPC machines use completely different EC backends.
      
      The olpc_ec_cmd is synchronous, and waits for the workqueue to send the
      command to the EC.  Multiple callers can run olpc_ec_cmd() at once, and
      they will by serialized and sleep while only one executes on the EC at a time.
      
      We don't provide an unregister function, as that doesn't make sense within
      the context of OLPC machines - there's only ever 1 EC, it's critical to
      functionality, and it certainly not hotpluggable.
      Signed-off-by: NAndres Salomon <dilinger@queued.net>
      Acked-by: NPaul Fox <pgf@laptop.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      3d26c20b
    • A
      Platform: OLPC: add a stub to drivers/platform/ for the OLPC EC driver · 392a325c
      Andres Salomon 提交于
      The OLPC EC driver has outgrown arch/x86/platform/.  It's time to both
      share common code amongst different architectures, as well as move it out
      of arch/x86/.  The XO-1.75 is ARM-based, and the EC driver shares a lot of
      code with the x86 code.
      Signed-off-by: NAndres Salomon <dilinger@queued.net>
      Acked-by: NPaul Fox <pgf@laptop.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      392a325c
    • M
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · d833352a
      Mel Gorman 提交于
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d833352a