1. 29 9月, 2017 3 次提交
  2. 15 9月, 2017 1 次提交
  3. 14 9月, 2017 1 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  4. 12 9月, 2017 1 次提交
    • J
      xdp: implement xdp_redirect_map for generic XDP · 96c5508e
      Jesper Dangaard Brouer 提交于
      Using bpf_redirect_map is allowed for generic XDP programs, but the
      appropriate map lookup was never performed in xdp_do_generic_redirect().
      
      Instead the map-index is directly used as the ifindex.  For the
      xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
      sending on ifindex 0 which isn't valid, resulting in getting SKB
      packets dropped.  Thus, the reported performance numbers are wrong in
      commit 24251c26 ("samples/bpf: add option for native and skb mode
      for redirect apps") for the 'xdp_redirect_map -S' case.
      
      Before commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") it could crash the kernel.  Like this
      commit also check that the map_owner owner is correct before
      dereferencing the map pointer.  But make sure that this API misusage
      can be caught by a tracepoint. Thus, allowing userspace via
      tracepoints to detect misbehaving bpf_progs.
      
      Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic")
      Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96c5508e
  5. 11 9月, 2017 1 次提交
    • G
      block: tolerate tracing of NULL bio · f8e9ec16
      Greg Thelen 提交于
      __get_request() can call trace_block_getrq() with bio=NULL which causes
      block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.
      
      Syzkaller fuzzer panics with
      linux-next (1d53d908b79d7870d89063062584eead4cf83448):
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Modules linked in:
        CPU: 0 PID: 2983 Comm: syzkaller401111 Not tainted 4.13.0-rc7-next-20170901+ #13
        task: ffff8801cf1da000 task.stack: ffff8801ce440000
        RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
        RSP: 0018:ffff8801ce4473f0 EFLAGS: 00010246
        RAX: ffff8801cf1da000 RBX: 1ffff10039c88e84 RCX: 1ffffd1ffff84d27
        RDX: dffffc0000000001 RSI: 1ffff1003b643e7a RDI: ffffe8ffffc26938
        RBP: ffff8801ce447530 R08: 1ffff1003b643e6c R09: ffffe8ffffc26964
        R10: 0000000000000002 R11: fffff91ffff84d2d R12: ffffe8ffffc1f890
        R13: ffffe8ffffc26930 R14: ffffffff85cad9e0 R15: 0000000000000000
        FS:  0000000002641880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000043e670 CR3: 00000001d1d7a000 CR4: 00000000001406f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          trace_block_getrq include/trace/events/block.h:423 [inline]
          __get_request block/blk-core.c:1283 [inline]
          get_request+0x1518/0x23b0 block/blk-core.c:1355
          blk_old_get_request block/blk-core.c:1402 [inline]
          blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
          sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
          sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
          vfs_ioctl fs/ioctl.c:45 [inline]
          do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
          SYSC_ioctl fs/ioctl.c:700 [inline]
          SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
          entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
      Only one of them is NULL tolerant.  Favor the NULL tolerant one.
      
      Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8e9ec16
  6. 07 9月, 2017 2 次提交
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  7. 01 9月, 2017 1 次提交
  8. 31 8月, 2017 2 次提交
  9. 30 8月, 2017 7 次提交
  10. 25 8月, 2017 2 次提交
  11. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  12. 22 8月, 2017 1 次提交
  13. 19 8月, 2017 1 次提交
    • J
      xdp: adjust xdp redirect tracepoint to include return error code · 4c03bdd7
      Jesper Dangaard Brouer 提交于
      The return error code need to be included in the tracepoint
      xdp:xdp_redirect, else its not possible to distinguish successful or
      failed XDP_REDIRECT transmits.
      
      XDP have no queuing mechanism. Thus, it is fairly easily to overrun a
      NIC transmit queue.  The eBPF program invoking helpers (bpf_redirect
      or bpf_redirect_map) to redirect a packet doesn't get any feedback
      whether the packet was actually transmitted.
      
      Info on failed transmits in the tracepoint xdp:xdp_redirect, is
      interesting as this opens for providing a feedback-loop to the
      receiving XDP program.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c03bdd7
  14. 18 8月, 2017 1 次提交
  15. 17 8月, 2017 1 次提交
    • J
      qdisc: add tracepoint qdisc:qdisc_dequeue for dequeued SKBs · e543002f
      Jesper Dangaard Brouer 提交于
      The main purpose of this tracepoint is to monitor bulk dequeue
      in the network qdisc layer, as it cannot be deducted from the
      existing qdisc stats.
      
      The txq_state can be used for determining the reason for zero packet
      dequeues, see enum netdev_queue_state_t.
      
      Notice all packets doesn't necessary activate this tracepoint. As
      qdiscs with flag TCQ_F_CAN_BYPASS, can directly invoke
      sch_direct_xmit() when qdisc_qlen is zero.
      
      Remember that perf record supports filters like:
      
       perf record -e qdisc:qdisc_dequeue \
        --filter 'ifindex == 4 && (packets > 1 || txq_state > 0)'
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e543002f
  16. 16 8月, 2017 4 次提交
  17. 31 7月, 2017 1 次提交
  18. 29 7月, 2017 1 次提交
  19. 25 7月, 2017 1 次提交
    • P
      rcutorture: Place event-traced strings into trace buffer · b3c98314
      Paul E. McKenney 提交于
      Strings used in event tracing need to be specially handled, for example,
      being copied to the trace buffer instead of being pointed to by the trace
      buffer.  Although the TPS() macro can be used to "launder" pointed-to
      strings, this might not be all that effective within a loadable module.
      This commit therefore copies rcutorture's strings to the trace buffer.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      b3c98314
  20. 18 7月, 2017 1 次提交
  21. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  22. 11 7月, 2017 2 次提交
    • R
      mm/oom_kill.c: add tracepoints for oom reaper-related events · 422580c3
      Roman Gushchin 提交于
      During the debugging of the problem described in
      https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
      https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
      output is not really useful to understand issues related to the oom
      reaper.
      
      So, I assume, that adding some tracepoints might help with debugging of
      similar issues.
      
      Trace the following events:
       1) a process is marked as an oom victim,
       2) a process is added to the oom reaper list,
       3) the oom reaper starts reaping process's mm,
       4) the oom reaper finished reaping,
       5) the oom reaper skips reaping.
      
      How it works in practice? Below is an example which show how the problem
      mentioned above can be found: one process is added twice to the
      oom_reaper list:
      
        $ cd /sys/kernel/debug/tracing
        $ echo "oom:mark_victim" > set_event
        $ echo "oom:wake_reaper" >> set_event
        $ echo "oom:skip_task_reaping" >> set_event
        $ echo "oom:start_task_reaping" >> set_event
        $ echo "oom:finish_task_reaping" >> set_event
        $ cat trace_pipe
                allocate-502   [001] ....    91.836405: mark_victim: pid=502
                allocate-502   [001] .N..    91.837356: wake_reaper: pid=502
                allocate-502   [000] .N..    91.871149: wake_reaper: pid=502
              oom_reaper-23    [000] ....    91.871177: start_task_reaping: pid=502
              oom_reaper-23    [000] .N..    91.879511: finish_task_reaping: pid=502
              oom_reaper-23    [000] ....    91.879580: skip_task_reaping: pid=502
      
      Link: http://lkml.kernel.org/r/20170530185231.GA13412@castleSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      422580c3
    • S
      oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACK · 7ab0e50a
      Steven Rostedt (VMware) 提交于
      After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
      CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
      been evaluated:
      
       # cat /sys/kernel/debug/tracing/enum_map
      
      (which will soon be renamed to eval_map)
      
      And it showed some interesting results:
      
        [..]
        ZONE_MOVABLE 3 (oom)
        ZONE_NORMAL 2 (oom)
        ZONE_DMA32 1 (oom)
        ZONE_DMA 0 (oom)
        3 3 (oom)
        2 2 (oom)
        1 1 (oom)
        COMPACT_PRIO_ASYNC 2 (oom)
        COMPACT_PRIO_SYNC_LIGHT 1 (oom)
        COMPACT_PRIO_SYNC_FULL 0 (oom)
        [..]
        ZONE_DMA 0 (vmscan)
        3 3 (vmscan)
        2 2 (vmscan)
        1 1 (vmscan)
        COMPACT_PRIO_ASYNC 2 (vmscan)
        [..]
        ZONE_DMA 0 (kmem)
        3 3 (kmem)
        2 2 (kmem)
        1 1 (kmem)
        COMPACT_PRIO_ASYNC 2 (kmem)
        [..]
        ZONE_DMA 0 (compaction)
        3 3 (compaction)
        2 2 (compaction)
        1 1 (compaction)
        COMPACT_PRIO_ASYNC 2 (compaction)
        [..]
      
      The name within the parenthesis are the trace systems that the enum/eval
      maps are associated with. When there's a number evaluated to another
      number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
      and not an enum. As #defines get converted normally, they are not needed
      to be evaluated.
      
      Each of the above trace systems with the number to number evaluation
      included the file include/trace/events/mmflags.h which has:
      
       /* High-level compaction status feedback */
       #define COMPACTION_FAILED       1
       #define COMPACTION_WITHDRAWN    2
       #define COMPACTION_PROGRESS     3
      
      [..]
      
       #define COMPACTION_FEEDBACK             \
              EM(COMPACTION_FAILED,           "failed")       \
              EM(COMPACTION_WITHDRAWN,        "withdrawn")    \
              EMe(COMPACTION_PROGRESS,        "progress")
      
      Which is still needed for the __print_symbolic() usage in the
      trace_event.  But it is not needed to be evaluated.
      
      Removing the evaluation part removes the unnecessary evaluations of
      numbers to numbers.
      
      Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.homeSigned-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ab0e50a
  23. 06 7月, 2017 1 次提交
    • J
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton 提交于
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5660e13d
  24. 21 6月, 2017 1 次提交
  25. 20 6月, 2017 1 次提交