1. 30 9月, 2017 2 次提交
    • T
      sched: Implement interface for cgroup unified hierarchy · 0d593634
      Tejun Heo 提交于
      There are a couple interface issues which can be addressed in cgroup2
      interface.
      
      * Stats from cpuacct being reported separately from the cpu stats.
      
      * Use of different time units.  Writable control knobs use
        microseconds, some stat fields use nanoseconds while other cpuacct
        stat fields use centiseconds.
      
      * Control knobs which can't be used in the root cgroup still show up
        in the root.
      
      * Control knob names and semantics aren't consistent with other
        controllers.
      
      This patchset implements cpu controller's interface on cgroup2 which
      adheres to the controller file conventions described in
      Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
      are made.
      
      * cpuacct is implictly enabled and disabled by cpu and its information
        is reported through "cpu.stat" which now uses microseconds for all
        time durations.  All time duration fields now have "_usec" appended
        to them for clarity.
      
        Note that cpuacct.usage_percpu is currently not included in
        "cpu.stat".  If this information is actually called for, it will be
        added later.
      
      * "cpu.shares" is replaced with "cpu.weight" and operates on the
        standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
        The weight is scaled to scheduler weight so that 100 maps to 1024
        and the ratio relationship is preserved - if weight is W and its
        scaled value is S, W / 100 == S / 1024.  While the mapped range is a
        bit smaller than the orignal scheduler weight range, the dead zones
        on both sides are relatively small and covers wider range than the
        nice value mappings.  This file doesn't make sense in the root
        cgroup and isn't created on root.
      
      * "cpu.weight.nice" is added. When read, it reads back the nice value
        which is closest to the current "cpu.weight".  When written, it sets
        "cpu.weight" to the weight value which matches the nice value.  This
        makes it easy to configure cgroups when they're competing against
        threads in threaded subtrees.
      
      * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
        which contains both quota and period.
      
      v4: - Use cgroup2 basic usage stat as the information source instead
            of cpuacct.
      
      v3: - Added "cpu.weight.nice" to allow using nice values when
            configuring the weight.  The feature is requested by PeterZ.
          - Merge the patch to enable threaded support on cpu and cpuacct.
          - Dropped the bits about getting rid of cpuacct from patch
            description as there is a pretty strong case for making cpuacct
            an implicit controller so that basic cpu usage stats are always
            available.
          - Documentation updated accordingly.  "cpu.rt.max" section is
            dropped for now.
      
      v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
            for CFS bandwidth stats and also using raw division for u64.
            Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
            not included yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      0d593634
    • T
      sched: Misc preps for cgroup unified hierarchy interface · a1f7164c
      Tejun Heo 提交于
      Make the following changes in preparation for the cpu controller
      interface implementation for cgroup2.  This patch doesn't cause any
      functional differences.
      
      * s/cpu_stats_show()/cpu_cfs_stat_show()/
      
      * s/cpu_files/cpu_legacy_files/
      
      v2: Dropped cpuacct changes as it won't be used by cpu controller
          interface anymore.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      a1f7164c
  2. 26 9月, 2017 2 次提交
    • T
      sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE · 8157a7fa
      Tejun Heo 提交于
      cfb766da ("sched/cputime: Expose cputime_adjust()") made
      cputime_adjust() public for cgroup basic cpu stat support; however,
      the commit forgot to add a dummy implementaiton for
      CONFIG_VIRT_CPU_ACCOUNTING_NATIVE leading to compiler errors on some
      s390 configurations.
      
      Fix it by adding the missing dummy implementation.
      Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
      Fixes: cfb766da ("sched/cputime: Expose cputime_adjust()")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8157a7fa
    • T
      cgroup: statically initialize init_css_set->dfl_cgrp · 38683148
      Tejun Heo 提交于
      Like other csets, init_css_set's dfl_cgrp is initialized when the cset
      gets linked.  init_css_set gets linked in cgroup_init().  This has
      been fine till now but the recently added basic CPU usage accounting
      may end up accessing dfl_cgrp of init before cgroup_init() leading to
      the following oops.
      
        SELinux:  Initializing.
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: account_system_index_time+0x60/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        +1.9.3-20161025_171302-gandalf 04/01/2014
        task: ffffffff81e10480 task.stack: ffffffff81e00000
        RIP: 0010:account_system_index_time+0x60/0x90
        RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
        RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
        RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
        RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
        R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
        R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
        FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         account_system_time+0x45/0x60
         account_process_tick+0x5a/0x140
         update_process_times+0x22/0x60
         tick_periodic+0x2b/0x90
         tick_handle_periodic+0x25/0x70
         timer_interrupt+0x15/0x20
         __handle_irq_event_percpu+0x7e/0x1b0
         handle_irq_event_percpu+0x23/0x60
         handle_irq_event+0x42/0x70
         handle_level_irq+0x83/0x100
         handle_irq+0x6f/0x110
         do_IRQ+0x46/0xd0
         common_interrupt+0x9d/0x9d
      
      Fix it by statically initializing init_css_set.dfl_cgrp so that init's
      default cgroup is accessible from the get-go.
      
      Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
      Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      38683148
  3. 25 9月, 2017 3 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
    • T
      cpuacct: Introduce cgroup_account_cputime[_field]() · d2cc5ed6
      Tejun Heo 提交于
      Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
      and cgroup_account_field().  This doesn't introduce any functional
      changes and will be used to add cgroup basic resource accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      d2cc5ed6
    • T
      sched/cputime: Expose cputime_adjust() · cfb766da
      Tejun Heo 提交于
      Will be used by basic cgroup resource stat reporting later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      cfb766da
  4. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  5. 20 9月, 2017 5 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
    • E
      bpf: do not disable/enable BH in bpf_map_free_id() · 930651a7
      Eric Dumazet 提交于
      syzkaller reported following splat [1]
      
      Since hard irq are disabled by the caller, bpf_map_free_id()
      should not try to enable/disable BH.
      
      Another solution would be to change htab_map_delete_elem() to
      defer the free_htab_elem() call after
      raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
      enough to cover other code paths.
      
      [1]
      WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip
      +0x1e/0x160 kernel/softirq.c:161
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+
      #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161
      RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046
      RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000
      RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0
      RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6
      R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47
      R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001
       __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline]
       _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207
       spin_unlock_bh include/linux/spinlock.h:361 [inline]
       bpf_map_free_id kernel/bpf/syscall.c:197 [inline]
       __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227
       bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235
       bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
       free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658
       htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063
       map_delete_elem kernel/bpf/syscall.c:633 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1479 [inline]
       SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: f3f1c054 ("bpf: Introduce bpf_map ID")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930651a7
    • T
      tracing: Fix trace_pipe behavior for instance traces · 75df6e68
      Tahsin Erdogan 提交于
      When reading data from trace_pipe, tracing_wait_pipe() performs a
      check to see if tracing has been turned off after some data was read.
      Currently, this check always looks at global trace state, but it
      should be checking the trace instance where trace_pipe is located at.
      
      Because of this bug, cat instances/i1/trace_pipe in the following
      script will immediately exit instead of waiting for data:
      
      cd /sys/kernel/debug/tracing
      echo 0 > tracing_on
      mkdir -p instances/i1
      echo 1 > instances/i1/tracing_on
      echo 1 > instances/i1/events/sched/sched_process_exec/enable
      cat instances/i1/trace_pipe
      
      Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com
      
      Cc: stable@vger.kernel.org
      Fixes: 10246fa3 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      75df6e68
    • Z
      tracing: Ignore mmiotrace from kernel commandline · c7b3ae0b
      Ziqian SUN (Zamir) 提交于
      The mmiotrace tracer cannot be enabled with ftrace=mmiotrace in kernel
      commandline. With this patch, noboot is added to the tracer struct,
      and when system boot with a tracer that has noboot=true, it will print
      out a warning message and continue booting.
      
      Link: http://lkml.kernel.org/r/1505111195-31942-1-git-send-email-zsun@redhat.comSigned-off-by: NZiqian SUN (Zamir) <zsun@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c7b3ae0b
    • B
      tracing: Erase irqsoff trace with empty write · 8dd33bcb
      Bo Yan 提交于
      One convenient way to erase trace is "echo > trace". However, this
      is currently broken if the current tracer is irqsoff tracer. This
      is because irqsoff tracer use max_buffer as the default trace
      buffer.
      
      Set the max_buffer as the one to be cleared when it's the trace
      buffer currently in use.
      
      Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com
      
      Cc: <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 4acd4d00 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NBo Yan <byan@nvidia.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8dd33bcb
  6. 19 9月, 2017 1 次提交
  7. 17 9月, 2017 1 次提交
    • T
      genirq: Fix cpumask check in __irq_startup_managed() · 9cb067ef
      Thomas Gleixner 提交于
      The result of cpumask_any_and() is invalid when result greater or equal
      nr_cpu_ids. The current check is checking for greater only. Fix it.
      
      Fixes: 761ea388 ("genirq: Handle managed irqs gracefully in irq_startup()")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rui Zhang <rui.zhang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de
      9cb067ef
  8. 16 9月, 2017 1 次提交
  9. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  10. 14 9月, 2017 1 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  11. 12 9月, 2017 5 次提交
  12. 11 9月, 2017 1 次提交
  13. 09 9月, 2017 15 次提交