1. 27 12月, 2019 11 次提交
  2. 05 9月, 2018 1 次提交
  3. 23 8月, 2018 3 次提交
  4. 18 8月, 2018 1 次提交
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  5. 10 8月, 2018 1 次提交
    • E
      signal: Don't restart fork when signals come in. · c3ad2c3b
      Eric W. Biederman 提交于
      Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
      report that a periodic signal received during fork can cause fork to
      continually restart preventing an application from making progress.
      
      The code was being overly pessimistic.  Fork needs to guarantee that a
      signal sent to multiple processes is logically delivered before the
      fork and just to the forking process or logically delivered after the
      fork to both the forking process and it's newly spawned child.  For
      signals like periodic timers that are always delivered to a single
      process fork can safely complete and let them appear to logically
      delivered after the fork().
      
      While examining this issue I also discovered that fork today will miss
      signals delivered to multiple processes during the fork and handled by
      another thread.  Similarly the current code will also miss blocked
      signals that are delivered to multiple process, as those signals will
      not appear pending during fork.
      
      Add a list of each thread that is currently forking, and keep on that
      list a signal set that records all of the signals sent to multiple
      processes.  When fork completes initialize the new processes
      shared_pending signal set with it.  The calculate_sigpending function
      will see those signals and set TIF_SIGPENDING causing the new task to
      take the slow path to userspace to handle those signals.  Making it
      appear as if those signals were received immediately after the fork.
      
      It is not possible to send real time signals to multiple processes and
      exceptions don't go to multiple processes, which means that that are
      no signals sent to multiple processes that require siginfo.  This
      means it is safe to not bother collecting siginfo on signals sent
      during fork.
      
      The sigaction of a child of fork is initially the same as the
      sigaction of the parent process.  So a signal the parent ignores the
      child will also initially ignore.  Therefore it is safe to ignore
      signals sent to multiple processes and ignored by the forking process.
      
      Signals sent to only a single process or only a single thread and delivered
      during fork are treated as if they are received after the fork, and generally
      not dealt with.  They won't cause any problems.
      
      V2: Added removal from the multiprocess list on failure.
      V3: Use -ERESTARTNOINTR directly
      V4: - Don't queue both SIGCONT and SIGSTOP
          - Initialize signal_struct.multiprocess in init_task
          - Move setting of shared_pending to before the new task
            is visible to signals.  This prevents signals from comming
            in before shared_pending.signal is set to delayed.signal
            and being lost.
      V5: - rework list add and delete to account for idle threads
      v6: - Use sigdelsetmask when removing stop signals
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
      Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
      Reported-by: Nmajiang <ma.jiang@zte.com.cn>
      Fixes: 4a2c7a78 ("[PATCH] make fork() atomic wrt pgrp/session signals")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c3ad2c3b
  6. 04 8月, 2018 1 次提交
    • E
      fork: Have new threads join on-going signal group stops · 924de3b8
      Eric W. Biederman 提交于
      There are only two signals that are delivered to every member of a
      signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
      signal appear to be delivered either before or after a clone syscall.
      SIGKILL terminates the clone so does not need to be considered.  Which
      leaves only SIGSTOP that needs to be considered when creating new
      threads.
      
      Today in the event of a group stop TIF_SIGPENDING will get set and the
      fork will restart ensuring the fork syscall participates in the group
      stop.
      
      A fork (especially of a process with a lot of memory) is one of the
      most expensive system so we really only want to restart a fork when
      necessary.
      
      It is easy so check to see if a SIGSTOP is ongoing and have the new
      thread join it immediate after the clone completes.  Making it appear
      the clone completed happened just before the SIGSTOP.
      
      The calculate_sigpending function will see the bits set in jobctl and
      set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
      
      V2: The call to task_join_group_stop was moved before the new task is
          added to the thread group list.  This should not matter as
          sighand->siglock is held over both the addition of the threads,
          the call to task_join_group_stop and do_signal_stop.  But the change
          is trivial and it is one less thing to worry about when reading
          the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      924de3b8
  7. 01 8月, 2018 1 次提交
  8. 27 7月, 2018 1 次提交
  9. 23 7月, 2018 2 次提交
    • E
      fork: Unconditionally exit if a fatal signal is pending · 7673bf55
      Eric W. Biederman 提交于
      In practice this does not change anything as testing for fatal_signal_pending
      and exiting for with an error code duplicates the work of the next clause
      which recalculates pending signals and then exits fork if any are pending.
      In both cases the pending signal will trigger the slow path when existing
      to userspace, and the fatal signal will cause do_exit to be called.
      
      The advantage of making this a separate test is that it makes it clear
      processing the fatal signal will terminate the fork, and it allows the
      rest of the signal logic to be updated without fear that this important
      case will be lost.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7673bf55
    • E
      fork: Move and describe why the code examines PIDNS_ADDING · 4ca1d3ee
      Eric W. Biederman 提交于
      Normally this would be something that would be handled by handling
      signals that are sent to a group of processes but in this case the
      forking process is not a member of the group being signaled.  Thus
      special code is needed to prevent a race with pid namespaces exiting,
      and fork adding new processes within them.
      
      Move this test up before the signal restart just in case signals are
      also pending.  Fatal conditions should take presedence over restarts.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      4ca1d3ee
  10. 22 7月, 2018 3 次提交
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds 提交于
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  11. 21 7月, 2018 2 次提交
    • E
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman 提交于
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • E
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman 提交于
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
  12. 17 7月, 2018 1 次提交
  13. 15 6月, 2018 1 次提交
    • T
      mm: check for SIGKILL inside dup_mmap() loop · 655c79bb
      Tetsuo Handa 提交于
      As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas
      can loop while potentially allocating memory, with mm->mmap_sem held for
      write by current thread.  This is bad if current thread was selected as
      an OOM victim, for current thread will continue allocations using memory
      reserves while OOM reaper is unable to reclaim memory.
      
      As an actually observable problem, it is not difficult to make OOM
      reaper unable to reclaim memory if the OOM victim is blocked at
      i_mmap_lock_write() in this loop.  Unfortunately, since nobody can
      explain whether it is safe to use killable wait there, let's check for
      SIGKILL before trying to allocate memory.  Even without an OOM event,
      there is no point with continuing the loop from the beginning if current
      thread is killed.
      
      I tested with debug printk().  This patch should be safe because we
      already fail if security_vm_enough_memory_mm() or
      kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.
      
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting exit_mmap() due to NULL mmap *****
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      655c79bb
  14. 14 6月, 2018 1 次提交
    • L
      Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables · 050e9baa
      Linus Torvalds 提交于
      The changes to automatically test for working stack protector compiler
      support in the Kconfig files removed the special STACKPROTECTOR_AUTO
      option that picked the strongest stack protector that the compiler
      supported.
      
      That was all a nice cleanup - it makes no sense to have the AUTO case
      now that the Kconfig phase can just determine the compiler support
      directly.
      
      HOWEVER.
      
      It also meant that doing "make oldconfig" would now _disable_ the strong
      stackprotector if you had AUTO enabled, because in a legacy config file,
      the sane stack protector configuration would look like
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_NONE is not set
        # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_STACKPROTECTOR_AUTO=y
      
      and when you ran this through "make oldconfig" with the Kbuild changes,
      it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
      been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
      CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
      used to be disabled (because it was really enabled by AUTO), and would
      disable it in the new config, resulting in:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      That's dangerously subtle - people could suddenly find themselves with
      the weaker stack protector setup without even realizing.
      
      The solution here is to just rename not just the old RECULAR stack
      protector option, but also the strong one.  This does that by just
      removing the CC_ prefix entirely for the user choices, because it really
      is not about the compiler support (the compiler support now instead
      automatially impacts _visibility_ of the options to users).
      
      This results in "make oldconfig" actually asking the user for their
      choice, so that we don't have any silent subtle security model changes.
      The end result would generally look like this:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_STACKPROTECTOR=y
        CONFIG_STACKPROTECTOR_STRONG=y
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      where the "CC_" versions really are about internal compiler
      infrastructure, not the user selections.
      Acked-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050e9baa
  15. 08 6月, 2018 1 次提交
    • Y
      mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6
      Yang Shi 提交于
      mmap_sem is on the hot path of kernel, and it very contended, but it is
      abused too.  It is used to protect arg_start|end and evn_start|end when
      reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
      sense since those proc files just expect to read 4 values atomically and
      not related to VM, they could be set to arbitrary values by C/R.
      
      And, the mmap_sem contention may cause unexpected issue like below:
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
       Call Trace:
         schedule+0x36/0x80
         rwsem_down_read_failed+0xf0/0x150
         call_rwsem_down_read_failed+0x18/0x30
         down_read+0x20/0x40
         proc_pid_cmdline_read+0xd9/0x4e0
         __vfs_read+0x37/0x150
         vfs_read+0x96/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
      for them to mitigate the abuse of mmap_sem.
      
      So, introduce a new spinlock in mm_struct to protect the concurrent
      access to arg_start|end, env_start|end and others, as well as replace
      write map_sem to read to protect the race condition between prctl and
      sys_brk which might break check_data_rlimit(), and makes prctl more
      friendly to other VM operations.
      
      This patch just eliminates the abuse of mmap_sem, but it can't resolve
      the above hung task warning completely since the later
      access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
      scalability issue will be solved in the future.
      
      [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
        Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88aa7cc6
  16. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  17. 15 5月, 2018 1 次提交
  18. 21 4月, 2018 1 次提交
    • K
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook 提交于
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
  19. 06 4月, 2018 1 次提交
  20. 03 4月, 2018 2 次提交
  21. 22 2月, 2018 1 次提交
  22. 07 2月, 2018 2 次提交