1. 23 8月, 2018 7 次提交
  2. 18 8月, 2018 5 次提交
    • M
      kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous() · d834c5ab
      Marek Szyprowski 提交于
      The CMA memory allocator doesn't support standard gfp flags for memory
      allocation, so there is no point having it as a parameter for
      dma_alloc_from_contiguous() function.  Replace it by a boolean no_warn
      argument, which covers all the underlaying cma_alloc() function
      supports.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.comSigned-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichał Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d834c5ab
    • M
      mm/cma: remove unsupported gfp_mask parameter from cma_alloc() · 65182029
      Marek Szyprowski 提交于
      cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
      convert gfp_mask parameter to boolean no_warn parameter.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.comSigned-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichał Nazarewicz <mina86@mina86.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65182029
    • A
      kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN · 0207df4f
      Andrey Ryabinin 提交于
      KASAN learns about hotadded memory via the memory hotplug notifier.
      devm_memremap_pages() intentionally skips calling memory hotplug
      notifiers.  So KASAN doesn't know anything about new memory added by
      devm_memremap_pages().  This causes a crash when KASAN tries to access
      non-existent shadow memory:
      
       BUG: unable to handle kernel paging request at ffffed0078000000
       RIP: 0010:check_memory_region+0x82/0x1e0
       Call Trace:
        memcpy+0x1f/0x50
        pmem_do_bvec+0x163/0x720
        pmem_make_request+0x305/0xac0
        generic_make_request+0x54f/0xcf0
        submit_bio+0x9c/0x370
        submit_bh_wbc+0x4c7/0x700
        block_read_full_page+0x5ef/0x870
        do_read_cache_page+0x2b8/0xb30
        read_dev_sector+0xbd/0x3f0
        read_lba.isra.0+0x277/0x670
        efi_partition+0x41a/0x18f0
        check_partition+0x30d/0x5e9
        rescan_partitions+0x18c/0x840
        __blkdev_get+0x859/0x1060
        blkdev_get+0x23f/0x810
        __device_add_disk+0x9c8/0xde0
        pmem_attach_disk+0x9a8/0xf50
        nvdimm_bus_probe+0xf3/0x3c0
        driver_probe_device+0x493/0xbd0
        bus_for_each_drv+0x118/0x1b0
        __device_attach+0x1cd/0x2b0
        bus_probe_device+0x1ac/0x260
        device_add+0x90d/0x1380
        nd_async_device_register+0xe/0x50
        async_run_entry_fn+0xc3/0x5d0
        process_one_work+0xa0a/0x1810
        worker_thread+0x87/0xe80
        kthread+0x2d7/0x390
        ret_from_fork+0x3a/0x50
      
      Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
      interface to map/unmap kasan_zero_page at requested virtual addresses.
      And use it to add/remove the shadow memory for hotplugged/unplugged
      device memory.
      
      Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0207df4f
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  3. 17 8月, 2018 8 次提交
    • S
      tracing: Fix SPDX format headers to use C++ style comments · bb730b58
      Steven Rostedt (VMware) 提交于
      The Linux kernel adopted the SPDX License format headers to ease license
      compliance management, and uses the C++ '//' style comments for the SPDX
      header tags. Some files in the tracing directory used the C style /* */
      comments for them. To be consistent across all files, replace the /* */
      C style SPDX tags with the C++ // SPDX tags.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bb730b58
    • S
      tracing: Add SPDX License format tags to tracing files · bcea3f96
      Steven Rostedt (VMware) 提交于
      Add the SPDX License header to ease license compliance management.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bcea3f96
    • S
      tracing: Add SPDX License format to bpf_trace.c · 179a0cc4
      Steven Rostedt (VMware) 提交于
      Add the SPDX License header to ease license compliance management.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      179a0cc4
    • D
      bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist · 585f5a62
      Daniel Borkmann 提交于
      The current code in sock_map_ctx_update_elem() allows for BPF_EXIST
      and BPF_NOEXIST map update flags. While on array-like maps this approach
      is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others
      enforce map update flags to be BPF_ANY such that xchg() can be used
      directly, the current implementation in sock map does not guarantee
      that such operation with BPF_EXIST / BPF_NOEXIST is atomic.
      
      The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the
      socket from the slot which is then tested for NULL / non-NULL. However
      later after __sock_map_ctx_update_elem(), the actual update is done
      through osock = xchg(&stab->sock_map[i], sock). Problem is that in
      the meantime a different CPU could have updated / deleted a socket
      on that specific slot and thus flag contraints won't hold anymore.
      
      I've been thinking whether best would be to just break UAPI and do
      an enforcement of BPF_ANY to check if someone actually complains,
      however trouble is that already in BPF kselftest we use BPF_NOEXIST
      for the map update, and therefore it might have been copied into
      applications already. The fix to keep the current behavior intact
      would be to add a map lock similar to the sock hash bucket lock only
      for covering the whole map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      585f5a62
    • D
      bpf, sockmap: fix map elem deletion race with smap_stop_sock · 166ab6f0
      Daniel Borkmann 提交于
      The smap_start_sock() and smap_stop_sock() are each protected under
      the sock->sk_callback_lock from their call-sites except in the case
      of sock_map_delete_elem() where we drop the old socket from the map
      slot. This is racy because the same sock could be part of multiple
      sock maps, so we run smap_stop_sock() in parallel, and given at that
      point psock->strp_enabled might be true on both CPUs, we might for
      example wrongly restore the sk->sk_data_ready / sk->sk_write_space.
      Therefore, hold the sock->sk_callback_lock as well on delete. Looks
      like 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add
      multi-map support") had this right, but later on e9db4ef6 ("bpf:
      sockhash fix omitted bucket lock in sock_close") removed it again
      from delete leaving this smap_stop_sock() instance unprotected.
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      166ab6f0
    • D
      bpf, sockmap: fix leakage of smap_psock_map_entry · d40b0116
      Daniel Borkmann 提交于
      While working on sockmap I noticed that we do not always kfree the
      struct smap_psock_map_entry list elements which track psocks attached
      to maps. In the case of sock_hash_ctx_update_elem(), these map entries
      are allocated outside of __sock_map_ctx_update_elem() with their
      linkage to the socket hash table filled. In the case of sock array,
      the map entries are allocated inside of __sock_map_ctx_update_elem()
      and added with their linkage to the psock->maps. Both additions are
      under psock->maps_lock each.
      
      Now, we drop these elements from their psock->maps list in a few
      occasions: i) in sock array via smap_list_map_remove() when an entry
      is either deleted from the map from user space, or updated via
      user space or BPF program where we drop the old socket at that map
      slot, or the sock array is freed via sock_map_free() and drops all
      its elements; ii) for sock hash via smap_list_hash_remove() in exactly
      the same occasions as just described for sock array; iii) in the
      bpf_tcp_close() where we remove the elements from the list via
      psock_map_pop() and iterate over them dropping themselves from either
      sock array or sock hash; and last but not least iv) once again in
      smap_gc_work() which is a callback for deferring the work once the
      psock refcount hit zero and thus the socket is being destroyed.
      
      Problem is that the only case where we kfree() the list entry is
      in case iv), which at that point should have an empty list in
      normal cases. So in cases from i) to iii) we unlink the elements
      without freeing where they go out of reach from us. Hence fix is
      to properly kfree() them as well to stop the leakage. Given these
      are all handled under psock->maps_lock there is no need for deferred
      RCU freeing.
      
      I later also ran with kmemleak detector and it confirmed the finding
      as well where in the state before the fix the object goes unreferenced
      while after the patch no kmemleak report related to BPF showed up.
      
        [...]
        unreferenced object 0xffff880378eadae0 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  PMu]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60
            [<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960
            [<000000002ef89e83>] 0xffffffffffffffff
        unreferenced object 0xffff880378ead240 (size 64):
          comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s)
          hex dump (first 32 bytes):
            00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
            00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00  .Du]............
          backtrace:
            [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210
            [<0000000030e37a3a>] sock_map_update_elem+0x125/0x240
            [<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0
            [<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360
            [<0000000000763660>] do_syscall_64+0x9a/0x300
            [<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
            [<000000002ef89e83>] 0xffffffffffffffff
        [...]
      
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Fixes: 54fedb42 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps")
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d40b0116
    • Y
      bpf: fix a rcu usage warning in bpf_prog_array_copy_core() · 965931e3
      Yonghong Song 提交于
      Commit 394e40a2 ("bpf: extend bpf_prog_array to store pointers
      to the cgroup storage") refactored the bpf_prog_array_copy_core()
      to accommodate new structure bpf_prog_array_item which contains
      bpf_prog array itself.
      
      In the old code, we had
         perf_event_query_prog_array():
           mutex_lock(...)
           bpf_prog_array_copy_call():
             prog = rcu_dereference_check(array, 1)->progs
             bpf_prog_array_copy_core(prog, ...)
           mutex_unlock(...)
      
      With the above commit, we had
         perf_event_query_prog_array():
           mutex_lock(...)
           bpf_prog_array_copy_call():
             bpf_prog_array_copy_core(array, ...):
               item = rcu_dereference(array)->items;
               ...
           mutex_unlock(...)
      
      The new code will trigger a lockdep rcu checking warning.
      The fix is to change rcu_dereference() to rcu_dereference_check()
      to prevent such a warning.
      
      Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com
      Fixes: 394e40a2 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage")
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      965931e3
    • S
      blktrace: Add SPDX License format header · 91c1e6ba
      Steven Rostedt (VMware) 提交于
      Add the SPDX License header to ease license compliance management.
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      91c1e6ba
  4. 16 8月, 2018 1 次提交
  5. 15 8月, 2018 1 次提交
  6. 14 8月, 2018 2 次提交
  7. 13 8月, 2018 3 次提交
    • H
      parisc: Drop architecture-specific ENOTSUP define · 93cb8e20
      Helge Deller 提交于
      parisc is the only Linux architecture which has defined a value for ENOTSUP.
      All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.
      
      Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
      problems with userspace programs which expect both to be the same.  One such
      example is a build error in the libuv package, as can be seen in
      https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.
      
      Since we dropped HP-UX support, there is no real benefit in keeping an own
      value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
      kernel sources. glibc needs no patch, it reuses the exported headers.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      93cb8e20
    • D
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      e8d2bec0
    • L
      init: rename and re-order boot_cpu_state_init() · b5b1404d
      Linus Torvalds 提交于
      This is purely a preparatory patch for upcoming changes during the 4.19
      merge window.
      
      We have a function called "boot_cpu_state_init()" that isn't really
      about the bootup cpu state: that is done much earlier by the similarly
      named "boot_cpu_init()" (note lack of "state" in name).
      
      This function initializes some hotplug CPU state, and needs to run after
      the percpu data has been properly initialized.  It even has a comment to
      that effect.
      
      Except it _doesn't_ actually run after the percpu data has been properly
      initialized.  On x86 it happens to do that, but on at least arm and
      arm64, the percpu base pointers are initialized by the arch-specific
      'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
      
      This had some unexpected results, and in particular we have a patch
      pending for the merge window that did the obvious cleanup of using
      'this_cpu_write()' in the cpu hotplug init code:
      
        -       per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
        +       this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
      
      which is obviously the right thing to do.  Except because of the
      ordering issue, it actually failed miserably and unexpectedly on arm64.
      
      So this just fixes the ordering, and changes the name of the function to
      be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
      hotplug state, because the core CPU state was supposed to have already
      been done earlier.
      
      Marked for stable, since the (not yet merged) patch that will show this
      problem is marked for stable.
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMian Yousaf Kaukab <yousaf.kaukab@suse.com>
      Suggested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5b1404d
  8. 11 8月, 2018 11 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • S
      tracepoints: Free early tracepoints after RCU is initialized · f8a79d5c
      Steven Rostedt (VMware) 提交于
      When enabling trace events via the kernel command line, I hit this warning:
      
      WARNING: CPU: 0 PID: 13 at kernel/rcu/srcutree.c:236 check_init_srcu_struct+0xe/0x61
      Modules linked in:
      CPU: 0 PID: 13 Comm: watchdog/0 Not tainted 4.18.0-rc6-test+ #6
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      RIP: 0010:check_init_srcu_struct+0xe/0x61
      Code: 48 c7 c6 ec 8a 65 b4 e8 ff 79 fe ff 48 89 df 31 f6 e8 f2 fa ff ff 5a
      5b 41 5c 5d c3 0f 1f 44 00 00 83 3d 68 94 b8 01 01 75 02 <0f> 0b 48 8b 87 f0
      0a 00 00 a8 03 74 45 55 48 89 e5 41 55 41 54 4c
      RSP: 0000:ffff96eb9ea03e68 EFLAGS: 00010246
      RAX: ffff96eb962b5b01 RBX: ffffffffb4a87420 RCX: 0000000000000001
      RDX: ffffffffb3107969 RSI: ffff96eb962b5b40 RDI: ffffffffb4a87420
      RBP: ffff96eb9ea03eb0 R08: ffffabbd00cd7f48 R09: 0000000000000000
      R10: ffff96eb9ea03e68 R11: ffffffffb4a6eec0 R12: ffff96eb962b5b40
      R13: ffff96eb9ea03ef8 R14: ffffffffb3107969 R15: ffffffffb3107948
      FS:  0000000000000000(0000) GS:ffff96eb9ea00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff96eb13ab2000 CR3: 0000000192a1e001 CR4: 00000000001606f0
      Call Trace:
       <IRQ>
       ? __call_srcu+0x2d/0x290
       ? rcu_process_callbacks+0x26e/0x448
       ? allocate_probes+0x2b/0x2b
       call_srcu+0x13/0x15
       rcu_free_old_probes+0x1f/0x21
       rcu_process_callbacks+0x2ed/0x448
       __do_softirq+0x172/0x336
       irq_exit+0x62/0xb2
       smp_apic_timer_interrupt+0x161/0x19e
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      
      The problem is that the enabling of trace events before RCU is set up will
      cause SRCU to give this warning. To avoid this, add a list to store probes
      that need to be freed till after RCU is initialized, and then free them
      then.
      
      Link: http://lkml.kernel.org/r/20180810113554.1df28050@gandalf.local.home
      Link: http://lkml.kernel.org/r/20180810123517.5e9714ad@gandalf.local.homeAcked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Fixes: e6753f23 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f8a79d5c
    • S
      uprobes: Use synchronize_rcu() not synchronize_sched() · 016f8ffc
      Steven Rostedt (VMware) 提交于
      While debugging another bug, I was looking at all the synchronize*()
      functions being used in kernel/trace, and noticed that trace_uprobes was
      using synchronize_sched(), with a comment to synchronize with
      {u,ret}_probe_trace_func(). When looking at those functions, the data is
      protected with "rcu_read_lock()" and not with "rcu_read_lock_sched()". This
      is using the wrong synchronize_*() function.
      
      Link: http://lkml.kernel.org/r/20180809160553.469e1e32@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Fixes: 70ed91c6 ("tracing/uprobes: Support ftrace_event_file base multibuffer")
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      016f8ffc
    • S
      tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister() · e0a568dc
      Steven Rostedt (VMware) 提交于
      Now that some trace events can be protected by srcu_read_lock(tracepoint_srcu),
      we need to make sure all locations that depend on this are also protected.
      There were many places that did a synchronize_sched() thinking that it was
      enough to protect againts access to trace events. This use to be the case,
      but now that we use SRCU for _rcuidle() trace events, they may not be
      protected by synchronize_sched(), as they may be called in paths that RCU is
      not watching for preempt disable.
      
      Fixes: e6753f23 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      e0a568dc
    • C
      ftrace: Remove unused pointer ftrace_swapper_pid · b207de3e
      Colin Ian King 提交于
      Pointer ftrace_swapper_pid is defined but is never used hence it is
      redundant and can be removed. The use of this variable was removed
      in commit 345ddcc8 ("ftrace: Have set_ftrace_pid use the bitmap
      like events do").
      
      Cleans up clang warning:
      warning: 'ftrace_swapper_pid' defined but not used [-Wunused-const-variable=]
      
      Link: http://lkml.kernel.org/r/20180809125609.13142-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b207de3e
    • S
      tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage" · 3f1756dc
      Steven Rostedt (VMware) 提交于
      Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
      by lockdep and irqsoff latency tracer. It made both use tracepoints. But the
      latency tracer is triggering warnings when using tracepoints to call into
      the latency tracer's routines. Mainly, they can be called from NMI context.
      If that happens, then the SRCU may not work properly because on some
      architectures, SRCU is not safe to be called in both NMI and non-NMI
      context.
      
      This is a partial revert of the clean up patch c3bc8fd6 ("tracing:
      Centralize preemptirq tracepoints and unify their usage") that adds back the
      direct calls into the latency tracer. It also only calls the trace events
      when not in NMI.
      
      Link: http://lkml.kernel.org/r/20180809210654.622445925@goodmis.orgReviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Fixes: c3bc8fd6 ("tracing: Centralize preemptirq tracepoints and unify their usage")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      3f1756dc
    • S
      tracing/irqsoff: Handle preempt_count for different configs · f27107fa
      Steven Rostedt (VMware) 提交于
      I was hitting the following warning:
      
      WARNING: CPU: 0 PID: 1 at kernel/trace/trace_irqsoff.c:631 tracer_hardirqs_off+0x15/0x2a
      
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6-test+ #13
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      EIP: tracer_hardirqs_off+0x15/0x2a
      Code: ff 85 c0 74 0e 8b 45 00 8b 50 04 8b 45 04 e8 35 ff ff ff 5d c3 55 64 a1 cc 37 51 c1 a9 ff ff ff 7f 89 e5 53 89 d3 89 ca 75 02 <0f> 0b e8 90 fc ff ff 85 c0 74 07 89 d8 e8 0c ff ff ff 5b 5d c3 55
      EAX: 80000000 EBX: c04337f0 ECX: c04338e3 EDX: c04338e3
      ESI: c04337f0 EDI: c04338e3 EBP: f2aa1d68 ESP: f2aa1d64
      DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210046
      CR0: 80050033 CR2: 00000000 CR3: 01668000 CR4: 001406f0
      Call Trace:
       trace_irq_disable_rcuidle+0x63/0x6c
       trace_hardirqs_off+0x26/0x30
       default_send_IPI_mask_allbutself_logical+0x31/0x93
       default_send_IPI_allbutself+0x37/0x48
       native_send_call_func_ipi+0x4d/0x6a
       smp_call_function_many+0x165/0x19d
       ? add_nops+0x34/0x34
       ? trace_hardirqs_on_caller+0x2d/0x2d
       ? add_nops+0x34/0x34
       smp_call_function+0x1f/0x23
       on_each_cpu+0x15/0x43
       ? trace_hardirqs_on_caller+0x2d/0x2d
       ? trace_hardirqs_on_caller+0x2d/0x2d
       ? trace_irq_disable_rcuidle+0x1/0x6c
       text_poke_bp+0xa0/0xc2
       ? trace_hardirqs_on_caller+0x2d/0x2d
       arch_jump_label_transform+0xa7/0xcb
       ? trace_irq_disable_rcuidle+0x5/0x6c
       __jump_label_update+0x3e/0x6d
       jump_label_update+0x7d/0x81
       static_key_slow_inc_cpuslocked+0x58/0x6d
       static_key_slow_inc+0x19/0x20
       tracepoint_probe_register_prio+0x19e/0x1d1
       ? start_critical_timings+0x1c/0x1c
       tracepoint_probe_register+0xf/0x11
       irqsoff_tracer_init+0x21/0xf2
       tracer_init+0x16/0x1a
       trace_selftest_startup_irqsoff+0x25/0xc4
       run_tracer_selftest+0xca/0x131
       register_tracer+0xd5/0x172
       ? trace_event_define_fields_preemptirq_template+0x45/0x45
       init_irqsoff_tracer+0xd/0x11
       do_one_initcall+0xab/0x1e8
       ? rcu_read_lock_sched_held+0x3d/0x44
       ? trace_initcall_level+0x52/0x86
       kernel_init_freeable+0x195/0x21a
       ? rest_init+0xb4/0xb4
       kernel_init+0xd/0xe4
       ret_from_fork+0x2e/0x38
      
      It is due to running a CONFIG_PREEMPT_VOLUNTARY kernel, which would trigger
      this warning every time:
      
      	WARN_ON_ONCE(preempt_count());
      
      Because on CONFIG_PREEMPT_VOLUNTARY, preempt_count() is always zero.
      
      This warning is to make sure preempt_count is set because
      tracer_hardirqs_on() does a preempt_enable_notrace() to make the
      preempt_trace() work properly, as being called by a trace event, the trace
      event code disables preemption, and the tracer wants to know what the
      preemption was before it was called.
      
      Instead of enabling preemption like this, just record the preempt_count,
      subtract PREEMPT_DISABLE_OFFSET from it (which is zero with !CONFIG_PREEMPT
      set), and pass that value to the necessary functions, which should use the
      passed in parameter instead of calling preempt_count() directly.
      
      Fixes: da5b3ebb ("tracing: irqsoff: Account for additional preempt_disable")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f27107fa
    • S
      tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage" · bff1b208
      Steven Rostedt (VMware) 提交于
      Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
      by lockdep and irqsoff latency tracer. It made both use tracepoints. But it
      caused lockdep to trigger several false positives. We have not figured out
      why yet, but removing lockdep from using the trace event hooks and just call
      its helper functions directly (like it use to), makes the problem go away.
      
      This is a partial revert of the clean up patch c3bc8fd6 ("tracing:
      Centralize preemptirq tracepoints and unify their usage") that adds direct
      calls for lockdep, but also keeps most of the clean up done to get rid of
      the horrible preprocessor if statements.
      
      Link: http://lkml.kernel.org/r/20180806155058.5ee875f4@gandalf.local.home
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Fixes: c3bc8fd6 ("tracing: Centralize preemptirq tracepoints and unify their usage")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bff1b208
    • Y
      bpf: btf: add pretty print for hash/lru_hash maps · 699c86d6
      Yonghong Song 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") added pretty print support to array map.
      This patch adds pretty print for hash and lru_hash maps.
      The following example shows the pretty-print result of
      a pinned hashmap:
      
          struct map_value {
                  int count_a;
                  int count_b;
          };
      
          cat /sys/fs/bpf/pinned_hash_map:
      
          87907: {87907,87908}
          57354: {37354,57355}
          76625: {76625,76626}
          ...
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      699c86d6
    • Y
      bpf: fix bpffs non-array map seq_show issue · dc1508a5
      Yonghong Song 提交于
      In function map_seq_next() of kernel/bpf/inode.c,
      the first key will be the "0" regardless of the map type.
      This works for array. But for hash type, if it happens
      key "0" is in the map, the bpffs map show will miss
      some items if the key "0" is not the first element of
      the first bucket.
      
      This patch fixed the issue by guaranteeing to get
      the first element, if the seq_show is just started,
      by passing NULL pointer key to map_get_next_key() callback.
      This way, no missing elements will occur for
      bpffs hash table show even if key "0" is in the map.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc1508a5
  9. 10 8月, 2018 2 次提交