1. 29 5月, 2018 3 次提交
  2. 26 5月, 2018 7 次提交
    • J
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron 提交于
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
    • M
      mm: do not warn on offline nodes unless the specific node is explicitly requested · 8addc2d0
      Michal Hocko 提交于
      Oscar has noticed that we splat
      
         WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9
         [...]
         CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G        W   E     4.17.0-rc5-next-20180517-1-default+ #66
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
         Workqueue: kacpi_hotplug acpi_hotplug_work_fn
         Call Trace:
          vmemmap_populate+0xf2/0x2ae
          sparse_mem_map_populate+0x28/0x35
          sparse_add_one_section+0x4c/0x187
          __add_pages+0xe7/0x1a0
          add_pages+0x16/0x70
          add_memory_resource+0xa3/0x1d0
          add_memory+0xe4/0x110
          acpi_memory_device_add+0x134/0x2e0
          acpi_bus_attach+0xd9/0x190
          acpi_bus_scan+0x37/0x70
          acpi_device_hotplug+0x389/0x4e0
          acpi_hotplug_work_fn+0x1a/0x30
          process_one_work+0x146/0x340
          worker_thread+0x47/0x3e0
          kthread+0xf5/0x130
          ret_from_fork+0x35/0x40
      
      when adding memory to a node that is currently offline.
      
      The VM_WARN_ON is just too loud without a good reason.  In this
      particular case we are doing
      
      	alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order)
      
      so we do not insist on allocating from the given node (it is more a
      hint) so we can fall back to any other populated node and moreover we
      explicitly ask to not warn for the allocation failure.
      
      Soften the warning only to cases when somebody asks for the given node
      explicitly by __GFP_THISNODE.
      
      Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NOscar Salvador <osalvador@techadventures.net>
      Tested-by: NOscar Salvador <osalvador@techadventures.net>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8addc2d0
    • T
      net/mlx5: Use order-0 allocations for all WQ types · 3a2f7033
      Tariq Toukan 提交于
      Complete the transition of all WQ types to use fragmented
      order-0 coherent memory instead of high-order allocations.
      
      CQ-WQ already uses order-0.
      Here we do the same for cyclic and linked-list WQs.
      
      This allows the driver to load cleanly on systems with a highly
      fragmented coherent memory.
      
      Performance tests:
      ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Packet rate of 64B packets, single transmit ring, size 8K.
      
      No degradation is sensed.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      3a2f7033
    • C
      net/mlx5: Add cap bits for flow table destination in FDB table · b4563002
      Chris Mi 提交于
      If set, the FDB table supports the forward action with a
      destination list that includes a flow table.
      Signed-off-by: NChris Mi <chrism@mellanox.com>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      b4563002
    • M
      qed*: Support drop action classification · 608e00d0
      Manish Chopra 提交于
      With this patch, User can configure for the supported
      flows to be dropped. Added a stat "gft_filter_drop"
      as well to be populated in ethtool for the dropped flows.
      
      For example -
      
      ethtool -N p5p1 flow-type udp4 dst-port 8000 action -1
      ethtool -N p5p1 flow-type tcp4 scr-ip 192.168.8.1 action -1
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NShahed Shaikh <shahed.shaikh@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      608e00d0
    • M
      qed*: Support other classification modes. · 3893fc62
      Manish Chopra 提交于
      Currently, driver supports flow classification to PF
      receive queues based on TCP/UDP 4 tuples [src_ip, dst_ip,
      src_port, dst_port] only.
      
      This patch enables to configure different flow profiles
      [For example - only UDP dest port or src_ip based] on the
      adapter so that classification can be done according to
      just those fields as well. Although, at a time just one
      type of flow configuration is supported due to limited
      number of flow profiles available on the device.
      
      For example -
      
      ethtool -N enp7s0f0 flow-type udp4 dst-port 45762 action 2
      ethtool -N enp7s0f0 flow-type tcp4 src-ip 192.16.4.10 action 1
      ethtool -N enp7s0f0 flow-type udp6 dst-port 45762 action 3
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NShahed Shaikh <shahed.shaikh@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3893fc62
    • N
      net: bridge: add support for port isolation · 7d850abd
      Nikolay Aleksandrov 提交于
      This patch adds support for a new port flag - BR_ISOLATED. If it is set
      then isolated ports cannot communicate between each other, but they can
      still communicate with non-isolated ports. The same can be achieved via
      ACLs but they can't scale with large number of ports and also the
      complexity of the rules grows. This feature can be used to achieve
      isolated vlan functionality (similar to pvlan) as well, though currently
      it will be port-wide (for all vlans on the port). The new test in
      should_deliver uses data that is already cache hot and the new boolean
      is used to avoid an additional source port test in should_deliver.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d850abd
  3. 25 5月, 2018 10 次提交
  4. 24 5月, 2018 6 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
    • M
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux 提交于
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • M
      bpf: Split lwt inout verifier structures · cd3092c7
      Mathieu Xhonneux 提交于
      The new bpf_lwt_push_encap helper should only be accessible within the
      LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
      panic.
      
      At the moment, both LWT BPF IN and OUT share the same list of helpers,
      whose calls are authorized by the verifier. This patch separates the
      verifier ops for the IN and OUT hooks, and allows the IN hook to call the
      bpf_lwt_push_encap helper.
      
      This patch is also the occasion to put all lwt_*_func_proto functions
      together for clarity. At the moment, socks_op_func_proto is in the middle
      of lwt_inout_func_proto and lwt_xmit_func_proto.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cd3092c7
    • W
      gso: limit udp gso to egress-only virtual devices · 8eea1ca8
      Willem de Bruijn 提交于
      Until the udp receive stack supports large packets (UDP GRO), GSO
      packets must not loop from the egress to the ingress path.
      
      Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual
      devices through NETIF_F_GSO_ENCAP_ALL as this included devices that
      may loop packets, such as veth and macvlan.
      
      Instead add it to specific devices that forward to another device's
      egress path, bonding and team.
      
      Fixes: 83aa025f ("udp: add gso support to virtual devices")
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8eea1ca8
    • A
      net: add skeleton of bpfilter kernel module · d2ba09c1
      Alexei Starovoitov 提交于
      bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
      and user mode helper code that is embedded into bpfilter.ko
      
      The steps to build bpfilter.ko are the following:
      - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
      - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
        is converted into bpfilter_umh.o object file
        with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
        Example:
        $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
        0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
        0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
        0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
      - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
      
      bpfilter_kern.c is a normal kernel module code that calls
      the fork_usermode_blob() helper to execute part of its own data
      as a user mode process.
      
      Notice that _binary_net_bpfilter_bpfilter_umh_start - end
      is placed into .init.rodata section, so it's freed as soon as __init
      function of bpfilter.ko is finished.
      As part of __init the bpfilter.ko does first request/reply action
      via two unix pipe provided by fork_usermode_blob() helper to
      make sure that umh is healthy. If not it will kill it via pid.
      
      Later bpfilter_process_sockopt() will be called from bpfilter hooks
      in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
      
      If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
      kill umh as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2ba09c1
    • A
      umh: introduce fork_usermode_blob() helper · 449325b5
      Alexei Starovoitov 提交于
      Introduce helper:
      int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
      struct umh_info {
             struct file *pipe_to_umh;
             struct file *pipe_from_umh;
             pid_t pid;
      };
      
      that GPLed kernel modules (signed or unsigned) can use it to execute part
      of its own data as swappable user mode process.
      
      The kernel will do:
      - allocate a unique file in tmpfs
      - populate that file with [data, data + len] bytes
      - user-mode-helper code will do_execve that file and, before the process
        starts, the kernel will create two unix pipes for bidirectional
        communication between kernel module and umh
      - close tmpfs file, effectively deleting it
      - the fork_usermode_blob will return zero on success and populate
        'struct umh_info' with two unix pipes and the pid of the user process
      
      As the first step in the development of the bpfilter project
      the fork_usermode_blob() helper is introduced to allow user mode code
      to be invoked from a kernel module. The idea is that user mode code plus
      normal kernel module code are built as part of the kernel build
      and installed as traditional kernel module into distro specified location,
      such that from a distribution point of view, there is
      no difference between regular kernel modules and kernel modules + umh code.
      Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
      by a kernel module doesn't make it any special from kernel and user space
      tooling point of view.
      
      Such approach enables kernel to delegate functionality traditionally done
      by the kernel modules into the user space processes (either root or !root) and
      reduces security attack surface of the new code. The buggy umh code would crash
      the user process, but not the kernel. Another advantage is that umh code
      of the kernel module can be debugged and tested out of user space
      (e.g. opening the possibility to run clang sanitizers, fuzzers or
      user space test suites on the umh code).
      In case of the bpfilter project such architecture allows complex control plane
      to be done in the user space while bpf based data plane stays in the kernel.
      
      Since umh can crash, can be oom-ed by the kernel, killed by the admin,
      the kernel module that uses them (like bpfilter) needs to manage life
      time of umh on its own via two unix pipes and the pid of umh.
      
      The exit code of such kernel module should kill the umh it started,
      so that rmmod of the kernel module will cleanup the corresponding umh.
      Just like if the kernel module does kmalloc() it should kfree() it
      in the exit code.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      449325b5
  5. 23 5月, 2018 10 次提交
  6. 21 5月, 2018 3 次提交
  7. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e