1. 26 3月, 2018 2 次提交
    • K
      net: Drop NETDEV_UNREGISTER_FINAL · 070f2d7e
      Kirill Tkhai 提交于
      Last user is gone after bdf5bd7f "rds: tcp: remove
      register_netdevice_notifier infrastructure.", so we can
      remove this netdevice command. This allows to delete
      rtnl_lock() in netdev_run_todo(), which is hot path for
      net namespace unregistration.
      
      dev_change_net_namespace() and netdev_wait_allrefs()
      have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
      and the source commits say they were introduced to
      delemit the call with NETDEV_UNREGISTER, but this patch
      leaves them on the places, since they require additional
      analysis, whether we need in them for something else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      070f2d7e
    • K
      net: Make NETDEV_XXX commands enum { } · ede2762d
      Kirill Tkhai 提交于
      This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
      Since the cmd is used in usnic_ib_netdev_event_to_string()
      to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
      from everywhere, we'd have holes in event2str[] in this
      function.
      
      Instead of that, let's make NETDEV_XXX commands names
      available for everyone, and to define netdev_cmd_to_name()
      in the way we won't have to shaffle names after their
      numbers are changed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede2762d
  2. 24 3月, 2018 7 次提交
  3. 23 3月, 2018 8 次提交
    • D
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek 提交于
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
    • K
      net: Replace ip_ra_lock with per-net mutex · d9ff3049
      Kirill Tkhai 提交于
      Since ra_chain is per-net, we may use per-net mutexes
      to protect them in ip_ra_control(). This improves
      scalability.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9ff3049
    • K
      net: Make ip_ra_chain per struct net · 5796ef75
      Kirill Tkhai 提交于
      This is optimization, which makes ip_call_ra_chain()
      iterate less sockets to find the sockets it's looking for.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5796ef75
    • S
      net: qualcomm: rmnet: Export mux_id and flags to netlink · 14452ca3
      Subash Abhinov Kasiviswanathan 提交于
      Define new netlink attributes for rmnet mux_id and flags. These
      flags / mux_id were earlier using vlan flags / id respectively.
      The flag bits are also moved to uapi and are renamed with
      prefix RMNET_FLAG_*.
      
      Also add the rmnet policy to handle the new netlink attributes.
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14452ca3
    • G
      tipc: step sk->sk_drops when rcv buffer is full · 872619d8
      GhantaKrishnamurthy MohanKrishna 提交于
      Currently when tipc is unable to queue a received message on a
      socket, the message is rejected back to the sender with error
      TIPC_ERR_OVERLOAD. However, the application on this socket
      has no knowledge about these discards.
      
      In this commit, we try to step the sk_drops counter when tipc
      is unable to queue a received message. Export sk_drops
      using tipc socket diagnostics.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      872619d8
    • G
      tipc: implement socket diagnostics for AF_TIPC · c30b70de
      GhantaKrishnamurthy MohanKrishna 提交于
      This commit adds socket diagnostics capability for AF_TIPC in netlink
      family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).
      
      The following are key design considerations:
      - config TIPC_DIAG has default y, like INET_DIAG.
      - only requests with flag NLM_F_DUMP is supported (dump all).
      - tipc_sock_diag_req message is introduced to send filter parameters.
      - the response attributes are of TLV, some nested.
      
      To avoid exposing data structures between diag and tipc modules and
      avoid code duplication, the following additions are required:
      - export tipc_nl_sk_walk function to reuse socket iterator.
      - export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
      - create a sock_diag response message in __tipc_add_sock_diag defined
        in diag.c and use the above exported tipc_sk_fill_sock_diag
        to fill response.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c30b70de
    • D
      devlink: Remove top_hierarchy arg to devlink_resource_register · 14530746
      David Ahern 提交于
      top_hierarchy arg can be determined by comparing parent_resource_id to
      DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate
      argument.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14530746
  4. 22 3月, 2018 2 次提交
  5. 21 3月, 2018 1 次提交
  6. 20 3月, 2018 9 次提交
    • J
      bpf: sk_msg program helper bpf_sk_msg_pull_data · 015632bb
      John Fastabend 提交于
      Currently, if a bpf sk msg program is run the program
      can only parse data that the (start,end) pointers already
      consumed. For sendmsg hooks this is likely the first
      scatterlist element. For sendpage this will be the range
      (0,0) because the data is shared with userspace and by
      default we want to avoid allowing userspace to modify
      data while (or after) BPF verdict is being decided.
      
      To support pulling in additional bytes for parsing use
      a new helper bpf_sk_msg_pull(start, end, flags) which
      works similar to cls tc logic. This helper will attempt
      to point the data start pointer at 'start' bytes offest
      into msg and data end pointer at 'end' bytes offset into
      message.
      
      After basic sanity checks to ensure 'start' <= 'end' and
      'end' <= msg_length there are a few cases we need to
      handle.
      
      First the sendmsg hook has already copied the data from
      userspace and has exclusive access to it. Therefor, it
      is not necessesary to copy the data. However, it may
      be required. After finding the scatterlist element with
      'start' offset byte in it there are two cases. One the
      range (start,end) is entirely contained in the sg element
      and is already linear. All that is needed is to update the
      data pointers, no allocate/copy is needed. The other case
      is (start, end) crosses sg element boundaries. In this
      case we allocate a block of size 'end - start' and copy
      the data to linearize it.
      
      Next sendpage hook has not copied any data in initial
      state so that data pointers are (0,0). In this case we
      handle it similar to the above sendmsg case except the
      allocation/copy must always happen. Then when sending
      the data we have possibly three memory regions that
      need to be sent, (0, start - 1), (start, end), and
      (end + 1, msg_length). This is required to ensure any
      writes by the BPF program are correctly transmitted.
      
      Lastly this operation will invalidate any previous
      data checks so BPF programs will have to revalidate
      pointers after making this BPF call.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      015632bb
    • J
      bpf: sockmap, add msg_cork_bytes() helper · 91843d54
      John Fastabend 提交于
      In the case where we need a specific number of bytes before a
      verdict can be assigned, even if the data spans multiple sendmsg
      or sendfile calls. The BPF program may use msg_cork_bytes().
      
      The extreme case is a user can call sendmsg repeatedly with
      1-byte msg segments. Obviously, this is bad for performance but
      is still valid. If the BPF program needs N bytes to validate
      a header it can use msg_cork_bytes to specify N bytes and the
      BPF program will not be called again until N bytes have been
      accumulated. The infrastructure will attempt to coalesce data
      if possible so in many cases (most my use cases at least) the
      data will be in a single scatterlist element with data pointers
      pointing to start/end of the element. However, this is dependent
      on available memory so is not guaranteed. So BPF programs must
      validate data pointer ranges, but this is the case anyways to
      convince the verifier the accesses are valid.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      91843d54
    • J
      bpf: sockmap, add bpf_msg_apply_bytes() helper · 2a100317
      John Fastabend 提交于
      A single sendmsg or sendfile system call can contain multiple logical
      messages that a BPF program may want to read and apply a verdict. But,
      without an apply_bytes helper any verdict on the data applies to all
      bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
      care to read the first N bytes of a msg. If the payload is large say
      MB or even GB setting up and calling the BPF program repeatedly for
      all bytes, even though the verdict is already known, creates
      unnecessary overhead.
      
      To allow BPF programs to control how many bytes a given verdict
      applies to we implement a bpf_msg_apply_bytes() helper. When called
      from within a BPF program this sets a counter, internal to the
      BPF infrastructure, that applies the last verdict to the next N
      bytes. If the N is smaller than the current data being processed
      from a sendmsg/sendfile call, the first N bytes will be sent and
      the BPF program will be re-run with start_data pointing to the N+1
      byte. If N is larger than the current data being processed the
      BPF verdict will be applied to multiple sendmsg/sendfile calls
      until N bytes are consumed.
      
      Note1 if a socket closes with apply_bytes counter non-zero this
      is not a problem because data is not being buffered for N bytes
      and is sent as its received.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2a100317
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      net: generalize sk_alloc_sg to work with scatterlist rings · 8c05dbf0
      John Fastabend 提交于
      The current implementation of sk_alloc_sg expects scatterlist to always
      start at entry 0 and complete at entry MAX_SKB_FRAGS.
      
      Future patches will want to support starting at arbitrary offset into
      scatterlist so add an additional sg_start parameters and then default
      to the current values in TLS code paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8c05dbf0
    • J
      net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG · 312fc2b4
      John Fastabend 提交于
      When calling do_tcp_sendpages() from in kernel and we know the data
      has no references from user side we can omit SKBTX_SHARED_FRAG flag.
      This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
      to omit setting SKBTX_SHARED_FRAG.
      
      The flag is not exposed to userspace because the sendpage call from
      the splice logic masks out all bits except MSG_MORE.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      312fc2b4
    • J
      sock: make static tls function alloc_sg generic sock helper · 2c3682f0
      John Fastabend 提交于
      The TLS ULP module builds scatterlists from a sock using
      page_frag_refill(). This is going to be useful for other ULPs
      so move it into sock file for more general use.
      
      In the process remove useless goto at end of while loop.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2c3682f0
    • L
      RDMA/verbs: Remove restrack entry from XRCD structure · 80cf79ae
      Leon Romanovsky 提交于
      XRCD object is not implemented in the restrack, so lets remove it.
      
      Fixes: 02d8883f ("RDMA/restrack: Add general infrastructure to track RDMA resources")
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      80cf79ae
    • T
      percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods · b3a5d111
      Tejun Heo 提交于
      percpu_ref internally uses sched-RCU to implement the percpu -> atomic
      mode switching and the documentation suggested that this could be
      depended upon.  This doesn't seem like a good idea.
      
      * percpu_ref uses sched-RCU which has different grace periods regular
        RCU.  Users may combine percpu_ref with regular RCU usage and
        incorrectly believe that regular RCU grace periods are performed by
        percpu_ref.  This can lead to, for example, use-after-free due to
        premature freeing.
      
      * percpu_ref has a grace period when switching from percpu to atomic
        mode.  It doesn't have one between the last put and release.  This
        distinction is subtle and can lead to surprising bugs.
      
      * percpu_ref allows starting in and switching to atomic mode manually
        for debugging and other purposes.  This means that there may not be
        any grace periods from kill to release.
      
      This patch makes it clear that the grace periods are percpu_ref's
      internal implementation detail and can't be depended upon by the
      users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b3a5d111
  7. 18 3月, 2018 2 次提交
    • A
      sctp: use proc_remove_subtree() · d47d08c8
      Al Viro 提交于
      use proc_remove_subtree() for subtree removal, both on setup failure
      halfway through and on teardown.  No need to make simple things
      complex...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d47d08c8
    • J
      tipc: obsolete TIPC_ZONE_SCOPE · 928df188
      Jon Maloy 提交于
      Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
      aspects handled the same way, both on the publishing node and on the
      receiving nodes.
      
      Despite previous ambitions to the contrary, this is never going to change,
      so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
      macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
      using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
      remain compatible with users and remote nodes still using ZONE_SCOPE.
      
      Furthermore, the non-formalized scope value 0 has always been permitted
      for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
      We now permit it even as binding scope, but for compatibility reasons we
      choose to not change the value of TIPC_CLUSTER_SCOPE.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      928df188
  8. 17 3月, 2018 3 次提交
  9. 16 3月, 2018 4 次提交
    • D
      net/ipv6: Change address check to always take a device argument · 232378e8
      David Ahern 提交于
      ipv6_chk_addr_and_flags determines if an address is a local address and
      optionally if it is an address on a specific device. For example, it is
      called by ip6_route_info_create to determine if a given gateway address
      is a local address. The address check currently does not consider L3
      domains and as a result does not allow a route to be added in one VRF
      if the nexthop points to an address in a second VRF. e.g.,
      
          $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
          Error: Invalid gateway address.
      
      where 2001:db8:102::23 is an address on an interface in vrf r1.
      
      ipv6_chk_addr_and_flags needs to allow callers to always pass in a device
      with a separate argument to not limit the address to the specific device.
      The device is used used to determine the L3 domain of interest.
      
      To that end add an argument to skip the device check and update callers
      to always pass a device where possible and use the new argument to mean
      any address in the domain.
      
      Update a handful of users of ipv6_chk_addr with a NULL dev argument. This
      patch handles the change to these callers without adding the domain check.
      
      ip6_validate_gw needs to handle 2 cases - one where the device is given
      as part of the nexthop spec and the other where the device is resolved.
      There is at least 1 VRF case where deferring the check to only after
      the route lookup has resolved the device fails with an unintuitive error
      "RTNETLINK answers: No route to host" as opposed to the preferred
      "Error: Gateway can not be a local address." The 'no route to host'
      error is because of the fallback to a full lookup. The check is done
      twice to avoid this error.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      232378e8
    • T
      vlan: Fix out of order vlan headers with reorder header off · cbe7128c
      Toshiaki Makita 提交于
      With reorder header off, received packets are untagged in skb_vlan_untag()
      called from within __netif_receive_skb_core(), and later the tag will be
      inserted back in vlan_do_receive().
      
      This caused out of order vlan headers when we create a vlan device on top
      of another vlan device, because vlan_do_receive() inserts a tag as the
      outermost vlan tag. E.g. the outer tag is first removed in skb_vlan_untag()
      and inserted back in vlan_do_receive(), then the inner tag is next removed
      and inserted back as the outermost tag.
      
      This patch fixes the behaviour by inserting the inner tag at the right
      position.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbe7128c
    • T
      net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off · 4bbb3e0e
      Toshiaki Makita 提交于
      When we have a bridge with vlan_filtering on and a vlan device on top of
      it, packets would be corrupted in skb_vlan_untag() called from
      br_dev_xmit().
      
      The problem sits in skb_reorder_vlan_header() used in skb_vlan_untag(),
      which makes use of skb->mac_len. In this function mac_len is meant for
      handling rx path with vlan devices with reorder_header disabled, but in
      tx path mac_len is typically 0 and cannot be used, which is the problem
      in this case.
      
      The current code even does not properly handle rx path (skb_vlan_untag()
      called from __netif_receive_skb_core()) with reorder_header off actually.
      
      In rx path single tag case, it works as follows:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                data
         v                                        v
         +-------------------+-------------+------+----
         |        ETH        |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TYPE |
         +-------------------+-------------+------+----
         <-------- mac_len --------->
                             <------------->
                              to be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                     data
                       v                          v
                       +-------------------+------+----
                       |        ETH        | ETH  |
                       |       ADDRS       | TYPE |
                       +-------------------+------+----
                       <-------- mac_len --------->
      
      This is ok, but in rx double tag case, it corrupts packets:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                              data
         v                                                      v
         +-------------------+-------------+-------------+------+----
         |        ETH        |    VLAN     |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TPID | TCI  | TYPE |
         +-------------------+-------------+-------------+------+----
         <--------------- mac_len ---------------->
                                           <------------->
                                          should be removed
                             <--------------------------->
                               actually will be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                                   data
                       v                                        v
                                     +-------------------+------+----
                                     |        ETH        | ETH  |
                                     |       ADDRS       | TYPE |
                                     +-------------------+------+----
                       <--------------- mac_len ---------------->
      
      So, two of vlan tags are both removed while only inner one should be
      removed and mac_header (and mac_len) is broken.
      
      skb_vlan_untag() is meant for removing the vlan header at (skb->data - 2),
      so use skb->data and skb->mac_header to calculate the right offset.
      Reported-by: NBrandon Carpenter <brandon.carpenter@cypherpath.com>
      Fixes: a6e18ff1 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bbb3e0e
    • E
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman 提交于
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  10. 15 3月, 2018 2 次提交
    • A
      mmc: core: Fix tracepoint print of blk_addr and blksz · c658dc58
      Adrian Hunter 提交于
      Swap the positions of blk_addr and blksz in the tracepoint print arguments
      so that they match the print format.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Fixes: d2f82254 ("mmc: core: Add members to mmc_request and mmc_data for CQE's")
      Cc: <stable@vger.kernel.org> # 4.14+
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      c658dc58
    • S
      bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7
      Song Liu 提交于
      Currently, bpf stackmap store address for each entry in the call trace.
      To map these addresses to user space files, it is necessary to maintain
      the mapping from these virtual address to symbols in the binary. Usually,
      the user space profiler (such as perf) has to scan /proc/pid/maps at the
      beginning of profiling, and monitor mmap2() calls afterwards. Given the
      cost of maintaining the address map, this solution is not practical for
      system wide profiling that is always on.
      
      This patch tries to solve this problem with a variation of stackmap. This
      variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
      addresses, the variation stores ELF file build_id + offset.
      
      Build ID is a 20-byte unique identifier for ELF files. The following
      command shows the Build ID of /bin/bash:
      
        [user@]$ readelf -n /bin/bash
        ...
          Build ID: XXXXXXXXXX
        ...
      
      With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
      for each entry in the call trace, and translate it into the following
      struct:
      
        struct bpf_stack_build_id_offset {
                __s32           status;
                unsigned char   build_id[BPF_BUILD_ID_SIZE];
                union {
                        __u64   offset;
                        __u64   ip;
                };
        };
      
      The search of build_id is limited to the first page of the file, and this
      page should be in page cache. Otherwise, we fallback to store ip for this
      entry (ip field in struct bpf_stack_build_id_offset). This requires the
      build_id to be stored in the first page. A quick survey of binary and
      dynamic library files in a few different systems shows that almost all
      binary and dynamic library files have build_id in the first page.
      
      Build_id is only meaningful for user stack. If a kernel stack is added to
      a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
      only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
      lookup failed for some reason, it will also fallback to store ip.
      
      User space can access struct bpf_stack_build_id_offset with bpf
      syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
      maintain mapping from build id to binary files. This mostly static
      mapping is much easier to maintain than per process address maps.
      
      Note: Stackmap with build_id only works in non-nmi context at this time.
      This is because we need to take mm->mmap_sem for find_vma(). If this
      changes, we would like to allow build_id lookup in nmi context.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      615755a7