1. 28 12月, 2017 5 次提交
  2. 27 12月, 2017 1 次提交
    • L
      rtnetlink: Replace implementation of ASSERT_RTNL() macro with WARN_ONCE() · 66364bdf
      Leon Romanovsky 提交于
      ASSERT_RTNL() macro is actual open-coded variant of WARN_ONCE() with
      two exceptions. First, it prints stack for multiple hits and not only
      once as WARN_ONCE() does. Second, the user can disable prints of
      WARN_ONCE by setting CONFIG_BUG to N.
      
      The multiple prints of dump stack are actually not needed, because calls
      without rtnl lock are programming errors and user can't do anything
      about them except to complain to the mailing list after first occurrence
      of such failure.
      
      The user who disabled BUG/WARN prints did it explicitly because by default
      in upstream kernel and distributions this option is enabled. It means
      that user doesn't want to see prints about missing locks too.
      
      This patch replaces open-coded variant in favor of already existing
      macro and change error prints to be once only.
      Reviewed-by: NMark Bloch <markb@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66364bdf
  3. 22 12月, 2017 11 次提交
  4. 21 12月, 2017 10 次提交
    • S
      xfrm: wrap xfrmdev_ops with offload config · 9cb0d21d
      Shannon Nelson 提交于
      There's no reason to define netdev->xfrmdev_ops if
      the offload facility is not CONFIG'd in.
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      9cb0d21d
    • S
      xfrm: check for xdo_dev_state_free · 7f05b467
      Shannon Nelson 提交于
      The current XFRM code assumes that we've implemented the
      xdo_dev_state_free() callback, even if it is meaningless to the driver.
      This patch adds a check for it before calling, as done in other APIs,
      to prevent a NULL function pointer kernel crash.
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      7f05b467
    • D
      bpf: allow for correlation of maps and helpers in dump · 7105e828
      Daniel Borkmann 提交于
      Currently a dump of an xlated prog (post verifier stage) doesn't
      correlate used helpers as well as maps. The prog info lists
      involved map ids, however there's no correlation of where in the
      program they are used as of today. Likewise, bpftool does not
      correlate helper calls with the target functions.
      
      The latter can be done w/o any kernel changes through kallsyms,
      and also has the advantage that this works with inlined helpers
      and BPF calls.
      
      Example, via interpreter:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                            direct-action not_in_hw id 1 tag c74773051b364165   <-- prog id:1
      
        * Output before patch (calls/maps remain unclear):
      
        # bpftool prog dump xlated id 1             <-- dump prog id:1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = 0xffff95c47a8d4800
         6: (85) call unknown#73040
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call unknown#73040
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        * Output after patch:
      
        # bpftool prog dump xlated id 1
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                     <-- map id:2
         6: (85) call bpf_map_lookup_elem#73424     <-- helper call
         7: (15) if r0 == 0x0 goto pc+18
         8: (bf) r2 = r10
         9: (07) r2 += -4
        10: (bf) r1 = r0
        11: (85) call bpf_map_lookup_elem#73424
        12: (15) if r0 == 0x0 goto pc+23
        [...]
      
        # bpftool map show id 2                     <-- show/dump/etc map id:2
        2: hash_of_maps  flags 0x0
              key 4B  value 4B  max_entries 3  memlock 4096B
      
      Example, JITed, same prog:
      
        # tc filter show dev foo ingress
        filter protocol all pref 49152 bpf chain 0
        filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                        direct-action not_in_hw id 3 tag c74773051b364165 jited
      
        # bpftool prog show id 3
        3: sched_cls  tag c74773051b364165
              loaded_at Dec 19/13:48  uid 0
              xlated 384B  jited 257B  memlock 4096B  map_ids 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]                      <-- map id:2
         6: (85) call __htab_map_lookup_elem#77408   <-+ inlined rewrite
         7: (15) if r0 == 0x0 goto pc+2                |
         8: (07) r0 += 56                              |
         9: (79) r0 = *(u64 *)(r0 +0)                <-+
        10: (15) if r0 == 0x0 goto pc+24
        11: (bf) r2 = r10
        12: (07) r2 += -4
        [...]
      
      Example, same prog, but kallsyms disabled (in that case we are
      also not allowed to pass any relative offsets, etc, so prog
      becomes pointer sanitized on dump):
      
        # sysctl kernel.kptr_restrict=2
        kernel.kptr_restrict = 2
      
        # bpftool prog dump xlated id 3
         0: (b7) r1 = 2
         1: (63) *(u32 *)(r10 -4) = r1
         2: (bf) r2 = r10
         3: (07) r2 += -4
         4: (18) r1 = map[id:2]
         6: (85) call bpf_unspec#0
         7: (15) if r0 == 0x0 goto pc+2
        [...]
      
      Example, BPF calls via interpreter:
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#__bpf_prog_run_args32
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      Example, BPF calls via JIT:
      
        # sysctl net.core.bpf_jit_enable=1
        net.core.bpf_jit_enable = 1
        # sysctl net.core.bpf_jit_kallsyms=1
        net.core.bpf_jit_kallsyms = 1
      
        # bpftool prog dump xlated id 1
         0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
         1: (b7) r0 = 1
         2: (95) exit
         3: (b7) r0 = 2
         4: (95) exit
      
      And finally, an example for tail calls that is now working
      as well wrt correlation:
      
        # bpftool prog dump xlated id 2
        [...]
        10: (b7) r2 = 8
        11: (85) call bpf_trace_printk#-41312
        12: (bf) r1 = r6
        13: (18) r2 = map[id:1]
        15: (b7) r3 = 0
        16: (85) call bpf_tail_call#12
        17: (b7) r1 = 42
        18: (6b) *(u16 *)(r6 +46) = r1
        19: (b7) r0 = 0
        20: (95) exit
      
        # bpftool map show id 1
        1: prog_array  flags 0x0
              key 4B  value 4B  max_entries 1  memlock 4096B
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7105e828
    • A
      bpf: fix integer overflows · bb7f0f98
      Alexei Starovoitov 提交于
      There were various issues related to the limited size of integers used in
      the verifier:
       - `off + size` overflow in __check_map_access()
       - `off + reg->off` overflow in check_mem_access()
       - `off + reg->var_off.value` overflow or 32-bit truncation of
         `reg->var_off.value` in check_mem_access()
       - 32-bit truncation in check_stack_boundary()
      
      Make sure that any integer math cannot overflow by not allowing
      pointer math with large values.
      
      Also reduce the scope of "scalar op scalar" tracking.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bb7f0f98
    • J
      block: unalign call_single_data in struct request · 4ccafe03
      Jens Axboe 提交于
      A previous change blindly added massive alignment to the
      call_single_data structure in struct request. This ballooned it in size
      from 296 to 320 bytes on my setup, for no valid reason at all.
      
      Use the unaligned struct __call_single_data variant instead.
      
      Fixes: 966a9671 ("smp: Avoid using two cache lines for struct call_single_data")
      Cc: stable@vger.kernel.org # v4.14
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ccafe03
    • Y
      net: sock: replace sk_state_load with inet_sk_state_load and remove sk_state_store · 986ffdfd
      Yafang Shao 提交于
      sk_state_load is only used by AF_INET/AF_INET6, so rename it to
      inet_sk_state_load and move it into inet_sock.h.
      
      sk_state_store is removed as it is not used any more.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      986ffdfd
    • Y
      net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint · 563e0bb0
      Yafang Shao 提交于
      As sk_state is a common field for struct sock, so the state
      transition tracepoint should not be a TCP specific feature.
      Currently it traces all AF_INET state transition, so I rename this
      tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
      into trace/events/sock.h.
      We dont need to create a file named trace/events/inet_sock.h for this one single
      tracepoint.
      
      Two helpers are introduced to trace sk_state transition
          - void inet_sk_state_store(struct sock *sk, int newstate);
          - void inet_sk_set_state(struct sock *sk, int state);
      As trace header should not be included in other header files,
      so they are defined in sock.c.
      
      The protocol such as SCTP maybe compiled as a ko, hence export
      inet_sk_set_state().
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563e0bb0
    • S
      tcp: Export to userspace the TCP state names for the trace events · d7b850a7
      Steven Rostedt (VMware) 提交于
      The TCP trace events (specifically tcp_set_state), maps emums to symbol
      names via __print_symbolic(). But this only works for reading trace events
      from the tracefs trace files. If perf or trace-cmd were to record these
      events, the event format file does not convert the enum names into numbers,
      and you get something like:
      
      __print_symbolic(REC->oldstate,
          { TCP_ESTABLISHED, "TCP_ESTABLISHED" },
          { TCP_SYN_SENT, "TCP_SYN_SENT" },
          { TCP_SYN_RECV, "TCP_SYN_RECV" },
          { TCP_FIN_WAIT1, "TCP_FIN_WAIT1" },
          { TCP_FIN_WAIT2, "TCP_FIN_WAIT2" },
          { TCP_TIME_WAIT, "TCP_TIME_WAIT" },
          { TCP_CLOSE, "TCP_CLOSE" },
          { TCP_CLOSE_WAIT, "TCP_CLOSE_WAIT" },
          { TCP_LAST_ACK, "TCP_LAST_ACK" },
          { TCP_LISTEN, "TCP_LISTEN" },
          { TCP_CLOSING, "TCP_CLOSING" },
          { TCP_NEW_SYN_RECV, "TCP_NEW_SYN_RECV" })
      
      Where trace-cmd and perf do not know the values of those enums.
      
      Use the TRACE_DEFINE_ENUM() macros that will have the trace events convert
      the enum strings into their values at system boot. This will allow perf and
      trace-cmd to see actual numbers and not enums:
      
      __print_symbolic(REC->oldstate,
          { 1, "TCP_ESTABLISHED" },
          { 2, "TCP_SYN_SENT" },
          { 3, "TCP_SYN_RECV" },
          { 4, "TCP_FIN_WAIT1" },
          { 5, "TCP_FIN_WAIT2" },
          { 6, "TCP_TIME_WAIT" },
          { 7, "TCP_CLOSE" },
          { 8, "TCP_CLOSE_WAIT" },
          { 9, "TCP_LAST_ACK" },
          { 10, "TCP_LISTEN" },
          { 11, "TCP_CLOSING" },
          { 12, "TCP_NEW_SYN_RECV" })
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7b850a7
    • S
      block-throttle: avoid double charge · 111be883
      Shaohua Li 提交于
      If a bio is throttled and split after throttling, the bio could be
      resubmited and enters the throttling again. This will cause part of the
      bio to be charged multiple times. If the cgroup has an IO limit, the
      double charge will significantly harm the performance. The bio split
      becomes quite common after arbitrary bio size change.
      
      To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
      If the bio is cloned/split, we copy the flag to new bio too to avoid a
      double charge. However, cloned bio could be directed to a new disk,
      keeping the flag be a problem. The observation is we always set new disk
      for the bio in this case, so we can clear the flag in bio_set_dev().
      
      This issue exists for a long time, arbitrary bio size change just makes
      it worse, so this should go into stable at least since v4.2.
      
      V1-> V2: Not add extra field in bio based on discussion with Tejun
      
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      111be883
    • J
      cls_bpf: fix offload assumptions after callback conversion · 102740bd
      Jakub Kicinski 提交于
      cls_bpf used to take care of tracking what offload state a filter
      is in, i.e. it would track if offload request succeeded or not.
      This information would then be used to issue correct requests to
      the driver, e.g. requests for statistics only on offloaded filters,
      removing only filters which were offloaded, using add instead of
      replace if previous filter was not added etc.
      
      This tracking of offload state no longer functions with the new
      callback infrastructure.  There could be multiple entities trying
      to offload the same filter.
      
      Throw out all the tracking and corresponding commands and simply
      pass to the drivers both old and new bpf program.  Drivers will
      have to deal with offload state tracking by themselves.
      
      Fixes: 3f7889c4 ("net: sched: cls_bpf: call block callbacks for offload")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102740bd
  5. 20 12月, 2017 6 次提交
  6. 19 12月, 2017 7 次提交
    • M
      net: Introduce NETIF_F_GRO_HW. · fb1f5f79
      Michael Chan 提交于
      Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
      GRO.  With this flag, we can now independently turn on or off hardware
      GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
      control hardware GRO and so it cannot be independently turned on or
      off without affecting GRO.
      
      Hardware GRO (just like GRO) guarantees that packets can be re-segmented
      by TSO/GSO to reconstruct the original packet stream.  Logically,
      GRO_HW should depend on GRO since it a subset, but we will let
      individual drivers enforce this dependency as they see fit.
      
      Since NETIF_F_GRO is not propagated between upper and lower devices,
      NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
      words, a lower device can independent have GRO/GRO_HW enabled or disabled
      and no feature propagation is required.  This will preserve the current
      GRO behavior.  This can be changed later if we decide to propagate GRO/
      GRO_HW/RXCSUM from upper to lower devices.
      
      Cc: Ariel Elior <Ariel.Elior@cavium.com>
      Cc: everest-linux-l2@cavium.com
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb1f5f79
    • T
      sock: Hide unused variable when !CONFIG_PROC_FS. · 398b841e
      Tonghao Zhang 提交于
      When CONFIG_PROC_FS is disabled, we will not use the prot_inuse
      counter. This adds an #ifdef to hide the variable definition in
      that case. This is not a bugfix. But we can save bytes when there
      are many network namespace.
      
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      398b841e
    • T
      sock: Move the socket inuse to namespace. · 648845ab
      Tonghao Zhang 提交于
      In some case, we want to know how many sockets are in use in
      different _net_ namespaces. It's a key resource metric.
      
      This patch add a member in struct netns_core. This is a counter
      for socket-inuse in the _net_ namespace. The patch will add/sub
      counter in the sk_alloc, sk_clone_lock and __sk_free.
      
      This patch will not counter the socket created in kernel.
      It's not very useful for userspace to know how many kernel
      sockets we created.
      
      The main reasons for doing this are that:
      
      1. When linux calls the 'do_exit' for process to exit, the functions
      'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
      'exit_task_namespaces' may have destroyed the _net_ namespace, but
      'sock_release' called in 'exit_task_work' may use the _net_ namespace
      if we counter the socket-inuse in sock_release.
      
      2. socket and sock are in pair. More important, sock holds the _net_
      namespace. We counter the socket-inuse in sock, for avoiding holding
      _net_ namespace again in socket. It's a easy way to maintain the code.
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648845ab
    • T
      sock: Change the netns_core member name. · 08fc7f81
      Tonghao Zhang 提交于
      Change the member name will make the code more readable.
      This patch will be used in next patch.
      Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08fc7f81
    • J
      nl80211: Remove obsolete kerneldoc line · 958a1b5a
      Jonathan Corbet 提交于
      Commit ca986ad9 (nl80211: allow multiple active scheduled scan
      requests) removed WIPHY_FLAG_SUPPORTS_SCHED_SCAN but left the kerneldoc
      description in place, leading to this docs-build warning:
      
         ./include/net/cfg80211.h:3278: warning: Excess enum value
                 'WIPHY_FLAG_SUPPORTS_SCHED_SCAN' description in 'wiphy_flags'
      
      Remove the line and gain a bit of peace.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      Acked-by: NArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      958a1b5a
    • Y
      bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog · 06ef0ccb
      Yonghong Song 提交于
      The tools/testing/selftests/bpf test program
      test_dev_cgroup fails with the following error
      when compiled with llvm 6.0. (I did not try
      with earlier versions.)
      
        libbpf: load bpf program failed: Permission denied
        libbpf: -- BEGIN DUMP LOG ---
        libbpf:
        0: (61) r2 = *(u32 *)(r1 +4)
        1: (b7) r0 = 0
        2: (55) if r2 != 0x1 goto pc+8
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv1 R10=fp0
        3: (69) r2 = *(u16 *)(r1 +0)
        invalid bpf_context access off=0 size=2
        ...
      
      The culprit is the following statement in dev_cgroup.c:
        short type = ctx->access_type & 0xFFFF;
      This code is typical as the ctx->access_type is assigned
      as below in kernel/bpf/cgroup.c:
        struct bpf_cgroup_dev_ctx ctx = {
              .access_type = (access << 16) | dev_type,
              .major = major,
              .minor = minor,
        };
      
      The compiler converts it to u16 access while
      the verifier cgroup_dev_is_valid_access rejects
      any non u32 access.
      
      This patch permits the field access_type to be accessible
      with type u16 and u8 as well.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Tested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      06ef0ccb
    • J
      block: fix blk_rq_append_bio · 0abc2a10
      Jens Axboe 提交于
      Commit caa4b024(blk-map: call blk_queue_bounce from blk_rq_append_bio)
      moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider
      the fact that the bounced bio becomes invisible to caller since the
      parameter type is 'struct bio *'. Make it a pointer to a pointer to
      a bio, so the caller sees the right bio also after a bounce.
      
      Fixes: caa4b024 ("blk-map: call blk_queue_bounce from blk_rq_append_bio")
      Cc: Christoph Hellwig <hch@lst.de>
      Reported-by: NMichele Ballabio <barra_cuda@katamail.com>
      (handling failure of blk_rq_append_bio(), only call bio_get() after
      blk_rq_append_bio() returns OK)
      Tested-by: NMichele Ballabio <barra_cuda@katamail.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0abc2a10