1. 08 7月, 2018 9 次提交
  2. 07 7月, 2018 3 次提交
  3. 04 7月, 2018 6 次提交
    • J
      net/sched: Make etf report drops on error_queue · 4b15c707
      Jesus Sanchez-Palencia 提交于
      Use the socket error queue for reporting dropped packets if the
      socket has enabled that feature through the SO_TXTIME API.
      
      Packets are dropped either on enqueue() if they aren't accepted by the
      qdisc or on dequeue() if the system misses their deadline. Those are
      reported as different errors so applications can react accordingly.
      
      Userspace can retrieve the errors through the socket error queue and the
      corresponding cmsg interfaces. A struct sock_extended_err* is used for
      returning the error data, and the packet's timestamp can be retrieved by
      adding both ee_data and ee_info fields as e.g.:
      
          ((__u64) serr->ee_data << 32) + serr->ee_info
      
      This feature is disabled by default and must be explicitly enabled by
      applications. Enabling it can bring some overhead for the Tx cycles
      of the application.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b15c707
    • J
      net/sched: Add HW offloading capability to ETF · 88cab771
      Jesus Sanchez-Palencia 提交于
      Add infra so etf qdisc supports HW offload of time-based transmission.
      
      For hw offload, the time sorted list is still used, so packets are
      dequeued always in order of txtime.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
      	   clockid CLOCK_REALTIME
      
      In this example, the Qdisc will use HW offload for the control of the
      transmission time through the network adapter. The hrtimer used for
      packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
      as reference and packets leave the Qdisc "delta" (100000) nanoseconds
      before their transmission time. Because this will be using HW offload and
      since dynamic clocks are not supported by the hrtimer, the system clock
      and the PHC clock must be synchronized for this mode to behave as
      expected.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88cab771
    • V
      net/sched: Introduce the ETF Qdisc · 25db26a9
      Vinicius Costa Gomes 提交于
      The ETF (Earliest TxTime First) qdisc uses the information added
      earlier in this series (the socket option SO_TXTIME and the new
      role of sk_buff->tstamp) to schedule packets transmission based
      on absolute time.
      
      For some workloads, just bandwidth enforcement is not enough, and
      precise control of the transmission of packets is necessary.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
                 clockid CLOCK_TAI
      
      In this example, the Qdisc will provide SW best-effort for the control
      of the transmission time to the network adapter, the time stamp in the
      socket will be in reference to the clockid CLOCK_TAI and packets
      will leave the qdisc "delta" (100000) nanoseconds before its transmission
      time.
      
      The ETF qdisc will buffer packets sorted by their txtime. It will drop
      packets on enqueue() if their skbuff clockid does not match the clock
      reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
      if it expires while being enqueued.
      
      The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
      will dequeue a packet as soon as possible and change the skb timestamp
      to 'now' during etf_dequeue().
      
      Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
      to be configured, but it's been decided that usage of CLOCK_TAI should
      be enforced until we decide to allow for other clockids to be used.
      The rationale here is that PTP times are usually in the TAI scale, thus
      no other clocks should be necessary. For now, the qdisc will return
      EINVAL if any clocks other than CLOCK_TAI are used.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25db26a9
    • V
      net/sched: Allow creating a Qdisc watchdog with other clocks · 860b642b
      Vinicius Costa Gomes 提交于
      This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
      passed, this allows other time references to be used when scheduling
      the Qdisc to run.
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      860b642b
    • W
      net: sched: act_pedit: fix possible memory leak in tcf_pedit_init() · 30e99ed6
      Wei Yongjun 提交于
      'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
      but not all of the error handle path free it, this may cause memory
      leak. This patch fix it.
      
      Fixes: 71d0ed70 ("net/act_pedit: Support using offset relative to the conventional network headers")
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e99ed6
    • Q
      net:sched: add action inheritdsfield to skbedit · e7e3728b
      Qiaobin Fu 提交于
      The new action inheritdsfield copies the field DS of
      IPv4 and IPv6 packets into skb->priority. This enables
      later classification of packets based on the DS field.
      
      v5:
      *Update the drop counter for TC_ACT_SHOT
      
      v4:
      *Not allow setting flags other than the expected ones.
      
      *Allow dumping the pure flags.
      
      v3:
      *Use optional flags, so that it won't break old versions of tc.
      
      *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.
      
      v2:
      *Fix the style issue
      
      *Move the code from skbmod to skbedit
      
      Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NQiaobin Fu <qiaobinf@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7e3728b
  4. 02 7月, 2018 1 次提交
  5. 29 6月, 2018 3 次提交
  6. 28 6月, 2018 6 次提交
  7. 26 6月, 2018 6 次提交
  8. 24 6月, 2018 1 次提交
  9. 23 6月, 2018 1 次提交
  10. 22 6月, 2018 1 次提交
    • P
      cls_flower: fix use after free in flower S/W path · 44a5cd43
      Paolo Abeni 提交于
      If flower filter is created without the skip_sw flag, fl_mask_put()
      can race with fl_classify() and we can destroy the mask rhashtable
      while a lookup operation is accessing it.
      
       BUG: unable to handle kernel paging request at 00000000000911d1
       PGD 0 P4D 0
       SMP PTI
       CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
       Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
       RIP: 0010:rht_bucket_nested+0x20/0x60
       Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 00 00 0
       RSP: 0018:ffffafc5cfbb7a48 EFLAGS: 00010206
       RAX: 0000000000001978 RBX: ffff9f12dff88a00 RCX: 00000000ffff9f12
       RDX: 00000000000911d1 RSI: 0000000000000148 RDI: 0000000000000001
       RBP: ffff9f12dff88a00 R08: 000000005f1cc119 R09: 00000000a715fae2
       R10: ffffafc5cfbb7aa8 R11: ffff9f1cb4be804e R12: ffff9f1265e13000
       R13: 0000000000000000 R14: ffffafc5cfbb7b48 R15: ffff9f12dff88b68
       FS:  0000000000000000(0000) GS:ffff9f1d3f0c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000000911d1 CR3: 0000001575a94006 CR4: 00000000001626e0
       Call Trace:
        fl_lookup+0x134/0x140 [cls_flower]
        fl_classify+0xf3/0x180 [cls_flower]
        tcf_classify+0x78/0x150
        __netif_receive_skb_core+0x69e/0xa50
        netif_receive_skb_internal+0x42/0xf0
        tun_get_user+0xdd5/0xfd0 [tun]
        tun_sendmsg+0x52/0x70 [tun]
        handle_tx+0x2b3/0x5f0 [vhost_net]
        vhost_worker+0xab/0x100 [vhost]
        kthread+0xf8/0x130
        ret_from_fork+0x35/0x40
       Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap sch_ingress
       CR2: 00000000000911d1
      
      Fix the above waiting for a RCU grace period before destroying the
      rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
      must run in process context, as pointed out by Cong Wang.
      
      v1 -> v2: use tcf_queue_work to run rhashtable_destroy().
      
      Fixes: 05cd271f ("cls_flower: Support multiple masks per priority")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44a5cd43
  11. 20 6月, 2018 2 次提交
    • D
      net/sched: act_ife: preserve the action control in case of error · cbf56c29
      Davide Caratti 提交于
      in the following script
      
       # tc actions add action ife encode allow prio pass index 42
       # tc actions replace action ife encode allow tcindex drop index 42
      
      the action control should remain equal to 'pass', if the kernel failed
      to replace the TC action. Pospone the assignment of the action control,
      to ensure it is not overwritten in the error path of tcf_ife_init().
      
      Fixes: ef6980b6 ("introduce IFE action")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbf56c29
    • D
      net/sched: act_ife: fix recursive lock and idr leak · 0a889b94
      Davide Caratti 提交于
      a recursive lock warning [1] can be observed with the following script,
      
       # $TC actions add action ife encode allow prio pass index 42
       IFE type 0xED3E
       # $TC actions replace action ife encode allow tcindex pass index 42
      
      in case the kernel was unable to run the last command (e.g. because of
      the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
      the kernel can leak idr in the error path of tcf_ife_init(), because
      tcf_idr_release() is not called after successful idr reservation:
      
       # $TC actions add action ife encode allow tcindex index 47
       IFE type 0xED3E
       RTNETLINK answers: No such file or directory
       We have an error talking to the kernel
       # $TC actions add action ife encode allow tcindex index 47
       IFE type 0xED3E
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
       IFE type 0xFEFE
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
      
      Since tcfa_lock is already taken when the action is being edited, a call
      to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
      again. On the other hand, tcf_idr_release() needs to be called in the
      error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
      Fix both problems in tcf_ife_init().
      Since the cleanup() routine can now be called when ife->params is NULL,
      also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).
      
       [1]
       ============================================
       WARNING: possible recursive locking detected
       4.17.0-rc4.kasan+ #417 Tainted: G            E
       --------------------------------------------
       tc/3932 is trying to acquire lock:
       000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_cleanup+0x19/0x80 [act_ife]
      
       but task is already holding lock:
       000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&(&p->tcfa_lock)->rlock);
         lock(&(&p->tcfa_lock)->rlock);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       2 locks held by tc/3932:
        #0: 000000007ca8e990 (rtnl_mutex){+.+.}, at: tcf_ife_init+0xf61/0x13c0 [act_ife]
        #1: 000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]
      
       stack backtrace:
       CPU: 3 PID: 3932 Comm: tc Tainted: G            E     4.17.0-rc4.kasan+ #417
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
       Call Trace:
        dump_stack+0x9a/0xeb
        __lock_acquire+0xf43/0x34a0
        ? debug_check_no_locks_freed+0x2b0/0x2b0
        ? debug_check_no_locks_freed+0x2b0/0x2b0
        ? debug_check_no_locks_freed+0x2b0/0x2b0
        ? __mutex_lock+0x62f/0x1240
        ? kvm_sched_clock_read+0x1a/0x30
        ? sched_clock+0x5/0x10
        ? sched_clock_cpu+0x18/0x170
        ? find_held_lock+0x39/0x1d0
        ? lock_acquire+0x10b/0x330
        lock_acquire+0x10b/0x330
        ? tcf_ife_cleanup+0x19/0x80 [act_ife]
        _raw_spin_lock_bh+0x38/0x70
        ? tcf_ife_cleanup+0x19/0x80 [act_ife]
        tcf_ife_cleanup+0x19/0x80 [act_ife]
        __tcf_idr_release+0xff/0x350
        tcf_ife_init+0xdde/0x13c0 [act_ife]
        ? ife_exit_net+0x290/0x290 [act_ife]
        ? __lock_is_held+0xb4/0x140
        tcf_action_init_1+0x67b/0xad0
        ? tcf_action_dump_old+0xa0/0xa0
        ? sched_clock+0x5/0x10
        ? sched_clock_cpu+0x18/0x170
        ? kvm_sched_clock_read+0x1a/0x30
        ? sched_clock+0x5/0x10
        ? sched_clock_cpu+0x18/0x170
        ? memset+0x1f/0x40
        tcf_action_init+0x30f/0x590
        ? tcf_action_init_1+0xad0/0xad0
        ? memset+0x1f/0x40
        tc_ctl_action+0x48e/0x5e0
        ? mutex_lock_io_nested+0x1160/0x1160
        ? tca_action_gd+0x990/0x990
        ? sched_clock+0x5/0x10
        ? find_held_lock+0x39/0x1d0
        rtnetlink_rcv_msg+0x4da/0x990
        ? validate_linkmsg+0x680/0x680
        ? sched_clock_cpu+0x18/0x170
        ? find_held_lock+0x39/0x1d0
        netlink_rcv_skb+0x127/0x350
        ? validate_linkmsg+0x680/0x680
        ? netlink_ack+0x970/0x970
        ? __kmalloc_node_track_caller+0x304/0x3a0
        netlink_unicast+0x40f/0x5d0
        ? netlink_attachskb+0x580/0x580
        ? _copy_from_iter_full+0x187/0x760
        ? import_iovec+0x90/0x390
        netlink_sendmsg+0x67f/0xb50
        ? netlink_unicast+0x5d0/0x5d0
        ? copy_msghdr_from_user+0x206/0x340
        ? netlink_unicast+0x5d0/0x5d0
        sock_sendmsg+0xb3/0xf0
        ___sys_sendmsg+0x60a/0x8b0
        ? copy_msghdr_from_user+0x340/0x340
        ? lock_downgrade+0x5e0/0x5e0
        ? tty_write_lock+0x18/0x50
        ? kvm_sched_clock_read+0x1a/0x30
        ? sched_clock+0x5/0x10
        ? sched_clock_cpu+0x18/0x170
        ? find_held_lock+0x39/0x1d0
        ? lock_downgrade+0x5e0/0x5e0
        ? lock_acquire+0x10b/0x330
        ? __audit_syscall_entry+0x316/0x690
        ? current_kernel_time64+0x6b/0xd0
        ? __fget_light+0x55/0x1f0
        ? __sys_sendmsg+0xd2/0x170
        __sys_sendmsg+0xd2/0x170
        ? __ia32_sys_shutdown+0x70/0x70
        ? syscall_trace_enter+0x57a/0xd60
        ? rcu_read_lock_sched_held+0xdc/0x110
        ? __bpf_trace_sys_enter+0x10/0x10
        ? do_syscall_64+0x22/0x480
        do_syscall_64+0xa5/0x480
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fd646988ba0
       RSP: 002b:00007fffc9fab3c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       RAX: ffffffffffffffda RBX: 00007fffc9fab4f0 RCX: 00007fd646988ba0
       RDX: 0000000000000000 RSI: 00007fffc9fab440 RDI: 0000000000000003
       RBP: 000000005b28c8b3 R08: 0000000000000002 R09: 0000000000000000
       R10: 00007fffc9faae20 R11: 0000000000000246 R12: 0000000000000000
       R13: 00007fffc9fab504 R14: 0000000000000001 R15: 000000000066c100
      
      Fixes: 4e8c8615 ("net sched: net sched: ife action fix late binding")
      Fixes: ef6980b6 ("introduce IFE action")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a889b94
  12. 17 6月, 2018 1 次提交