1. 20 2月, 2018 1 次提交
  2. 13 2月, 2018 1 次提交
  3. 30 1月, 2018 2 次提交
    • E
      ipv6: addrconf: break critical section in addrconf_verify_rtnl() · e64e469b
      Eric Dumazet 提交于
      Heiner reported a lockdep splat [1]
      
      This is caused by attempting GFP_KERNEL allocation while RCU lock is
      held and BH blocked.
      
      We believe that addrconf_verify_rtnl() could run for a long period,
      so instead of using GFP_ATOMIC here as Ido suggested, we should break
      the critical section and restart it after the allocation.
      
      [1]
      [86220.125562] =============================
      [86220.125586] WARNING: suspicious RCU usage
      [86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted
      [86220.125641] -----------------------------
      [86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section!
      [86220.125711]
                     other info that might help us debug this:
      
      [86220.125755]
                     rcu_scheduler_active = 2, debug_locks = 1
      [86220.125792] 4 locks held by kworker/0:2/1003:
      [86220.125817]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125895]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125959]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.126017]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.126111]
                     stack backtrace:
      [86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.126288] Call Trace:
      [86220.126312]  dump_stack+0x70/0x9e
      [86220.126337]  lockdep_rcu_suspicious+0xce/0xf0
      [86220.126365]  ___might_sleep+0x1d3/0x240
      [86220.126390]  __might_sleep+0x45/0x80
      [86220.126416]  kmem_cache_alloc_trace+0x53/0x250
      [86220.126458]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126498]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126538]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126580]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126623]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126664]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126708]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.126738]  process_one_work+0x258/0x680
      [86220.126765]  worker_thread+0x35/0x3f0
      [86220.126790]  kthread+0x124/0x140
      [86220.126813]  ? process_one_work+0x680/0x680
      [86220.126839]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.126869]  ? umh_complete+0x40/0x40
      [86220.126893]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.126926]  ret_from_fork+0x4b/0x60
      [86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420
      [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2
      [86220.127082] 4 locks held by kworker/0:2/1003:
      [86220.127107]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127179]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127242]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.127300]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.127568] Call Trace:
      [86220.127591]  dump_stack+0x70/0x9e
      [86220.127616]  ___might_sleep+0x14d/0x240
      [86220.127644]  __might_sleep+0x45/0x80
      [86220.127672]  kmem_cache_alloc_trace+0x53/0x250
      [86220.127717]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127762]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127807]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127854]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127903]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127950]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127998]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.128032]  process_one_work+0x258/0x680
      [86220.128063]  worker_thread+0x35/0x3f0
      [86220.128091]  kthread+0x124/0x140
      [86220.128117]  ? process_one_work+0x680/0x680
      [86220.128146]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.128180]  ? umh_complete+0x40/0x40
      [86220.128207]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.128243]  ret_from_fork+0x4b/0x60
      
      Fixes: f3d9832e ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e64e469b
    • D
      net: ipv6: send unsolicited NA after DAD · c76fe2d9
      David Ahern 提交于
      Unsolicited IPv6 neighbor advertisements should be sent after DAD
      completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
      addresses and have those sent by addrconf_dad_completed after DAD.
      
      Fixes: 4a6e3c5d ("net: ipv6: send unsolicited NA on admin up")
      Reported-by: NVivek Venkatraman <vivek@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c76fe2d9
  4. 17 1月, 2018 1 次提交
    • A
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan 提交于
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      
      VFS stopped pinning module at this point.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96890d62
  5. 08 1月, 2018 3 次提交
  6. 05 12月, 2017 2 次提交
  7. 22 11月, 2017 1 次提交
    • K
      treewide: setup_timer() -> timer_setup() · e99e88a9
      Kees Cook 提交于
      This converts all remaining cases of the old setup_timer() API into using
      timer_setup(), where the callback argument is the structure already
      holding the struct timer_list. These should have no behavioral changes,
      since they just change which pointer is passed into the callback with
      the same available pointers after conversion. It handles the following
      examples, in addition to some other variations.
      
      Casting from unsigned long:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, ptr);
      
      and forced object casts:
      
          void my_callback(struct something *ptr)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
      
      become:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      Direct function assignments:
      
          void my_callback(unsigned long data)
          {
              struct something *ptr = (struct something *)data;
          ...
          }
          ...
          ptr->my_timer.function = my_callback;
      
      have a temporary cast added, along with converting the args:
      
          void my_callback(struct timer_list *t)
          {
              struct something *ptr = from_timer(ptr, t, my_timer);
          ...
          }
          ...
          ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
      
      And finally, callbacks without a data assignment:
      
          void my_callback(unsigned long data)
          {
          ...
          }
          ...
          setup_timer(&ptr->my_timer, my_callback, 0);
      
      have their argument renamed to verify they're unused during conversion:
      
          void my_callback(struct timer_list *unused)
          {
          ...
          }
          ...
          timer_setup(&ptr->my_timer, my_callback, 0);
      
      The conversion is done with the following Coccinelle script:
      
      spatch --very-quiet --all-includes --include-headers \
      	-I ./arch/x86/include -I ./arch/x86/include/generated \
      	-I ./include -I ./arch/x86/include/uapi \
      	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
      	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
      	--dir . \
      	--cocci-file ~/src/data/timer_setup.cocci
      
      @fix_address_of@
      expression e;
      @@
      
       setup_timer(
      -&(e)
      +&e
       , ...)
      
      // Update any raw setup_timer() usages that have a NULL callback, but
      // would otherwise match change_timer_function_usage, since the latter
      // will update all function assignments done in the face of a NULL
      // function initialization in setup_timer().
      @change_timer_function_usage_NULL@
      expression _E;
      identifier _timer;
      type _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, NULL, _E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
      +timer_setup(&_E->_timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, &_E);
      +timer_setup(&_E._timer, NULL, 0);
      |
      -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
      +timer_setup(&_E._timer, NULL, 0);
      )
      
      @change_timer_function_usage@
      expression _E;
      identifier _timer;
      struct timer_list _stl;
      identifier _callback;
      type _cast_func, _cast_data;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
      +timer_setup(&_E._timer, _callback, 0);
      |
       _E->_timer@_stl.function = _callback;
      |
       _E->_timer@_stl.function = &_callback;
      |
       _E->_timer@_stl.function = (_cast_func)_callback;
      |
       _E->_timer@_stl.function = (_cast_func)&_callback;
      |
       _E._timer@_stl.function = _callback;
      |
       _E._timer@_stl.function = &_callback;
      |
       _E._timer@_stl.function = (_cast_func)_callback;
      |
       _E._timer@_stl.function = (_cast_func)&_callback;
      )
      
      // callback(unsigned long arg)
      @change_callback_handle_cast
       depends on change_timer_function_usage@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      (
      	... when != _origarg
      	_handletype *_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _timer);
      	... when != _origarg
      )
       }
      
      // callback(unsigned long arg) without existing variable
      @change_callback_handle_cast_no_arg
       depends on change_timer_function_usage &&
                           !change_callback_handle_cast@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      +	_handletype *_origarg = from_timer(_origarg, t, _timer);
      +
      	... when != _origarg
      -	(_handletype *)_origarg
      +	_origarg
      	... when != _origarg
       }
      
      // Avoid already converted callbacks.
      @match_callback_converted
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
      	    !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       { ... }
      
      // callback(struct something *handle)
      @change_callback_handle_arg
       depends on change_timer_function_usage &&
      	    !match_callback_converted &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_handletype *_handle
      +struct timer_list *t
       )
       {
      +	_handletype *_handle = from_timer(_handle, t, _timer);
      	...
       }
      
      // If change_callback_handle_arg ran on an empty function, remove
      // the added handler.
      @unchange_callback_handle_arg
       depends on change_timer_function_usage &&
      	    change_callback_handle_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       {
      -	_handletype *_handle = from_timer(_handle, t, _timer);
       }
      
      // We only want to refactor the setup_timer() data argument if we've found
      // the matching callback. This undoes changes in change_timer_function_usage.
      @unchange_timer_function_usage
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg &&
      	    !change_callback_handle_arg@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type change_timer_function_usage._cast_data;
      @@
      
      (
      -timer_setup(&_E->_timer, _callback, 0);
      +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
      |
      -timer_setup(&_E._timer, _callback, 0);
      +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
      )
      
      // If we fixed a callback from a .function assignment, fix the
      // assignment cast now.
      @change_timer_function_assignment
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_func;
      typedef TIMER_FUNC_TYPE;
      @@
      
      (
       _E->_timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -&_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      )
      
      // Sometimes timer functions are called directly. Replace matched args.
      @change_timer_function_calls
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression _E;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_data;
      @@
      
       _callback(
      (
      -(_cast_data)_E
      +&_E->_timer
      |
      -(_cast_data)&_E
      +&_E._timer
      |
      -_E
      +&_E->_timer
      )
       )
      
      // If a timer has been configured without a data argument, it can be
      // converted without regard to the callback argument, since it is unused.
      @match_timer_function_unused_data@
      expression _E;
      identifier _timer;
      identifier _callback;
      @@
      
      (
      -setup_timer(&_E->_timer, _callback, 0);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0L);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E->_timer, _callback, 0UL);
      +timer_setup(&_E->_timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0L);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_E._timer, _callback, 0UL);
      +timer_setup(&_E._timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0L);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(&_timer, _callback, 0UL);
      +timer_setup(&_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0L);
      +timer_setup(_timer, _callback, 0);
      |
      -setup_timer(_timer, _callback, 0UL);
      +timer_setup(_timer, _callback, 0);
      )
      
      @change_callback_unused_data
       depends on match_timer_function_unused_data@
      identifier match_timer_function_unused_data._callback;
      type _origtype;
      identifier _origarg;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *unused
       )
       {
      	... when != _origarg
       }
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e99e88a9
  8. 15 11月, 2017 1 次提交
    • N
      ipv6: set all.accept_dad to 0 by default · 09400953
      Nicolas Dichtel 提交于
      With commits 35e015e1 and a2d3f3e3, the global 'accept_dad' flag
      is also taken into account (default value is 1). If either global or
      per-interface flag is non-zero, DAD will be enabled on a given interface.
      
      This is not backward compatible: before those patches, the user could
      disable DAD just by setting the per-interface flag to 0. Now, the
      user instead needs to set both flags to 0 to actually disable DAD.
      
      Restore the previous behaviour by setting the default for the global
      'accept_dad' flag to 0. This way, DAD is still enabled by default,
      as per-interface flags are set to 1 on device creation, but setting
      them to 0 is enough to disable DAD on a given interface.
      
      - Before 35e015e1f57a7 and a2d3f3e3:
                global    per-interface    DAD enabled
      [default]   1             1              yes
                  X             0              no
                  X             1              yes
      
      - After 35e015e1 and a2d3f3e3:
                global    per-interface    DAD enabled
      [default]   1             1              yes
                  0             0              no
                  0             1              yes
                  1             0              yes
      
      - After this fix:
                global    per-interface    DAD enabled
                  1             1              yes
                  0             0              no
      [default]   0             1              yes
                  1             0              yes
      
      Fixes: 35e015e1 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
      Fixes: a2d3f3e3 ("ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real")
      CC: Stefano Brivio <sbrivio@redhat.com>
      CC: Matteo Croce <mcroce@redhat.com>
      CC: Erik Kline <ek@google.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09400953
  9. 11 11月, 2017 1 次提交
    • M
      net: ipv6: sysctl to specify IPv6 ND traffic class · 2210d6b2
      Maciej Żenczykowski 提交于
      Add a per-device sysctl to specify the default traffic class to use for
      kernel originated IPv6 Neighbour Discovery packets.
      
      Currently this includes:
      
        - Router Solicitation (ICMPv6 type 133)
          ndisc_send_rs() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Solicitation (ICMPv6 type 135)
          ndisc_send_ns() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Neighbour Advertisement (ICMPv6 type 136)
          ndisc_send_na() -> ndisc_send_skb() -> ip6_nd_hdr()
      
        - Redirect (ICMPv6 type 137)
          ndisc_send_redirect() -> ndisc_send_skb() -> ip6_nd_hdr()
      
      and if the kernel ever gets around to generating RA's,
      it would presumably also include:
      
        - Router Advertisement (ICMPv6 type 134)
          (radvd daemon could pick up on the kernel setting and use it)
      
      Interface drivers may examine the Traffic Class value and translate
      the DiffServ Code Point into a link-layer appropriate traffic
      prioritization scheme.  An example of mapping IETF DSCP values to
      IEEE 802.11 User Priority values can be found here:
      
          https://tools.ietf.org/html/draft-ietf-tsvwg-ieee-802-11
      
      The expected primary use case is to properly prioritize ND over wifi.
      
      Testing:
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        0
        jzem22:~# echo -1 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 256 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        -bash: echo: write error: Invalid argument
        jzem22:~# echo 0 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# echo 255 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        255
        jzem22:~# echo 34 > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# cat /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        34
      
        jzem22:~# echo $[0xDC] > /proc/sys/net/ipv6/conf/eth0/ndisc_tclass
        jzem22:~# tcpdump -v -i eth0 icmp6 and src host jzem22.pgc and dst host fe80::1
        tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
        IP6 (class 0xdc, hlim 255, next-header ICMPv6 (58) payload length: 24)
        jzem22.pgc > fe80::1: [icmp6 sum ok] ICMP6, neighbor advertisement,
        length 24, tgt is jzem22.pgc, Flags [solicited]
      
      (based on original change written by Erik Kline, with minor changes)
      
      v2: fix 'suspicious rcu_dereference_check() usage'
          by explicitly grabbing the rcu_read_lock.
      
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NErik Kline <ek@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2210d6b2
  10. 07 11月, 2017 1 次提交
    • E
      ipv6: addrconf: fix a lockdep splat · fffcefe9
      Eric Dumazet 提交于
      Fixes a case where GFP_ATOMIC allocation must be used instead of
      GFP_KERNEL one.
      
      [   54.891146]  lock_acquire+0xb3/0x2f0
      [   54.891153]  ? fs_reclaim_acquire.part.60+0x5/0x30
      [   54.891165]  fs_reclaim_acquire.part.60+0x29/0x30
      [   54.891170]  ? fs_reclaim_acquire.part.60+0x5/0x30
      [   54.891178]  kmem_cache_alloc_trace+0x3f/0x500
      [   54.891186]  ? cyc2ns_read_end+0x1e/0x30
      [   54.891196]  ipv6_add_addr+0x15a/0xc30
      [   54.891217]  ? ipv6_create_tempaddr+0x2ea/0x5d0
      [   54.891223]  ipv6_create_tempaddr+0x2ea/0x5d0
      [   54.891238]  ? manage_tempaddrs+0x195/0x220
      [   54.891249]  ? addrconf_prefix_rcv_add_addr+0x1c0/0x4f0
      [   54.891255]  addrconf_prefix_rcv_add_addr+0x1c0/0x4f0
      [   54.891268]  addrconf_prefix_rcv+0x2e5/0x9b0
      [   54.891279]  ? neigh_update+0x446/0xb90
      [   54.891298]  ? ndisc_router_discovery+0x5ab/0xf00
      [   54.891303]  ndisc_router_discovery+0x5ab/0xf00
      [   54.891311]  ? retint_kernel+0x2d/0x2d
      [   54.891331]  ndisc_rcv+0x1b6/0x270
      [   54.891340]  icmpv6_rcv+0x6aa/0x9f0
      [   54.891345]  ? ipv6_chk_mcast_addr+0x176/0x530
      [   54.891351]  ? do_csum+0x17b/0x260
      [   54.891360]  ip6_input_finish+0x194/0xb20
      [   54.891372]  ip6_input+0x5b/0x2c0
      [   54.891380]  ? ip6_rcv_finish+0x320/0x320
      [   54.891389]  ip6_mc_input+0x15a/0x250
      [   54.891396]  ipv6_rcv+0x772/0x1050
      [   54.891403]  ? consume_skb+0xbe/0x2d0
      [   54.891412]  ? ip6_make_skb+0x2a0/0x2a0
      [   54.891418]  ? ip6_input+0x2c0/0x2c0
      [   54.891425]  __netif_receive_skb_core+0xa0f/0x1600
      [   54.891436]  ? process_backlog+0xac/0x400
      [   54.891441]  process_backlog+0xfa/0x400
      [   54.891448]  ? net_rx_action+0x145/0x1130
      [   54.891456]  net_rx_action+0x310/0x1130
      [   54.891524]  ? RTUSBBulkReceive+0x11d/0x190 [mt7610u_sta]
      [   54.891538]  __do_softirq+0x140/0xaba
      [   54.891553]  irq_exit+0x10b/0x160
      [   54.891561]  do_IRQ+0xbb/0x1b0
      
      Fixes: f3d9832e ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fffcefe9
  11. 05 11月, 2017 1 次提交
  12. 01 11月, 2017 2 次提交
    • E
      ipv6: addrconf: increment ifp refcount before ipv6_del_addr() · e669b869
      Eric Dumazet 提交于
      In the (unlikely) event fixup_permanent_addr() returns a failure,
      addrconf_permanent_addr() calls ipv6_del_addr() without the
      mandatory call to in6_ifa_hold(), leading to a refcount error,
      spotted by syzkaller :
      
      WARNING: CPU: 1 PID: 3142 at lib/refcount.c:227 refcount_dec+0x4c/0x50
      lib/refcount.c:227
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 3142 Comm: ip Not tainted 4.14.0-rc4-next-20171009+ #33
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x41c kernel/panic.c:181
       __warn+0x1c4/0x1e0 kernel/panic.c:544
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:refcount_dec+0x4c/0x50 lib/refcount.c:227
      RSP: 0018:ffff8801ca49e680 EFLAGS: 00010286
      RAX: 000000000000002c RBX: ffff8801d07cfcdc RCX: 0000000000000000
      RDX: 000000000000002c RSI: 1ffff10039493c90 RDI: ffffed0039493cc4
      RBP: ffff8801ca49e688 R08: ffff8801ca49dd70 R09: 0000000000000000
      R10: ffff8801ca49df58 R11: 0000000000000000 R12: 1ffff10039493cd9
      R13: ffff8801ca49e6e8 R14: ffff8801ca49e7e8 R15: ffff8801d07cfcdc
       __in6_ifa_put include/net/addrconf.h:369 [inline]
       ipv6_del_addr+0x42b/0xb60 net/ipv6/addrconf.c:1208
       addrconf_permanent_addr net/ipv6/addrconf.c:3327 [inline]
       addrconf_notify+0x1c66/0x2190 net/ipv6/addrconf.c:3393
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x32/0x60 net/core/dev.c:1697
       call_netdevice_notifiers net/core/dev.c:1715 [inline]
       __dev_notify_flags+0x15d/0x430 net/core/dev.c:6843
       dev_change_flags+0xf5/0x140 net/core/dev.c:6879
       do_setlink+0xa1b/0x38e0 net/core/rtnetlink.c:2113
       rtnl_newlink+0xf0d/0x1a40 net/core/rtnetlink.c:2661
       rtnetlink_rcv_msg+0x733/0x1090 net/core/rtnetlink.c:4301
       netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2408
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4313
       netlink_unicast_kernel net/netlink/af_netlink.c:1273 [inline]
       netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1299
       netlink_sendmsg+0xa4a/0xe70 net/netlink/af_netlink.c:1862
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2049
       __sys_sendmsg+0xe5/0x210 net/socket.c:2083
       SYSC_sendmsg net/socket.c:2094 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2090
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x7fa9174d3320
      RSP: 002b:00007ffe302ae9e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007ffe302b2ae0 RCX: 00007fa9174d3320
      RDX: 0000000000000000 RSI: 00007ffe302aea20 RDI: 0000000000000016
      RBP: 0000000000000082 R08: 0000000000000000 R09: 000000000000000f
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe302b32a0
      R13: 0000000000000000 R14: 00007ffe302b2ab8 R15: 00007ffe302b32b8
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e669b869
    • V
      net: display hw address of source machine during ipv6 DAD failure · da13c59b
      Vishwanath Pai 提交于
      This patch updates the error messages displayed in kernel log to include
      hwaddress of the source machine that caused ipv6 duplicate address
      detection failures.
      
      Examples:
      
      a) When we receive a NA packet from another machine advertising our
      address:
      
      ICMPv6: NA: 34:ab:cd:56:11:e8 advertised our address 2001:db8:: on eth0!
      
      b) When we detect DAD failure during address assignment to an interface:
      
      IPv6: eth0: IPv6 duplicate address 2001:db8:: used by 34:ab:cd:56:11:e8
      detected!
      
      v2:
          Changed %pI6 to %pI6c in ndisc_recv_na()
          Chaged the v6 address in the commit message to 2001:db8::
      Suggested-by: NIgor Lubashev <ilubashe@akamai.com>
      Signed-off-by: NVishwanath Pai <vpai@akamai.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da13c59b
  13. 24 10月, 2017 7 次提交
  14. 20 10月, 2017 3 次提交
    • D
      net: Add extack to validator_info structs used for address notifier · de95e047
      David Ahern 提交于
      Add extack to in_validator_info and in6_validator_info. Update the one
      user of each, ipvlan, to return an error message for failures.
      
      Only manual configuration of an address is plumbed in the IPv6 code path.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de95e047
    • D
      net: ipv6: Make inet6addr_validator a blocking notifier · ff7883ea
      David Ahern 提交于
      inet6addr_validator chain was added by commit 3ad7d246 ("Ipvlan
      should return an error when an address is already in use") to allow
      address validation before changes are committed and to be able to
      fail the address change with an error back to the user. The address
      validation is not done for addresses received from router
      advertisements.
      
      Handling RAs in softirq context is the only reason for the notifier
      chain to be atomic versus blocking. Since the only current user, ipvlan,
      of the validator chain ignores softirq context, the notifier can be made
      blocking and simply not invoked for softirq path.
      
      The blocking option is needed by spectrum for example to validate
      resources for an adding an address to an interface.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7883ea
    • D
      ipv6: addrconf: cleanup locking in ipv6_add_addr · f3d9832e
      David Ahern 提交于
      ipv6_add_addr is called in process context with rtnl lock held
      (e.g., manual config of an address) or during softirq processing
      (e.g., autoconf and address from a router advertisement).
      
      Currently, ipv6_add_addr calls rcu_read_lock_bh shortly after entry
      and does not call unlock until exit, minus the call around the address
      validator notifier. Similarly, addrconf_hash_lock is taken after the
      validator notifier and held until exit. This forces the allocation of
      inet6_ifaddr to always be atomic.
      
      Refactor ipv6_add_addr as follows:
      1. add an input boolean to discriminate the call path (process context
         or softirq). This new flag controls whether the alloc can be done
         with GFP_KERNEL or GFP_ATOMIC.
      
      2. Move the rcu_read_lock_bh and unlock calls only around functions that
         do rcu updates.
      
      3. Remove the in6_dev_hold and put added by 3ad7d246 ("Ipvlan should
         return an error when an address is already in use."). This was done
         presumably because rcu_read_unlock_bh needs to be called before calling
         the validator. Since rcu_read_lock is not needed before the validator
         runs revert the hold and put added by 3ad7d246 and only do the
         hold when setting ifp->idev.
      
      4. move duplicate address check and insertion of new address in the global
         address hash into a helper. The helper is called after an ifa is
         allocated and filled in.
      
      This allows the ifa for manually configured addresses to be done with
      GFP_KERNEL and reduces the overall amount of time with rcu_read_lock held
      and hash table spinlock held.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3d9832e
  15. 12 10月, 2017 2 次提交
  16. 09 10月, 2017 6 次提交
  17. 08 10月, 2017 5 次提交
    • M
      ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real · a2d3f3e3
      Matteo Croce 提交于
      Commit 35e015e1 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
      was intended to affect accept_dad flag handling in such a way that
      DAD operation and mode on a given interface would be selected
      according to the maximum value of conf/{all,interface}/accept_dad.
      
      However, addrconf_dad_begin() checks for particular cases in which we
      need to skip DAD, and this check was modified in the wrong way.
      
      Namely, it was modified so that, if the accept_dad flag is 0 for the
      given interface *or* for all interfaces, DAD would be skipped.
      
      We have instead to skip DAD if accept_dad is 0 for the given interface
      *and* for all interfaces.
      
      Fixes: 35e015e1 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
      Acked-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Reported-by: NErik Kline <ek@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2d3f3e3
    • W
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang 提交于
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66f5d6ce
    • W
      ipv6: replace dst_hold() with dst_hold_safe() in routing code · d3843fe5
      Wei Wang 提交于
      With rwlock, it is safe to call dst_hold() in the read thread because
      read thread is guaranteed to be separated from write thread.
      However, after we replace rwlock with rcu, it is no longer safe to use
      dst_hold(). A dst might already have been deleted but is waiting for the
      rcu grace period to pass before freeing the memory when a read thread is
      trying to do dst_hold(). This could potentially cause double free issue.
      
      So this commit replaces all dst_hold() with dst_hold_safe() in all read
      thread to avoid this double free issue.
      And in order to make the code more compact, a new function ip6_hold_safe()
      is introduced. It calls dst_hold_safe() first, and if that fails, it will
      either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
      NULL according to the caller's need.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3843fe5
    • W
      ipv6: hook up exception table to store dst cache · 2b760fcf
      Wei Wang 提交于
      This commit makes use of the exception hash table implementation to
      store dst caches created by pmtu discovery and ip redirect into the hash
      table under the rt_info and no longer inserts these routes into fib6
      tree.
      This makes the fib6 tree only contain static configured routes and could
      now be protected by rcu instead of a rw lock.
      With this change, in the route lookup related functions, after finding
      the rt6_info with the longest prefix, we also need to search for the
      exception table before doing backtracking.
      In the route delete function, if the route being deleted is not a dst
      cache, deletion of this route also need to flush the whole hash table
      under it. If it is a dst cache, then only delete the cached dst in the
      hash table.
      
      Note: for fib6_walk_continue() function, w->root now is always pointing
      to a root node considering that fib6_prune_clones() is removed from the
      code. So we add a WARN_ON() msg to make sure w->root always points to a
      root node and also removed the update of w->root in fib6_repair_tree().
      This is a prerequisite for later patch because we don't need to make
      w->root as rcu protected when replacing rwlock with RCU.
      Also, we remove all prune related variables as it is no longer used.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b760fcf
    • W
      ipv6: prepare fib6_locate() for exception table · 38fbeeee
      Wei Wang 提交于
      fib6_locate() is used to find the fib6_node according to the passed in
      prefix address key. It currently tries to find the fib6_node with the
      exact match of the passed in key. However, when we move cached routes
      into the exception table, fib6_locate() will fail to find the fib6_node
      for it as the cached routes will be stored in the exception table under
      the fib6_node with the longest prefix match of the cache's dst addr key.
      This commit adds a new parameter to let the caller specify if it needs
      exact match or longest prefix match.
      Right now, all callers still does exact match when calling
      fib6_locate(). It will be changed in later commit where exception table
      is hooked up to store cached routes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38fbeeee