1. 11 2月, 2017 3 次提交
  2. 25 12月, 2016 1 次提交
  3. 06 12月, 2016 2 次提交
  4. 04 12月, 2016 3 次提交
    • I
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel 提交于
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11
    • I
      ipv4: fib: Convert FIB notification chain to be atomic · d3f706f6
      Ido Schimmel 提交于
      In order not to hold RTNL for long periods of time we're going to dump
      the FIB tables using RCU.
      
      Convert the FIB notification chain to be atomic, as we can't block in
      RCU critical sections.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3f706f6
  5. 17 11月, 2016 2 次提交
  6. 08 11月, 2016 1 次提交
    • A
      fib_trie: Correct /proc/net/route off by one error · fd0285a3
      Alexander Duyck 提交于
      The display of /proc/net/route has had a couple issues due to the fact that
      when I originally rewrote most of fib_trie I made it so that the iterator
      was tracking the next value to use instead of the current.
      
      In addition it had an off by 1 error where I was tracking the first piece
      of data as position 0, even though in reality that belonged to the
      SEQ_START_TOKEN.
      
      This patch updates the code so the iterator tracks the last reported
      position and key instead of the next expected position and key.  In
      addition it shifts things so that all of the leaves start at 1 instead of
      trying to report leaves starting with offset 0 as being valid.  With these
      two issues addressed this should resolve any off by one errors that were
      present in the display of /proc/net/route.
      
      Fixes: 25b97c01 ("ipv4: off-by-one in continuation handling in /proc/net/route")
      Cc: Andy Whitcroft <apw@canonical.com>
      Reported-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd0285a3
  7. 28 9月, 2016 2 次提交
  8. 10 9月, 2016 1 次提交
    • G
      ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events · b93e1fa7
      Guillaume Nault 提交于
      fib_table_insert() inconsistently fills the nlmsg_flags field in its
      notification messages.
      
      Since commit b8f55831 ("[RTNETLINK]: Fix sending netlink message
      when replace route."), the netlink message has its nlmsg_flags set to
      NLM_F_REPLACE if the route replaced a preexisting one.
      
      Then commit a2bb6d7d ("ipv4: include NLM_F_APPEND flag in append
      route notifications") started setting nlmsg_flags to NLM_F_APPEND if
      the route matched a preexisting one but was appended.
      
      In other cases (exclusive creation or prepend), nlmsg_flags is 0.
      
      This patch sets ->nlmsg_flags in all situations, preserving the
      semantic of the NLM_F_* bits:
      
        * NLM_F_CREATE: a new fib entry has been created for this route.
        * NLM_F_EXCL: no other fib entry existed for this route.
        * NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
        * NLM_F_APPEND: the new fib entry was added after other entries for
          the same route.
      
      As a result, the possible flag combination can now be reported
      (iproute2's terminology into parentheses):
      
        * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
          ("add").
        * NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
          added after preexisting ones ("append").
        * NLM_F_CREATE: route did already exist, new route added before
          preexisting ones ("prepend").
        * NLM_F_REPLACE: route did already exist, new route replaced the
          first preexisting one ("change").
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b93e1fa7
  9. 19 8月, 2016 1 次提交
  10. 06 8月, 2016 1 次提交
    • D
      ipv4: panic in leaf_walk_rcu due to stale node pointer · 94d9f1c5
      David Forster 提交于
      Panic occurs when issuing "cat /proc/net/route" whilst
      populating FIB with > 1M routes.
      
      Use of cached node pointer in fib_route_get_idx is unsafe.
      
       BUG: unable to handle kernel paging request at ffffc90001630024
       IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0
       Oops: 0000 [#1] SMP
       Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac
       snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti
       acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd
      tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod
       CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000
       RIP: 0010:[<ffffffff814cf6a0>]  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
       RSP: 0018:ffff88011a05fda0  EFLAGS: 00010202
       RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20
       RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950
       RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000
       R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950
       R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480
       FS:  00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0
       Stack:
        ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00
        ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000
        ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80
       Call Trace:
        [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0
        [<ffffffff811dd8b9>] ? seq_read+0x149/0x380
        [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500
        [<ffffffff8138dce0>] ? process_echoes+0x70/0x70
        [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70
        [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0
        [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0
        [<ffffffff811bbe61>] ? vfs_read+0x81/0x120
        [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0
        [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75
       Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
      a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89
       RIP  [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0
        RSP <ffff88011a05fda0>
       CR2: ffffc90001630024
      Signed-off-by: NDave Forster <dforster@brocade.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94d9f1c5
  11. 30 1月, 2016 1 次提交
  12. 23 1月, 2016 1 次提交
  13. 28 10月, 2015 1 次提交
  14. 18 9月, 2015 1 次提交
  15. 30 8月, 2015 1 次提交
  16. 14 8月, 2015 2 次提交
  17. 28 7月, 2015 1 次提交
  18. 25 7月, 2015 1 次提交
    • J
      ipv4: consider TOS in fib_select_default · 2392debc
      Julian Anastasov 提交于
      fib_select_default considers alternative routes only when
      res->fi is for the first alias in res->fa_head. In the
      common case this can happen only when the initial lookup
      matches the first alias with highest TOS value. This
      prevents the alternative routes to require specific TOS.
      
      This patch solves the problem as follows:
      
      - routes that require specific TOS should be returned by
      fib_select_default only when TOS matches, as already done
      in fib_table_lookup. This rule implies that depending on the
      TOS we can have many different lists of alternative gateways
      and we have to keep the last used gateway (fa_default) in first
      alias for the TOS instead of using single tb_default value.
      
      - as the aliases are ordered by many keys (TOS desc,
      fib_priority asc), we restrict the possible results to
      routes with matching TOS and lowest metric (fib_priority)
      and routes that match any TOS, again with lowest metric.
      
      For example, packet with TOS 8 can not use gw3 (not lowest
      metric), gw4 (different TOS) and gw6 (not lowest metric),
      all other gateways can be used:
      
      tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
      tos 8 via gw2 metric 2
      tos 8 via gw3 metric 3
      tos 4 via gw4
      tos 0 via gw5
      tos 0 via gw6 metric 1
      Reported-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2392debc
  19. 24 6月, 2015 1 次提交
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
  20. 22 6月, 2015 1 次提交
  21. 08 6月, 2015 1 次提交
    • F
      fib_trie: coding style: Use pointer after check · f38b24c9
      Firo Yang 提交于
      As Alexander Duyck pointed out that:
      struct tnode {
              ...
              struct key_vector kv[1];
      }
      The kv[1] member of struct tnode is an arry that refernced by
      a null pointer will not crash the system, like this:
      struct tnode *p = NULL;
      struct key_vector *kv = p->kv;
      As such p->kv doesn't actually dereference anything, it is simply a
      means for getting the offset to the array from the pointer p.
      
      This patch make the code more regular to avoid making people feel
      odd when they look at the code.
      Signed-off-by: NFiro Yang <firogm@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38b24c9
  22. 27 5月, 2015 1 次提交
    • D
      ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. · ffa915d0
      David S. Miller 提交于
      We used to get this indirectly I supposed, but no longer do.
      
      Either way, an explicit include should have been done in the
      first place.
      
         net/ipv4/fib_trie.c: In function '__node_free_rcu':
      >> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
            vfree(n);
            ^
         net/ipv4/fib_trie.c: In function 'tnode_alloc':
      >> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
            return vzalloc(size);
            ^
      >> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast
         cc1: some warnings being treated as errors
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffa915d0
  23. 23 5月, 2015 1 次提交
  24. 15 5月, 2015 1 次提交
  25. 13 5月, 2015 1 次提交
  26. 04 4月, 2015 2 次提交
  27. 24 3月, 2015 1 次提交
  28. 13 3月, 2015 1 次提交
    • A
      fib_trie: Provide a deterministic order for fib_alias w/ tables merged · 0b65bd97
      Alexander Duyck 提交于
      This change makes it so that we should always have a deterministic ordering
      for the main and local aliases within the merged table when two leaves
      overlap.
      
      So for example if we have a leaf with a key of 192.168.254.0.  If we
      previously added two aliases with a prefix length of 24 from both local and
      main the first entry would be first and the second would be second.  When I
      was coding this I had added a WARN_ON should such a situation occur as I
      wasn't sure how likely it would be.  However this WARN_ON has been
      triggered so this is something that should be addressed.
      
      With this patch the ordering of the aliases is as follows.  First they are
      sorted on prefix length, then on their table ID, then tos, and finally
      priority.  This way what we end up doing is essentially interleaving the
      two tables on what used to be leaf_info structure boundaries.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b65bd97
  29. 12 3月, 2015 2 次提交
    • A
      fib_trie: Only display main table in /proc/net/route · 654eff45
      Alexander Duyck 提交于
      When we merged the tries for local and main I had overlooked the iterator
      for /proc/net/route.  As a result it was outputting both local and main
      when the two tries were merged.
      
      This patch resolves that by only providing output for aliases that are
      actually in the main trie.  As a result we should go back to the original
      behavior which I assume will be necessary to maintain legacy support.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      654eff45
    • A
      ipv4: FIB Local/MAIN table collapse · 0ddcf43d
      Alexander Duyck 提交于
      This patch is meant to collapse local and main into one by converting
      tb_data from an array to a pointer.  Doing this allows us to point the
      local table into the main while maintaining the same variables in the
      table.
      
      As such the tb_data was converted from an array to a pointer, and a new
      array called data is added in order to still provide an object for tb_data
      to point to.
      
      In order to track the origin of the fib aliases a tb_id value was added in
      a hole that existed on 64b systems.  Using this we can also reverse the
      merge in the event that custom FIB rules are enabled.
      
      With this patch I am seeing an improvement of 20ns to 30ns for routing
      lookups as long as custom rules are not enabled, with custom rules enabled
      we fall back to split tables and the original behavior.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ddcf43d
  30. 11 3月, 2015 1 次提交