Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'

Petr Machata says: ==================== nexthop: Preparations for resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group. Mpath groups implement the hash-threshold algorithm, described in RFC 2992[1]. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. RFC 2992 illustrates it thus: +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 4 | 5 | +---------+---------+---------+---------+ Before and after deletion of next hop 3 under the hash-threshold algorithm. Note how next hop 2 gave up part of the hash space in favor of next hop 1, and 4 in favor of 5. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. If a multipath group is used for load-balancing between multiple servers, this hash space reassignment causes an issue that packets from a single flow suddenly end up arriving at a server that does not expect them, which may lead to TCP reset. If a multipath group is used for load-balancing among available paths to the same server, the issue is that different latencies and reordering along the way causes the packets to arrive in wrong order. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ v v v v +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Before and after deletion of next hop 3 under the resilient hashing algorithm. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. This patchset prepares the next-hop code for eventual introduction of resilient hashing groups. - Patches #1-#4 carry otherwise disjoint changes that just remove certain assumptions in the next-hop code. - Patches #5-#6 extend the in-kernel next-hop notifiers to support more next-hop group types. - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups will introduce a new logical object, a hash table bucket. It turns out that handling bucket-related messages is similar to how next-hop messages are handled. These patches extract the commonalities into reusable components. The plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next hop groups (this patchset) 3) Implementation of resilient next hop group 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 ==================== Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.comSigned-off-by: N Jakub Kicinski <kuba@kernel.org>

Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'
Petr Machata says: ==================== nexthop: Preparations for resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group. Mpath groups implement the hash-threshold algorithm, described in RFC 2992[1]. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. RFC 2992 illustrates it thus: +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 4 | 5 | +---------+---------+---------+---------+ Before and after deletion of next hop 3 under the hash-threshold algorithm. Note how next hop 2 gave up part of the hash space in favor of next hop 1, and 4 in favor of 5. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. If a multipath group is used for load-balancing between multiple servers, this hash space reassignment causes an issue that packets from a single flow suddenly end up arriving at a server that does not expect them, which may lead to TCP reset. If a multipath group is used for load-balancing among available paths to the same server, the issue is that different latencies and reordering along the way causes the packets to arrive in wrong order. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ v v v v +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Before and after deletion of next hop 3 under the resilient hashing algorithm. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. This patchset prepares the next-hop code for eventual introduction of resilient hashing groups. - Patches #1-#4 carry otherwise disjoint changes that just remove certain assumptions in the next-hop code. - Patches #5-#6 extend the in-kernel next-hop notifiers to support more next-hop group types. - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups will introduce a new logical object, a hash table bucket. It turns out that handling bucket-related messages is similar to how next-hop messages are handled. These patches extract the commonalities into reusable components. The plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next hop groups (this patchset) 3) Implementation of resilient next hop group 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 ==================== Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.comSigned-off-by: N Jakub Kicinski <kuba@kernel.org>
67d25ce8 · Jakub Kicinski · 4915a404 · 0bccf8ed · 67d25ce8 · 67d25ce8
4 changed file
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -4309,11 +4309,18 @@ static int mlxsw_sp_nexthop_obj_validate(struct mlxsw_sp *mlxsw_sp,
 	if (event != NEXTHOP_EVENT_REPLACE)
 		return 0;

-	if (!info->is_grp)
+	switch (info->type) {
+	case NH_NOTIFIER_INFO_TYPE_SINGLE:
 		return mlxsw_sp_nexthop_obj_single_validate(mlxsw_sp, info->nh,
 							    info->extack);
-	return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp, info->nh_grp,
-						   info->extack);
+	case NH_NOTIFIER_INFO_TYPE_GRP:
+		return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp,
+							   info->nh_grp,
+							   info->extack);
+	default:
+		NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
+		return -EOPNOTSUPP;
+	}
 }

 static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp,
@@ -4321,13 +4328,17 @@ static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp,
 {
 	const struct net_device *dev;

-	if (info->is_grp)
+	switch (info->type) {
+	case NH_NOTIFIER_INFO_TYPE_SINGLE:
+		dev = info->nh->dev;
+		return info->nh->gw_family || info->nh->is_reject ||
+		       mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
+	case NH_NOTIFIER_INFO_TYPE_GRP:
 		/* Already validated earlier. */
 		return true;
-
-	dev = info->nh->dev;
-	return info->nh->gw_family || info->nh->is_reject ||
-	       mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
+	default:
+		return false;
+	}
 }

 static void mlxsw_sp_nexthop_obj_blackhole_init(struct mlxsw_sp *mlxsw_sp,
@@ -4410,11 +4421,22 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
 				     struct mlxsw_sp_nexthop_group *nh_grp,
 				     struct nh_notifier_info *info)
 {
-	unsigned int nhs = info->is_grp ? info->nh_grp->num_nh : 1;
 	struct mlxsw_sp_nexthop_group_info *nhgi;
 	struct mlxsw_sp_nexthop *nh;
+	unsigned int nhs;
 	int err, i;

+	switch (info->type) {
+	case NH_NOTIFIER_INFO_TYPE_SINGLE:
+		nhs = 1;
+		break;
+	case NH_NOTIFIER_INFO_TYPE_GRP:
+		nhs = info->nh_grp->num_nh;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	nhgi = kzalloc(struct_size(nhgi, nexthops, nhs), GFP_KERNEL);
 	if (!nhgi)
 		return -ENOMEM;
@@ -4427,12 +4449,18 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
 		int weight;

 		nh = &nhgi->nexthops[i];
-		if (info->is_grp) {
-			nh_obj = &info->nh_grp->nh_entries[i].nh;
-			weight = info->nh_grp->nh_entries[i].weight;
-		} else {
+		switch (info->type) {
+		case NH_NOTIFIER_INFO_TYPE_SINGLE:
 			nh_obj = info->nh;
 			weight = 1;
+			break;
+		case NH_NOTIFIER_INFO_TYPE_GRP:
+			nh_obj = &info->nh_grp->nh_entries[i].nh;
+			weight = info->nh_grp->nh_entries[i].weight;
+			break;
+		default:
+			err = -EINVAL;
+			goto err_nexthop_obj_init;
 		}
 		err = mlxsw_sp_nexthop_obj_init(mlxsw_sp, nh_grp, nh, nh_obj,
 						weight);

--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -860,7 +860,7 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data,

 	nexthop = kzalloc(sizeof(*nexthop), GFP_KERNEL);
 	if (!nexthop)
-		return NULL;
+		return ERR_PTR(-ENOMEM);

 	nexthop->id = info->id;

@@ -868,15 +868,20 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data,
 	 * occupy.
 	 */

-	if (!info->is_grp) {
+	switch (info->type) {
+	case NH_NOTIFIER_INFO_TYPE_SINGLE:
 		occ = 1;
-		goto out;
+		break;
+	case NH_NOTIFIER_INFO_TYPE_GRP:
+		for (i = 0; i < info->nh_grp->num_nh; i++)
+			occ += info->nh_grp->nh_entries[i].weight;
+		break;
+	default:
+		NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
+		kfree(nexthop);
+		return ERR_PTR(-EOPNOTSUPP);
 	}

-	for (i = 0; i < info->nh_grp->num_nh; i++)
-		occ += info->nh_grp->nh_entries[i].weight;
-
-out:
 	nexthop->occ = occ;
 	return nexthop;
 }
@@ -972,8 +977,8 @@ static int nsim_nexthop_insert(struct nsim_fib_data *data,
 	int err;

 	nexthop = nsim_nexthop_create(data, info);
-	if (!nexthop)
-		return -ENOMEM;
+	if (IS_ERR(nexthop))
+		return PTR_ERR(nexthop);

 	nexthop_old = rhashtable_lookup_fast(&data->nexthop_ht, &info->id,
 					     nsim_nexthop_ht_params);

--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -66,7 +66,12 @@ struct nh_info {
 struct nh_grp_entry {
 	struct nexthop	*nh;
 	u8		weight;
-	atomic_t	upper_bound;
+
+	union {
+		struct {
+			atomic_t	upper_bound;
+		} mpath;
+	};

 	struct list_head nh_list;
 	struct nexthop	*nh_parent;  /* nexthop of group with this entry */
@@ -109,6 +114,11 @@ enum nexthop_event_type {
 	NEXTHOP_EVENT_REPLACE,
 };

+enum nh_notifier_info_type {
+	NH_NOTIFIER_INFO_TYPE_SINGLE,
+	NH_NOTIFIER_INFO_TYPE_GRP,
+};
+
 struct nh_notifier_single_info {
 	struct net_device *dev;
 	u8 gw_family;
@@ -137,7 +147,7 @@ struct nh_notifier_info {
 	struct net *net;
 	struct netlink_ext_ack *extack;
 	u32 id;
-	bool is_grp;
+	enum nh_notifier_info_type type;
 	union {
 		struct nh_notifier_single_info *nh;
 		struct nh_notifier_grp_info *nh_grp;

--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -71,6 +71,7 @@ __nh_notifier_single_info_init(struct nh_notifier_single_info *nh_info,
 static int nh_notifier_single_info_init(struct nh_notifier_info *info,
 					const struct nexthop *nh)
 {
+	info->type = NH_NOTIFIER_INFO_TYPE_SINGLE;
 	info->nh = kzalloc(sizeof(*info->nh), GFP_KERNEL);
 	if (!info->nh)
 		return -ENOMEM;
@@ -85,13 +86,13 @@ static void nh_notifier_single_info_fini(struct nh_notifier_info *info)
 	kfree(info->nh);
 }

-static int nh_notifier_grp_info_init(struct nh_notifier_info *info,
-				     const struct nexthop *nh)
+static int nh_notifier_mp_info_init(struct nh_notifier_info *info,
+				    struct nh_group *nhg)
 {
-	struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
 	u16 num_nh = nhg->num_nh;
 	int i;

+	info->type = NH_NOTIFIER_INFO_TYPE_GRP;
 	info->nh_grp = kzalloc(struct_size(info->nh_grp, nh_entries, num_nh),
 			       GFP_KERNEL);
 	if (!info->nh_grp)
@@ -112,27 +113,41 @@ static int nh_notifier_grp_info_init(struct nh_notifier_info *info,
 	return 0;
 }

-static void nh_notifier_grp_info_fini(struct nh_notifier_info *info)
+static int nh_notifier_grp_info_init(struct nh_notifier_info *info,
+				     const struct nexthop *nh)
+{
+	struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
+
+	if (nhg->mpath)
+		return nh_notifier_mp_info_init(info, nhg);
+	return -EINVAL;
+}
+
+static void nh_notifier_grp_info_fini(struct nh_notifier_info *info,
+				      const struct nexthop *nh)
 {
-	kfree(info->nh_grp);
+	struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
+
+	if (nhg->mpath)
+		kfree(info->nh_grp);
 }

 static int nh_notifier_info_init(struct nh_notifier_info *info,
 				 const struct nexthop *nh)
 {
 	info->id = nh->id;
-	info->is_grp = nh->is_group;

-	if (info->is_grp)
+	if (nh->is_group)
 		return nh_notifier_grp_info_init(info, nh);
 	else
 		return nh_notifier_single_info_init(info, nh);
 }

-static void nh_notifier_info_fini(struct nh_notifier_info *info)
+static void nh_notifier_info_fini(struct nh_notifier_info *info,
+				  const struct nexthop *nh)
 {
-	if (info->is_grp)
-		nh_notifier_grp_info_fini(info);
+	if (nh->is_group)
+		nh_notifier_grp_info_fini(info, nh);
 	else
 		nh_notifier_single_info_fini(info);
 }
@@ -161,7 +176,7 @@ static int call_nexthop_notifiers(struct net *net,

 	err = blocking_notifier_call_chain(&net->nexthop.notifier_chain,
 					   event_type, &info);
-	nh_notifier_info_fini(&info);
+	nh_notifier_info_fini(&info, nh);

 	return notifier_to_errno(err);
 }
@@ -182,7 +197,7 @@ static int call_nexthop_notifier(struct notifier_block *nb, struct net *net,
 		return err;

 	err = nb->notifier_call(nb, event_type, &info);
-	nh_notifier_info_fini(&info);
+	nh_notifier_info_fini(&info, nh);

 	return notifier_to_errno(err);
 }
@@ -209,7 +224,7 @@ static void nexthop_devhash_add(struct net *net, struct nh_info *nhi)
 	hlist_add_head(&nhi->dev_hash, head);
 }

-static void nexthop_free_mpath(struct nexthop *nh)
+static void nexthop_free_group(struct nexthop *nh)
 {
 	struct nh_group *nhg;
 	int i;
@@ -249,7 +264,7 @@ void nexthop_free_rcu(struct rcu_head *head)
 	struct nexthop *nh = container_of(head, struct nexthop, rcu);

 	if (nh->is_group)
-		nexthop_free_mpath(nh);
+		nexthop_free_group(nh);
 	else
 		nexthop_free_single(nh);

@@ -680,21 +695,16 @@ static bool ipv4_good_nh(const struct fib_nh *nh)
 	return !!(state & NUD_VALID);
 }

-struct nexthop *nexthop_select_path(struct nexthop *nh, int hash)
+static struct nexthop *nexthop_select_path_mp(struct nh_group *nhg, int hash)
 {
 	struct nexthop *rc = NULL;
-	struct nh_group *nhg;
 	int i;

-	if (!nh->is_group)
-		return nh;
-
-	nhg = rcu_dereference(nh->nh_grp);
 	for (i = 0; i < nhg->num_nh; ++i) {
 		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
 		struct nh_info *nhi;

-		if (hash > atomic_read(&nhge->upper_bound))
+		if (hash > atomic_read(&nhge->mpath.upper_bound))
 			continue;

 		nhi = rcu_dereference(nhge->nh->nh_info);
@@ -721,6 +731,21 @@ struct nexthop *nexthop_select_path(struct nexthop *nh, int hash)

 	return rc;
 }
+
+struct nexthop *nexthop_select_path(struct nexthop *nh, int hash)
+{
+	struct nh_group *nhg;
+
+	if (!nh->is_group)
+		return nh;
+
+	nhg = rcu_dereference(nh->nh_grp);
+	if (nhg->mpath)
+		return nexthop_select_path_mp(nhg, hash);
+
+	/* Unreachable. */
+	return NULL;
+}
 EXPORT_SYMBOL_GPL(nexthop_select_path);

 int nexthop_for_each_fib6_nh(struct nexthop *nh,
@@ -914,7 +939,7 @@ static void nh_group_rebalance(struct nh_group *nhg)

 		w += nhge->weight;
 		upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1;
-		atomic_set(&nhge->upper_bound, upper_bound);
+		atomic_set(&nhge->mpath.upper_bound, upper_bound);
 	}
 }

@@ -1456,10 +1481,13 @@ static struct nexthop *nexthop_create_group(struct net *net,
 		nhg->nh_entries[i].nh_parent = nh;
 	}

-	if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) {
+	if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH)
 		nhg->mpath = 1;
+
+	WARN_ON_ONCE(nhg->mpath != 1);
+
+	if (nhg->mpath)
 		nh_group_rebalance(nhg);
-	}

 	if (cfg->nh_fdb)
 		nhg->fdb_nh = 1;
@@ -1844,37 +1872,44 @@ static int rtm_new_nexthop(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return err;
 }

-static int nh_valid_get_del_req(struct nlmsghdr *nlh, u32 *id,
-				struct netlink_ext_ack *extack)
+static int __nh_valid_get_del_req(const struct nlmsghdr *nlh,
+				  struct nlattr **tb, u32 *id,
+				  struct netlink_ext_ack *extack)
 {
 	struct nhmsg *nhm = nlmsg_data(nlh);
-	struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_get)];
-	int err;

-	err = nlmsg_parse(nlh, sizeof(*nhm), tb,
-			  ARRAY_SIZE(rtm_nh_policy_get) - 1,
-			  rtm_nh_policy_get, extack);
-	if (err < 0)
-		return err;
-
-	err = -EINVAL;
 	if (nhm->nh_protocol || nhm->resvd || nhm->nh_scope || nhm->nh_flags) {
 		NL_SET_ERR_MSG(extack, "Invalid values in header");
-		goto out;
+		return -EINVAL;
 	}

 	if (!tb[NHA_ID]) {
 		NL_SET_ERR_MSG(extack, "Nexthop id is missing");
-		goto out;
+		return -EINVAL;
 	}

 	*id = nla_get_u32(tb[NHA_ID]);
-	if (!(*id))
+	if (!(*id)) {
 		NL_SET_ERR_MSG(extack, "Invalid nexthop id");
-	else
-		err = 0;
-out:
-	return err;
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int nh_valid_get_del_req(const struct nlmsghdr *nlh, u32 *id,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_get)];
+	int err;
+
+	err = nlmsg_parse(nlh, sizeof(struct nhmsg), tb,
+			  ARRAY_SIZE(rtm_nh_policy_get) - 1,
+			  rtm_nh_policy_get, extack);
+	if (err < 0)
+		return err;
+
+	return __nh_valid_get_del_req(nlh, tb, id, extack);
 }

 /* rtnl */
@@ -1943,16 +1978,23 @@ static int rtm_get_nexthop(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	goto out;
 }

-static bool nh_dump_filtered(struct nexthop *nh, int dev_idx, int master_idx,
-			     bool group_filter, u8 family)
+struct nh_dump_filter {
+	int dev_idx;
+	int master_idx;
+	bool group_filter;
+	bool fdb_filter;
+};
+
+static bool nh_dump_filtered(struct nexthop *nh,
+			     struct nh_dump_filter *filter, u8 family)
 {
 	const struct net_device *dev;
 	const struct nh_info *nhi;

-	if (group_filter && !nh->is_group)
+	if (filter->group_filter && !nh->is_group)
 		return true;

-	if (!dev_idx && !master_idx && !family)
+	if (!filter->dev_idx && !filter->master_idx && !family)
 		return false;

 	if (nh->is_group)
@@ -1963,46 +2005,37 @@ static bool nh_dump_filtered(struct nexthop *nh, int dev_idx, int master_idx,
 		return true;

 	dev = nhi->fib_nhc.nhc_dev;
-	if (dev_idx && (!dev || dev->ifindex != dev_idx))
+	if (filter->dev_idx && (!dev || dev->ifindex != filter->dev_idx))
 		return true;

-	if (master_idx) {
+	if (filter->master_idx) {
 		struct net_device *master;

 		if (!dev)
 			return true;

 		master = netdev_master_upper_dev_get((struct net_device *)dev);
-		if (!master || master->ifindex != master_idx)
+		if (!master || master->ifindex != filter->master_idx)
 			return true;
 	}

 	return false;
 }

-static int nh_valid_dump_req(const struct nlmsghdr *nlh, int *dev_idx,
-			     int *master_idx, bool *group_filter,
-			     bool *fdb_filter, struct netlink_callback *cb)
+static int __nh_valid_dump_req(const struct nlmsghdr *nlh, struct nlattr **tb,
+			       struct nh_dump_filter *filter,
+			       struct netlink_ext_ack *extack)
 {
-	struct netlink_ext_ack *extack = cb->extack;
-	struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_dump)];
 	struct nhmsg *nhm;
-	int err;
 	u32 idx;

-	err = nlmsg_parse(nlh, sizeof(*nhm), tb,
-			  ARRAY_SIZE(rtm_nh_policy_dump) - 1,
-			  rtm_nh_policy_dump, NULL);
-	if (err < 0)
-		return err;
-
 	if (tb[NHA_OIF]) {
 		idx = nla_get_u32(tb[NHA_OIF]);
 		if (idx > INT_MAX) {
 			NL_SET_ERR_MSG(extack, "Invalid device index");
 			return -EINVAL;
 		}
-		*dev_idx = idx;
+		filter->dev_idx = idx;
 	}
 	if (tb[NHA_MASTER]) {
 		idx = nla_get_u32(tb[NHA_MASTER]);
@@ -2010,10 +2043,10 @@ static int nh_valid_dump_req(const struct nlmsghdr *nlh, int *dev_idx,
 			NL_SET_ERR_MSG(extack, "Invalid master device index");
 			return -EINVAL;
 		}
-		*master_idx = idx;
+		filter->master_idx = idx;
 	}
-	*group_filter = nla_get_flag(tb[NHA_GROUPS]);
-	*fdb_filter = nla_get_flag(tb[NHA_FDB]);
+	filter->group_filter = nla_get_flag(tb[NHA_GROUPS]);
+	filter->fdb_filter = nla_get_flag(tb[NHA_FDB]);

 	nhm = nlmsg_data(nlh);
 	if (nhm->nh_protocol || nhm->resvd || nhm->nh_scope || nhm->nh_flags) {
@@ -2024,24 +2057,49 @@ static int nh_valid_dump_req(const struct nlmsghdr *nlh, int *dev_idx,
 	return 0;
 }

-/* rtnl */
-static int rtm_dump_nexthop(struct sk_buff *skb, struct netlink_callback *cb)
+static int nh_valid_dump_req(const struct nlmsghdr *nlh,
+			     struct nh_dump_filter *filter,
+			     struct netlink_callback *cb)
 {
-	bool group_filter = false, fdb_filter = false;
-	struct nhmsg *nhm = nlmsg_data(cb->nlh);
-	int dev_filter_idx = 0, master_idx = 0;
-	struct net *net = sock_net(skb->sk);
-	struct rb_root *root = &net->nexthop.rb_root;
-	struct rb_node *node;
-	int idx = 0, s_idx;
+	struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_dump)];
 	int err;

-	err = nh_valid_dump_req(cb->nlh, &dev_filter_idx, &master_idx,
-				&group_filter, &fdb_filter, cb);
+	err = nlmsg_parse(nlh, sizeof(struct nhmsg), tb,
+			  ARRAY_SIZE(rtm_nh_policy_dump) - 1,
+			  rtm_nh_policy_dump, cb->extack);
 	if (err < 0)
 		return err;

-	s_idx = cb->args[0];
+	return __nh_valid_dump_req(nlh, tb, filter, cb->extack);
+}
+
+struct rtm_dump_nh_ctx {
+	u32 idx;
+};
+
+static struct rtm_dump_nh_ctx *
+rtm_dump_nh_ctx(struct netlink_callback *cb)
+{
+	struct rtm_dump_nh_ctx *ctx = (void *)cb->ctx;
+
+	BUILD_BUG_ON(sizeof(*ctx) > sizeof(cb->ctx));
+	return ctx;
+}
+
+static int rtm_dump_walk_nexthops(struct sk_buff *skb,
+				  struct netlink_callback *cb,
+				  struct rb_root *root,
+				  struct rtm_dump_nh_ctx *ctx,
+				  int (*nh_cb)(struct sk_buff *skb,
+					       struct netlink_callback *cb,
+					       struct nexthop *nh, void *data),
+				  void *data)
+{
+	struct rb_node *node;
+	int idx = 0, s_idx;
+	int err;
+
+	s_idx = ctx->idx;
 	for (node = rb_first(root); node; node = rb_next(node)) {
 		struct nexthop *nh;

@@ -2049,30 +2107,58 @@ static int rtm_dump_nexthop(struct sk_buff *skb, struct netlink_callback *cb)
 			goto cont;

 		nh = rb_entry(node, struct nexthop, rb_node);
-		if (nh_dump_filtered(nh, dev_filter_idx, master_idx,
-				     group_filter, nhm->nh_family))
-			goto cont;
-
-		err = nh_fill_node(skb, nh, RTM_NEWNEXTHOP,
-				   NETLINK_CB(cb->skb).portid,
-				   cb->nlh->nlmsg_seq, NLM_F_MULTI);
-		if (err < 0) {
-			if (likely(skb->len))
-				goto out;
-
-			goto out_err;
-		}
+		ctx->idx = idx;
+		err = nh_cb(skb, cb, nh, data);
+		if (err)
+			return err;
 cont:
 		idx++;
 	}

+	ctx->idx = idx;
+	return 0;
+}
+
+static int rtm_dump_nexthop_cb(struct sk_buff *skb, struct netlink_callback *cb,
+			       struct nexthop *nh, void *data)
+{
+	struct nhmsg *nhm = nlmsg_data(cb->nlh);
+	struct nh_dump_filter *filter = data;
+
+	if (nh_dump_filtered(nh, filter, nhm->nh_family))
+		return 0;
+
+	return nh_fill_node(skb, nh, RTM_NEWNEXTHOP,
+			    NETLINK_CB(cb->skb).portid,
+			    cb->nlh->nlmsg_seq, NLM_F_MULTI);
+}
+
+/* rtnl */
+static int rtm_dump_nexthop(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct rtm_dump_nh_ctx *ctx = rtm_dump_nh_ctx(cb);
+	struct net *net = sock_net(skb->sk);
+	struct rb_root *root = &net->nexthop.rb_root;
+	struct nh_dump_filter filter = {};
+	int err;
+
+	err = nh_valid_dump_req(cb->nlh, &filter, cb);
+	if (err < 0)
+		return err;
+
+	err = rtm_dump_walk_nexthops(skb, cb, root, ctx,
+				     &rtm_dump_nexthop_cb, &filter);
+	if (err < 0) {
+		if (likely(skb->len))
+			goto out;
+		goto out_err;
+	}
+
 out:
 	err = skb->len;
 out_err:
-	cb->args[0] = idx;
 	cb->seq = net->nexthop.seq;
 	nl_dump_check_consistent(cb, nlmsg_hdr(skb));
-
 	return err;
 }