提交 · b1e7a8fd854d2f895730e82137400012b509650e · openeuler / Kernel

01 7月, 2006 14 次提交

[PATCH] zoned vm counters: conversion of nr_dirty to per zone counter · b1e7a8fd

由 Christoph Lameter 提交于 6月 30, 2006

This makes nr_dirty a per zone counter.  Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones.  Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b1e7a8fd

[PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter · df849a15

由 Christoph Lameter 提交于 6月 30, 2006

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

df849a15

[PATCH] zoned vm counters: conversion of nr_slab to per zone counter · 9a865ffa

由 Christoph Lameter 提交于 6月 30, 2006

- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
  the slab. This may become useful to balance slab allocations over
  various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

9a865ffa

[PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval · 34aa1330

由 Christoph Lameter 提交于 6月 30, 2006

The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone.  Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone.  So we can simply skip the
reclaim if there is an insufficient number of unmapped pages.  We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

34aa1330

[PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED · f3dbd344

由 Christoph Lameter 提交于 6月 30, 2006

The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages.  However, that is not
true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.

It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.

Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
calculation?

Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
user space tools that monitor these counters!  NR_FILE_MAPPED works like
NR_FILE_DIRTY.  It is only valid for pagecache pages.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f3dbd344

[PATCH] zoned vm counters: remove NR_FILE_MAPPED from scan control structure · bf02cf4b

由 Christoph Lameter 提交于 6月 30, 2006

We can now access the number of pages in a mapped state in an inexpensive way
in shrink_active_list.  So drop the nr_mapped field from scan_control.

[akpm@osdl.org: bugfix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bf02cf4b

[PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter · 347ce434

由 Christoph Lameter 提交于 6月 30, 2006

Currently a single atomic variable is used to establish the size of the page
cache in the whole machine.  The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

347ce434

[PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5

由 Christoph Lameter 提交于 6月 30, 2006

nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations.  This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

65ba55f5

[PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a

由 Christoph Lameter 提交于 6月 30, 2006

Per zone counter infrastructure

The counters that we currently have for the VM are split per processor.  The
processor however has not much to do with the zone these pages belong to.  We
cannot tell f.e.  how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
particular node.  If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system.  For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim.  We do not know how many unmapped pages exist
per zone.  So we just have to try to reclaim.  If it is not working then we
pause and try again later.  It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone.  This patchset allows the determination
of the number of unmapped pages per zone.  We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone.  The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array.  VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state.  The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided.  These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state().  In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

2244b95a

[PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h · f6ac2354

由 Christoph Lameter 提交于 6月 30, 2006

NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
event counters do not need to be.

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is.  In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.

It is also useful to know how the computing load affects the memory
allocations on various zones.  This patchset allows the retrieval of that data
from userspace.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.

Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach).  The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.

Benefits of this patchset right now:

- Ability for UP and SMP configuration to determine how memory
  is balanced between the DMA, NORMAL and HIGHMEM zones.

- loops over all processors are avoided in writeback and
  reclaim paths. We can avoid caching the writeback information
  because the needed information is directly accessible.

- Special handling for nr_pagecache removed.

- zone_reclaim_interval vanishes since VM stats can now determine
  when it is worth to do local reclaim.

- Fast inline per node page state determination.

- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
  counters are counting simply which processor allocated a page somewhere
  and guestimate based on that. So the counters were not useful to show
  the actual distribution of page use on a specific zone.

- The swap_prefetch patch requires per node statistics in order to
  figure out when processors of a node can prefetch. This patch provides
  some of the needed numbers.

- Detailed VM counters available in more /proc and /sys status files.

References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

Performance tests with AIM7 did not show any regressions.  Seems to be a tad
faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
fixes for s390/arm/uml arch code.

This patch:

Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.

Move the vm_stat_text array before zoneinfo_show.

[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: NChristoph Lameter <clameter@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f6ac2354

[PATCH] fix ISTALLION=y · 672b2714

由 Adrian Bunk 提交于 6月 30, 2006

drivers/char/istallion.c: In function âstli_initbrdsâ:
drivers/char/istallion.c:4150: error: implicit declaration of function âstli_parsebrdâ
drivers/char/istallion.c:4150: error: âstli_brdspâ undeclared (first use in this function)
drivers/char/istallion.c:4150: error: (Each undeclared identifier is reported only once
drivers/char/istallion.c:4150: error: for each function it appears in.)
drivers/char/istallion.c:4164: error: implicit declaration of function âstli_argbrdsâ

While I was at it, I also removed the #ifdef MODULE around the initialation
code to allow it to perhaps work when built into the kernel and made a
needlessly global function static.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

672b2714

[PATCH] msr.c: use register_hotcpu_notifier() · e09793bb

由 Andrew Morton 提交于 6月 30, 2006

register_cpu_notifier() cannot do anything in a module, in a
!CONFIG_HOTPLUG_CPU kernel.

Cc: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e09793bb

[PATCH] fix platform_device_put/del mishaps · 1017f6af

由 Ingo Molnar 提交于 6月 30, 2006

This fixes drivers/char/pc8736x_gpio.c and drivers/char/scx200_gpio.c to
use the platform_device_del/put ops correctly.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1017f6af

[PATCH] fix drivers/video/imacfb.c compilation · 491d525f

由 Ingo Molnar 提交于 6月 30, 2006

Fix build error on x86_64.  There's nothing even remotely close to
imacmp_seg in the kernel, so I removed the whole line.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Cc: Edgar Hucek <hostmaster@ed-soft.at>
Cc: Antonino Daplas <adaplas@pol.net>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

491d525f

30 6月, 2006 26 次提交

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 · 501b7c77

由 Linus Torvalds 提交于 6月 29, 2006

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: remove redundant NULL checks in ocfs2_direct_IO_get_blocks()
  ocfs2: clean up some osb fields
  ocfs2: fix init of uuid_net_key
  ocfs2: silence a debug print
  ocfs2: silence ENOENT during lookup of broken links
  ocfs2: Cleanup message prints
  ocfs2: silence -EEXIST from ocfs2_extent_map_insert/lookup
  [PATCH] fs/ocfs2/dlm/dlmrecovery.c: make dlm_lockres_master_requery() static
  ocfs2: warn the user on a dead timeout mismatch
  ocfs2: OCFS2_FS must depend on SYSFS
  ocfs2: Compile-time disabling of ocfs2 debugging output.
  configfs: Clear up a few extra spaces where there should be TABs.
  configfs: Release memory in configfs_example.

501b7c77

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 · 74e651f0

由 Linus Torvalds 提交于 6月 29, 2006

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
  [TIPC]: Initial activation message now includes TIPC version number
  [TIPC]: Improve response to requests for node/link information
  [TIPC]: Fixed skb_under_panic caused by tipc_link_bundle_buf
  [IrDA]: Fix the AU1000 FIR dependencies
  [IrDA]: Fix RCU lock pairing on error path
  [XFRM]: unexport xfrm_state_mtu
  [NET]: make skb_release_data() static
  [NETFILTE] ipv4: Fix typo (Bugzilla #6753)
  [IrDA]: MCS7780 usb_driver struct should be static
  [BNX2]: Turn off link during shutdown
  [BNX2]: Use dev_kfree_skb() instead of the _irq version
  [ATM]: basic sysfs support for ATM devices
  [ATM]: [suni] change suni_init to __devinit
  [ATM]: [iphase] should be __devinit not __init
  [ATM]: [idt77105] should be __devinit not __init
  [BNX2]: Add NETIF_F_TSO_ECN
  [NET]: Add ECN support for TSO
  [AF_UNIX]: Datagram getpeersec
  [NET]: Fix logical error in skb_gso_ok
  [PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000
  ...

74e651f0

[TIPC]: Initial activation message now includes TIPC version number · 0702056f

由 Allan Stephens 提交于 6月 29, 2006

Signed-off-by: NAllan Stephens <allan.stephens@windriver.com>
Signed-off-by: NPer Liden <per.liden@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0702056f

[TIPC]: Improve response to requests for node/link information · ea13847b

由 Allan Stephens 提交于 6月 29, 2006

Now allocates reply space for "get links" request based on number of actual
links, not number of potential links.  Also, limits reply to "get links" and
"get nodes" requests to 32KB to match capabilities of tipc-config utility
that issued request.
Signed-off-by: NAllan Stephens <allan.stephens@windriver.com>
Signed-off-by: NPer Liden <per.liden@ericsson.com>

ea13847b

[TIPC]: Fixed skb_under_panic caused by tipc_link_bundle_buf · e49060c7

由 Allan Stephens 提交于 6月 29, 2006

Now determines tailroom of bundle buffer by directly inspection of buffer.
Previously, buffer was assumed to have a max capacity equal to the link MTU,
but the addition of link MTU negotiation means that the link MTU can increase
after the bundle buffer is allocated.
Signed-off-by: NAllan Stephens <allan.stephens@windriver.com>
Signed-off-by: NPer Liden <per.liden@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e49060c7

[IrDA]: Fix the AU1000 FIR dependencies · caf430f3

由 Adrian Bunk 提交于 6月 29, 2006

AU1000 FIR is broken, it should depend on SOC_AU1000.

Spotted by Jean-Luc Leger.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NSamuel Ortiz <samuel@sortiz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

caf430f3

[IrDA]: Fix RCU lock pairing on error path · 1bc17311

由 Josh Triplett 提交于 6月 29, 2006

irlan_client_discovery_indication calls rcu_read_lock and rcu_read_unlock, but
returns without unlocking in an error case.  Fix that by replacing the return
with a goto so that the rcu_read_unlock always gets executed.
Signed-off-by: NJosh Triplett <josh@freedesktop.org>
Acked-by: NPaul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Samuel Ortiz samuel@sortiz.org <>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bc17311

[XFRM]: unexport xfrm_state_mtu · 244055fd

由 Adrian Bunk 提交于 6月 29, 2006

This patch removes the unused EXPORT_SYMBOL(xfrm_state_mtu).
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

244055fd

[NET]: make skb_release_data() static · 5bba1712

由 Adrian Bunk 提交于 6月 29, 2006

skb_release_data() no longer has any users in other files.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5bba1712

[NETFILTE] ipv4: Fix typo (Bugzilla #6753) · c22751b7

由 Matt LaPlante 提交于 6月 29, 2006

This patch fixes bugzilla #6753, a typo in the netfilter Kconfig
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c22751b7

[IrDA]: MCS7780 usb_driver struct should be static · 7263ade1

由 Adrian Bunk 提交于 6月 29, 2006

This patch makes a needlessly global struct static.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NSamuel Ortiz <samuel@sortiz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7263ade1

[BNX2]: Turn off link during shutdown · 6c4f095e

由 Michael Chan 提交于 6月 29, 2006

Minor change in shutdown logic to effect a link down.

Update version to 1.4.43.
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c4f095e

[BNX2]: Use dev_kfree_skb() instead of the _irq version · 745720e5

由 Michael Chan 提交于 6月 29, 2006

Change all dev_kfree_skb_irq() and dev_kfree_skb_any() to
dev_kfree_skb().  These calls are never used in irq context.
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

745720e5

[ATM]: basic sysfs support for ATM devices · 656d98b0

由 Roman Kagan 提交于 6月 29, 2006

Signed-off-by: NChas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

656d98b0

[ATM]: [suni] change suni_init to __devinit · d17f0865

由 Chas Williams 提交于 6月 29, 2006

Signed-off-by: NChas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d17f0865

[ATM]: [iphase] should be __devinit not __init · 249c14b5

由 Chas Williams 提交于 6月 29, 2006

Signed-off-by: NChas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

249c14b5

[ATM]: [idt77105] should be __devinit not __init · b47eb0eb

由 Chas Williams 提交于 6月 29, 2006

Signed-off-by: NChas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b47eb0eb

[BNX2]: Add NETIF_F_TSO_ECN · b11d6213

由 Michael Chan 提交于 6月 29, 2006

Add NETIF_F_TSO_ECN feature for all bnx2 hardware.
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b11d6213

[NET]: Add ECN support for TSO · b0da8537

由 Michael Chan 提交于 6月 29, 2006

In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection.  The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet.  Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.

This patch adds the ability to turn on NETIF_F_TSO and ECN using
GSO if necessary to handle TSO packets with CWR set.  Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.

All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set.  If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.

With help from Herbert Xu <herbert@gondor.apana.org.au>.

Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock
flag is completely removed.
Signed-off-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0da8537

[AF_UNIX]: Datagram getpeersec · 877ce7c1

由 Catherine Zhang 提交于 6月 29, 2006

This patch implements an API whereby an application can determine the
label of its peer's Unix datagram sockets via the auxiliary data mechanism of
recvmsg.

Patch purpose:

This patch enables a security-aware application to retrieve the
security context of the peer of a Unix datagram socket.  The application
can then use this security context to determine the security context for
processing on behalf of the peer who sent the packet.

Patch design and implementation:

The design and implementation is very similar to the UDP case for INET
sockets.  Basically we build upon the existing Unix domain socket API for
retrieving user credentials.  Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).  To retrieve the security
context, the application first indicates to the kernel such desire by
setting the SO_PASSSEC option via getsockopt.  Then the application
retrieves the security context using the auxiliary data mechanism.

An example server application for Unix datagram socket should look like this:

toggle = 1;
toggle_len = sizeof(toggle);

setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
        cmsg_hdr->cmsg_level == SOL_SOCKET &&
        cmsg_hdr->cmsg_type == SCM_SECURITY) {
        memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
}

sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
a server socket to receive security context of the peer.

Testing:

We have tested the patch by setting up Unix datagram client and server
applications.  We verified that the server can retrieve the security context
using the auxiliary data mechanism of recvmsg.
Signed-off-by: NCatherine Zhang <cxzhang@watson.ibm.com>
Acked-by: NAcked-by: James Morris <jmorris@namei.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

877ce7c1

[NET]: Fix logical error in skb_gso_ok · d6b4991a

由 Herbert Xu 提交于 6月 29, 2006

The test in skb_gso_ok is backwards.  Noticed by Michael Chan
<mchan@broadcom.com>.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6b4991a

S
[PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000 · 4ee303df
由 Shuya MAEDA 提交于 6月 28, 2006
```
Signed-off-by: NShuya MAEDA <maeda-sxb@necst.nec.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
4ee303df

[NET]: Make illegal_highdma more anal · 3d3a8533

由 Herbert Xu 提交于 6月 27, 2006

Rather than having illegal_highdma as a macro when HIGHMEM is off, we
can turn it into an inline function that returns zero.  This will catch
callers that give it bad arguments.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d3a8533

[TCP]: Export accept queue len of a TCP listening socket via rx_queue · 47da8ee6

由 Sridhar Samudrala 提交于 6月 27, 2006

While debugging a TCP server hang issue, we noticed that currently there is
no way for a user to get the acceptq backlog value for a TCP listen socket.

All the standard networking utilities that display socket info like netstat,
ss and /proc/net/tcp have 2 fields called rx_queue and tx_queue. These
fields do not mean much for listening sockets. This patch uses one of these
unused fields(rx_queue) to export the accept queue len for listening sockets.
Signed-off-by: NSridhar Samudrala <sri@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

47da8ee6

[NETLINK]: Encapsulate eff_cap usage within security framework. · c7bdb545

由 Darrel Goeddel 提交于 6月 27, 2006

This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface. It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.
Signed-off-by: NDarrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7bdb545

[NET]: Added GSO header verification · 576a30eb

由 Herbert Xu 提交于 6月 27, 2006

When GSO packets come from an untrusted source (e.g., a Xen guest domain),
we need to verify the header integrity before passing it to the hardware.

Since the first step in GSO is to verify the header, we can reuse that
code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this
bit set can only be fed directly to devices with the corresponding bit
NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb
is fed to the GSO engine which will allow the packet to be sent to the
hardware if it passes the header check.

This patch changes the sg flag to a full features flag. The same method
can be used to implement TSO ECN support. We simply have to mark packets
with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment
the packet, or segment the first MTU and pass the rest to the hardware for
further segmentation.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

576a30eb

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功