提交 · bb3ab596832b920c703d1aea1ce76d69c0f71fb7 · OpenHarmony / kernel_linux

16 12月, 2009 19 次提交

vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596

由 KOSAKI Motohiro 提交于 12月 14, 2009

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.

This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: add stats to track how relevant the logic is]
[mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bb3ab596

vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3

由 Mel Gorman 提交于 12月 14, 2009

After kswapd balances all zones in a pgdat, it goes to sleep.  In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached.  If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process.  It first
tries to sleep for HZ/10.  If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work.  Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks.  A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep.  A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: Frans Pop <elendil@planet.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f50de2d3

swap: rework map_swap_page() again · d4906e1a

由 Lee Schermerhorn 提交于 12月 14, 2009

Seems that page_io.c doesn't really need to know that page_private(page)
is the swp_entry 'val'.  Rework map_swap_page() to do what its name says
and map a page to a page offset in the swap space.

The only other caller of map_swap_page() is internal to mm/swapfile.c and
it does want to map a swap entry to the 'sector'.  So rename
map_swap_page() to map_swap_entry(), make it 'static' and and implement
map_swap_page() as a wrapper around that.
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d4906e1a

swap_info: reorder its fields · 7509765a

由 Hugh Dickins 提交于 12月 14, 2009

Reorder (and comment) the fields of swap_info_struct, to make better
use of its cachelines: it's good for swap_duplicate() in particular
if unsigned int max and swap_map are near the start.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7509765a

swap_info: note SWAP_MAP_SHMEM · aaa46865

由 Hugh Dickins 提交于 12月 14, 2009

While we're fiddling with the swap_map values, let's assign a particular
value to shmem/tmpfs swap pages: their swap counts are never incremented,
and it helps swapoff's try_to_unuse() a little if it can immediately
distinguish those pages from process pages.

Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
we might as well use that 0xbf value for SWAP_MAP_SHMEM.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aaa46865

swap_info: swap count continuations · 570a335b

由 Hugh Dickins 提交于 12月 14, 2009

Swap is duplicated (reference count incremented by one) whenever the same
swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).

swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)

We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.

Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.

This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

570a335b

swap_info: swap_map of chars not shorts · 8d69aaee

由 Hugh Dickins 提交于 12月 14, 2009

Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8d69aaee

swap_info: SWAP_HAS_CACHE cleanups · 253d553b

由 Hugh Dickins 提交于 12月 14, 2009

Though swap_count() is useful, I'm finding that swap_has_cache() and
encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it.  Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

253d553b

swap_info: include first_swap_extent · 9625a5f2

由 Hugh Dickins 提交于 12月 14, 2009

Make better use of the space by folding first swap_extent into its
swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.

[jirislaby@gmail.com: fix crash on double swapon]
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9625a5f2

swap_info: change to array of pointers · efa90a98

由 Hugh Dickins 提交于 12月 14, 2009

The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
to reserve an array of about 30 of them in bss, when most people will
want only one.  Change swap_info[] to an array of pointers.

That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.

/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

efa90a98

swap_info: private to swapfile.c · f29ad6a9

由 Hugh Dickins 提交于 12月 14, 2009

The swap_info_struct is mostly private to mm/swapfile.c, with only
one other in-tree user: get_swap_bio().  Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().

But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nigel Cunningham <ncunningham@crca.org.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f29ad6a9

mm: add gfp flags for NODEMASK_ALLOC slab allocations · bad44b5b

由 David Rientjes 提交于 12月 14, 2009

Objects passed to NODEMASK_ALLOC() are relatively small in size and are
backed by slab caches that are not of large order, traditionally never
greater than PAGE_ALLOC_COSTLY_ORDER.

Thus, using GFP_KERNEL for these allocations on large machines when
CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
the allocation attempt, each time invoking both direct reclaim or the oom
killer.

This is of particular interest when using NODEMASK_ALLOC() from a
mempolicy context (either directly in mm/mempolicy.c or the mempolicy
constrained hugetlb allocations) since the oom killer always kills current
when allocations are constrained by mempolicies.  So for all present use
cases in the kernel, current would end up being oom killed when direct
reclaim fails.  That would allow the NODEMASK_ALLOC() to succeed but
current would have sacrificed itself upon returning.

This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
All current use cases either directly from hugetlb code or indirectly via
NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
killer when the slab allocator needs to allocate additional pages.

The side-effect of this change is that all current use cases of either
NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
when the allocation fails (never for CONFIG_NODES_SHIFT <= 8).  All
current use cases were audited and do have appropriate error handling at
this time.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bad44b5b

hugetlb: offload per node attribute registrations · 39da08cb

由 Lee Schermerhorn 提交于 12月 14, 2009

Offload the registration and unregistration of per node hstate sysfs
attributes to a worker thread rather than attempt the
allocation/attachment or detachment/freeing of the attributes in the
context of the memory hotplug handler.

I don't know that this is absolutely required, but the registration can
sleep in allocations and other mem hot plug handlers do it this way.  If
it turns out this is NOT required, we can drop this patch.

N.B.,  Only tested build, boot, libhugetlbfs regression.
       i.e., no memory hotplug testing.
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: NAndi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

39da08cb

mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined · 8fe23e05

由 David Rientjes 提交于 12月 14, 2009

When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.

In such a situation, kswapd must also be stopped since it has nothing left
to do.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8fe23e05

hugetlb: add per node hstate attributes · 9a305230

由 Lee Schermerhorn 提交于 12月 14, 2009

Add the per huge page size control/query attributes to the per node
sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
	nr_hugepages       - r/w
	free_huge_pages    - r/o
	surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing global hstate
attribute initialization and handling, and the "nodes_allowed" constraint
processing as possible.

Calling set_max_huge_pages() with no node indicates a change to global
hstate parameters.  In this case, any non-default task mempolicy will be
used to generate the nodes_allowed mask.  A valid node id indicates an
update to that node's hstate parameters, and the count argument specifies
the target count for the specified node.  From this info, we compute the
target global count for the hstate and construct a nodes_allowed node mask
contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NAndi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9a305230

hugetlb: add generic definition of NUMA_NO_NODE · 4e25b257

由 Lee Schermerhorn 提交于 12月 14, 2009

Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
to generic header 'linux/numa.h' for use in generic code.  NUMA_NO_NODE
replaces bare '-1' where it's used in this series to indicate "no node id
specified".  Ultimately, it can be used to replace the -1 elsewhere where
it is used similarly.
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NAndi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e25b257

hugetlb: derive huge pages nodes allowed from task mempolicy · 06808b08

由 Lee Schermerhorn 提交于 12月 14, 2009

This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced.
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	sysctl vm.nr_hugepages[_mempolicy]=32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NAndi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

06808b08

hugetlb: factor init_nodemask_of_node() · c1e6c8d0

由 Lee Schermerhorn 提交于 12月 14, 2009

Factor init_nodemask_of_node() out of the nodemask_of_node() macro.

This will be used to populate the huge pages "nodes_allowed" nodemask for
a single node when basing nodes_allowed on a preferred/local mempolicy or
when a persistent huge page pool page count is modified via a per node
sysfs attribute.
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NAndi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c1e6c8d0

nodemask: make NODEMASK_ALLOC more general · 4e7b8a6c

由 David Rientjes 提交于 12月 14, 2009

This is a series of patches to provide control over the location of the
allocation and freeing of persistent huge pages on a NUMA platform.
Please consider for merging into mmotm.

This series uses two mechanisms to constrain the nodes from which
persistent huge pages are allocated: 1) the task NUMA mempolicy of the
task modifying a new sysctl "nr_hugepages_mempolicy", based on a
suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
attributes have been added [in V4] to each node system device under:

	/sys/devices/node/node[0-9]*/hugepages

The per node attibutes allow direct assignment of a huge page count on a
specific node, regardless of the task's mempolicy or cpuset constraints.

This patch:

NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
It's perfectly reasonable to use this macro to allocate a nodemask_t,
which is anonymous, either dynamically or on the stack depending on
NODES_SHIFT.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e7b8a6c

15 12月, 2009 7 次提交

i2c: Get rid of I2C_CLIENT_MODULE_PARM · 7f508118

由 Jean Delvare 提交于 12月 14, 2009

There is no user left of I2C_CLIENT_MODULE_PARM, so we can finally
get rid of this ugly macro.
Signed-off-by: NJean Delvare <khali@linux-fr.org>
Tested-by: NWolfram Sang <w.sang@pengutronix.de>

7f508118

i2c: Drop I2C_CLIENT_INSMOD_2 to 8 · e5e9f44c

由 Jean Delvare 提交于 12月 14, 2009

These macros simply declare an enum, so drivers might as well declare
it themselves. This puts an end to the arbitrary limit of 8 chip types
per i2c driver.
Signed-off-by: NJean Delvare <khali@linux-fr.org>
Tested-by: NWolfram Sang <w.sang@pengutronix.de>

e5e9f44c

i2c: Drop I2C_CLIENT_INSMOD_1 · 1f86df49

由 Jean Delvare 提交于 12月 14, 2009

This macro simply declares an enum, so drivers might as well declare
it themselves.
Signed-off-by: NJean Delvare <khali@linux-fr.org>
Tested-by: NWolfram Sang <w.sang@pengutronix.de>

1f86df49

i2c: Get rid of struct i2c_client_address_data · c3813d6a

由 Jean Delvare 提交于 12月 14, 2009

Struct i2c_client_address_data only contains one field at this point,
which makes its usefulness questionable. Get rid of it and pass simple
address lists around instead.
Signed-off-by: NJean Delvare <khali@linux-fr.org>
Tested-by: NWolfram Sang <w.sang@pengutronix.de>

c3813d6a

i2c: Drop the kind parameter from detect callbacks · 310ec792

由 Jean Delvare 提交于 12月 14, 2009

The "kind" parameter always has value -1, and nobody is using it any
longer, so we can remove it.
Signed-off-by: NJean Delvare <khali@linux-fr.org>
Tested-by: NWolfram Sang <w.sang@pengutronix.de>

310ec792

PCI: Global variable decls must match the defs in section attributes · 491424c0

由 David Howells 提交于 12月 14, 2009

Global variable declarations must match the definitions in section attributes
as the compiler is at liberty to vary the method it uses to access a variable,
depending on the section it is in.

When building the FRV arch, I now see:

drivers/built-in.o: In function `pci_apply_final_quirks':
drivers/pci/quirks.c:2606: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
drivers/pci/quirks.c:2623: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
drivers/pci/quirks.c:2630: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o

because the declaration of pci_dfl_cache_line_size in linux/pci.h does not
match the definition in drivers/pci/pci.c.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

491424c0

FRV: Fix no-hardware-breakpoint case · 5185fb06

由 David Howells 提交于 12月 14, 2009

If there is no hardware breakpoint support, modify_user_hw_breakpoint()
tries to return a NULL pointer through as an 'int' return value:

  In file included from kernel/exit.c:53:
  include/linux/hw_breakpoint.h: In function 'modify_user_hw_breakpoint':
  include/linux/hw_breakpoint.h:96: warning: return makes integer from pointer without a cast

Return 0 instead.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5185fb06

14 12月, 2009 14 次提交

N
md: remove sparse warning:symbol XXX was not declared. · 7820f9e1
由 NeilBrown 提交于 12月 14, 2009
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
7820f9e1

mfd: Add twl6030 regulator subdevices · 9da66539

由 Rajendra Nayak 提交于 12月 13, 2009

This patch adds initial support for creating twl6030 PMIC
specific voltage regulators in the twl mfd driver.

Board specific regulator configurations will have to be passed from
respective board files.
Signed-off-by: NRajendra Nayak <rnayak@ti.com>
Signed-off-by: NBalaji T K <balajitk@ti.com>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

9da66539

regulator: Add support for twl6030 regulators · 441a4505

由 Rajendra Nayak 提交于 12月 13, 2009

This patch updates the regulator driver to add support
for TWL6030 PMIC specific LDO regulators.
SMPS resources are not yet supported for TWL6030 and
also .set_mode and .get_status for LDO's are yet to
be implemented for TWL6030.
Signed-off-by: NRajendra Nayak <rnayak@ti.com>
Signed-off-by: NBalaji T K <balajitk@ti.com>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

441a4505

mfd: Add support for twl6030 irq framework · e8deb28c

由 Balaji T K 提交于 12月 14, 2009

This patch adds support for phoenix interrupt framework. New iInterrupt
status register A, B, C are introduced in Phoenix and are cleared on write.
Due to the differences in interrupt handling with respect to TWL4030,
twl6030-irq.c is created for TWL6030 PMIC
Signed-off-by: NRajendra Nayak <rnayak@ti.com>
Signed-off-by: NBalaji T K <balajitk@ti.com>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

e8deb28c

mfd: Rename all twl4030_i2c* · fc7b92fc

由 Balaji T K 提交于 12月 13, 2009

This patch renames function names like twl4030_i2c_write_u8,
twl4030_i2c_read_u8 to twl_i2c_write_u8, twl_i2c_read_u8
and also common variable in twl-core.c
Signed-off-by: NRajendra Nayak <rnayak@ti.com>
Signed-off-by: NBalaji T K <balajitk@ti.com>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: NKevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

fc7b92fc

mfd: Rename twl4030* driver files to enable re-use · b07682b6

由 Santosh Shilimkar 提交于 12月 13, 2009

The upcoming TWL6030 is companion chip for OMAP4 like the current TWL4030
for OMAP3. The common modules like RTC, Regulator creates opportunity
to re-use the most of the code from twl4030.

This patch renames few common drivers twl4030* files to twl* to enable
the code re-use.
Signed-off-by: NRajendra Nayak <rnayak@ti.com>
Signed-off-by: NBalaji T K <balajitk@ti.com>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: NKevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

b07682b6

mfd: Add all twl4030 regulators to the twl4030 mfd driver · ab4abe05

由 Juha Keski-Saari 提交于 12月 11, 2009

Add all twl4030 regulators to the twl4030 mfd driver and
twl4030_platform_data
Signed-off-by: NJuha Keski-Saari <ext-juha.1.keski-saari@nokia.com>
Acked-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

ab4abe05

mfd: Remove ezx-pcap defines for custom led gpio encoding · ed89a755

由 Antonio Ospite 提交于 11月 29, 2009

We used these, in a first version of leds-pcap driver, in order to encode gpio
enabling and gpio inversion for a led inside the variable used for the gpio
number. In the new leds-pcap driver we rely on gpio_is_valid() to derive if a
led is gpio enabled and we have a dedicated flag to tell if the gpio value has
to be inverted.
Signed-off-by: NAntonio Ospite <ospite@studenti.unina.it>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

ed89a755

mfd: Near complete mc13783 rewrite · 9e272677

由 Uwe Kleine-König 提交于 11月 30, 2009

This fixes several things while still providing the old API:

 - simplify and fix locking
 - better error handling
 - don't ack all irqs making it impossible to detect a reset of the
   rtc
 - use a timeout variant to wait for completion of ADC conversion
 - provide platform-data to regulator subdevice (This allows making
   struct mc13783 opaque for other drivers after the regulator driver is
   updated to use its platform_data.)
 - expose all interrupts
 - use threaded irq

After all users in mainline are converted to the new API, some things
(e.g. mc13783-private.h) can go away.
Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

9e272677

mfd: Move WM831x to generic IRQ · 5fb4d38b

由 Mark Brown 提交于 11月 11, 2009

Replace the wm831x-local IRQ infrastructure with genirq, allowing access
to the diagnostic infrastructure of genirq and allowing us to implement
interrupt support for the GPIOs.  The switchover is done within the
wm831x specific IRQ API, further patches will convert the individual
drivers to use genirq directly.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

5fb4d38b

mfd: Initial support for twl5031 · 1920a61e

由 Ilkka Koskinen 提交于 11月 10, 2009

TWL5031 introduces two new interrupts in PIH. Moreover, BCI
has changed remarkably and, thus, it's disabled when TWL5031
is in use.
Signed-off-by: NIlkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

1920a61e

mfd: Convert wm8350 IRQ handlers to irq_handler_t · 5a65edbc

由 Mark Brown 提交于 11月 04, 2009

This is done as simple code transformation, the semantics of the
IRQ API provided by the core are are still very different to those
of genirq (mainly with regard to masking).
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

5a65edbc

mfd: Allow configuration of VDCDC2 for tps65010 · b45440c3

由 Ben Dooks 提交于 11月 02, 2009

Add function to allow the configuation fo the VDCDC2 register by
external users, to allow changing of the standard and low-power
running modes.

This is needed, for example, for the Simtec IM2440D20 where we need
to use the low-power mode to shutdown the LDO/DCDC that are not needed
during suspend (saving substantial power) and the runtime use of the
low-power mode to change VCore.
Signed-off-by: NBen Dooks <ben@simtec.co.uk>
Signed-off-by: NSimtec Linux Team <linux@simtec.co.uk>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

b45440c3

mfd: Enable twl4030 32kHz oscillator low-power mode · 38a68496

由 Ilkka Koskinen 提交于 10月 22, 2009

Allows TWL's 32kHz oscillator to go in low-power mode when
main battery voltage is running low.
Signed-off-by: NIlkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>

38a68496

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多