提交 · 92afb286d744511f51a05f8acb6c111d05737617 · openeuler / raspberrypi-kernel

20 9月, 2011 7 次提交

M
regmap: Allow drivers to control cache_only flag · 92afb286
由 Mark Brown 提交于 9月 19, 2011
```
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
```
92afb286
M
regmap: Prototype regcache_sync() · 39a58439
由 Mark Brown 提交于 9月 19, 2011
```
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
```
39a58439

regmap: Add the regcache_sync trace event · 59360089

由 Dimitris Papastamos 提交于 9月 19, 2011

Signed-off-by: NDimitris Papastamos <dp@opensource.wolfsonmicro.com>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

59360089

regmap: Add the LZO cache support · 2cbbb579

由 Dimitris Papastamos 提交于 9月 19, 2011

This patch adds support for LZO compression when storing the register
cache.

For a typical device whose register map would normally occupy 25kB or 50kB
by using the LZO compression technique, one can get down to ~5-7kB. There
might be a performance penalty associated with each individual read/write
due to decompressing/compressing the underlying cache, however that should not
be noticeable. These memory benefits depend on whether the target architecture
can get rid of the memory occupied by the original register defaults cache
which is marked as __devinitconst. Nevertheless there will be some memory
gain even if the target architecture can't get rid of the original register
map, this should be around ~30-32kB instead of 50kB.
Signed-off-by: NDimitris Papastamos <dp@opensource.wolfsonmicro.com>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

2cbbb579

regmap: Add the rbtree cache support · 28644c80

由 Dimitris Papastamos 提交于 9月 19, 2011

This patch adds support for the rbtree cache compression type.

Each rbnode manages a variable length block of registers. There can be no
two nodes with overlapping blocks. Each block has a base register and a
currently top register, all the other registers, if any, lie in between these
two and in ascending order.

The reasoning behind the construction of this rbtree is simple. In the
snd_soc_rbtree_cache_init() function, we iterate over the register defaults
provided by the regcache core. For each register value that is non-zero we
insert it in the rbtree. In order to determine in which rbnode we need
to add the register, we first look if there is another register already
added that is adjacent to the one we are about to add. If that is the case
we append it in that rbnode block, otherwise we create a new rbnode
with a single register in its block and add it to the tree.

There are various optimizations across the implementation to speed up lookups
by caching the most recently used rbnode.
Signed-off-by: NDimitris Papastamos <dp@opensource.wolfsonmicro.com>
Tested-by: NLars-Peter Clausen <lars@metafoo.de>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

28644c80

regmap: Add the indexed cache support · 195af65c

由 Dimitris Papastamos 提交于 9月 19, 2011

This is the simplest form of a cache available in regcache. Any
registers whose default value is 0 are ignored. If any of those
registers are modified in the future, they will be placed in the
cache on demand. The cache layout is essentially using the provided
register defaults by the regcache core directly and does not re-map
it to another representation.
Signed-off-by: NDimitris Papastamos <dp@opensource.wolfsonmicro.com>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

195af65c

regmap: Introduce caching support · 9fabe24e

由 Dimitris Papastamos 提交于 9月 19, 2011

This patch introduces caching support for regmap. The regcache API
has evolved essentially out of ASoC soc-cache so most of the actual
caching types (except LZO) have been tested in the past.

The purpose of regcache is to optimize in time and space the handling
of register caches. Time optimization is achieved by not having to go
over a slow bus like I2C to read the value of a register, instead it is
cached locally in memory and can be retrieved faster. Regarding space
optimization, some of the cache types are better at packing the caches,
for e.g. the rbtree and the LZO caches. By doing this the sacrifice in
time still wins over doing I2C transactions.
Signed-off-by: NDimitris Papastamos <dp@opensource.wolfsonmicro.com>
Tested-by: NLars-Peter Clausen <lars@metafoo.de>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

9fabe24e

10 9月, 2011 1 次提交

regmap: Provide device read and write map interface for merging · 069af897

由 Mark Brown 提交于 9月 05, 2011

Add the externally visible interface introduced by Lars-Peter's commit
6f3064 (regmap: Add support for device specific write and read flag
masks) separately in order to allow merge into other subsystems for
integration with drivers. Drivers relying on this feature will not be
functional until they are merged with the implementation.
Signed-off-by: NLars-Peter Clausen <lars@metafoo.de>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

069af897

06 9月, 2011 2 次提交

regmap: Add support for device specific write and read flag masks. · 6f306441

由 Lars-Peter Clausen 提交于 9月 05, 2011

Some buses like SPI have no standard notation of read or write operations.
The general scheme here is to set or clear specific bits in the register
address to indicate whether the operation is a read or write. We already
support having a read flag mask per bus, but as there is no standard
the bits which need to be set or cleared differ between devices and vendors,
thus we need a mechanism to specify them per device.

This patch adds two new entries to the regmap_config struct, read_flag_mask and
write_flag_mask. These will be or'ed onto the top byte when doing a read or
write operation. If both masks are empty the device will fallback to the
regmap_bus masks.
Signed-off-by: NLars-Peter Clausen <lars@metafoo.de>
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

6f306441

regmap: Remove redundant owner field from the bus type struct · 5b457e39

由 Mark Brown 提交于 9月 05, 2011

No longer used as users link directly with the bus types so the core
module infrastructure does refcounting for us.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

5b457e39

22 8月, 2011 3 次提交

M
mfd: Convert WM8400 to regmap API · 50eeef5d
由 Mark Brown 提交于 8月 09, 2011
```
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
```
50eeef5d

mfd: Convert WM8994 to use new register map API · d6c645fc

由 Mark Brown 提交于 6月 17, 2011

Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>

d6c645fc

mfd: Convert WM831x to use regmap API · 1df5981b

由 Mark Brown 提交于 6月 10, 2011

Factor out the register read/write code to use the register map API.  We
still need some wm831x specific code and locking in place to check that
the user key is handled correctly but only on the write side, reads are
not affected by the key.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: NSamuel Ortiz <sameo@linux.intel.com>

1df5981b

21 8月, 2011 1 次提交

regmap: Allow drivers to specify register defaults · bd20eb54

由 Mark Brown 提交于 8月 19, 2011

It is useful for the register cache code to be able to specify the
default values for the device registers. The major use is when restoring
the register cache after suspend, knowing the register defaults allows
us to skip registers that are at their default values when we resume which
can be a substantial win on larger modern devices. For some devices
(mostly older ones) the hardware does not support readback so the only way we
can know the values is from code and so initializing the cache with default
values makes it much easier for drivers work with read/modify/write
updates.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

bd20eb54

09 8月, 2011 3 次提交

regmap: Remove unused type and list fields from bus interface · 790923e5

由 Mark Brown 提交于 8月 09, 2011

We no longer enumerate the bus types, we rely on the driver telling us
this on init.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

790923e5

regmap: Use int rather than size_t for lengths when logging blocks · 3d9ead7c

由 Mark Brown 提交于 8月 09, 2011

x86_64 warns as size_t is not an int.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>

3d9ead7c

regmap: Fix kerneldoc errors for regmap · 3566cc9d

由 Mark Brown 提交于 8月 09, 2011

Field names didn't match between the documentation and the code.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

3566cc9d

08 8月, 2011 5 次提交

regmap: Add basic tracepoints · fb2736bb

由 Mark Brown 提交于 7月 24, 2011

Trace single register reads and writes, plus start/stop tracepoints for
the actual I/O to see where we're spending time. This makes it easy to
have always on logging without overwhelming the logs and also lets us take
advantage of all the context and time information that the trace subsystem
collects for us.

We don't currently trace register values for bulk operations as this would
add complexity and overhead parsing the cooked data that's being worked
with.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

fb2736bb

regmap: Add precious registers to the driver interface · 18694886

由 Mark Brown 提交于 8月 08, 2011

Some devices are sensitive to reads on their registers, especially for
things like clear on read interrupt status registers. Avoid creating
problems with these with things like debugfs by allowing drivers to tell
the core about them. If a register is marked as precious then the core
will not internally generate any reads of it.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

18694886

regmap: Allow devices to specify which registers are accessible · 2e2ae66d

由 Mark Brown 提交于 7月 20, 2011

This is currently unused but we need to know which registers exist and
their properties in order to implement diagnostics like register map
dumps and the cache features.

We use callbacks partly because properties can vary at runtime (eg, through
access locks on registers) and partly because big switch statements are a
good compromise between readable code and small data size for providing
information on big register maps.
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>

2e2ae66d

M
regmap: Add kerneldoc for struct regmap_config · dd898b20
由 Mark Brown 提交于 7月 20, 2011
```
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
```
dd898b20

fix rcu annotations noise in cred.h · 32955148

由 Al Viro 提交于 8月 07, 2011

task->cred is declared as __rcu, and access to other tasks' ->cred is,
indeed, protected. Access to current->cred does not need rcu_dereference()
at all, since only the task itself can change its ->cred. sparse, of
course, has no way of knowing that...

Add force-cast in current_cred(), make current_fsuid() et.al. use it.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

32955148

07 8月, 2011 6 次提交

vfs: optimize inode cache access patterns · 3ddcd056

由 Linus Torvalds 提交于 8月 06, 2011

The inode structure layout is largely random, and some of the vfs paths
really do care.  The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.

We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.

It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.

The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible.  The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture.  So there's more tuning
to be done.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ddcd056

vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e

由 Linus Torvalds 提交于 8月 06, 2011

Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.

And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

830c0f0e

exofs: Rename raid engine from exofs/ios.c => ore · 8ff660ab

由 Boaz Harrosh 提交于 8月 06, 2011

ORE stands for "Objects Raid Engine"

This patch is a mechanical rename of everything that was in ios.c
and its API declaration to an ore.c and an osd_ore.h header. The ore
engine will later be used by the pnfs objects layout driver.

* File ios.c => ore.c

* Declaration of types and API are moved from exofs.h to a new
  osd_ore.h

* All used types are prefixed by ore_ from their exofs_ name.

* Shift includes from exofs.h to osd_ore.h so osd_ore.h is
  independent, include it from exofs.h.

Other than a pure rename there are no other changes. Next patch
will move the ore into it's own module and will export the API
to be used by exofs and later the layout driver
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

8ff660ab

net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea

由 David S. Miller 提交于 8月 03, 2011

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: NDan Kaminsky <dan@doxpara.com>
Tested-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e5714ea

crypto: Move md5_transform to lib/md5.c · bc0b96b5

由 David S. Miller 提交于 8月 03, 2011

We are going to use this for TCP/IP sequence number and fragment ID
generation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc0b96b5

lib/sha1: use the git implementation of SHA-1 · 1eb19a12

由 Mandeep Singh Baines 提交于 8月 05, 2011

For ChromiumOS, we use SHA-1 to verify the integrity of the root
filesystem.  The speed of the kernel sha-1 implementation has a major
impact on our boot performance.

To improve boot performance, we investigated using the heavily optimized
sha-1 implementation used in git.  With the git sha-1 implementation, we
see a 11.7% improvement in boot time.

10 reboots, remove slowest/fastest.

Before:

  Mean: 6.58 seconds Stdev: 0.14

After (with git sha-1, this patch):

  Mean: 5.89 seconds Stdev: 0.07

The other cool thing about the git SHA-1 implementation is that it only
needs 64 bytes of stack for the workspace while the original kernel
implementation needed 320 bytes.
Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
Cc: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Cc: Nicolas Pitre <nico@cam.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1eb19a12

06 8月, 2011 1 次提交

Add KEY_MICMUTE and enable it on Lenovo X220 · 33009557

由 Andy Lutomirski 提交于 5月 24, 2011

I suspect that this works on T410.
Signed-off-by: NAndy Lutomirski <luto@mit.edu>
Signed-off-by: NMatthew Garrett <mjg@redhat.com>

33009557

05 8月, 2011 1 次提交

nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4 · 655b1612

由 Boaz Harrosh 提交于 5月 29, 2011

exofs file system wants to use pnfs_osd_xdr.h file instead of
redefining pnfs-objects types in it's private "pnfs.h" headr.

Before we do the switch we must make sure pnfs_osd_xdr.h is
compilable also under NFS versions smaller than 4.1. Since now
it is needed regardless of version, by the exofs code.

nfs4_string is not the only nfs4 type out in the global scope.
Ack-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

655b1612

04 8月, 2011 10 次提交

drm: Separate EDID Header Check from EDID Block Check · 051963d4

由 Thomas Reim 提交于 7月 29, 2011

    Provides function drm_edid_header_is_valid() for EDID header check
    and replaces EDID header check part of function drm_edid_block_valid()
    by a call of drm_edid_header_is_valid().
    This is a prerequisite to extend DDC probing, e. g. in function
    radeon_ddc_probe() for Radeon devices, by a central EDID header check.

    Tested for kernel 2.6.35, 2.6.38 and 3.0

Cc: <stable@kernel.org>
Signed-off-by: NThomas Reim <reimth@gmail.com>
Reviewed-by: NAlex Deucher <alexdeucher@gmail.com>
Acked-by: NStephen Michaels <Stephen.Micheals@gmail.com>
Signed-off-by: NDave Airlie <airlied@redhat.com>

051963d4

Revert "dt: add of_alias_scan and of_alias_get_id" · fe55c184

由 Grant Likely 提交于 8月 04, 2011

This reverts commit 750f463a.

of_alias_* still needs work to be generalized for 'promtree' dt
platforms, and to no implicitly create entries for available ids.
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

fe55c184

drm: track CEA version number if present · ebec9a7b

由 Jesse Barnes 提交于 8月 03, 2011

Drivers need to know the CEA version number in addition to other display
info (like whether the display is an HDMI sink) before enabling certain
features.  So track the CEA version number in the display info
structure.
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: NKeith Packard <keithp@keithp.com>

ebec9a7b

tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd

由 Hugh Dickins 提交于 8月 03, 2011

We have already acknowledged that swapoff of a tmpfs file is slower than
it was before conversion to the generic radix_tree: a little slower
there will be acceptable, if the hotter paths are faster.

But it was a shock to find swapoff of a 500MB file 20 times slower on my
laptop, taking 10 minutes; and at that rate it significantly slows down
my testing.

Now, most of that turned out to be overhead from PROVE_LOCKING and
PROVE_RCU: without those it was only 4 times slower than before; and
more realistic tests on other machines don't fare as badly.

I've tried a number of things to improve it, including tagging the swap
entries, then doing lookup by tag: I'd expected that to halve the time,
but in practice it's erratic, and often counter-productive.

The only change I've so far found to make a consistent improvement, is
to short-circuit the way we go back and forth, gang lookup packing
entries into the array supplied, then shmem scanning that array for the
target entry. Scanning in place doubles the speed, so it's now only
twice as slow as before (or three times slower when the PROVEs are on).

So, add radix_tree_locate_item() as an expedient, once-off,
single-caller hack to do the lookup directly in place. #ifdef it on
CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
applicability as save space in other configurations. And, sadly,
#include sched.h for cond_resched().
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e504f3fd

tmpfs: use kmemdup for short symlinks · 69f07ec9

由 Hugh Dickins 提交于 8月 03, 2011

But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info. That's because it was still being shared with the
inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing? I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69f07ec9

tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895

由 Hugh Dickins 提交于 8月 03, 2011

Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start
using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa3b1895

tmpfs: miscellaneous trivial cleanups · 41ffe5d5

由 Hugh Dickins 提交于 8月 03, 2011

While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_", better change
init_tmpfs() to shmem_init().
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41ffe5d5

tmpfs: demolish old swap vector support · 285b2c4f

由 Hugh Dickins 提交于 8月 03, 2011

The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector. With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed. Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to
make use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
tmpfs would never have invented its own peculiar radix tree: we would
have fitted swap entries into the common radix tree instead, in much the
same way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages and
for its swap entries? The swap entries are required precisely where and
when the pages are not. We want to put them together in a single radix
tree: which can then avoid much of the locking which was needed to
prevent them from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first in
the shmem_inode itself, then at least two more pages once a file grew
beyond 16 data pages (pages accounted by df and du, but not by memcg).
Allocated upfront, to avoid allocation when under swapping pressure, but
pure waste when CONFIG_SWAP is not set - I have never spattered around
the ifdefs to prevent that, preferring this move to sharing the common
radix tree instead.

There are three downsides to sharing the radix tree. One, that it binds
tmpfs more tightly to the rest of mm, either requiring knowledge of swap
entries in radix tree there, or duplication of its code here in shmem.c.
I believe that the simplications and memory savings (and probable higher
performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it was
the less precious highmem swap vector pages that could not be freed.
I'm hoping that 64-bit has now been accessible for long enough, that the
highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will
notice.

So... now remove most of the old swap vector code from shmem.c. But,
for the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

285b2c4f

mm: let swap use exceptional entries · a2c16d6c

由 Hugh Dickins 提交于 8月 03, 2011

If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check
for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test. And make it a BUG in
find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.

The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset) to
30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB. Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper
32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2c16d6c

radix_tree: exceptional entries and indices · 6328650b

由 Hugh Dickins 提交于 8月 03, 2011

A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
peculiar swap vector, instead keeping a file's swap entries in the same
radix tree as its struct page pointers: thus saving memory, and
simplifying its code and locking.

This patch:

The radix_tree is used by several subsystems for different purposes.  A
major use is to store the struct page pointers of a file's pagecache for
memory management.  But what if mm wanted to store something other than
page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry()
case.

Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely
case, and radix_tree_exceptional_entry() to return non-0 in the second
case.

If a subsystem already uses radix_tree with that bit set, no problem: it
does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g.  by the page->index of
a struct page.  But that may not be feasible for some kinds of item to
be stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets.  The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6328650b