提交 · c290f8358acaeffd8e0c551ddcc24d1206143376 · openeuler / raspberrypi-kernel

19 10月, 2011 1 次提交

tty: Support compat_ioctl get/set termios_locked · 8193c429

由 Thomas Meyer 提交于 10月 05, 2011

When running a Fedora 15 (x86) on an x86_64 kernel, in the boot process
plymouthd complains about those two missing ioctls:
[    2.581783] ioctl32(plymouthd:186): Unknown cmd fd(10) cmd(00005457){t:'T';sz:0} arg(ffb6a5d0) on /dev/tty1
[    2.581803] ioctl32(plymouthd:186): Unknown cmd fd(10) cmd(00005456){t:'T';sz:0} arg(ffb6a680) on /dev/tty1

both ioctl functions work on the 'struct termios' resp. 'struct termios2',
which has the same size (36 bytes resp. 44 bytes) on x86 and x86_64,
so it's just a matter of converting the pointer from userland.
Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: NAndrew Morton <akpm@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

8193c429

23 9月, 2011 1 次提交

serial: Support the EFR-register of XR1715x uarts. · 06315348

由 Søren Holm 提交于 9月 02, 2011

The EFR (Enhenced-Features-Register) is located at a different offset
than the other devices supporting UART_CAP_EFR. This change add a special
setup quick to set UPF_EXAR_EFR on the port. UPF_EXAR_EFR is then used to
the port type to PORT_XR17D15X since it is for sure a XR17D15X uart.
Signed-off-by: NSøren Holm <sgh@sgh.dk>
Acked-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

06315348

26 8月, 2011 1 次提交

TTY: define tty_wait_until_sent_from_close · a57a7bf3

由 Jiri Slaby 提交于 8月 25, 2011

We need this helper to fix system stalls. The issue is that the rest
of the system TTYs wait for us to finish waiting. This wasn't an issue
with BKL. BKL used to unlock implicitly.

This is based on the Arnd suggestion.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

a57a7bf3

25 8月, 2011 2 次提交

atmel_serial: RS485: receiving enabled when sending data · 83cac9f3

由 Bernhard Roth 提交于 8月 24, 2011

By default the atmel_serial driver in RS485 mode disables receiving data until
all data in the send buffer has been sent. This flag allows to receive data
even whilst sending data.
Signed-off-by: NBernhard Roth <br@pwrnet.de>
Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
Acked-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

83cac9f3

Revert "tty: serial8250: add helpers for the DesignWare 8250" · 9a234349

由 Greg Kroah-Hartman 提交于 8月 24, 2011

This reverts commit 6b1a98d1.

It causes a build error that needs to be resolved differently.

Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

9a234349

24 8月, 2011 6 次提交

tty: serial8250: add helpers for the DesignWare 8250 · 6b1a98d1

由 Jamie Iles 提交于 8月 16, 2011

The Synopsys DesignWare 8250 is an 8250 that has an extra interrupt that
gets raised when writing to the LCR when busy.  To handle this we need
special serial_out, serial_in and handle_irq methods.  Add a new
function serial8250_use_designware_io() that configures a uart_port with
these accessors.

Cc: Alan Cox <alan@linux.intel.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJamie Iles <jamie@jamieiles.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

6b1a98d1

tty: serial8250: remove UPIO_DWAPB{,32} · 4834d028

由 Jamie Iles 提交于 8月 15, 2011

Now that platforms can override the port IRQ handler and the only user
of these UPIO modes has been converted over, kill off UPIO_DWAPB and
UPIO_DWAPB32.
Acked-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NJamie Iles <jamie@jamieiles.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

4834d028

tty: serial8250: allow platforms to override irq handler · 583d28e9

由 Jamie Iles 提交于 8月 15, 2011

Some ports (e.g. Synopsys DesignWare 8250) have special requirements for
handling the interrupts.  Allow these platforms to specify their own
interrupt handler that will override the default.
serial8250_handle_irq() is provided so that platforms can extend the IRQ
handler rather than completely replacing it.
Signed-off-by: NJamie Iles <jamie@jamieiles.com>
Acked-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

583d28e9

tty: serial: allow ports to override the irq handler · a74036f5

由 Jamie Iles 提交于 8月 15, 2011

Some serial ports may have unusal requirements for interrupt handling
(e.g. the Synopsys DesignWare 8250-alike port and it's busy detect
interrupt).  Add a .handle_irq callback that can be used for platforms
to override the interrupt behaviour in a similar fashion to the
.serial_out and .serial_in callbacks.
Signed-off-by: NJamie Iles <jamie@jamieiles.com>
Acked-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

a74036f5

TTY: remove tty_locked · 906cbe13

由 Jiri Slaby 提交于 7月 14, 2011

We used it really only serial and ami_serial. The rest of the
callsites were BUG/WARN_ONs to check if BTM is held. Now that we
pruned tty_locked from both of the real users, we can get rid of
tty_lock along with __big_tty_mutex_owner.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

906cbe13

TTY: serial, remove tasklet for tty_wakeup · 6a3e492b

由 Jiri Slaby 提交于 7月 14, 2011

tty_wakeup can be called from any context. So there is no need to have
an extra tasklet for calling that. Hence save some space and remove
the tasklet completely.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

6a3e492b

18 8月, 2011 2 次提交

mm: fix __page_to_pfn for a const struct page argument · aa462abe

由 Ian Campbell 提交于 8月 17, 2011

This allows the cast in lowmem_page_address (introduced as a warning
fixup to 33dd4e0e "mm: make some struct page's const") to be
removed.

Propagate const'ness to page_to_section() as well since it is required
by __page_to_pfn.
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa462abe

mm: make HASHED_PAGE_VIRTUAL page_address' struct page argument const. · f9918794

由 Ian Campbell 提交于 8月 17, 2011

Followup to 33dd4e0e "mm: make some struct page's const" which missed the
HASHED_PAGE_VIRTUAL case.
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9918794

16 8月, 2011 1 次提交

block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa

由 Jeff Moyer 提交于 8月 15, 2011

Commit ae1b1539, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
        struct request *rq;

        while (1) {
-               while (!list_empty(&q->queue_head)) {
+               if (!list_empty(&q->queue_head)) {
                        rq = list_entry_rq(q->queue_head.next);
-                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-                           (rq->cmd_flags & REQ_FLUSH_SEQ))
-                               return rq;
-                       rq = blk_do_flush(q, rq);
-                       if (rq)
-                               return rq;
+                       return rq;
                }

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
        unsigned int fflags = q->flush_flags; /* may change, cache it */
        bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
        bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
        bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
        REQ_FUA);
        unsigned skip = 0;
...
        if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                rq->cmd_flags &= ~REQ_FLUSH;
		if (!has_fua)
			rq->cmd_flags &= ~REQ_FUA;
	        return rq;
	}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all.  Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking.  While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io).  Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers.  So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance.  Comments and other testers, as always, are
appreciated.

Cheers,
Jeff
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

4853abaa

14 8月, 2011 2 次提交

PM / Domains: Fix build for CONFIG_PM_RUNTIME unset · 17f2ae7f

由 Rafael J. Wysocki 提交于 8月 14, 2011

Function genpd_queue_power_off_work() is not defined for
CONFIG_PM_RUNTIME, so pm_genpd_poweroff_unused() causes a build
error to happen in that case.  Fix the problem by making
pm_genpd_poweroff_unused() depend on CONFIG_PM_RUNTIME too.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

17f2ae7f

mmc: remove unused "ddr" parameter in struct mmc_ios · 7fd781e8

由 Jaehoon Chung 提交于 8月 08, 2011

"mmc: dw_mmc: Fix DDR mode support" removed the last user.
Signed-off-by: NJaehoon Chung <jh80.chung@samsung.com>
Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: NChris Ball <cjb@laptop.org>

7fd781e8

12 8月, 2011 1 次提交

move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997

由 Vasiliy Kulikov 提交于 8月 08, 2011

The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.

Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only.  But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges.  So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.

The NPROC can still be enforced in the common code flow of daemons
spawning user processes.  Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.

Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.

Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve().  If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected.  If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().

The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.

Similar check was introduced in -ow patches (without the process flag).

v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Acked-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72fa5997

11 8月, 2011 2 次提交

blktrace: add FLUSH/FUA support · c09c47ca

由 Namhyung Kim 提交于 8月 11, 2011

Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):

 - WRITE:            W
 - WRITE_FLUSH:      FW
 - WRITE_FUA:        WF
 - WRITE_FLUSH_FUA:  FWF

Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c09c47ca

Move some REQ flags to the common bio/request area · 8e4bf844

由 Matthew Wilcox 提交于 8月 11, 2011

REQ_SECURE, REQ_FLUSH and REQ_FUA may all be set on a bio as well as
on a request, so relocate them to the shared part of the enum.
Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

8e4bf844

10 8月, 2011 1 次提交

dt: add empty of_get_property for non-dt · 89272b8c

由 Stephen Warren 提交于 8月 05, 2011

The patch adds empty function of_get_property for non-dt build, so that
drivers migrating to dt can save some '#ifdef CONFIG_OF'.

This also fixes the current Tegra compile problem in linux-next.
Signed-off-by: NStephen Warren <swarren@nvidia.com>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

89272b8c

09 8月, 2011 3 次提交

mm: Fix fixup_user_fault() for MMU=n · 5c723ba5

由 Peter Zijlstra 提交于 7月 27, 2011

In commit 2efaca92 ("mm/futex: fix futex writes on archs with SW
tracking of dirty & young") we forgot about MMU=n.  This patch fixes
that.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: NDavid Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/1311761831.24752.413.camel@twinsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c723ba5

cred: use 'const' in get_current_{user,groups} · 638a8439

由 Linus Torvalds 提交于 8月 08, 2011

Avoid annoying warnings from these functions ("discards qualifiers")
because they assign 'current_cred()' to a non-const pointer.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

638a8439

CRED: Restore const to current_cred() · 27e4e436

由 David Howells 提交于 8月 08, 2011

Commit 32955148 ("fix rcu annotations noise in cred.h") accidentally
dropped the const of current->cred inside current_cred() by the
insertion of a cast to deal with an RCU annotation loss warning from
sparce.

Use an appropriate RCU wrapper instead so as not to lose the const.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

27e4e436

08 8月, 2011 2 次提交

net: Make userland include of netlink.h more sane. · 6602a4ba

由 David S. Miller 提交于 8月 07, 2011

Currently userland will barf when including linux/netlink.h unless it
precisely includes sys/socket.h first.  The issue is where the
definition of "sa_family_t" comes from.

We've been back and forth on how to fix this issue in the past, see:

http://thread.gmane.org/gmane.linux.debian.devel.bugs.general/622621
http://thread.gmane.org/gmane.linux.network/143380

Ben Hutchings suggested we take a hint from how we handle the
sockaddr_storage type.  First we define a "__kernel_sa_family_t"
to linux/socket.h that is always defined.

Then if __KERNEL__ is defined, we also define "sa_family_t" as
equal to "__kernel_sa_family_t".

Then in places like linux/netlink.h we use __kernel_sa_family_t
in user visible datastructures.
Reported-by: NMichel Machado <michel@digirati.com.br>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6602a4ba

fix rcu annotations noise in cred.h · 32955148

由 Al Viro 提交于 8月 07, 2011

task->cred is declared as __rcu, and access to other tasks' ->cred is,
indeed, protected. Access to current->cred does not need rcu_dereference()
at all, since only the task itself can change its ->cred. sparse, of
course, has no way of knowing that...

Add force-cast in current_cred(), make current_fsuid() et.al. use it.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

32955148

07 8月, 2011 5 次提交

vfs: optimize inode cache access patterns · 3ddcd056

由 Linus Torvalds 提交于 8月 06, 2011

The inode structure layout is largely random, and some of the vfs paths
really do care.  The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.

We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.

It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.

The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible.  The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture.  So there's more tuning
to be done.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ddcd056

vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e

由 Linus Torvalds 提交于 8月 06, 2011

Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.

And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

830c0f0e

net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea

由 David S. Miller 提交于 8月 03, 2011

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: NDan Kaminsky <dan@doxpara.com>
Tested-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e5714ea

crypto: Move md5_transform to lib/md5.c · bc0b96b5

由 David S. Miller 提交于 8月 03, 2011

We are going to use this for TCP/IP sequence number and fragment ID
generation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc0b96b5

lib/sha1: use the git implementation of SHA-1 · 1eb19a12

由 Mandeep Singh Baines 提交于 8月 05, 2011

For ChromiumOS, we use SHA-1 to verify the integrity of the root
filesystem.  The speed of the kernel sha-1 implementation has a major
impact on our boot performance.

To improve boot performance, we investigated using the heavily optimized
sha-1 implementation used in git.  With the git sha-1 implementation, we
see a 11.7% improvement in boot time.

10 reboots, remove slowest/fastest.

Before:

  Mean: 6.58 seconds Stdev: 0.14

After (with git sha-1, this patch):

  Mean: 5.89 seconds Stdev: 0.07

The other cool thing about the git SHA-1 implementation is that it only
needs 64 bytes of stack for the workspace while the original kernel
implementation needed 320 bytes.
Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
Cc: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Cc: Nicolas Pitre <nico@cam.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1eb19a12

06 8月, 2011 1 次提交

Add KEY_MICMUTE and enable it on Lenovo X220 · 33009557

由 Andy Lutomirski 提交于 5月 24, 2011

I suspect that this works on T410.
Signed-off-by: NAndy Lutomirski <luto@mit.edu>
Signed-off-by: NMatthew Garrett <mjg@redhat.com>

33009557

05 8月, 2011 1 次提交

nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4 · 655b1612

由 Boaz Harrosh 提交于 5月 29, 2011

exofs file system wants to use pnfs_osd_xdr.h file instead of
redefining pnfs-objects types in it's private "pnfs.h" headr.

Before we do the switch we must make sure pnfs_osd_xdr.h is
compilable also under NFS versions smaller than 4.1. Since now
it is needed regardless of version, by the exofs code.

nfs4_string is not the only nfs4 type out in the global scope.
Ack-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

655b1612

04 8月, 2011 8 次提交

Revert "dt: add of_alias_scan and of_alias_get_id" · fe55c184

由 Grant Likely 提交于 8月 04, 2011

This reverts commit 750f463a.

of_alias_* still needs work to be generalized for 'promtree' dt
platforms, and to no implicitly create entries for available ids.
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

fe55c184

tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd

由 Hugh Dickins 提交于 8月 03, 2011

We have already acknowledged that swapoff of a tmpfs file is slower than
it was before conversion to the generic radix_tree: a little slower
there will be acceptable, if the hotter paths are faster.

But it was a shock to find swapoff of a 500MB file 20 times slower on my
laptop, taking 10 minutes; and at that rate it significantly slows down
my testing.

Now, most of that turned out to be overhead from PROVE_LOCKING and
PROVE_RCU: without those it was only 4 times slower than before; and
more realistic tests on other machines don't fare as badly.

I've tried a number of things to improve it, including tagging the swap
entries, then doing lookup by tag: I'd expected that to halve the time,
but in practice it's erratic, and often counter-productive.

The only change I've so far found to make a consistent improvement, is
to short-circuit the way we go back and forth, gang lookup packing
entries into the array supplied, then shmem scanning that array for the
target entry. Scanning in place doubles the speed, so it's now only
twice as slow as before (or three times slower when the PROVEs are on).

So, add radix_tree_locate_item() as an expedient, once-off,
single-caller hack to do the lookup directly in place. #ifdef it on
CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
applicability as save space in other configurations. And, sadly,
#include sched.h for cond_resched().
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e504f3fd

tmpfs: use kmemdup for short symlinks · 69f07ec9

由 Hugh Dickins 提交于 8月 03, 2011

But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info. That's because it was still being shared with the
inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing? I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69f07ec9

tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895

由 Hugh Dickins 提交于 8月 03, 2011

Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start
using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa3b1895

tmpfs: miscellaneous trivial cleanups · 41ffe5d5

由 Hugh Dickins 提交于 8月 03, 2011

While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_", better change
init_tmpfs() to shmem_init().
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41ffe5d5

tmpfs: demolish old swap vector support · 285b2c4f

由 Hugh Dickins 提交于 8月 03, 2011

The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector. With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed. Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to
make use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
tmpfs would never have invented its own peculiar radix tree: we would
have fitted swap entries into the common radix tree instead, in much the
same way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages and
for its swap entries? The swap entries are required precisely where and
when the pages are not. We want to put them together in a single radix
tree: which can then avoid much of the locking which was needed to
prevent them from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first in
the shmem_inode itself, then at least two more pages once a file grew
beyond 16 data pages (pages accounted by df and du, but not by memcg).
Allocated upfront, to avoid allocation when under swapping pressure, but
pure waste when CONFIG_SWAP is not set - I have never spattered around
the ifdefs to prevent that, preferring this move to sharing the common
radix tree instead.

There are three downsides to sharing the radix tree. One, that it binds
tmpfs more tightly to the rest of mm, either requiring knowledge of swap
entries in radix tree there, or duplication of its code here in shmem.c.
I believe that the simplications and memory savings (and probable higher
performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it was
the less precious highmem swap vector pages that could not be freed.
I'm hoping that 64-bit has now been accessible for long enough, that the
highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will
notice.

So... now remove most of the old swap vector code from shmem.c. But,
for the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

285b2c4f

mm: let swap use exceptional entries · a2c16d6c

由 Hugh Dickins 提交于 8月 03, 2011

If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check
for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test. And make it a BUG in
find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.

The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset) to
30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB. Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper
32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a2c16d6c

radix_tree: exceptional entries and indices · 6328650b

由 Hugh Dickins 提交于 8月 03, 2011

A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
peculiar swap vector, instead keeping a file's swap entries in the same
radix tree as its struct page pointers: thus saving memory, and
simplifying its code and locking.

This patch:

The radix_tree is used by several subsystems for different purposes.  A
major use is to store the struct page pointers of a file's pagecache for
memory management.  But what if mm wanted to store something other than
page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry()
case.

Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely
case, and radix_tree_exceptional_entry() to return non-0 in the second
case.

If a subsystem already uses radix_tree with that bit set, no problem: it
does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g.  by the page->index of
a struct page.  But that may not be feasible for some kinds of item to
be stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets.  The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6328650b