提交 · e185dda89d69cde142b48059413a03561f41f78a · openeuler / Kernel

08 6月, 2011 4 次提交

writeback: avoid extra sync work at enqueue time · e185dda8

由 Wu Fengguang 提交于 4月 23, 2011

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e185dda8

writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab

由 Christoph Hellwig 提交于 4月 21, 2011

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

f758eeab

writeback: introduce writeback_control.inodes_written · cb9bd115

由 Wu Fengguang 提交于 7月 21, 2010

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.
Acked-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

cb9bd115

writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6

由 Wu Fengguang 提交于 6月 06, 2010

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daae ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daae,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.
Acked-by: NJan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

6e6938b6

04 6月, 2011 1 次提交

Revert "tty: make receive_buf() return the amout of bytes received" · 55db4c64

由 Linus Torvalds 提交于 6月 04, 2011

This reverts commit b1c43f82.

It was broken in so many ways, and results in random odd pty issues.

It re-introduced the buggy schedule_work() in flush_to_ldisc() that can
cause endless work-loops (see commit a5660b41: "tty: fix endless
work loop when the buffer fills up").

It also used an "unsigned int" return value fo the ->receive_buf()
function, but then made multiple functions return a negative error code,
and didn't actually check for the error in the caller.

And it didn't actually work at all.  BenH bisected down odd tty behavior
to it:
  "It looks like the patch is causing some major malfunctions of the X
   server for me, possibly related to PTYs.  For example, cat'ing a
   large file in a gnome terminal hangs the kernel for -minutes- in a
   loop of what looks like flush_to_ldisc/workqueue code, (some ftrace
   data in the quoted bits further down).

   ...

   Some more data: It -looks- like what happens is that the
   flush_to_ldisc work queue entry constantly re-queues itself (because
   the PTY is full ?) and the workqueue thread will basically loop
   forver calling it without ever scheduling, thus starving the consumer
   process that could have emptied the PTY."

which is pretty much exactly the problem we fixed in a5660b41.

Milton Miller pointed out the 'unsigned int' issue.
Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: NMilton Miller <miltonm@bga.com>
Cc: Stefan Bigler <stefan.bigler@keymile.com>
Cc: Toby Gray <toby.gray@realvnc.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

55db4c64

03 6月, 2011 2 次提交

net: tracepoint of net_dev_xmit sees freed skb and causes panic · ec764bf0

由 Koki Sanagi 提交于 5月 30, 2011

Because there is a possibility that skb is kfree_skb()ed and zero cleared
after ndo_start_xmit, we should not see the contents of skb like skb->len and
skb->dev->name after ndo_start_xmit. But trace_net_dev_xmit does that
and causes panic by NULL pointer dereference.
This patch fixes trace_net_dev_xmit not to see the contents of skb directly.

If you want to reproduce this panic,

1. Get tracepoint of net_dev_xmit on
2. Create 2 guests on KVM
2. Make 2 guests use virtio_net
4. Execute netperf from one to another for a long time as a network burden
5. host will panic(It takes about 30 minutes)
Signed-off-by: NKoki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec764bf0

C
asm-generic/unistd.h: support sendmmsg syscall · b36a9689
由 Chris Metcalf 提交于 6月 02, 2011
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
```
b36a9689

02 6月, 2011 2 次提交

af-packet: Add flag to distinguish VID 0 from no-vlan. · a3bcc23e

由 Ben Greear 提交于 6月 01, 2011

Currently, user-space cannot determine if a 0 tcp_vlan_tci
means there is no VLAN tag or the VLAN ID was zero.

Add flag to make this explicit.  User-space can check for
TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
compatible. Older could would have just checked for tp_vlan_tci,
so it will work no worse than before.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a3bcc23e

cfg80211: don't drop p2p probe responses · 333ba732

由 Eliad Peller 提交于 5月 29, 2011

Commit 0a35d36d ("cfg80211: Use capability info to detect mesh beacons")
assumed that probe response with both ESS and IBSS bits cleared
means that the frame was sent by a mesh sta.

However, these capabilities are also being used in the p2p_find phase,
and the mesh-validation broke it.

Rename the WLAN_CAPABILITY_IS_MBSS macro, and verify that mesh ies
exist before assuming this frame was sent by a mesh sta.
Signed-off-by: NEliad Peller <eliad@wizery.com>
Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>

333ba732

01 6月, 2011 3 次提交

intel-iommu: Enable super page (2MiB, 1GiB, etc.) support · 6dd9a7c7

由 Youquan Song 提交于 5月 25, 2011

There are no externally-visible changes with this. In the loop in the
internal __domain_mapping() function, we simply detect if we are mapping:
  - size >= 2MiB, and
  - virtual address aligned to 2MiB, and
  - physical address aligned to 2MiB, and
  - on hardware that supports superpages.

(and likewise for larger superpages).

We automatically use a superpage for such mappings. We never have to
worry about *breaking* superpages, since we trust that we will always
*unmap* the same range that was mapped. So all we need to do is ensure
that dma_pte_clear_range() will also cope with superpages.

Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
it can return a PTE at the appropriate level rather than always
extending the page tables all the way down to level 1. Again, this is
simplified by the fact that we should never encounter existing small
pages when we're creating a mapping; any old mapping that used the same
virtual range will have been entirely removed and its obsolete page
tables freed.

Provide an 'intel_iommu=sp_off' argument on the command line as a
chicken bit. Not that it should ever be required.

==

The original commit seen in the iommu-2.6.git was Youquan's
implementation (and completion) of my own half-baked code which I'd
typed into an email. Followed by half a dozen subsequent 'fixes'.

I've taken the unusual step of rewriting history and collapsing the
original commits in order to keep the main history simpler, and make
life easier for the people who are going to have to backport this to
older kernels. And also so I can give it a more coherent commit comment
which (hopefully) gives a better explanation of what's going on.

The original sequence of commits leading to identical code was:

Youquan Song (3):
      intel-iommu: super page support
      intel-iommu: Fix superpage alignment calculation error
      intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()

David Woodhouse (4):
      intel-iommu: Precalculate superpage support for dmar_domain
      intel-iommu: Fix hardware_largepage_caps()
      intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
      intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>

6dd9a7c7

mtd: fix physmap.h warnings · 63da0290

由 Randy Dunlap 提交于 5月 23, 2011

Fix build warnings in physmap.h:

include/linux/mtd/physmap.h:25: warning: 'struct platform_device' declared inside parameter list
include/linux/mtd/physmap.h:25: warning: its scope is only this definition or declaration, which is probably not what you want
include/linux/mtd/physmap.h:26: warning: 'struct platform_device' declared inside parameter list
include/linux/mtd/physmap.h:27: warning: 'struct platform_device' declared inside parameter list
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>

63da0290

sctp: stop pending timers and purge queues when peer restart asoc · a000c01e

由 Wei Yongjun 提交于 5月 29, 2011

If the peer restart the asoc, we should not only fail any unsent/unacked
data, but also stop the T3-rtx, SACK, T4-rto timers, and teardown ASCONF
queues.
Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a000c01e

31 5月, 2011 1 次提交

block: remove unwanted semicolons · ea9d6553

由 Namhyung Kim 提交于 5月 31, 2011

Since those defined functions require additional semicolon
from the caller, they could cause potential syntax errors
when used in if-else statements.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

ea9d6553

30 5月, 2011 11 次提交

Revert "block: Remove extra discard_alignment from hd_struct." · a1706ac4

由 Jens Axboe 提交于 5月 30, 2011

It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.

Since discard is the only one of the bunch actually looking into
the queue, just revert the change.

This reverts commit 23ceb5b7.

Conflicts:
	fs/partitions/check.c

a1706ac4

virtio: add api for delayed callbacks · 7ab358c2

由 Michael S. Tsirkin 提交于 5月 20, 2011

Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.

Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

7ab358c2

virtio ring: inline function to check for events · bf7035bf

由 Michael S. Tsirkin 提交于 5月 20, 2011

With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

bf7035bf

virtio: event index interface · 770b31a8

由 Michael S. Tsirkin 提交于 5月 20, 2011

Define a new feature bit for the guest and host to utilize
an event index (like Xen) instead if a flag bit to enable/disable
interrupts and kicks.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

770b31a8

virtio: add full three-clause BSD text to headers. · a1b38387

由 Rusty Russell 提交于 5月 30, 2011

It's unclear to me if it's important, but it's obviously causing my
technical colleages some headaches and I'd hate such imprecision to
slow virtio adoption.

I've emailed this to all non-trivial contributors for approval, too.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NGrant Likely <grant.likely@secretlab.ca>
Acked-by: NRyan Harper <ryanh@us.ibm.com>
Acked-by: NAnthony Liguori <aliguori@us.ibm.com>
Acked-by: NEric Van Hensbergen <ericvh@gmail.com>
Acked-by: Njohn cooper <john.cooper@redhat.com>
Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>

a1b38387

mm: Fix boot crash in mm_alloc() · 6345d24d

由 Linus Torvalds 提交于 5月 29, 2011

Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [<c11ae035>] find_next_bit+0x55/0xb0
    Call Trace:
     [<c11addda>] cpumask_any_but+0x2a/0x70
     [<c102396b>] flush_tlb_mm+0x2b/0x80
     [<c1022705>] pud_populate+0x35/0x50
     [<c10227ba>] pgd_alloc+0x9a/0xf0
     [<c103a3fc>] mm_init+0xec/0x120
     [<c103a7a3>] mm_alloc+0x53/0xd0

which was introduced by commit de03c72c ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Reported-by: NIngo Molnar <mingo@elte.hu>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6345d24d

B
NFSv4.1: change pg_test return type to bool · 18ad0a9f
由 Benny Halevy 提交于 5月 25, 2011
```
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
```
18ad0a9f

pnfs: layoutreturn · cbe82603

由 Benny Halevy 提交于 5月 22, 2011

NFSv4.1 LAYOUTRETURN implementation

Currently, does not support layout-type payload encoding.
Signed-off-by: NAlexandros Batsakis <batsakis@netapp.com>
Signed-off-by: NAndy Adamson <andros@citi.umich.edu>
Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NDean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: NFred Isaman <iisaman@citi.umich.edu>
Signed-off-by: NFred Isaman <iisaman@netapp.com>
Signed-off-by: NMarc Eshel <eshel@almaden.ibm.com>
Signed-off-by: NZhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>

cbe82603

pnfs: support for non-rpc layout drivers · d20581aa

由 Benny Halevy 提交于 5月 22, 2011

Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.

[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>

d20581aa

pnfs-obj: pnfs_osd XDR definitions · 38b7c401

由 Benny Halevy 提交于 5月 22, 2011

* Add the pnfs_osd_xdr.h header

* defintions the pnfs_osd_layout structure including all it's
  sub-types and constants.
* Declare the pnfs_osd_xdr_decode_layout API + all needed
  inline helpers.

* Define the pnfs_osd_deviceaddr structure and all its subtypes and
  constants.
* Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.

* Define the pnfs_osd_ioerr structure, its substructures and constants.
* Declare API for encoding of a pnfs_osd_ioerr into XDR stream.

* Define the pnfs_osd_layoutupdate structure and its substructures.
* Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.

[Remove server definitions]
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>

38b7c401

SUNRPC: introduce xdr_init_decode_pages · f7da7a12

由 Benny Halevy 提交于 5月 19, 2011

Initialize xdr_stream and xdr_buf using an array of page pointers
and length of buffer.
Signed-off-by: NBenny Halevy <bhalevy@panasas.com>

f7da7a12

29 5月, 2011 10 次提交

dm kcopyd: return client directly and not through a pointer · fa34ce73

由 Mikulas Patocka 提交于 5月 29, 2011

Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fa34ce73

dm kcopyd: reserve fewer pages · 5f43ba29

由 Mikulas Patocka 提交于 5月 29, 2011

Reserve just the minimum of pages needed to process one job.

Because we allocate pages from page allocator, we don't need to reserve
a large number of pages.  The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

5f43ba29

dm io: use fixed initial mempool size · bda8efec

由 Mikulas Patocka 提交于 5月 29, 2011

Replace the arbitrary calculation of an initial io struct mempool size
with a constant.

The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4.  This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.

Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures.  One structure is enough to
process the whole request.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

bda8efec

dm table: allow targets to support discards internally · 4c259327

由 Mike Snitzer 提交于 5月 29, 2011

Permit a target to support discards regardless of whether or not all its
underlying devices do.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

4c259327

[S390] mm: fix storage key handling · a43a9d93

由 Heiko Carstens 提交于 5月 29, 2011

page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".

Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.

Unable to handle kernel paging request at virtual user address (null)
Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 Not tainted 2.6.39-07551-g139f37f5-dirty #152
Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
           0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
           0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
           0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
           0000000000316b0a: b9020022           ltgr    %r2,%r2
           0000000000316b0e: a7740015           brc     7,316b38
          >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
           0000000000316b18: 4120c010           la      %r2,16(%r12)
           0000000000316b1c: 4130d060           la      %r3,96(%r13)
           0000000000316b20: e340d0600004       lg      %r4,96(%r13)
           0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
Call Trace:
([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
 [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
 [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
 [<00000000002597e8>] writeback_single_inode+0xf8/0x268
 [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
 [<000000000025a700>] writeback_inodes_wb+0x80/0x168
 [<000000000025aa92>] wb_writeback+0x2aa/0x324
 [<000000000025abde>] wb_do_writeback+0xd2/0x274
 [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
 [<00000000001737be>] kthread+0xa6/0xb0
 [<000000000056c1da>] kernel_thread_starter+0x6/0xc
 [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Last Breaking-Event-Address:
 [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

a43a9d93

ACPI: Add D3 cold state · 28c2103d

由 Lin Ming 提交于 5月 04, 2011

_SxW returns an Integer containing the lowest D-state supported in state
Sx. If OSPM has not indicated that it supports _PR3, then the value “3”
corresponds to D3.  If it has indicated _PR3 support, the value “3”
represents D3hot and the value “4” represents D3cold.

Linux does set _OSC._PR3, so we should fix it to expect that _SxW can
return 4.
Signed-off-by: NLin Ming <ming.m.lin@intel.com>
Acked-by: NJesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: NLen Brown <len.brown@intel.com>

28c2103d

ACPI: processor: fix processor_physically_present in UP kernel · 932df741

由 Lin Ming 提交于 5月 16, 2011

Usually, there are multiple processors defined in ACPI table, for
example

    Scope (_PR)
    {
        Processor (CPU0, 0x00, 0x00000410, 0x06) {}
        Processor (CPU1, 0x01, 0x00000410, 0x06) {}
        Processor (CPU2, 0x02, 0x00000410, 0x06) {}
        Processor (CPU3, 0x03, 0x00000410, 0x06) {}
    }

processor_physically_present(...) will be called to check whether those
processors are physically present.

Currently we have below codes in processor_physically_present,

cpuid = acpi_get_cpuid(...);
if ((cpuid == -1) && (num_possible_cpus() > 1))
        return false;
return true;

In UP kernel, acpi_get_cpuid(...) always return -1 and
num_possible_cpus() always return 1, so
processor_physically_present(...) always returns true for all passed in
processor handles.

This is wrong for UP processor or SMP processor running UP kernel.

This patch removes the !SMP version of acpi_get_cpuid(), so both UP and
SMP kernel use the same acpi_get_cpuid function.

And for UP kernel, only processor 0 is valid.

https://bugzilla.kernel.org/show_bug.cgi?id=16548
https://bugzilla.kernel.org/show_bug.cgi?id=16357Tested-by: NAnton Kochkov <anton.kochkov@gmail.com>
Tested-by: NAmbroz Bizjak <ambrop7@gmail.com>
Signed-off-by: NLin Ming <ming.m.lin@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

932df741

idle governor: Avoid lock acquisition to read pm_qos before entering idle · 333c5ae9

由 Tim Chen 提交于 2月 11, 2011

Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.

I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received.  This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list.  The contention is severe when there are a lot
of cpus waking and going into idle.  For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:

-     37.82%          swapper  [kernel.kallsyms]          [k]
_raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
      - 95.65% pm_qos_request
           menu_select
           cpuidle_idle_call
         - cpu_idle
              99.98% start_secondary

A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.

cc: stable@kernel.org
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Acked-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NJames Bottomley <James.Bottomley@suse.de>
Acked-by: Nmark gross <markgross@thegnar.org>
Signed-off-by: NLen Brown <len.brown@intel.com>

333c5ae9

ns: Wire up the setns system call · 7b21fddd

由 Eric W. Biederman 提交于 5月 27, 2011

32bit and 64bit on x86 are tested and working.  The rest I have looked
at closely and I can't find any problems.

setns is an easy system call to wire up.  It just takes two ints so I
don't expect any weird architecture porting problems.

While doing this I have noticed that we have some architectures that are
very slow to get new system calls.  cris seems to be the slowest where
the last system calls wired up were preadv and pwritev.  avr32 is weird
in that recvmmsg was wired up but never declared in unistd.h.  frv is
behind with perf_event_open being the last syscall wired up.  On h8300
the last system call wired up was epoll_wait.  On m32r the last system
call wired up was fallocate.  mn10300 has recvmmsg as the last system
call wired up.  The rest seem to at least have syncfs wired up which was
new in the 2.6.39.

v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch
v5: ported to v2.6.39-rc6
v6: rebased onto parisc-next and net-next to avoid syscall  conflicts.
v7: ported to Linus's latest post 2.6.39 tree.

>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>  arch/blackfin/mach-common/entry.S      |    1 +
Acked-by: NMike Frysinger <vapier@gentoo.org>

Oh - ia64 wiring looks good.
Acked-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7b21fddd

Cache xattr security drop check for write v2 · 69b45732

由 Andi Kleen 提交于 5月 28, 2011

Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.

Why xattr lookup on every write I hear you ask?

write wants to drop suid and security related xattrs that could set o
capabilities for executables.  To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.

In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.

Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.

This patch adds a simple per inode to avoid this problem.  We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.

I also used the same flag to avoid the suid check, although
that one is pretty cheap.

A file system can also set this flag when it creates the inode,
if it has a cheap way to do so.  This is done for some common file systems
in followon patches.

With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.

v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

69b45732

28 5月, 2011 6 次提交

atomic: Add atomic_or() · 55c2945a

由 Paul E. McKenney 提交于 5月 11, 2011

An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
add a generic version.
Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

55c2945a

cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51

由 KOSAKI Motohiro 提交于 5月 19, 2011

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1e1b6c51

gpio: make gpio_{request,free}_array gpio array parameter const · 7c295975

由 Lars-Peter Clausen 提交于 5月 25, 2011

gpio_{request,free}_array should not (and do not) modify the passed gpio
array, so make the parameter const.
Signed-off-by: NLars-Peter Clausen <lars@metafoo.de>
Acked-by: NEric Miao <eric.y.miao@gmail.com>
Acked-by: NWolfram Sang <w.sang@pengutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

7c295975

SUNRPC: Support for RPC over AF_LOCAL transports · 176e21ee

由 Chuck Lever 提交于 5月 09, 2011

TI-RPC introduces the capability of performing RPC over AF_LOCAL
sockets. It uses this mainly for registering and unregistering
local RPC services securely with the local rpcbind, but we could
also conceivably use it as a generic upcall mechanism.

This patch provides a client-side only implementation for the moment.
We might also consider a server-side implementation to provide
AF_LOCAL access to NLM (for statd downcalls, and such like).

Autobinding is not supported on kernel AF_LOCAL transports at this
time. Kernel ULPs must specify the pathname of the remote endpoint
when an AF_LOCAL transport is created. rpcbind supports registering
services available via AF_LOCAL, so the kernel could handle it with
some adjustment to ->rpcbind and ->set_port. But we don't need this
feature for doing upcalls via well-known named sockets.

This has not been tested with ULPs that move a substantial amount of
data. Thus, I can't attest to how robust the write_space and
congestion management logic is.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

176e21ee

net: Kill ratelimit.h dependency in linux/net.h · c5c177b4

由 David S. Miller 提交于 5月 27, 2011

Ingo Molnar noticed that we have this unnecessary ratelimit.h
dependency in linux/net.h, which hid compilation problems from
people doing builds only with CONFIG_NET enabled.

Move this stuff out to a seperate net/net_ratelimit.h file and
include that in the only two places where this thing is needed.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NIngo Molnar <mingo@elte.hu>

c5c177b4

net: Add linux/sysctl.h includes where needed. · bee95250

由 David S. Miller 提交于 5月 26, 2011

Several networking headers were depending upon the implicit
linux/sysctl.h include they get when including linux/net.h

Add explicit includes.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bee95250

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功