提交 · 4d94aa1b1d437f9513ddc89974d8bd214b8304f6 · xiphi1978 / linux

10 10月, 2010 1 次提交

ocfs2/cluster: Bump up dlm protocol to version 1.1 · 4d94aa1b

由 Sunil Mushran 提交于 10月 09, 2010

dlm protocol 1.1. activates messages DLM_QUERY_REGION and DLM_QUERY_NODEINFO
that are a must for global heartbeat.

It also activates o2hb_global_heartbeat_active().
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

4d94aa1b

07 10月, 2010 4 次提交

· 43695d09

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Show per region heartbeat elapsed time

This patch adds a per region debugfs file that shows the elapsed time
since the time the o2hb timer was last armed.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

43695d09

· d6aa1c7c

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Add mlogs for heartbeat up/down events

This patch adds mlogs for o2hb up and down events.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

d6aa1c7c

· 1f285305

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Create debugfs dir/files for each region

This patch creates debugfs directory for each o2hb region and creates
files to expose the region number and the per region live node bitmap.
This information will be useful in debugging cluster issues.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

1f285305

· a6de0136

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Create debugfs files for live, quorum and failed region bitmaps

This patch prints the bitmaps of live, quorum and failed regions. This
information will be useful in debugging cluster issues.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

a6de0136

08 10月, 2010 1 次提交

· b1c5ebfb

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Maintain bitmap of failed regions

In global heartbeat mode, we track the bitmap of regions that have seen
heartbeat timeouts. We fence if the number of such regions is greater than
or equal to half the number of quorum regions.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

b1c5ebfb

07 10月, 2010 2 次提交

· 43182d2a

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Maintain bitmap of quorum regions

o2hb allows online adding of regions. However, a newly added region is not
used in quorum calculations unless it has been added on all nodes. This patch
tracks a bitmap of such quorum regions.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

43182d2a

· e7d656ba

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Track bitmap of live heartbeat regions

A heartbeat region becomes live (or active) after a fixed number of (steady)
iterations.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

e7d656ba

08 10月, 2010 1 次提交

· 536f0741

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Track number of global heartbeat regions

In global heartbeat mode, we have a upper limit for the number of active regions.
This patch adds the facility to track the number of active global heartbeat
regions and fails to start heartbeat if the number exceeds the maximum.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

536f0741

07 10月, 2010 1 次提交

· 823a637a

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Maintain live node bitmap per heartbeat region

Currently we track a global livenode bitmap that keeps track of all nodes
that are heartbeating in all regions.

This patch adds the ability to track the livenode bitmap on a per region basis.
We will use this facility in a later patch to allow us to withstand the loss of
a minority number of regions.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

823a637a

08 10月, 2010 3 次提交

· 8ca8b0bb

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Reorganize o2hb debugfs init

o2hb debugfs handling is reorganized to allow for easy expansion.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

8ca8b0bb

· 0e105d37

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Check slots for unconfigured live nodes

o2hb currently checks slots for configured nodes only. This patch makes
it check the slots for the live nodes too to take care of a race in which
a node is removed from the configuration but not from the live map.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

0e105d37

S
ocfs2/cluster: Print messages when adding/removing nodes · 39a29856
由 Sunil Mushran 提交于 10月 07, 2010
```
Prints messages when the user adds or removes nodes.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
```
39a29856

07 10月, 2010 1 次提交

· 18c50cb0

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/cluster: Print messages when adding/removing heartbeat regions

Prints messages when the user adds or removes heartbeat regions in global
heartbeat mode. These messages are useful when debugging cluster related issues.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

18c50cb0

08 10月, 2010 1 次提交

· 18cfdf1b

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/dlm: Add message DLM_QUERY_NODEINFO

Adds new dlm message DLM_QUERY_NODEINFO that sends the attributes of all
registered nodes. This message is sent if the negotiated dlm protocol is
1.1 or higher. If the information of the joining node does not match
that of any existing nodes, the join domain request is rejected.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

18cfdf1b

07 10月, 2010 1 次提交

· 5f3c6d9c

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2: Print message if user mounts without starting global heartbeat

In global heartbeat mode, the heartbeat is started by the user. This patch
prints an error if the user attempts to mount a volume without starting the
heartbeat.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

5f3c6d9c

10 10月, 2010 1 次提交

· ea203441

由 Sunil Mushran 提交于 10月 09, 2010

ocfs2/dlm: Add message DLM_QUERY_REGION

Adds new dlm message DLM_QUERY_REGION that sends the names of all active
heartbeat regions. This message is only sent in the global heartbeat
mode. If the regions in the joining node do not fully match the ones in
the active nodes, the join domain request is rejected.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

ea203441

08 10月, 2010 1 次提交

· b3c85c4c

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Get all heartbeat regions

Export function in o2hb to get a list of heartbeat regions. It also adds an
upper limit to the length of the heartbeat region name.

o2hb_global_heartbeat_active() currently disables global heartbeat. It will
be enabled in a later patch after all the code is added.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

b3c85c4c

07 10月, 2010 1 次提交

· b1365d0b

由 Sunil Mushran 提交于 10月 06, 2010

ocfs2/dlm: Expose dlm_protocol in dlm_state

Add dlm_protocol to the list of info shown by the debugfs file, dlm_state.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

b1365d0b

08 10月, 2010 1 次提交

· 2c442719

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2: Add support for heartbeat=global mount option

Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

2c442719

10 10月, 2010 1 次提交

· 98f486f2

由 Sunil Mushran 提交于 10月 09, 2010

ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO

OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.

This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.

This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

98f486f2

08 10月, 2010 1 次提交

· 54b5187b

由 Sunil Mushran 提交于 10月 07, 2010

ocfs2/cluster: Add heartbeat mode configfs parameter

Add heartbeat mode parameter to the configfs tree. This will be used
to set/show the heartbeat mode. The user is free to toggle the mode
between local and global as long as there is no active heartbeat region.
Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>

54b5187b

04 10月, 2010 2 次提交

writeback: always use sb->s_bdi for writeback purposes · aaead25b

由 Christoph Hellwig 提交于 10月 04, 2010

We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags. We're also using for tracking dirty inodes for
writeback.

To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.

Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.

The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently. For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

aaead25b

fuse: Initialize total_len in fuse_retrieve() · 0157443c

由 Geert Uytterhoeven 提交于 9月 30, 2010

fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this
function

Initialize total_len to zero, else its value will be undefined.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

0157443c

02 10月, 2010 4 次提交

reiserfs: fix unwanted reiserfs lock recursion · 9d8117e7

由 Frederic Weisbecker 提交于 9月 30, 2010

Prevent from recursively locking the reiserfs lock in reiserfs_unpack()
because we may call journal_begin() that requires the lock to be taken
only once, otherwise it won't be able to release the lock while taking
other mutexes, ending up in inverted dependencies between the journal
mutex and the reiserfs lock for example.

This fixes:

  =======================================================
  [ INFO: possible circular locking dependency detected ]
  2.6.35.4.4a #3
  -------------------------------------------------------
  lilo/1620 is trying to acquire lock:
   (&journal->j_mutex){+.+...}, at: [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs]

  but task is already holding lock:
   (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
         [<c10562b7>] lock_acquire+0x67/0x80
         [<c12facad>] __mutex_lock_common+0x4d/0x410
         [<c12fb0c8>] mutex_lock_nested+0x18/0x20
         [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs]
         [<d0325c06>] do_journal_begin_r+0x86/0x340 [reiserfs]
         [<d0325f77>] journal_begin+0x77/0x140 [reiserfs]
         [<d0315be4>] reiserfs_remount+0x224/0x530 [reiserfs]
         [<c10b6a20>] do_remount_sb+0x60/0x110
         [<c10cee25>] do_mount+0x625/0x790
         [<c10cf014>] sys_mount+0x84/0xb0
         [<c12fca3d>] syscall_call+0x7/0xb

  -> #0 (&journal->j_mutex){+.+...}:
         [<c10560f6>] __lock_acquire+0x1026/0x1180
         [<c10562b7>] lock_acquire+0x67/0x80
         [<c12facad>] __mutex_lock_common+0x4d/0x410
         [<c12fb0c8>] mutex_lock_nested+0x18/0x20
         [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs]
         [<d0325f77>] journal_begin+0x77/0x140 [reiserfs]
         [<d0326271>] reiserfs_persistent_transaction+0x41/0x90 [reiserfs]
         [<d030d06c>] reiserfs_get_block+0x22c/0x1530 [reiserfs]
         [<c10db9db>] __block_prepare_write+0x1bb/0x3a0
         [<c10dbbe6>] block_prepare_write+0x26/0x40
         [<d030b738>] reiserfs_prepare_write+0x88/0x170 [reiserfs]
         [<d03294d6>] reiserfs_unpack+0xe6/0x120 [reiserfs]
         [<d0329782>] reiserfs_ioctl+0x272/0x320 [reiserfs]
         [<c10c3188>] vfs_ioctl+0x28/0xa0
         [<c10c3bbd>] do_vfs_ioctl+0x32d/0x5c0
         [<c10c3eb3>] sys_ioctl+0x63/0x70
         [<c12fca3d>] syscall_call+0x7/0xb

  other info that might help us debug this:

  2 locks held by lilo/1620:
   #0:  (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<d032945a>] reiserfs_unpack+0x6a/0x120 [reiserfs]
   #1:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a278>] reiserfs_write_lock+0x28/0x40 [reiserfs]

  stack backtrace:
  Pid: 1620, comm: lilo Not tainted 2.6.35.4.4a #3
  Call Trace:
   [<c10560f6>] __lock_acquire+0x1026/0x1180
   [<c10562b7>] lock_acquire+0x67/0x80
   [<c12facad>] __mutex_lock_common+0x4d/0x410
   [<c12fb0c8>] mutex_lock_nested+0x18/0x20
   [<d0325bff>] do_journal_begin_r+0x7f/0x340 [reiserfs]
   [<d0325f77>] journal_begin+0x77/0x140 [reiserfs]
   [<d0326271>] reiserfs_persistent_transaction+0x41/0x90 [reiserfs]
   [<d030d06c>] reiserfs_get_block+0x22c/0x1530 [reiserfs]
   [<c10db9db>] __block_prepare_write+0x1bb/0x3a0
   [<c10dbbe6>] block_prepare_write+0x26/0x40
   [<d030b738>] reiserfs_prepare_write+0x88/0x170 [reiserfs]
   [<d03294d6>] reiserfs_unpack+0xe6/0x120 [reiserfs]
   [<d0329782>] reiserfs_ioctl+0x272/0x320 [reiserfs]
   [<c10c3188>] vfs_ioctl+0x28/0xa0
   [<c10c3bbd>] do_vfs_ioctl+0x32d/0x5c0
   [<c10c3eb3>] sys_ioctl+0x63/0x70
   [<c12fca3d>] syscall_call+0x7/0xb
Reported-by: NJarek Poplawski <jarkao2@gmail.com>
Tested-by: NJarek Poplawski <jarkao2@gmail.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: All since 2.6.32 <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d8117e7

reiserfs: fix dependency inversion between inode and reiserfs mutexes · 3f259d09

由 Frederic Weisbecker 提交于 9月 30, 2010

The reiserfs mutex already depends on the inode mutex, so we can't lock
the inode mutex in reiserfs_unpack() without using the safe locking API,
because reiserfs_unpack() is always called with the reiserfs mutex locked.

This fixes:

  =======================================================
  [ INFO: possible circular locking dependency detected ]
  2.6.35c #13
  -------------------------------------------------------
  lilo/1606 is trying to acquire lock:
   (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs]

  but task is already holding lock:
   (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
         [<c1056347>] lock_acquire+0x67/0x80
         [<c12f083d>] __mutex_lock_common+0x4d/0x410
         [<c12f0c58>] mutex_lock_nested+0x18/0x20
         [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs]
         [<d0329e9a>] reiserfs_lookup_privroot+0x2a/0x90 [reiserfs]
         [<d0316b81>] reiserfs_fill_super+0x941/0xe60 [reiserfs]
         [<c10b7d17>] get_sb_bdev+0x117/0x170
         [<d0313e21>] get_super_block+0x21/0x30 [reiserfs]
         [<c10b74ba>] vfs_kern_mount+0x6a/0x1b0
         [<c10b7659>] do_kern_mount+0x39/0xe0
         [<c10cebe0>] do_mount+0x340/0x790
         [<c10cf0b4>] sys_mount+0x84/0xb0
         [<c12f25cd>] syscall_call+0x7/0xb

  -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
         [<c1056186>] __lock_acquire+0x1026/0x1180
         [<c1056347>] lock_acquire+0x67/0x80
         [<c12f083d>] __mutex_lock_common+0x4d/0x410
         [<c12f0c58>] mutex_lock_nested+0x18/0x20
         [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs]
         [<d0329772>] reiserfs_ioctl+0x272/0x320 [reiserfs]
         [<c10c3228>] vfs_ioctl+0x28/0xa0
         [<c10c3c5d>] do_vfs_ioctl+0x32d/0x5c0
         [<c10c3f53>] sys_ioctl+0x63/0x70
         [<c12f25cd>] syscall_call+0x7/0xb

  other info that might help us debug this:

  1 lock held by lilo/1606:
   #0:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<d032a268>] reiserfs_write_lock+0x28/0x40 [reiserfs]

  stack backtrace:
  Pid: 1606, comm: lilo Not tainted 2.6.35c #13
  Call Trace:
   [<c1056186>] __lock_acquire+0x1026/0x1180
   [<c1056347>] lock_acquire+0x67/0x80
   [<c12f083d>] __mutex_lock_common+0x4d/0x410
   [<c12f0c58>] mutex_lock_nested+0x18/0x20
   [<d0329450>] reiserfs_unpack+0x60/0x110 [reiserfs]
   [<d0329772>] reiserfs_ioctl+0x272/0x320 [reiserfs]
   [<c10c3228>] vfs_ioctl+0x28/0xa0
   [<c10c3c5d>] do_vfs_ioctl+0x32d/0x5c0
   [<c10c3f53>] sys_ioctl+0x63/0x70
   [<c12f25cd>] syscall_call+0x7/0xb
Reported-by: NJarek Poplawski <jarkao2@gmail.com>
Tested-by: NJarek Poplawski <jarkao2@gmail.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org>		[2.6.32 and later]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3f259d09

proc: make /proc/pid/limits world readable · 3036e7b4

由 Jiri Olsa 提交于 9月 30, 2010

Having the limits file world readable will ease the task of system
management on systems where root privileges might be restricted.

Having admin restricted with root priviledges, he/she could not check
other users process' limits.

Also it'd align with most of the /proc stat files.
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: Eugene Teo <eugene@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3036e7b4

cifs: prevent infinite recursion in cifs_reconnect_tcon · f569599a

由 Jeff Layton 提交于 9月 29, 2010

cifs_reconnect_tcon is called from smb_init. After a successful
reconnect, cifs_reconnect_tcon will call reset_cifs_unix_caps. That
function will, in turn call CIFSSMBQFSUnixInfo and CIFSSMBSetFSUnixInfo.
Those functions also call smb_init.

It's possible for the session and tcon reconnect to succeed, and then
for another cifs_reconnect to occur before CIFSSMBQFSUnixInfo or
CIFSSMBSetFSUnixInfo to be called. That'll cause those functions to call
smb_init and cifs_reconnect_tcon again, ad infinitum...

Break the infinite recursion by having those functions use a new
smb_init variant that doesn't attempt to perform a reconnect.
Reported-and-Tested-by: NMichal Suchanek <hramrach@centrum.cz>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

f569599a

30 9月, 2010 2 次提交

ocfs2: Don't walk off the end of fast symlinks. · 1fc8a117

由 Joel Becker 提交于 9月 29, 2010

ocfs2 fast symlinks are NUL terminated strings stored inline in the
inode data area.  However, disk corruption or a local attacker could, in
theory, remove that NUL.  Because we're using strlen() (my fault,
introduced in a731d1 when removing vfs_follow_link()), we could walk off
the end of that string.
Signed-off-by: NJoel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org

1fc8a117

cifs: set backing_dev_info on new S_ISREG inodes · 522440ed

由 Jeff Layton 提交于 9月 29, 2010

Testing on very recent kernel (2.6.36-rc6) made this warning pop:

    WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x65/0x70()
    Hardware name:
    Dirtiable inode bdi default != sb bdi cifs

...the following patch fixes it and seems to be the obviously correct
thing to do for cifs.

Cc: stable@kernel.org
Acked-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

522440ed

29 9月, 2010 1 次提交

xfs: force background CIL push under sustained load · 80168676

由 Dave Chinner 提交于 9月 24, 2010

I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well.  IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.

Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.

However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?

Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.

The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.

It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....

The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NAlex Elder <aelder@sgi.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

80168676

24 9月, 2010 5 次提交

o2dlm: force free mles during dlm exit · 5dad6c39

由 Srinivas Eeda 提交于 9月 21, 2010

While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.

This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.
Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

5dad6c39

ocfs2: Sync inode flags with ext2. · 0000b862

由 Tao Ma 提交于 9月 19, 2010

We sync our inode flags with ext2 and define them by hex
values. But actually in commit 36695673(4 years ago), all
these values are moved to include/linux/fs.h. So we'd
better also use them as what ext2 did. So sync our inode
flags with ext2 by using FS_*.
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

0000b862

ocfs2: Move 'wanted' into parens of ocfs2_resmap_resv_bits. · 4a452de4

由 Tao Ma 提交于 9月 19, 2010

The first time I read the function ocfs2_resmap_resv_bits, I consider
about what 'wanted' will be used and consider about the comments.
Then I find it is only used if the reservation is empty. ;)

So we'd better move it to the parens so that it make the code more
readable, what's more, ocfs2_resmap_resv_bits is used so frequently
and we should save some cpus.
Acked-by: NMark Fasheh <mfasheh@suse.com>
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

4a452de4

ocfs2: Use cpu_to_le16 for e_leaf_clusters in ocfs2_bg_discontig_add_extent. · 47dea423

由 Tao Ma 提交于 9月 13, 2010

e_leaf_clusters is a le16, so use cpu_to_le16 instead
of cpu_to_le32.

What's more, we change 'clusters' to unsigned int to
signify that the size of 'clusters' isn't important here.
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

47dea423

ocfs2: update ctime when changing the file's permission by setfacl · 12828061

由 Tao Ma 提交于 9月 13, 2010

In commit 30e2bab2, ext3 fixed it. So change it accordingly in ocfs2.

Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1283760364
# setfacl -m  'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1283760364
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJoel Becker <joel.becker@oracle.com>

12828061

23 9月, 2010 4 次提交

/proc/pid/smaps: fix dirty pages accounting · 1c2499ae

由 KOSAKI Motohiro 提交于 9月 22, 2010

Currently, /proc/<pid>/smaps has wrong dirty pages accounting.
Shared_Dirty and Private_Dirty output only pte dirty pages and ignore
PG_dirty page flag.  It is difference against documentation, but also
inconsistent against Referenced field.  (Referenced checks both pte and
page flags)

This patch fixes it.

Test program:

 large-array.c
 ---------------------------------------------------
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>

 char array[1*1024*1024*1024L];

 int main(void)
 {
         memset(array, 1, sizeof(array));
         pause();

         return 0;
 }
 ---------------------------------------------------

Test case:
 1. run ./large-array
 2. cat /proc/`pidof large-array`/smaps
 3. swapoff -a
 4. cat /proc/`pidof large-array`/smaps again

Test result:
 <before patch>

00601000-40601000 rw-p 00000000 00:00 0
Size:            1048576 kB
Rss:             1048576 kB
Pss:             1048576 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:    218992 kB   <-- showed pages as clean incorrectly
Private_Dirty:    829584 kB
Referenced:       388364 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

 <after patch>

00601000-40601000 rw-p 00000000 00:00 0
Size:            1048576 kB
Rss:             1048576 kB
Pss:             1048576 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1048576 kB  <-- fixed
Referenced:       388480 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1c2499ae

aio: do not return ERESTARTSYS as a result of AIO · a0c42bac

由 Jan Kara 提交于 9月 22, 2010

OCFS2 can return ERESTARTSYS from its write function when the process is
signalled while waiting for a cluster lock (and the filesystem is mounted
with intr mount option).  Generally, it seems reasonable to allow
filesystems to return this error code from its IO functions.  As we must
not leak ERESTARTSYS (and similar error codes) to userspace as a result of
an AIO operation, we have to properly convert it to EINTR inside AIO code
(restarting the syscall isn't really an option because other AIO could
have been already submitted by the same io_submit syscall).
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a0c42bac

/proc/vmcore: fix seeking · c227e690

由 Arnd Bergmann 提交于 9月 22, 2010

Commit 73296bc6 ("procfs: Use generic_file_llseek in /proc/vmcore")
broke seeking on /proc/vmcore.  This changes it back to use default_llseek
in order to restore the original behaviour.

The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
file systems.  We should merge generic_file_llseek and default_llseek some
day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
is the safer solution.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: NCAI Qian <caiqian@redhat.com>
Tested-by: NCAI Qian <caiqian@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c227e690

Prevent freeing uninitialized pointer in compat_do_readv_writev · 767b68e9

由 Dan Rosenberg 提交于 9月 22, 2010

In 32-bit compatibility mode, the error handling for
compat_do_readv_writev() may free an uninitialized pointer, potentially
leading to all sorts of ugly memory corruption.  This is reliably
triggerable by unprivileged users by invoking the readv()/writev()
syscalls with an invalid iovec pointer.  The below patch fixes this to
emulate the non-compat version.

Introduced by commit b8373363 ("compat: factor out
compat_rw_copy_check_uvector from compat_do_readv_writev")
Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: stable@kernel.org (2.6.35)
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

767b68e9