提交 · fe2fd9ed5bf184f797412be8b86f4589d1b77bb8 · openeuler / raspberrypi-kernel

26 10月, 2010 2 次提交

exofs: Remove inode->i_count manipulation in exofs_new_inode · fe2fd9ed

由 Boaz Harrosh 提交于 10月 16, 2010

exofs_new_inode() was incrementing the inode->i_count and
decrementing it in create_done(), in a bad attempt to make sure
the inode will still be there when the asynchronous create_done()
finally arrives. This was very stupid because iput() was not called,
and if it was actually needed, it would leak the inode.

However all this is not needed, because at exofs_evict_inode()
we already wait for create_done() by waiting for the
object_created event. Therefore remove the superfluous ref counting
and just Thicken the comment at exofs_evict_inode() a bit.

While at it change places that open coded wait_obj_created()
to call the already available wrapper.

CC: Dave Chinner <dchinner@redhat.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

fe2fd9ed

fs/exofs: typo fix of faild to failed · 571f7f46

由 Joe Perches 提交于 10月 21, 2010

Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

571f7f46

19 10月, 2010 2 次提交

exofs: Set i_mapping->backing_dev_info anyway · 115e19c5

由 Boaz Harrosh 提交于 10月 07, 2010

Though it has been promised that inode->i_mapping->backing_dev_info
is not used and the supporting code is fine. Until the pointer
will default to NULL, I'd rather it points to the correct thing
regardless.

At least for future infrastructure coder it is a clear indication
of where are the key points that inodes are initialized.
I know because it took me time to find this out.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>

115e19c5

exofs: Cleaup read path in regard with read_for_write · 7aebf410

由 Boaz Harrosh 提交于 10月 13, 2010

Last BUG fix added a flag to the the page_collect structure
to communicate with readpage_strip. This calls for a clean up
removing that flag's reincarnations in the read functions
parameters.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>

7aebf410

08 10月, 2010 1 次提交

exofs: Fix double page_unlock BUG in write_begin/end · f17b1f9f

由 Boaz Harrosh 提交于 10月 07, 2010

This BUG is there since the first submit of the code, but only triggered
in last Kernel. It's timing related do to the asynchronous object-creation
behaviour of exofs. (Which should be investigated farther)

The bug is obvious hence the fixed.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>

f17b1f9f

10 8月, 2010 3 次提交

A
convert exofs to ->evict_inode() · 4ec70c9b
由 Al Viro 提交于 6月 07, 2010
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
4ec70c9b

exofs: New truncate sequence · 2f246fd0

由 Boaz Harrosh 提交于 6月 09, 2010

These changes are crafted based on the similar
conversion done to ext2 by Nick Piggin.

* Remove the deprecated ->truncate vector. Let exofs_setattr
  take care of on-disk size updates.
* Call truncate_pagecache on the unused pages if
  write_begin/end fails.
* Cleanup exofs_delete_inode that did stupid inode
  writes and updates on an inode that will be
  removed.
* And finally get rid of exofs_get_block. We never
  had any blocks it was all for calling nobh_truncate_page.
  nobh_truncate_page is not actually needed in exofs since
  the last page is complete and gone, just like all the other
  pages. There is no partial blocks in exofs.

I've tested with this patch, and there are no apparent
failures, so far.

CC: Nick Piggin <npiggin@suse.de>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2f246fd0

remove inode_setattr · 1025774c

由 Christoph Hellwig 提交于 6月 04, 2010

Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1025774c

08 8月, 2010 1 次提交

block: unify flags for struct bio and struct request · 7b6d91da

由 Christoph Hellwig 提交于 8月 07, 2010

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7b6d91da

04 8月, 2010 4 次提交

exofs: Fix groups code when num_devices is not divisible by group_width · 5002dd18

由 Boaz Harrosh 提交于 8月 02, 2010

There is a bug when num_devices is not divisible by group_width * mirrors.
We would not return to the proper device and offset when looping on to the
next group.

The fix makes code simpler actually.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

5002dd18

exofs: Remove useless optimization · 6e31609b

由 Boaz Harrosh 提交于 7月 29, 2010

We used to compact all used devices in an IO to the beginning
of the device array in an io_state. And keep a last device used
so in later loops we don't iterate on all device slots. This
does not prevent us from checking if slots are empty since in
reads we only read from a single mirror and jump to the next
mirror-set.

This optimization is marginal, and needlessly complicates the
code. Specially when we will later want to support raid/456
with same abstract code. So remove the distinction between
"dev" and "comp". Only "dev" is used both as the device used
and as the index (component) in the device array.

[Note that now the io_state->dev member is redundant but I
 keep it because I might want to optimize by only IOing a
 single group, though keeping a group_width*mirrors devices
 in io_state, we now keep num-devices in each io_state]
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

6e31609b

exofs: exofs_file_fsync and exofs_file_flush correctness · b2848349

由 Boaz Harrosh 提交于 5月 31, 2010

As per Christoph advise: no need to call filemap_write_and_wait().
In exofs all metadata is at the inode so just writing the inode is
all is needed. ->fsync implies this must be done synchronously.

But now exofs_file_fsync can not be used by exofs_file_flush.
vfs_fsync() should do that job correctly.

FIXME: remove the sb_sync and fix that sb_update better.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

b2848349

exofs: Remove superfluous dependency on buffer_head and writeback · 85dc7878

由 Boaz Harrosh 提交于 5月 31, 2010

exofs_releasepage && exofs_invalidatepage are never called.
Leave the WARN_ONs but remove any code. Remove the

cleanup other stale #includes.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

85dc7878

28 5月, 2010 1 次提交

drop unused dentry argument to ->fsync · 7ea80859

由 Christoph Hellwig 提交于 5月 26, 2010

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7ea80859

22 5月, 2010 1 次提交

exofs: replace inode uid,gid,mode initialization with helper function · e00117f1

由 Dmitry Monakhov 提交于 3月 04, 2010

Ack-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e00117f1

17 5月, 2010 2 次提交

exofs: confusion between kmap() and kmap_atomic() api · ddf08f4b

由 Dan Carpenter 提交于 5月 07, 2010

For kmap_atomic() we call kunmap_atomic() on the returned pointer.
That's different from kmap() and kunmap() and so it's easy to get them
backwards.

Cc: Stable <stable@kernel.org>
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

ddf08f4b

exofs: Add default address_space_operations · 200b0700

由 Boaz Harrosh 提交于 3月 22, 2010

All vectors of address_space_operations should be initialized
by the filesystem. Add the missing parts.

This is actually an optimization, by using
__set_page_dirty_nobuffers. The default, in case of NULL,
would be __set_page_dirty_buffers which has these extar if(s).

.releasepage && .invalidatepage should both not be called
because page_private() is NULL in exofs. Put a WARN_ON if
they are called, to indicate the Kernel has changed in this
regard, if when it does.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

200b0700

30 4月, 2010 1 次提交

exofs: Fix "add bdi backing to mount session" fall out · 3c2023dd

由 Boaz Harrosh 提交于 4月 29, 2010

The patch: add bdi backing to mount session
	(b3d0ab7e)

Has a bug in the placement of the bdi member at
struct exofs_sb_info. The layout member must be kept
last.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

3c2023dd

29 4月, 2010 1 次提交

exofs: Fix "add bdi backing to mount session" fall out · a36fed12

由 Boaz Harrosh 提交于 4月 29, 2010

Commit b3d0ab7e ("exofs: add bdi backing
to mount session") has a bug in the placement of the bdi member at
struct exofs_sb_info.  The layout member must be kept last.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
Acked-by: NJens Axboe <jens.axboe@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a36fed12

22 4月, 2010 1 次提交

exofs: add bdi backing to mount session · b3d0ab7e

由 Jens Axboe 提交于 4月 22, 2010

This ensures that dirty data gets flushed properly.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

b3d0ab7e

30 3月, 2010 1 次提交

include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6

由 Tejun Heo 提交于 3月 24, 2010

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: NTejun Heo <tj@kernel.org>
Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

5a0e3ad6

06 3月, 2010 1 次提交

pass writeback_control to ->write_inode · a9185b41

由 Christoph Hellwig 提交于 3月 05, 2010

This gives the filesystem more information about the writeback that
is happening.  Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a9185b41

28 2月, 2010 11 次提交

exofs: groups support · 50a76fd3

由 Boaz Harrosh 提交于 2月 11, 2010

* _calc_stripe_info() changes to accommodate for grouping
  calculations. Returns additional information

* old _prepare_pages() becomes _prepare_one_group()
  which stores pages belonging to one device group.

* New _prepare_for_striping iterates on all groups calling
  _prepare_one_group().

* Enable mounting of groups data_maps (group_width != 0)

[QUESTION]
what is faster A or B;
A.	x += stride;
	x = x % width + first_x;

B	x += stride
	if (x < last_x)
		x = first_x;
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

50a76fd3

exofs: Prepare for groups · b367e78b

由 Boaz Harrosh 提交于 2月 07, 2010

* Rename _offset_dev_unit_off() to _calc_stripe_info()
  and recieve a struct for the output params

* In _prepare_for_striping we only need to call
  _calc_stripe_info() once. The other componets
  are easy to calculate from that. This code
  was inspired by what's done in truncate.

* Some code shifts that make sense now but will make
  more sense when group support is added.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

b367e78b

exofs: Error recovery if object is missing from storage · 96391e2b

由 Boaz Harrosh 提交于 2月 09, 2010

If an object is referenced by a directory but does not
exist on a target, it is a very serious corruption that
means:
1. Either a power failure with very slim chance of it
  happening. Because the directory update is always submitted
  much after object creation, but if a directory is written
  to one device and the object creation to another it might
  theoretically happen.
2. It only ever happened to me while developing with BUGs
  causing file corruption. Crashes could also cause it but
  they are more like case 1.

In any way the object does not exist, so data is surely lost.
If there is a mix-up in the obj-id or data-map, then lost objects
can be salvaged by off-line fsck. The only recoverable information
is the directory name. By letting it appear as a regular empty file,
with date==0 (1970 Jan 1st) ownership to root, we enable recovery
of the only useful information. And also enable deletion or over-write.
I can see how this can hurt.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

96391e2b

exofs: convert io_state to use pages array instead of bio at input · 86093aaf

由 Boaz Harrosh 提交于 1月 28, 2010

* inode.c operations are full-pages based, and not actually
  true scatter-gather
* Lets us use more pages at once upto 512 (from 249) in 64 bit
* Brings us much much closer to be able to use exofs's io_state engine
  from objlayout driver. (Once I decide where to put the common code)

After RAID0 patch the outer (input) bio was never used as a bio, but
was simply a page carrier into the raid engine. Even in the simple
mirror/single-dev arrangement pages info was copied into a second bio.
It is now easer to just pass a pages array into the io_state and prepare
bio(s) once.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

86093aaf

exofs: RAID0 support · 5d952b83

由 Boaz Harrosh 提交于 2月 01, 2010

We now support striping over mirror devices. Including variable sized
stripe_unit.

Some limits:
* stripe_unit must be a multiple of PAGE_SIZE
* stripe_unit * stripe_count is maximum upto 32-bit (4Gb)

Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.

Design notes:
* I'm not using a vectored raid-engine mechanism yet. Following the
  pnfs-objects-layout data-map structure, "Mirror" is just a private
  case of "group_width" == 1, and RAID0 is a private case of
  "Mirrors" == 1. The performance lose of the general case over the
  particular special case optimization is totally negligible, also
  considering the extra code size.

* In general I added a prepare_stripes() stage that divides the
  to-be-io pages to the participating devices, the previous
  exofs_ios_write/read, now becomes _write/read_mirrors and a new
  write/read upper layer loops on all devices calling
  _write/read_mirrors. Effectively the prepare_stripes stage is the all
  secret.
  Also truncate need fixing to accommodate for striping.

* In a RAID0 arrangement, in a regular usage scenario, if all inode
  layouts will start at the same device, the small files fill up the
  first device and the later devices stay empty, the farther the device
  the emptier it is.

  To fix that, each inode will start at a different stripe_unit,
  according to it's obj_id modulus number-of-stripe-units. And
  will then span all stripe-units in the same incrementing order
  wrapping back to the beginning of the device table. We call it
  a stripe-units moving window.

  Special consideration was taken to keep all devices in a mirror
  arrangement identical. So a broken osd-device could just be cloned
  from one of the mirrors and no FS scrubbing is needed. (We do that
  by rotating stripe-unit at a time and not a single device at a time.)

TODO:
 We no longer verify object_length == inode->i_size in exofs_iget.
 (since i_size is stripped on multiple objects now).
 I should introduce a multiple-device attribute reading, and use
 it in exofs_iget.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

5d952b83

exofs: Define on-disk per-inode optional layout attribute · d9c740d2

由 Boaz Harrosh 提交于 1月 28, 2010

* Layouts describe the way a file is spread on multiple devices.
  The layout information is stored in the objects attribute introduced
  in this patch.

* There can be multiple generating function for the layout.
  Currently defined:
    - No attribute present - use below moving-window on global
      device table, all devices.
      (This is the only one currently used in exofs)
    - an obj_id generated moving window - the obj_id is a randomizing
      factor in the otherwise global map layout.
    - An explicit layout stored, including a data_map and a device
      index list.
    - More might be defined in future ...

* There are two attributes defined of the same structure:
  A-data-files-layout - This layout is used by data-files. If present
                        at a directory, all files of that directory will
                        be created with this layout.
  A-meta-data-layout - This layout is used by a directory and other
                       meta-data information. Also inherited at creation
                       of subdirectories.

* At creation time inodes are created with the layout specified above.
  A usermode utility may change the creation layout on a give directory
  or file. Which in the case of directories, will also apply to newly
  created files/subdirectories, children of that directory.
  In the simple unaltered case of a newly created exofs, no layout
  attributes are present, and all layouts adhere to the layout specified
  at the device-table.

* In case of a future file system loaded in an old exofs-driver.
  At iget(), the generating_function is inspected and if not supported
  will return an IO error to the application and the inode will not
  be loaded. So not to damage any data.
  Note: After this patch we do not yet support any type of layout
        only the RAID0 patch that enables striping at the super-block
        level will add support for RAID0 layouts above. This way we
        are past and future compatible and fully bisectable.

* Access to the device table is done by an accessor since
  it will change according to above information.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

d9c740d2

exofs: unindent exofs_sbi_read · 46f4d973

由 Boaz Harrosh 提交于 2月 01, 2010

The original idea was that a mirror read can be sub-divided
to multiple devices. But this has very little gain and only
at very large IOes so it's not going to be implemented soon.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

46f4d973

exofs: Move layout related members to a layout structure · 45d3abcb

由 Boaz Harrosh 提交于 1月 28, 2010

* Abstract away those members in exofs_sb_info that are related/needed
  by a layout into a new exofs_layout structure. Embed it in exofs_sb_info.

* At exofs_io_state receive/keep a pointer to an exofs_layout. No need for
  an exofs_sb_info pointer, all we need is at exofs_layout.

* Change any usage of above exofs_sb_info members to their new name.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

45d3abcb

exofs: Recover in the case of read-passed-end-of-file · 22ddc556

由 Boaz Harrosh 提交于 1月 19, 2010

In check_io, implement the case of reading passed end of
file, by clearing the pages and recover with no error. In
a raid arrangement this can become a legitimate situation
in case of holes in the file.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

22ddc556

exofs: Micro-optimize exofs_i_info · 518f167a

由 Boaz Harrosh 提交于 1月 21, 2010

optimize the exofs_i_info struct usage by moving the embedded
vfs_inode to be first. A compiler might optimize away an "add"
operation with constant zero. (Which it cannot with other constants)
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

518f167a

exofs: debug print even less · 34ce4e7c

由 Boaz Harrosh 提交于 12月 15, 2009

* Last debug trimming left in some stupid print, remove them.
  Fixup some other prints
* Shift printing from inode.c to ios.c
* Add couple of prints when memory allocation fails.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

34ce4e7c

05 1月, 2010 2 次提交

exofs: simple_write_end does not mark_inode_dirty · efd124b9

由 Boaz Harrosh 提交于 12月 27, 2009

exofs uses simple_write_end() for it's .write_end handler. But
it is not enough because simple_write_end() does not call
mark_inode_dirty() when it extends i_size. So even if we do
call mark_inode_dirty at beginning of write out, with a very
long IO and a saturated system we might get the .write_inode()
called while still extend-writing to file and miss out on the last
i_size updates.

So override .write_end, call simple_write_end(), and afterwords if
i_size was changed call mark_inode_dirty().

It stands to logic that since simple_write_end() was the one extending
i_size it should also call mark_inode_dirty(). But it looks like all
users of simple_write_end() are memory-bound pseudo filesystems, who
could careless about mark_inode_dirty(). I might submit a
warning-comment patch to simple_write_end() in future.

CC: Stable <stable@kernel.org>
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

efd124b9

exofs: fix pnfs_osd re-definitions in pre-pnfs trees · 89be5030

由 Boaz Harrosh 提交于 12月 21, 2009

Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h
file. Since we needed these types before the pnfs-objects code was
accepted to mainline we duplicated the minimal needed definitions into
an exofs local header. The definitions where conditionally included
depending on !CONFIG_PNFS defined. So if PNFS was present in the tree
definitions are taken from there and if not they are defined locally.

That was all good but, the CONFIG_PNFS is planed to be included upstream
before the pnfs-objects is also included. (The first pnfs batch might be
pnfs-files only)

So condition exofs local definitions on the absence of pnfs_osd_xdr.h
inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure
that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h,
which happens to be so in current code.

Once pnfs-objects hits mainline, exofs's local header will be removed.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

89be5030

10 12月, 2009 5 次提交

exofs: Multi-device mirror support · 04dc1e88

由 Boaz Harrosh 提交于 11月 16, 2009

This patch changes on-disk format, it is accompanied with a parallel
patch to mkfs.exofs that enables multi-device capabilities.

After this patch, old exofs will refuse to mount a new formatted FS and
new exofs will refuse an old format. This is done by moving the magic
field offset inside the FSCB. A new FSCB *version* field was added. In
the future, exofs will refuse to mount unmatched FSCB version. To
up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
before mounting.

Introduced, a new object that contains a *device-table*. This object
contains the default *data-map* and a linear array of devices
information, which identifies the devices used in the filesystem. This
object is only written to offline by mkfs.exofs. This is why it is kept
separate from the FSCB, since the later is written to while mounted.

Same partition number, same object number is used on all devices only
the device varies.

* define the new format, then load the device table on mount time make
  sure every thing is supported.

* Change I/O engine to now support Mirror IO, .i.e write same data
  to multiple devices, read from a random device to spread the
  read-load from multiple clients (TODO: stripe read)

Implementation notes:
 A few points introduced in previous patch should be mentioned here:

* Special care was made so absolutlly all operation that have any chance
  of failing are done before any osd-request is executed. This is to
  minimize the need for a data consistency recovery, to only real IO
  errors.

* Each IO state has a kref. It starts at 1, any osd-request executed
  will increment the kref, finally when all are executed the first ref
  is dropped. At IO-done, each request completion decrements the kref,
  the last one to return executes the internal _last_io() routine.
  _last_io() will call the registered io_state_done. On sync mode a
  caller does not supply a done method, indicating a synchronous
  request, the caller is put to sleep and a special io_state_done is
  registered that will awaken the caller. Though also in sync mode all
  operations are executed in parallel.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

04dc1e88

exofs: Move all operations to an io_engine · 06886a5a

由 Boaz Harrosh 提交于 11月 08, 2009

In anticipation for multi-device operations, we separate osd operations
into an abstract I/O API. Currently only one device is used but later
when adding more devices, we will drive all devices in parallel according
to a "data_map" that describes how data is arranged on multiple devices.
The file system level operates, like before, as if there is one object
(inode-number) and an i_size. The io engine will split this to the same
object-number but on multiple device.

At first we introduce Mirror (raid 1) layout. But at the final outcome
we intend to fully implement the pNFS-Objects data-map, including
raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12

* Define an io_state based API for accessing osd storage devices
  in an abstract way.
  Usage:
	First a caller allocates an io state with:
		exofs_get_io_state(struct exofs_sb_info *sbi,
				   struct exofs_io_state** ios);

	Then calles one of:
		exofs_sbi_create(struct exofs_io_state *ios);
		exofs_sbi_remove(struct exofs_io_state *ios);
		exofs_sbi_write(struct exofs_io_state *ios);
		exofs_sbi_read(struct exofs_io_state *ios);
		exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);

	And when done
		exofs_put_io_state(struct exofs_io_state *ios);

* Convert all source files to use this new API
* Convert from bio_alloc to bio_kmalloc
* In io engine we make use of the now fixed osd_req_decode_sense

There are no functional changes or on disk additions after this patch.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

06886a5a

exofs: move osd.c to ios.c · 8ce9bdd1

由 Boaz Harrosh 提交于 11月 03, 2009

If I do a "git mv" together with a massive code change
and commit in one patch, git looses the rename and
records a delete/new instead. This is bad because I want
a rename recorded so later rebased/cherry-picked patches
to the old name will work. Also the --follow is lost.
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

8ce9bdd1

exofs: statfs blocks is sectors not FS blocks · cae012d8

由 Boaz Harrosh 提交于 11月 02, 2009

Even though exofs has a 4k block size, statfs blocks
is in sectors (512 bytes).

Also if target returns 0 for capacity then make it
ULLONG_MAX. df does not like zero-size filesystems
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

cae012d8

exofs: Prints on mount and unmout · 19fe294f

由 Boaz Harrosh 提交于 9月 03, 2009

It is important to print in the logs when a filesystem was
mounted and eventually unmounted.

Print the osd-device's osd_name and pid the FS was
mounted/unmounted on.

TODO: How to also print the namespace path the filesystem was
      mounted on?
Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>

19fe294f