提交 80f21213 编写于 作者: C Christoph Hellwig 提交者: Martin K. Petersen

scsi: fs: remove exofs

This was an example for using the SCSI OSD protocol, which we're trying
to remove.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
上级 972248e9
===============================================================================
WHAT IS EXOFS?
===============================================================================
exofs is a file system that uses an OSD and exports the API of a normal Linux
file system. Users access exofs like any other local file system, and exofs
will in turn issue commands to the local OSD initiator.
OSD is a new T10 command set that views storage devices not as a large/flat
array of sectors but as a container of objects, each having a length, quota,
time attributes and more. Each object is addressed by a 64bit ID, and is
contained in a 64bit ID partition. Each object has associated attributes
attached to it, which are integral part of the object and provide metadata about
the object. The standard defines some common obligatory attributes, but user
attributes can be added as needed.
===============================================================================
ENVIRONMENT
===============================================================================
To use this file system, you need to have an object store to run it on. You
may download a target from:
http://open-osd.org
See Documentation/scsi/osd.txt for how to setup a working osd environment.
===============================================================================
USAGE
===============================================================================
1. Download and compile exofs and open-osd initiator:
You need an external Kernel source tree or kernel headers from your
distribution. (anything based on 2.6.26 or later).
a. download open-osd including exofs source using:
[parent-directory]$ git clone git://git.open-osd.org/open-osd.git
b. Build the library module like this:
[parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
This will build both the open-osd initiator as well as the exofs kernel
module. Use whatever parameters you compiled your Kernel with and
$(KER_DIR) above pointing to the Kernel you compile against. See the file
open-osd/top-level-Makefile for an example.
2. Get the OSD initiator and target set up properly, and login to the target.
See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
for example script that does all these steps.
3. Insmod the exofs.ko module:
[exofs]$ insmod exofs.ko
4. Make sure the directory where you want to mount exists. If not, create it.
(For example, mkdir /mnt/exofs)
5. At first run you will need to invoke the mkfs.exofs application
As an example, this will create the file system on:
/dev/osd0 partition ID 65536
mkfs.exofs --pid=65536 --format /dev/osd0
The --format is optional. If not specified, no OSD_FORMAT will be
performed and a clean file system will be created in the specified pid,
in the available space of the target. (Use --format=size_in_meg to limit
the total LUN space available)
If pid already exists, it will be deleted and a new one will be created in
its place. Be careful.
An exofs lives inside a single OSD partition. You can create multiple exofs
filesystems on the same device using multiple pids.
(run mkfs.exofs without any parameters for usage help message)
6. Mount the file system.
For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
7. For reference (See do-exofs example script):
do-exofs start - an example of how to perform the above steps.
do-exofs stop - an example of how to unmount the file system.
do-exofs format - an example of how to format and mkfs a new exofs.
8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
===============================================================================
exofs mount options
===============================================================================
Similar to any mount command:
mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
Where:
-t exofs: specifies the exofs file system
/dev/osdX: X is a decimal number. /dev/osdX was created after a successful
login into an OSD target.
mount_exofs_directory: The directory to mount the file system on
exofs specific options: Options are separated by commas (,)
pid=<integer> - The partition number to mount/create as
container of the filesystem.
This option is mandatory. integer can be
Hex by pre-pending an 0x to the number.
osdname=<id> - Mount by a device's osdname.
osdname is usually a 36 character uuid of the
form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
It is one of the device's uuid specified in the
mkfs.exofs format command.
If this option is specified then the /dev/osdX
above can be empty and is ignored.
to=<integer> - Timeout in ticks for a single command.
default is (60 * HZ) [for debugging only]
===============================================================================
DESIGN
===============================================================================
* The file system control block (AKA on-disk superblock) resides in an object
with a special ID (defined in common.h).
Information included in the file system control block is used to fill the
in-memory superblock structure at mount time. This object is created before
the file system is used by mkexofs.c. It contains information such as:
- The file system's magic number
- The next inode number to be allocated
* Each file resides in its own object and contains the data (and it will be
possible to extend the file over multiple objects, though this has not been
implemented yet).
* A directory is treated as a file, and essentially contains a list of <file
name, inode #> pairs for files that are found in that directory. The object
IDs correspond to the files' inode numbers and will be allocated according to
a bitmap (stored in a separate object). Now they are allocated using a
counter.
* Each file's control block (AKA on-disk inode) is stored in its object's
attributes. This applies to both regular files and other types (directories,
device files, symlinks, etc.).
* Credentials are generated per object (inode and superblock) when they are
created in memory (read from disk or created). The credential works for all
operations and is used as long as the object remains in memory.
* Async OSD operations are used whenever possible, but the target may execute
them out of order. The operations that concern us are create, delete,
readpage, writepage, update_inode, and truncate. The following pairs of
operations should execute in the order written, and we need to prevent them
from executing in reverse order:
- The following are handled with the OBJ_CREATED and OBJ_2BCREATED
flags. OBJ_CREATED is set when we know the object exists on the OSD -
in create's callback function, and when we successfully do a
read_inode.
OBJ_2BCREATED is set in the beginning of the create function, so we
know that we should wait.
- create/delete: delete should wait until the object is created
on the OSD.
- create/readpage: readpage should be able to return a page
full of zeroes in this case. If there was a write already
en-route (i.e. create, writepage, readpage) then the page
would be locked, and so it would really be the same as
create/writepage.
- create/writepage: if writepage is called for a sync write, it
should wait until the object is created on the OSD.
Otherwise, it should just return.
- create/truncate: truncate should wait until the object is
created on the OSD.
- create/update_inode: update_inode should wait until the
object is created on the OSD.
- Handled by VFS locks:
- readpage/delete: shouldn't happen because of page lock.
- writepage/delete: shouldn't happen because of page lock.
- readpage/writepage: shouldn't happen because of page lock.
===============================================================================
LICENSE/COPYRIGHT
===============================================================================
The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
version 2.6.10). All files include the original copyrights, and the license
is GPL version 2 (only version 2, as is true for the Linux kernel). The
Linux kernel can be downloaded from www.kernel.org.
......@@ -24,11 +24,6 @@ osd-uld:
platform, both for the in-kernel initiator as well as connected targets. It
currently has no useful user-mode API, though it could have if need be.
exofs:
Is an OSD based Linux file system. It uses the osd-initiator and osd-uld,
to export a usable file system for users.
See Documentation/filesystems/exofs.txt for more details
osd target:
There are no current plans for an OSD target implementation in kernel. For all
needs, a user-mode target that is based on the scsi tgt target framework is
......
......@@ -11390,7 +11390,6 @@ M: Boaz Harrosh <ooo@electrozaur.com>
S: Maintained
F: drivers/scsi/osd/
F: include/scsi/osd_*
F: fs/exofs/
OV2659 OMNIVISION SENSOR DRIVER
M: "Lad, Prabhakar" <prabhakar.csengg@gmail.com>
......
......@@ -254,12 +254,9 @@ source "fs/romfs/Kconfig"
source "fs/pstore/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
source "fs/exofs/Kconfig"
endif # MISC_FILESYSTEMS
source "fs/exofs/Kconfig.ore"
menuconfig NETWORK_FILESYSTEMS
bool "Network File Systems"
default y
......
......@@ -124,7 +124,6 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_F2FS_FS) += f2fs/
obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
- Out-of-space may cause a severe problem if the object (and directory entry)
were written, but the inode attributes failed. Then if the filesystem was
unmounted and mounted the kernel can get into an endless loop doing a readdir.
#
# Kbuild for the EXOFS module
#
# Copyright (C) 2008 Panasas Inc. All rights reserved.
#
# Authors:
# Boaz Harrosh <ooo@electrozaur.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2
#
# Kbuild - Gets included from the Kernels Makefile and build system
#
# ore module library
libore-y := ore.o ore_raid.o
obj-$(CONFIG_ORE) += libore.o
exofs-y := inode.o file.o namei.o dir.o super.o sys.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
config EXOFS_FS
tristate "exofs: OSD based file system support"
depends on SCSI_OSD_ULD
help
EXOFS is a file system that uses an OSD storage device,
as its backing storage.
# Debugging-related stuff
config EXOFS_DEBUG
bool "Enable debugging"
depends on EXOFS_FS
help
This option enables EXOFS debug prints.
# ORE - Objects Raid Engine (libore.ko)
#
# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
# for every ORE user we do it like this. Any user should add itself here
# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
# selected here, and we default to "ON". So in effect it is like been
# selected by any of the users.
config ORE
tristate
depends on EXOFS_FS || PNFS_OBJLAYOUT
select ASYNC_XOR
select RAID6_PQ
select ASYNC_PQ
default SCSI_OSD_ULD
/*
* common.h - Common definitions for both Kernel and user-mode utilities
*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#ifndef __EXOFS_COM_H__
#define __EXOFS_COM_H__
#include <linux/types.h>
#include <scsi/osd_attributes.h>
#include <scsi/osd_initiator.h>
#include <scsi/osd_sec.h>
/****************************************************************************
* Object ID related defines
* NOTE: inode# = object ID - EXOFS_OBJ_OFF
****************************************************************************/
#define EXOFS_MIN_PID 0x10000 /* Smallest partition ID */
#define EXOFS_OBJ_OFF 0x10000 /* offset for objects */
#define EXOFS_SUPER_ID 0x10000 /* object ID for on-disk superblock */
#define EXOFS_DEVTABLE_ID 0x10001 /* object ID for on-disk device table */
#define EXOFS_ROOT_ID 0x10002 /* object ID for root directory */
/* exofs Application specific page/attribute */
/* Inode attrs */
# define EXOFS_APAGE_FS_DATA (OSD_APAGE_APP_DEFINED_FIRST + 3)
# define EXOFS_ATTR_INODE_DATA 1
# define EXOFS_ATTR_INODE_FILE_LAYOUT 2
# define EXOFS_ATTR_INODE_DIR_LAYOUT 3
/* Partition attrs */
# define EXOFS_APAGE_SB_DATA (0xF0000000U + 3)
# define EXOFS_ATTR_SB_STATS 1
/*
* The maximum number of files we can have is limited by the size of the
* inode number. This is the largest object ID that the file system supports.
* Object IDs 0, 1, and 2 are always in use (see above defines).
*/
enum {
EXOFS_MAX_INO_ID = (sizeof(ino_t) * 8 == 64) ? ULLONG_MAX :
(1ULL << (sizeof(ino_t) * 8ULL - 1ULL)),
EXOFS_MAX_ID = (EXOFS_MAX_INO_ID - 1 - EXOFS_OBJ_OFF),
};
/****************************************************************************
* Misc.
****************************************************************************/
#define EXOFS_BLKSHIFT 12
#define EXOFS_BLKSIZE (1UL << EXOFS_BLKSHIFT)
/****************************************************************************
* superblock-related things
****************************************************************************/
#define EXOFS_SUPER_MAGIC 0x5DF5
/*
* The file system control block - stored in object EXOFS_SUPER_ID's data.
* This is where the in-memory superblock is stored on disk.
*/
enum {EXOFS_FSCB_VER = 1, EXOFS_DT_VER = 1};
struct exofs_fscb {
__le64 s_nextid; /* Only used after mkfs */
__le64 s_numfiles; /* Only used after mkfs */
__le32 s_version; /* == EXOFS_FSCB_VER */
__le16 s_magic; /* Magic signature */
__le16 s_newfs; /* Non-zero if this is a new fs */
/* From here on it's a static part, only written by mkexofs */
__le64 s_dev_table_oid; /* Resurved, not used */
__le64 s_dev_table_count; /* == 0 means no dev_table */
} __packed;
/*
* This struct is set on the FS partition's attributes.
* [EXOFS_APAGE_SB_DATA, EXOFS_ATTR_SB_STATS] and is written together
* with the create command, to atomically persist the sb writeable information.
*/
struct exofs_sb_stats {
__le64 s_nextid; /* Highest object ID used */
__le64 s_numfiles; /* Number of files on fs */
} __packed;
/*
* Describes the raid used in the FS. It is part of the device table.
* This here is taken from the pNFS-objects definition. In exofs we
* use one raid policy through-out the filesystem. (NOTE: the funny
* alignment at beginning. We take care of it at exofs_device_table.
*/
struct exofs_dt_data_map {
__le32 cb_num_comps;
__le64 cb_stripe_unit;
__le32 cb_group_width;
__le32 cb_group_depth;
__le32 cb_mirror_cnt;
__le32 cb_raid_algorithm;
} __packed;
/*
* This is an osd device information descriptor. It is a single entry in
* the exofs device table. It describes an osd target lun which
* contains data belonging to this FS. (Same partition_id on all devices)
*/
struct exofs_dt_device_info {
__le32 systemid_len;
u8 systemid[OSD_SYSTEMID_LEN];
__le64 long_name_offset; /* If !0 then offset-in-file */
__le32 osdname_len; /* */
u8 osdname[44]; /* Embbeded, Usually an asci uuid */
} __packed;
/*
* The EXOFS device table - stored in object EXOFS_DEVTABLE_ID's data.
* It contains the raid used for this multy-device FS and an array of
* participating devices.
*/
struct exofs_device_table {
__le32 dt_version; /* == EXOFS_DT_VER */
struct exofs_dt_data_map dt_data_map; /* Raid policy to use */
/* Resurved space For future use. Total includeing this:
* (8 * sizeof(le64))
*/
__le64 __Resurved[4];
__le64 dt_num_devices; /* Array size */
struct exofs_dt_device_info dt_dev_table[]; /* Array of devices */
} __packed;
/****************************************************************************
* inode-related things
****************************************************************************/
#define EXOFS_IDATA 5
/*
* The file control block - stored in an object's attributes. This is where
* the in-memory inode is stored on disk.
*/
struct exofs_fcb {
__le64 i_size; /* Size of the file */
__le16 i_mode; /* File mode */
__le16 i_links_count; /* Links count */
__le32 i_uid; /* Owner Uid */
__le32 i_gid; /* Group Id */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Creation time */
__le32 i_mtime; /* Modification time */
__le32 i_flags; /* File flags (unused for now)*/
__le32 i_generation; /* File version (for NFS) */
__le32 i_data[EXOFS_IDATA]; /* Short symlink names and device #s */
};
#define EXOFS_INO_ATTR_SIZE sizeof(struct exofs_fcb)
/* This is the Attribute the fcb is stored in */
static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
EXOFS_APAGE_FS_DATA,
EXOFS_ATTR_INODE_DATA,
EXOFS_INO_ATTR_SIZE);
/****************************************************************************
* dentry-related things
****************************************************************************/
#define EXOFS_NAME_LEN 255
/*
* The on-disk directory entry
*/
struct exofs_dir_entry {
__le64 inode_no; /* inode number */
__le16 rec_len; /* directory entry length */
u8 name_len; /* name length */
u8 file_type; /* umm...file type */
char name[EXOFS_NAME_LEN]; /* file name */
};
enum {
EXOFS_FT_UNKNOWN,
EXOFS_FT_REG_FILE,
EXOFS_FT_DIR,
EXOFS_FT_CHRDEV,
EXOFS_FT_BLKDEV,
EXOFS_FT_FIFO,
EXOFS_FT_SOCK,
EXOFS_FT_SYMLINK,
EXOFS_FT_MAX
};
#define EXOFS_DIR_PAD 4
#define EXOFS_DIR_ROUND (EXOFS_DIR_PAD - 1)
#define EXOFS_DIR_REC_LEN(name_len) \
(((name_len) + offsetof(struct exofs_dir_entry, name) + \
EXOFS_DIR_ROUND) & ~EXOFS_DIR_ROUND)
/*
* The on-disk (optional) layout structure.
* sits in an EXOFS_ATTR_INODE_FILE_LAYOUT or EXOFS_ATTR_INODE_DIR_LAYOUT
* attribute, attached to any inode, usually to a directory.
*/
enum exofs_inode_layout_gen_functions {
LAYOUT_MOVING_WINDOW = 0,
LAYOUT_IMPLICT = 1,
};
struct exofs_on_disk_inode_layout {
__le16 gen_func; /* One of enum exofs_inode_layout_gen_functions */
__le16 pad;
union {
/* gen_func == LAYOUT_MOVING_WINDOW (default) */
struct exofs_layout_sliding_window {
__le32 num_devices; /* first n devices in global-table*/
} sliding_window __packed;
/* gen_func == LAYOUT_IMPLICT */
struct exofs_layout_implict_list {
struct exofs_dt_data_map data_map;
/* Variable array of size data_map.cb_num_comps. These
* are device indexes of the devices in the global table
*/
__le32 dev_indexes[];
} implict __packed;
};
} __packed;
static inline size_t exofs_on_disk_inode_layout_size(unsigned max_devs)
{
return sizeof(struct exofs_on_disk_inode_layout) +
max_devs * sizeof(__le32);
}
#endif /*ifndef __EXOFS_COM_H__*/
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/iversion.h>
#include "exofs.h"
static inline unsigned exofs_chunk_size(struct inode *inode)
{
return inode->i_sb->s_blocksize;
}
static inline void exofs_put_page(struct page *page)
{
kunmap(page);
put_page(page);
}
static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
{
loff_t last_byte = inode->i_size;
last_byte -= page_nr << PAGE_SHIFT;
if (last_byte > PAGE_SIZE)
last_byte = PAGE_SIZE;
return last_byte;
}
static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
{
struct address_space *mapping = page->mapping;
struct inode *dir = mapping->host;
int err = 0;
inode_inc_iversion(dir);
if (!PageUptodate(page))
SetPageUptodate(page);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
mark_inode_dirty(dir);
}
set_page_dirty(page);
if (IS_DIRSYNC(dir))
err = write_one_page(page);
else
unlock_page(page);
return err;
}
static bool exofs_check_page(struct page *page)
{
struct inode *dir = page->mapping->host;
unsigned chunk_size = exofs_chunk_size(dir);
char *kaddr = page_address(page);
unsigned offs, rec_len;
unsigned limit = PAGE_SIZE;
struct exofs_dir_entry *p;
char *error;
/* if the page is the last one in the directory */
if ((dir->i_size >> PAGE_SHIFT) == page->index) {
limit = dir->i_size & ~PAGE_MASK;
if (limit & (chunk_size - 1))
goto Ebadsize;
if (!limit)
goto out;
}
for (offs = 0; offs <= limit - EXOFS_DIR_REC_LEN(1); offs += rec_len) {
p = (struct exofs_dir_entry *)(kaddr + offs);
rec_len = le16_to_cpu(p->rec_len);
if (rec_len < EXOFS_DIR_REC_LEN(1))
goto Eshort;
if (rec_len & 3)
goto Ealign;
if (rec_len < EXOFS_DIR_REC_LEN(p->name_len))
goto Enamelen;
if (((offs + rec_len - 1) ^ offs) & ~(chunk_size-1))
goto Espan;
}
if (offs != limit)
goto Eend;
out:
SetPageChecked(page);
return true;
Ebadsize:
EXOFS_ERR("ERROR [exofs_check_page]: "
"size of directory(0x%lx) is not a multiple of chunk size\n",
dir->i_ino
);
goto fail;
Eshort:
error = "rec_len is smaller than minimal";
goto bad_entry;
Ealign:
error = "unaligned directory entry";
goto bad_entry;
Enamelen:
error = "rec_len is too small for name_len";
goto bad_entry;
Espan:
error = "directory entry across blocks";
goto bad_entry;
bad_entry:
EXOFS_ERR(
"ERROR [exofs_check_page]: bad entry in directory(0x%lx): %s - "
"offset=%lu, inode=0x%llx, rec_len=%d, name_len=%d\n",
dir->i_ino, error, (page->index<<PAGE_SHIFT)+offs,
_LLU(le64_to_cpu(p->inode_no)),
rec_len, p->name_len);
goto fail;
Eend:
p = (struct exofs_dir_entry *)(kaddr + offs);
EXOFS_ERR("ERROR [exofs_check_page]: "
"entry in directory(0x%lx) spans the page boundary"
"offset=%lu, inode=0x%llx\n",
dir->i_ino, (page->index<<PAGE_SHIFT)+offs,
_LLU(le64_to_cpu(p->inode_no)));
fail:
SetPageError(page);
return false;
}
static struct page *exofs_get_page(struct inode *dir, unsigned long n)
{
struct address_space *mapping = dir->i_mapping;
struct page *page = read_mapping_page(mapping, n, NULL);
if (!IS_ERR(page)) {
kmap(page);
if (unlikely(!PageChecked(page))) {
if (PageError(page) || !exofs_check_page(page))
goto fail;
}
}
return page;
fail:
exofs_put_page(page);
return ERR_PTR(-EIO);
}
static inline int exofs_match(int len, const unsigned char *name,
struct exofs_dir_entry *de)
{
if (len != de->name_len)
return 0;
if (!de->inode_no)
return 0;
return !memcmp(name, de->name, len);
}
static inline
struct exofs_dir_entry *exofs_next_entry(struct exofs_dir_entry *p)
{
return (struct exofs_dir_entry *)((char *)p + le16_to_cpu(p->rec_len));
}
static inline unsigned
exofs_validate_entry(char *base, unsigned offset, unsigned mask)
{
struct exofs_dir_entry *de = (struct exofs_dir_entry *)(base + offset);
struct exofs_dir_entry *p =
(struct exofs_dir_entry *)(base + (offset&mask));
while ((char *)p < (char *)de) {
if (p->rec_len == 0)
break;
p = exofs_next_entry(p);
}
return (char *)p - base;
}
static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
[EXOFS_FT_UNKNOWN] = DT_UNKNOWN,
[EXOFS_FT_REG_FILE] = DT_REG,
[EXOFS_FT_DIR] = DT_DIR,
[EXOFS_FT_CHRDEV] = DT_CHR,
[EXOFS_FT_BLKDEV] = DT_BLK,
[EXOFS_FT_FIFO] = DT_FIFO,
[EXOFS_FT_SOCK] = DT_SOCK,
[EXOFS_FT_SYMLINK] = DT_LNK,
};
#define S_SHIFT 12
static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
[S_IFREG >> S_SHIFT] = EXOFS_FT_REG_FILE,
[S_IFDIR >> S_SHIFT] = EXOFS_FT_DIR,
[S_IFCHR >> S_SHIFT] = EXOFS_FT_CHRDEV,
[S_IFBLK >> S_SHIFT] = EXOFS_FT_BLKDEV,
[S_IFIFO >> S_SHIFT] = EXOFS_FT_FIFO,
[S_IFSOCK >> S_SHIFT] = EXOFS_FT_SOCK,
[S_IFLNK >> S_SHIFT] = EXOFS_FT_SYMLINK,
};
static inline
void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
{
umode_t mode = inode->i_mode;
de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
}
static int
exofs_readdir(struct file *file, struct dir_context *ctx)
{
loff_t pos = ctx->pos;
struct inode *inode = file_inode(file);
unsigned int offset = pos & ~PAGE_MASK;
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
bool need_revalidate = !inode_eq_iversion(inode, file->f_version);
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
for ( ; n < npages; n++, offset = 0) {
char *kaddr, *limit;
struct exofs_dir_entry *de;
struct page *page = exofs_get_page(inode, n);
if (IS_ERR(page)) {
EXOFS_ERR("ERROR: bad page in directory(0x%lx)\n",
inode->i_ino);
ctx->pos += PAGE_SIZE - offset;
return PTR_ERR(page);
}
kaddr = page_address(page);
if (unlikely(need_revalidate)) {
if (offset) {
offset = exofs_validate_entry(kaddr, offset,
chunk_mask);
ctx->pos = (n<<PAGE_SHIFT) + offset;
}
file->f_version = inode_query_iversion(inode);
need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
EXOFS_DIR_REC_LEN(1);
for (; (char *)de <= limit; de = exofs_next_entry(de)) {
if (de->rec_len == 0) {
EXOFS_ERR("ERROR: "
"zero-length entry in directory(0x%lx)\n",
inode->i_ino);
exofs_put_page(page);
return -EIO;
}
if (de->inode_no) {
unsigned char t;
if (de->file_type < EXOFS_FT_MAX)
t = exofs_filetype_table[de->file_type];
else
t = DT_UNKNOWN;
if (!dir_emit(ctx, de->name, de->name_len,
le64_to_cpu(de->inode_no),
t)) {
exofs_put_page(page);
return 0;
}
}
ctx->pos += le16_to_cpu(de->rec_len);
}
exofs_put_page(page);
}
return 0;
}
struct exofs_dir_entry *exofs_find_entry(struct inode *dir,
struct dentry *dentry, struct page **res_page)
{
const unsigned char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
unsigned long start, n;
unsigned long npages = dir_pages(dir);
struct page *page = NULL;
struct exofs_i_info *oi = exofs_i(dir);
struct exofs_dir_entry *de;
if (npages == 0)
goto out;
*res_page = NULL;
start = oi->i_dir_start_lookup;
if (start >= npages)
start = 0;
n = start;
do {
char *kaddr;
page = exofs_get_page(dir, n);
if (!IS_ERR(page)) {
kaddr = page_address(page);
de = (struct exofs_dir_entry *) kaddr;
kaddr += exofs_last_byte(dir, n) - reclen;
while ((char *) de <= kaddr) {
if (de->rec_len == 0) {
EXOFS_ERR("ERROR: zero-length entry in "
"directory(0x%lx)\n",
dir->i_ino);
exofs_put_page(page);
goto out;
}
if (exofs_match(namelen, name, de))
goto found;
de = exofs_next_entry(de);
}
exofs_put_page(page);
}
if (++n >= npages)
n = 0;
} while (n != start);
out:
return NULL;
found:
*res_page = page;
oi->i_dir_start_lookup = n;
return de;
}
struct exofs_dir_entry *exofs_dotdot(struct inode *dir, struct page **p)
{
struct page *page = exofs_get_page(dir, 0);
struct exofs_dir_entry *de = NULL;
if (!IS_ERR(page)) {
de = exofs_next_entry(
(struct exofs_dir_entry *)page_address(page));
*p = page;
}
return de;
}
ino_t exofs_parent_ino(struct dentry *child)
{
struct page *page;
struct exofs_dir_entry *de;
ino_t ino;
de = exofs_dotdot(d_inode(child), &page);
if (!de)
return 0;
ino = le64_to_cpu(de->inode_no);
exofs_put_page(page);
return ino;
}
ino_t exofs_inode_by_name(struct inode *dir, struct dentry *dentry)
{
ino_t res = 0;
struct exofs_dir_entry *de;
struct page *page;
de = exofs_find_entry(dir, dentry, &page);
if (de) {
res = le64_to_cpu(de->inode_no);
exofs_put_page(page);
}
return res;
}
int exofs_set_link(struct inode *dir, struct exofs_dir_entry *de,
struct page *page, struct inode *inode)
{
loff_t pos = page_offset(page) +
(char *) de - (char *) page_address(page);
unsigned len = le16_to_cpu(de->rec_len);
int err;
lock_page(page);
err = exofs_write_begin(NULL, page->mapping, pos, len, 0, &page, NULL);
if (err)
EXOFS_ERR("exofs_set_link: exofs_write_begin FAILED => %d\n",
err);
de->inode_no = cpu_to_le64(inode->i_ino);
exofs_set_de_type(de, inode);
if (likely(!err))
err = exofs_commit_chunk(page, pos, len);
exofs_put_page(page);
dir->i_mtime = dir->i_ctime = current_time(dir);
mark_inode_dirty(dir);
return err;
}
int exofs_add_link(struct dentry *dentry, struct inode *inode)
{
struct inode *dir = d_inode(dentry->d_parent);
const unsigned char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
unsigned chunk_size = exofs_chunk_size(dir);
unsigned reclen = EXOFS_DIR_REC_LEN(namelen);
unsigned short rec_len, name_len;
struct page *page = NULL;
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
struct exofs_dir_entry *de;
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
loff_t pos;
int err;
for (n = 0; n <= npages; n++) {
char *dir_end;
page = exofs_get_page(dir, n);
err = PTR_ERR(page);
if (IS_ERR(page))
goto out;
lock_page(page);
kaddr = page_address(page);
dir_end = kaddr + exofs_last_byte(dir, n);
de = (struct exofs_dir_entry *)kaddr;
kaddr += PAGE_SIZE - reclen;
while ((char *)de <= kaddr) {
if ((char *)de == dir_end) {
name_len = 0;
rec_len = chunk_size;
de->rec_len = cpu_to_le16(chunk_size);
de->inode_no = 0;
goto got_it;
}
if (de->rec_len == 0) {
EXOFS_ERR("ERROR: exofs_add_link: "
"zero-length entry in directory(0x%lx)\n",
inode->i_ino);
err = -EIO;
goto out_unlock;
}
err = -EEXIST;
if (exofs_match(namelen, name, de))
goto out_unlock;
name_len = EXOFS_DIR_REC_LEN(de->name_len);
rec_len = le16_to_cpu(de->rec_len);
if (!de->inode_no && rec_len >= reclen)
goto got_it;
if (rec_len >= name_len + reclen)
goto got_it;
de = (struct exofs_dir_entry *) ((char *) de + rec_len);
}
unlock_page(page);
exofs_put_page(page);
}
EXOFS_ERR("exofs_add_link: BAD dentry=%p or inode=0x%lx\n",
dentry, inode->i_ino);
return -EINVAL;
got_it:
pos = page_offset(page) +
(char *)de - (char *)page_address(page);
err = exofs_write_begin(NULL, page->mapping, pos, rec_len, 0,
&page, NULL);
if (err)
goto out_unlock;
if (de->inode_no) {
struct exofs_dir_entry *de1 =
(struct exofs_dir_entry *)((char *)de + name_len);
de1->rec_len = cpu_to_le16(rec_len - name_len);
de->rec_len = cpu_to_le16(name_len);
de = de1;
}
de->name_len = namelen;
memcpy(de->name, name, namelen);
de->inode_no = cpu_to_le64(inode->i_ino);
exofs_set_de_type(de, inode);
err = exofs_commit_chunk(page, pos, rec_len);
dir->i_mtime = dir->i_ctime = current_time(dir);
mark_inode_dirty(dir);
sbi->s_numfiles++;
out_put:
exofs_put_page(page);
out:
return err;
out_unlock:
unlock_page(page);
goto out_put;
}
int exofs_delete_entry(struct exofs_dir_entry *dir, struct page *page)
{
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
char *kaddr = page_address(page);
unsigned from = ((char *)dir - kaddr) & ~(exofs_chunk_size(inode)-1);
unsigned to = ((char *)dir - kaddr) + le16_to_cpu(dir->rec_len);
loff_t pos;
struct exofs_dir_entry *pde = NULL;
struct exofs_dir_entry *de = (struct exofs_dir_entry *) (kaddr + from);
int err;
while (de < dir) {
if (de->rec_len == 0) {
EXOFS_ERR("ERROR: exofs_delete_entry:"
"zero-length entry in directory(0x%lx)\n",
inode->i_ino);
err = -EIO;
goto out;
}
pde = de;
de = exofs_next_entry(de);
}
if (pde)
from = (char *)pde - (char *)page_address(page);
pos = page_offset(page) + from;
lock_page(page);
err = exofs_write_begin(NULL, page->mapping, pos, to - from, 0,
&page, NULL);
if (err)
EXOFS_ERR("exofs_delete_entry: exofs_write_begin FAILED => %d\n",
err);
if (pde)
pde->rec_len = cpu_to_le16(to - from);
dir->inode_no = 0;
if (likely(!err))
err = exofs_commit_chunk(page, pos, to - from);
inode->i_ctime = inode->i_mtime = current_time(inode);
mark_inode_dirty(inode);
sbi->s_numfiles--;
out:
exofs_put_page(page);
return err;
}
/* kept aligned on 4 bytes */
#define THIS_DIR ".\0\0"
#define PARENT_DIR "..\0"
int exofs_make_empty(struct inode *inode, struct inode *parent)
{
struct address_space *mapping = inode->i_mapping;
struct page *page = grab_cache_page(mapping, 0);
unsigned chunk_size = exofs_chunk_size(inode);
struct exofs_dir_entry *de;
int err;
void *kaddr;
if (!page)
return -ENOMEM;
err = exofs_write_begin(NULL, page->mapping, 0, chunk_size, 0,
&page, NULL);
if (err) {
unlock_page(page);
goto fail;
}
kaddr = kmap_atomic(page);
de = (struct exofs_dir_entry *)kaddr;
de->name_len = 1;
de->rec_len = cpu_to_le16(EXOFS_DIR_REC_LEN(1));
memcpy(de->name, THIS_DIR, sizeof(THIS_DIR));
de->inode_no = cpu_to_le64(inode->i_ino);
exofs_set_de_type(de, inode);
de = (struct exofs_dir_entry *)(kaddr + EXOFS_DIR_REC_LEN(1));
de->name_len = 2;
de->rec_len = cpu_to_le16(chunk_size - EXOFS_DIR_REC_LEN(1));
de->inode_no = cpu_to_le64(parent->i_ino);
memcpy(de->name, PARENT_DIR, sizeof(PARENT_DIR));
exofs_set_de_type(de, inode);
kunmap_atomic(kaddr);
err = exofs_commit_chunk(page, 0, chunk_size);
fail:
put_page(page);
return err;
}
int exofs_empty_dir(struct inode *inode)
{
struct page *page = NULL;
unsigned long i, npages = dir_pages(inode);
for (i = 0; i < npages; i++) {
char *kaddr;
struct exofs_dir_entry *de;
page = exofs_get_page(inode, i);
if (IS_ERR(page))
continue;
kaddr = page_address(page);
de = (struct exofs_dir_entry *)kaddr;
kaddr += exofs_last_byte(inode, i) - EXOFS_DIR_REC_LEN(1);
while ((char *)de <= kaddr) {
if (de->rec_len == 0) {
EXOFS_ERR("ERROR: exofs_empty_dir: "
"zero-length directory entry"
"kaddr=%p, de=%p\n", kaddr, de);
goto not_empty;
}
if (de->inode_no != 0) {
/* check for . and .. */
if (de->name[0] != '.')
goto not_empty;
if (de->name_len > 2)
goto not_empty;
if (de->name_len < 2) {
if (le64_to_cpu(de->inode_no) !=
inode->i_ino)
goto not_empty;
} else if (de->name[1] != '.')
goto not_empty;
}
de = exofs_next_entry(de);
}
exofs_put_page(page);
}
return 1;
not_empty:
exofs_put_page(page);
return 0;
}
const struct file_operations exofs_dir_operations = {
.llseek = generic_file_llseek,
.read = generic_read_dir,
.iterate_shared = exofs_readdir,
};
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#ifndef __EXOFS_H__
#define __EXOFS_H__
#include <linux/fs.h>
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <scsi/osd_ore.h>
#include "common.h"
#define EXOFS_ERR(fmt, a...) printk(KERN_ERR "exofs: " fmt, ##a)
#ifdef CONFIG_EXOFS_DEBUG
#define EXOFS_DBGMSG(fmt, a...) \
printk(KERN_NOTICE "exofs @%s:%d: " fmt, __func__, __LINE__, ##a)
#else
#define EXOFS_DBGMSG(fmt, a...) \
do { if (0) printk(fmt, ##a); } while (0)
#endif
/* u64 has problems with printk this will cast it to unsigned long long */
#define _LLU(x) (unsigned long long)(x)
struct exofs_dev {
struct ore_dev ored;
unsigned did;
unsigned urilen;
uint8_t *uri;
struct kobject ed_kobj;
};
/*
* our extension to the in-memory superblock
*/
struct exofs_sb_info {
struct exofs_sb_stats s_ess; /* Written often, pre-allocate*/
int s_timeout; /* timeout for OSD operations */
uint64_t s_nextid; /* highest object ID used */
uint32_t s_numfiles; /* number of files on fs */
spinlock_t s_next_gen_lock; /* spinlock for gen # update */
u32 s_next_generation; /* next gen # to use */
atomic_t s_curr_pending; /* number of pending commands */
struct ore_layout layout; /* Default files layout */
struct ore_comp one_comp; /* id & cred of partition id=0*/
struct ore_components oc; /* comps for the partition */
struct kobject s_kobj; /* holds per-sbi kobject */
};
/*
* our extension to the in-memory inode
*/
struct exofs_i_info {
struct inode vfs_inode; /* normal in-memory inode */
wait_queue_head_t i_wq; /* wait queue for inode */
unsigned long i_flags; /* various atomic flags */
uint32_t i_data[EXOFS_IDATA];/*short symlink names and device #s*/
uint32_t i_dir_start_lookup; /* which page to start lookup */
uint64_t i_commit_size; /* the object's written length */
struct ore_comp one_comp; /* same component for all devices */
struct ore_components oc; /* inode view of the device table */
};
static inline osd_id exofs_oi_objno(struct exofs_i_info *oi)
{
return oi->vfs_inode.i_ino + EXOFS_OBJ_OFF;
}
/*
* our inode flags
*/
#define OBJ_2BCREATED 0 /* object will be created soon*/
#define OBJ_CREATED 1 /* object has been created on the osd*/
static inline int obj_2bcreated(struct exofs_i_info *oi)
{
return test_bit(OBJ_2BCREATED, &oi->i_flags);
}
static inline void set_obj_2bcreated(struct exofs_i_info *oi)
{
set_bit(OBJ_2BCREATED, &oi->i_flags);
}
static inline int obj_created(struct exofs_i_info *oi)
{
return test_bit(OBJ_CREATED, &oi->i_flags);
}
static inline void set_obj_created(struct exofs_i_info *oi)
{
set_bit(OBJ_CREATED, &oi->i_flags);
}
int __exofs_wait_obj_created(struct exofs_i_info *oi);
static inline int wait_obj_created(struct exofs_i_info *oi)
{
if (likely(obj_created(oi)))
return 0;
return __exofs_wait_obj_created(oi);
}
/*
* get to our inode from the vfs inode
*/
static inline struct exofs_i_info *exofs_i(struct inode *inode)
{
return container_of(inode, struct exofs_i_info, vfs_inode);
}
/*
* Maximum count of links to a file
*/
#define EXOFS_LINK_MAX 32000
/*************************
* function declarations *
*************************/
/* inode.c */
unsigned exofs_max_io_pages(struct ore_layout *layout,
unsigned expected_pages);
int exofs_setattr(struct dentry *, struct iattr *);
int exofs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
extern struct inode *exofs_iget(struct super_block *, unsigned long);
struct inode *exofs_new_inode(struct inode *, umode_t);
extern int exofs_write_inode(struct inode *, struct writeback_control *wbc);
extern void exofs_evict_inode(struct inode *);
/* dir.c: */
int exofs_add_link(struct dentry *, struct inode *);
ino_t exofs_inode_by_name(struct inode *, struct dentry *);
int exofs_delete_entry(struct exofs_dir_entry *, struct page *);
int exofs_make_empty(struct inode *, struct inode *);
struct exofs_dir_entry *exofs_find_entry(struct inode *, struct dentry *,
struct page **);
int exofs_empty_dir(struct inode *);
struct exofs_dir_entry *exofs_dotdot(struct inode *, struct page **);
ino_t exofs_parent_ino(struct dentry *child);
int exofs_set_link(struct inode *, struct exofs_dir_entry *, struct page *,
struct inode *);
/* super.c */
void exofs_make_credential(u8 cred_a[OSD_CAP_LEN],
const struct osd_obj_id *obj);
int exofs_sbi_write_stats(struct exofs_sb_info *sbi);
/* sys.c */
int exofs_sysfs_init(void);
void exofs_sysfs_uninit(void);
int exofs_sysfs_sb_add(struct exofs_sb_info *sbi,
struct exofs_dt_device_info *dt_dev);
void exofs_sysfs_sb_del(struct exofs_sb_info *sbi);
int exofs_sysfs_odev_add(struct exofs_dev *edev,
struct exofs_sb_info *sbi);
void exofs_sysfs_dbg_print(void);
/*********************
* operation vectors *
*********************/
/* dir.c: */
extern const struct file_operations exofs_dir_operations;
/* file.c */
extern const struct inode_operations exofs_file_inode_operations;
extern const struct file_operations exofs_file_operations;
/* inode.c */
extern const struct address_space_operations exofs_aops;
/* namei.c */
extern const struct inode_operations exofs_dir_inode_operations;
extern const struct inode_operations exofs_special_inode_operations;
/* exofs_init_comps will initialize an ore_components device array
* pointing to a single ore_comp struct, and a round-robin view
* of the device table.
* The first device of each inode is the [inode->ino % num_devices]
* and the rest of the devices sequentially following where the
* first device is after the last device.
* It is assumed that the global device array at @sbi is twice
* bigger and that the device table repeats twice.
* See: exofs_read_lookup_dev_table()
*/
static inline void exofs_init_comps(struct ore_components *oc,
struct ore_comp *one_comp,
struct exofs_sb_info *sbi, osd_id oid)
{
unsigned dev_mod = (unsigned)oid, first_dev;
one_comp->obj.partition = sbi->one_comp.obj.partition;
one_comp->obj.id = oid;
exofs_make_credential(one_comp->cred, &one_comp->obj);
oc->first_dev = 0;
oc->numdevs = sbi->layout.group_width * sbi->layout.mirrors_p1 *
sbi->layout.group_count;
oc->single_comp = EC_SINGLE_COMP;
oc->comps = one_comp;
/* Round robin device view of the table */
first_dev = (dev_mod * sbi->layout.mirrors_p1) % sbi->oc.numdevs;
oc->ods = &sbi->oc.ods[first_dev];
}
#endif
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include "exofs.h"
static int exofs_release_file(struct inode *inode, struct file *filp)
{
return 0;
}
/* exofs_file_fsync - flush the inode to disk
*
* Note, in exofs all metadata is written as part of inode, regardless.
* The writeout is synchronous
*/
static int exofs_file_fsync(struct file *filp, loff_t start, loff_t end,
int datasync)
{
struct inode *inode = filp->f_mapping->host;
int ret;
ret = file_write_and_wait_range(filp, start, end);
if (ret)
return ret;
inode_lock(inode);
ret = sync_inode_metadata(filp->f_mapping->host, 1);
inode_unlock(inode);
return ret;
}
static int exofs_flush(struct file *file, fl_owner_t id)
{
int ret = vfs_fsync(file, 0);
/* TODO: Flush the OSD target */
return ret;
}
const struct file_operations exofs_file_operations = {
.llseek = generic_file_llseek,
.read_iter = generic_file_read_iter,
.write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.open = generic_file_open,
.release = exofs_release_file,
.fsync = exofs_file_fsync,
.flush = exofs_flush,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
};
const struct inode_operations exofs_file_inode_operations = {
.setattr = exofs_setattr,
};
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/slab.h>
#include "exofs.h"
#define EXOFS_DBGMSG2(M...) do {} while (0)
unsigned exofs_max_io_pages(struct ore_layout *layout,
unsigned expected_pages)
{
unsigned pages = min_t(unsigned, expected_pages,
layout->max_io_length / PAGE_SIZE);
return pages;
}
struct page_collect {
struct exofs_sb_info *sbi;
struct inode *inode;
unsigned expected_pages;
struct ore_io_state *ios;
struct page **pages;
unsigned alloc_pages;
unsigned nr_pages;
unsigned long length;
loff_t pg_first; /* keep 64bit also in 32-arches */
bool read_4_write; /* This means two things: that the read is sync
* And the pages should not be unlocked.
*/
struct page *that_locked_page;
};
static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
struct inode *inode)
{
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
pcol->sbi = sbi;
pcol->inode = inode;
pcol->expected_pages = expected_pages;
pcol->ios = NULL;
pcol->pages = NULL;
pcol->alloc_pages = 0;
pcol->nr_pages = 0;
pcol->length = 0;
pcol->pg_first = -1;
pcol->read_4_write = false;
pcol->that_locked_page = NULL;
}
static void _pcol_reset(struct page_collect *pcol)
{
pcol->expected_pages -= min(pcol->nr_pages, pcol->expected_pages);
pcol->pages = NULL;
pcol->alloc_pages = 0;
pcol->nr_pages = 0;
pcol->length = 0;
pcol->pg_first = -1;
pcol->ios = NULL;
pcol->that_locked_page = NULL;
/* this is probably the end of the loop but in writes
* it might not end here. don't be left with nothing
*/
if (!pcol->expected_pages)
pcol->expected_pages =
exofs_max_io_pages(&pcol->sbi->layout, ~0);
}
static int pcol_try_alloc(struct page_collect *pcol)
{
unsigned pages;
/* TODO: easily support bio chaining */
pages = exofs_max_io_pages(&pcol->sbi->layout, pcol->expected_pages);
for (; pages; pages >>= 1) {
pcol->pages = kmalloc_array(pages, sizeof(struct page *),
GFP_KERNEL);
if (likely(pcol->pages)) {
pcol->alloc_pages = pages;
return 0;
}
}
EXOFS_ERR("Failed to kmalloc expected_pages=%u\n",
pcol->expected_pages);
return -ENOMEM;
}
static void pcol_free(struct page_collect *pcol)
{
kfree(pcol->pages);
pcol->pages = NULL;
if (pcol->ios) {
ore_put_io_state(pcol->ios);
pcol->ios = NULL;
}
}
static int pcol_add_page(struct page_collect *pcol, struct page *page,
unsigned len)
{
if (unlikely(pcol->nr_pages >= pcol->alloc_pages))
return -ENOMEM;
pcol->pages[pcol->nr_pages++] = page;
pcol->length += len;
return 0;
}
enum {PAGE_WAS_NOT_IN_IO = 17};
static int update_read_page(struct page *page, int ret)
{
switch (ret) {
case 0:
/* Everything is OK */
SetPageUptodate(page);
if (PageError(page))
ClearPageError(page);
break;
case -EFAULT:
/* In this case we were trying to read something that wasn't on
* disk yet - return a page full of zeroes. This should be OK,
* because the object should be empty (if there was a write
* before this read, the read would be waiting with the page
* locked */
clear_highpage(page);
SetPageUptodate(page);
if (PageError(page))
ClearPageError(page);
EXOFS_DBGMSG("recovered read error\n");
/* fall through */
case PAGE_WAS_NOT_IN_IO:
ret = 0; /* recovered error */
break;
default:
SetPageError(page);
}
return ret;
}
static void update_write_page(struct page *page, int ret)
{
if (unlikely(ret == PAGE_WAS_NOT_IN_IO))
return; /* don't pass start don't collect $200 */
if (ret) {
mapping_set_error(page->mapping, ret);
SetPageError(page);
}
end_page_writeback(page);
}
/* Called at the end of reads, to optionally unlock pages and update their
* status.
*/
static int __readpages_done(struct page_collect *pcol)
{
int i;
u64 good_bytes;
u64 length = 0;
int ret = ore_check_io(pcol->ios, NULL);
if (likely(!ret)) {
good_bytes = pcol->length;
ret = PAGE_WAS_NOT_IN_IO;
} else {
good_bytes = 0;
}
EXOFS_DBGMSG2("readpages_done(0x%lx) good_bytes=0x%llx"
" length=0x%lx nr_pages=%u\n",
pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
pcol->nr_pages);
for (i = 0; i < pcol->nr_pages; i++) {
struct page *page = pcol->pages[i];
struct inode *inode = page->mapping->host;
int page_stat;
if (inode != pcol->inode)
continue; /* osd might add more pages at end */
if (likely(length < good_bytes))
page_stat = 0;
else
page_stat = ret;
EXOFS_DBGMSG2(" readpages_done(0x%lx, 0x%lx) %s\n",
inode->i_ino, page->index,
page_stat ? "bad_bytes" : "good_bytes");
ret = update_read_page(page, page_stat);
if (!pcol->read_4_write)
unlock_page(page);
length += PAGE_SIZE;
}
pcol_free(pcol);
EXOFS_DBGMSG2("readpages_done END\n");
return ret;
}
/* callback of async reads */
static void readpages_done(struct ore_io_state *ios, void *p)
{
struct page_collect *pcol = p;
__readpages_done(pcol);
atomic_dec(&pcol->sbi->s_curr_pending);
kfree(pcol);
}
static void _unlock_pcol_pages(struct page_collect *pcol, int ret, int rw)
{
int i;
for (i = 0; i < pcol->nr_pages; i++) {
struct page *page = pcol->pages[i];
if (rw == READ)
update_read_page(page, ret);
else
update_write_page(page, ret);
unlock_page(page);
}
}
static int _maybe_not_all_in_one_io(struct ore_io_state *ios,
struct page_collect *pcol_src, struct page_collect *pcol)
{
/* length was wrong or offset was not page aligned */
BUG_ON(pcol_src->nr_pages < ios->nr_pages);
if (pcol_src->nr_pages > ios->nr_pages) {
struct page **src_page;
unsigned pages_less = pcol_src->nr_pages - ios->nr_pages;
unsigned long len_less = pcol_src->length - ios->length;
unsigned i;
int ret;
/* This IO was trimmed */
pcol_src->nr_pages = ios->nr_pages;
pcol_src->length = ios->length;
/* Left over pages are passed to the next io */
pcol->expected_pages += pages_less;
pcol->nr_pages = pages_less;
pcol->length = len_less;
src_page = pcol_src->pages + pcol_src->nr_pages;
pcol->pg_first = (*src_page)->index;
ret = pcol_try_alloc(pcol);
if (unlikely(ret))
return ret;
for (i = 0; i < pages_less; ++i)
pcol->pages[i] = *src_page++;
EXOFS_DBGMSG("Length was adjusted nr_pages=0x%x "
"pages_less=0x%x expected_pages=0x%x "
"next_offset=0x%llx next_len=0x%lx\n",
pcol_src->nr_pages, pages_less, pcol->expected_pages,
pcol->pg_first * PAGE_SIZE, pcol->length);
}
return 0;
}
static int read_exec(struct page_collect *pcol)
{
struct exofs_i_info *oi = exofs_i(pcol->inode);
struct ore_io_state *ios;
struct page_collect *pcol_copy = NULL;
int ret;
if (!pcol->pages)
return 0;
if (!pcol->ios) {
int ret = ore_get_rw_state(&pcol->sbi->layout, &oi->oc, true,
pcol->pg_first << PAGE_SHIFT,
pcol->length, &pcol->ios);
if (ret)
return ret;
}
ios = pcol->ios;
ios->pages = pcol->pages;
if (pcol->read_4_write) {
ore_read(pcol->ios);
return __readpages_done(pcol);
}
pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
if (!pcol_copy) {
ret = -ENOMEM;
goto err;
}
*pcol_copy = *pcol;
ios->done = readpages_done;
ios->private = pcol_copy;
/* pages ownership was passed to pcol_copy */
_pcol_reset(pcol);
ret = _maybe_not_all_in_one_io(ios, pcol_copy, pcol);
if (unlikely(ret))
goto err;
EXOFS_DBGMSG2("read_exec(0x%lx) offset=0x%llx length=0x%llx\n",
pcol->inode->i_ino, _LLU(ios->offset), _LLU(ios->length));
ret = ore_read(ios);
if (unlikely(ret))
goto err;
atomic_inc(&pcol->sbi->s_curr_pending);
return 0;
err:
if (!pcol_copy) /* Failed before ownership transfer */
pcol_copy = pcol;
_unlock_pcol_pages(pcol_copy, ret, READ);
pcol_free(pcol_copy);
kfree(pcol_copy);
return ret;
}
/* readpage_strip is called either directly from readpage() or by the VFS from
* within read_cache_pages(), to add one more page to be read. It will try to
* collect as many contiguous pages as posible. If a discontinuity is
* encountered, or it runs out of resources, it will submit the previous segment
* and will start a new collection. Eventually caller must submit the last
* segment if present.
*/
static int readpage_strip(void *data, struct page *page)
{
struct page_collect *pcol = data;
struct inode *inode = pcol->inode;
struct exofs_i_info *oi = exofs_i(inode);
loff_t i_size = i_size_read(inode);
pgoff_t end_index = i_size >> PAGE_SHIFT;
size_t len;
int ret;
BUG_ON(!PageLocked(page));
/* FIXME: Just for debugging, will be removed */
if (PageUptodate(page))
EXOFS_ERR("PageUptodate(0x%lx, 0x%lx)\n", pcol->inode->i_ino,
page->index);
pcol->that_locked_page = page;
if (page->index < end_index)
len = PAGE_SIZE;
else if (page->index == end_index)
len = i_size & ~PAGE_MASK;
else
len = 0;
if (!len || !obj_created(oi)) {
/* this will be out of bounds, or doesn't exist yet.
* Current page is cleared and the request is split
*/
clear_highpage(page);
SetPageUptodate(page);
if (PageError(page))
ClearPageError(page);
if (!pcol->read_4_write)
unlock_page(page);
EXOFS_DBGMSG("readpage_strip(0x%lx) empty page len=%zx "
"read_4_write=%d index=0x%lx end_index=0x%lx "
"splitting\n", inode->i_ino, len,
pcol->read_4_write, page->index, end_index);
return read_exec(pcol);
}
try_again:
if (unlikely(pcol->pg_first == -1)) {
pcol->pg_first = page->index;
} else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
page->index)) {
/* Discontinuity detected, split the request */
ret = read_exec(pcol);
if (unlikely(ret))
goto fail;
goto try_again;
}
if (!pcol->pages) {
ret = pcol_try_alloc(pcol);
if (unlikely(ret))
goto fail;
}
if (len != PAGE_SIZE)
zero_user(page, len, PAGE_SIZE - len);
EXOFS_DBGMSG2(" readpage_strip(0x%lx, 0x%lx) len=0x%zx\n",
inode->i_ino, page->index, len);
ret = pcol_add_page(pcol, page, len);
if (ret) {
EXOFS_DBGMSG2("Failed pcol_add_page pages[i]=%p "
"this_len=0x%zx nr_pages=%u length=0x%lx\n",
page, len, pcol->nr_pages, pcol->length);
/* split the request, and start again with current page */
ret = read_exec(pcol);
if (unlikely(ret))
goto fail;
goto try_again;
}
return 0;
fail:
/* SetPageError(page); ??? */
unlock_page(page);
return ret;
}
static int exofs_readpages(struct file *file, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages)
{
struct page_collect pcol;
int ret;
_pcol_init(&pcol, nr_pages, mapping->host);
ret = read_cache_pages(mapping, pages, readpage_strip, &pcol);
if (ret) {
EXOFS_ERR("read_cache_pages => %d\n", ret);
return ret;
}
ret = read_exec(&pcol);
if (unlikely(ret))
return ret;
return read_exec(&pcol);
}
static int _readpage(struct page *page, bool read_4_write)
{
struct page_collect pcol;
int ret;
_pcol_init(&pcol, 1, page->mapping->host);
pcol.read_4_write = read_4_write;
ret = readpage_strip(&pcol, page);
if (ret) {
EXOFS_ERR("_readpage => %d\n", ret);
return ret;
}
return read_exec(&pcol);
}
/*
* We don't need the file
*/
static int exofs_readpage(struct file *file, struct page *page)
{
return _readpage(page, false);
}
/* Callback for osd_write. All writes are asynchronous */
static void writepages_done(struct ore_io_state *ios, void *p)
{
struct page_collect *pcol = p;
int i;
u64 good_bytes;
u64 length = 0;
int ret = ore_check_io(ios, NULL);
atomic_dec(&pcol->sbi->s_curr_pending);
if (likely(!ret)) {
good_bytes = pcol->length;
ret = PAGE_WAS_NOT_IN_IO;
} else {
good_bytes = 0;
}
EXOFS_DBGMSG2("writepages_done(0x%lx) good_bytes=0x%llx"
" length=0x%lx nr_pages=%u\n",
pcol->inode->i_ino, _LLU(good_bytes), pcol->length,
pcol->nr_pages);
for (i = 0; i < pcol->nr_pages; i++) {
struct page *page = pcol->pages[i];
struct inode *inode = page->mapping->host;
int page_stat;
if (inode != pcol->inode)
continue; /* osd might add more pages to a bio */
if (likely(length < good_bytes))
page_stat = 0;
else
page_stat = ret;
update_write_page(page, page_stat);
unlock_page(page);
EXOFS_DBGMSG2(" writepages_done(0x%lx, 0x%lx) status=%d\n",
inode->i_ino, page->index, page_stat);
length += PAGE_SIZE;
}
pcol_free(pcol);
kfree(pcol);
EXOFS_DBGMSG2("writepages_done END\n");
}
static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
{
struct page_collect *pcol = priv;
pgoff_t index = offset / PAGE_SIZE;
if (!pcol->that_locked_page ||
(pcol->that_locked_page->index != index)) {
struct page *page;
loff_t i_size = i_size_read(pcol->inode);
if (offset >= i_size) {
*uptodate = true;
EXOFS_DBGMSG2("offset >= i_size index=0x%lx\n", index);
return ZERO_PAGE(0);
}
page = find_get_page(pcol->inode->i_mapping, index);
if (!page) {
page = find_or_create_page(pcol->inode->i_mapping,
index, GFP_NOFS);
if (unlikely(!page)) {
EXOFS_DBGMSG("grab_cache_page Failed "
"index=0x%llx\n", _LLU(index));
return NULL;
}
unlock_page(page);
}
*uptodate = PageUptodate(page);
EXOFS_DBGMSG2("index=0x%lx uptodate=%d\n", index, *uptodate);
return page;
} else {
EXOFS_DBGMSG2("YES that_locked_page index=0x%lx\n",
pcol->that_locked_page->index);
*uptodate = true;
return pcol->that_locked_page;
}
}
static void __r4w_put_page(void *priv, struct page *page)
{
struct page_collect *pcol = priv;
if ((pcol->that_locked_page != page) && (ZERO_PAGE(0) != page)) {
EXOFS_DBGMSG2("index=0x%lx\n", page->index);
put_page(page);
return;
}
EXOFS_DBGMSG2("that_locked_page index=0x%lx\n",
ZERO_PAGE(0) == page ? -1 : page->index);
}
static const struct _ore_r4w_op _r4w_op = {
.get_page = &__r4w_get_page,
.put_page = &__r4w_put_page,
};
static int write_exec(struct page_collect *pcol)
{
struct exofs_i_info *oi = exofs_i(pcol->inode);
struct ore_io_state *ios;
struct page_collect *pcol_copy = NULL;
int ret;
if (!pcol->pages)
return 0;
BUG_ON(pcol->ios);
ret = ore_get_rw_state(&pcol->sbi->layout, &oi->oc, false,
pcol->pg_first << PAGE_SHIFT,
pcol->length, &pcol->ios);
if (unlikely(ret))
goto err;
pcol_copy = kmalloc(sizeof(*pcol_copy), GFP_KERNEL);
if (!pcol_copy) {
EXOFS_ERR("write_exec: Failed to kmalloc(pcol)\n");
ret = -ENOMEM;
goto err;
}
*pcol_copy = *pcol;
ios = pcol->ios;
ios->pages = pcol_copy->pages;
ios->done = writepages_done;
ios->r4w = &_r4w_op;
ios->private = pcol_copy;
/* pages ownership was passed to pcol_copy */
_pcol_reset(pcol);
ret = _maybe_not_all_in_one_io(ios, pcol_copy, pcol);
if (unlikely(ret))
goto err;
EXOFS_DBGMSG2("write_exec(0x%lx) offset=0x%llx length=0x%llx\n",
pcol->inode->i_ino, _LLU(ios->offset), _LLU(ios->length));
ret = ore_write(ios);
if (unlikely(ret)) {
EXOFS_ERR("write_exec: ore_write() Failed\n");
goto err;
}
atomic_inc(&pcol->sbi->s_curr_pending);
return 0;
err:
if (!pcol_copy) /* Failed before ownership transfer */
pcol_copy = pcol;
_unlock_pcol_pages(pcol_copy, ret, WRITE);
pcol_free(pcol_copy);
kfree(pcol_copy);
return ret;
}
/* writepage_strip is called either directly from writepage() or by the VFS from
* within write_cache_pages(), to add one more page to be written to storage.
* It will try to collect as many contiguous pages as possible. If a
* discontinuity is encountered or it runs out of resources it will submit the
* previous segment and will start a new collection.
* Eventually caller must submit the last segment if present.
*/
static int writepage_strip(struct page *page,
struct writeback_control *wbc_unused, void *data)
{
struct page_collect *pcol = data;
struct inode *inode = pcol->inode;
struct exofs_i_info *oi = exofs_i(inode);
loff_t i_size = i_size_read(inode);
pgoff_t end_index = i_size >> PAGE_SHIFT;
size_t len;
int ret;
BUG_ON(!PageLocked(page));
ret = wait_obj_created(oi);
if (unlikely(ret))
goto fail;
if (page->index < end_index)
/* in this case, the page is within the limits of the file */
len = PAGE_SIZE;
else {
len = i_size & ~PAGE_MASK;
if (page->index > end_index || !len) {
/* in this case, the page is outside the limits
* (truncate in progress)
*/
ret = write_exec(pcol);
if (unlikely(ret))
goto fail;
if (PageError(page))
ClearPageError(page);
unlock_page(page);
EXOFS_DBGMSG("writepage_strip(0x%lx, 0x%lx) "
"outside the limits\n",
inode->i_ino, page->index);
return 0;
}
}
try_again:
if (unlikely(pcol->pg_first == -1)) {
pcol->pg_first = page->index;
} else if (unlikely((pcol->pg_first + pcol->nr_pages) !=
page->index)) {
/* Discontinuity detected, split the request */
ret = write_exec(pcol);
if (unlikely(ret))
goto fail;
EXOFS_DBGMSG("writepage_strip(0x%lx, 0x%lx) Discontinuity\n",
inode->i_ino, page->index);
goto try_again;
}
if (!pcol->pages) {
ret = pcol_try_alloc(pcol);
if (unlikely(ret))
goto fail;
}
EXOFS_DBGMSG2(" writepage_strip(0x%lx, 0x%lx) len=0x%zx\n",
inode->i_ino, page->index, len);
ret = pcol_add_page(pcol, page, len);
if (unlikely(ret)) {
EXOFS_DBGMSG2("Failed pcol_add_page "
"nr_pages=%u total_length=0x%lx\n",
pcol->nr_pages, pcol->length);
/* split the request, next loop will start again */
ret = write_exec(pcol);
if (unlikely(ret)) {
EXOFS_DBGMSG("write_exec failed => %d", ret);
goto fail;
}
goto try_again;
}
BUG_ON(PageWriteback(page));
set_page_writeback(page);
return 0;
fail:
EXOFS_DBGMSG("Error: writepage_strip(0x%lx, 0x%lx)=>%d\n",
inode->i_ino, page->index, ret);
mapping_set_error(page->mapping, -EIO);
unlock_page(page);
return ret;
}
static int exofs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
struct page_collect pcol;
long start, end, expected_pages;
int ret;
start = wbc->range_start >> PAGE_SHIFT;
end = (wbc->range_end == LLONG_MAX) ?
start + mapping->nrpages :
wbc->range_end >> PAGE_SHIFT;
if (start || end)
expected_pages = end - start + 1;
else
expected_pages = mapping->nrpages;
if (expected_pages < 32L)
expected_pages = 32L;
EXOFS_DBGMSG2("inode(0x%lx) wbc->start=0x%llx wbc->end=0x%llx "
"nrpages=%lu start=0x%lx end=0x%lx expected_pages=%ld\n",
mapping->host->i_ino, wbc->range_start, wbc->range_end,
mapping->nrpages, start, end, expected_pages);
_pcol_init(&pcol, expected_pages, mapping->host);
ret = write_cache_pages(mapping, wbc, writepage_strip, &pcol);
if (unlikely(ret)) {
EXOFS_ERR("write_cache_pages => %d\n", ret);
return ret;
}
ret = write_exec(&pcol);
if (unlikely(ret))
return ret;
if (wbc->sync_mode == WB_SYNC_ALL) {
return write_exec(&pcol); /* pump the last reminder */
} else if (pcol.nr_pages) {
/* not SYNC let the reminder join the next writeout */
unsigned i;
for (i = 0; i < pcol.nr_pages; i++) {
struct page *page = pcol.pages[i];
end_page_writeback(page);
set_page_dirty(page);
unlock_page(page);
}
}
return 0;
}
/*
static int exofs_writepage(struct page *page, struct writeback_control *wbc)
{
struct page_collect pcol;
int ret;
_pcol_init(&pcol, 1, page->mapping->host);
ret = writepage_strip(page, NULL, &pcol);
if (ret) {
EXOFS_ERR("exofs_writepage => %d\n", ret);
return ret;
}
return write_exec(&pcol);
}
*/
/* i_mutex held using inode->i_size directly */
static void _write_failed(struct inode *inode, loff_t to)
{
if (to > inode->i_size)
truncate_pagecache(inode, inode->i_size);
}
int exofs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
int ret = 0;
struct page *page;
page = *pagep;
if (page == NULL) {
page = grab_cache_page_write_begin(mapping, pos >> PAGE_SHIFT,
flags);
if (!page) {
EXOFS_DBGMSG("grab_cache_page_write_begin failed\n");
return -ENOMEM;
}
*pagep = page;
}
/* read modify write */
if (!PageUptodate(page) && (len != PAGE_SIZE)) {
loff_t i_size = i_size_read(mapping->host);
pgoff_t end_index = i_size >> PAGE_SHIFT;
if (page->index > end_index) {
clear_highpage(page);
SetPageUptodate(page);
} else {
ret = _readpage(page, true);
if (ret) {
unlock_page(page);
EXOFS_DBGMSG("__readpage failed\n");
}
}
}
return ret;
}
static int exofs_write_begin_export(struct file *file,
struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
*pagep = NULL;
return exofs_write_begin(file, mapping, pos, len, flags, pagep,
fsdata);
}
static int exofs_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
struct inode *inode = mapping->host;
loff_t last_pos = pos + copied;
if (!PageUptodate(page)) {
if (copied < len) {
_write_failed(inode, pos + len);
copied = 0;
goto out;
}
SetPageUptodate(page);
}
if (last_pos > inode->i_size) {
i_size_write(inode, last_pos);
mark_inode_dirty(inode);
}
set_page_dirty(page);
out:
unlock_page(page);
put_page(page);
return copied;
}
static int exofs_releasepage(struct page *page, gfp_t gfp)
{
EXOFS_DBGMSG("page 0x%lx\n", page->index);
WARN_ON(1);
return 0;
}
static void exofs_invalidatepage(struct page *page, unsigned int offset,
unsigned int length)
{
EXOFS_DBGMSG("page 0x%lx offset 0x%x length 0x%x\n",
page->index, offset, length);
WARN_ON(1);
}
/* TODO: Should be easy enough to do proprly */
static ssize_t exofs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
return 0;
}
const struct address_space_operations exofs_aops = {
.readpage = exofs_readpage,
.readpages = exofs_readpages,
.writepage = NULL,
.writepages = exofs_writepages,
.write_begin = exofs_write_begin_export,
.write_end = exofs_write_end,
.releasepage = exofs_releasepage,
.set_page_dirty = __set_page_dirty_nobuffers,
.invalidatepage = exofs_invalidatepage,
/* Not implemented Yet */
.bmap = NULL, /* TODO: use osd's OSD_ACT_READ_MAP */
.direct_IO = exofs_direct_IO,
/* With these NULL has special meaning or default is not exported */
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
.error_remove_page = NULL,
};
/******************************************************************************
* INODE OPERATIONS
*****************************************************************************/
/*
* Test whether an inode is a fast symlink.
*/
static inline int exofs_inode_is_fast_symlink(struct inode *inode)
{
struct exofs_i_info *oi = exofs_i(inode);
return S_ISLNK(inode->i_mode) && (oi->i_data[0] != 0);
}
static int _do_truncate(struct inode *inode, loff_t newsize)
{
struct exofs_i_info *oi = exofs_i(inode);
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
int ret;
inode->i_mtime = inode->i_ctime = current_time(inode);
ret = ore_truncate(&sbi->layout, &oi->oc, (u64)newsize);
if (likely(!ret))
truncate_setsize(inode, newsize);
EXOFS_DBGMSG2("(0x%lx) size=0x%llx ret=>%d\n",
inode->i_ino, newsize, ret);
return ret;
}
/*
* Set inode attributes - update size attribute on OSD if needed,
* otherwise just call generic functions.
*/
int exofs_setattr(struct dentry *dentry, struct iattr *iattr)
{
struct inode *inode = d_inode(dentry);
int error;
/* if we are about to modify an object, and it hasn't been
* created yet, wait
*/
error = wait_obj_created(exofs_i(inode));
if (unlikely(error))
return error;
error = setattr_prepare(dentry, iattr);
if (unlikely(error))
return error;
if ((iattr->ia_valid & ATTR_SIZE) &&
iattr->ia_size != i_size_read(inode)) {
error = _do_truncate(inode, iattr->ia_size);
if (unlikely(error))
return error;
}
setattr_copy(inode, iattr);
mark_inode_dirty(inode);
return 0;
}
static const struct osd_attr g_attr_inode_file_layout = ATTR_DEF(
EXOFS_APAGE_FS_DATA,
EXOFS_ATTR_INODE_FILE_LAYOUT,
0);
static const struct osd_attr g_attr_inode_dir_layout = ATTR_DEF(
EXOFS_APAGE_FS_DATA,
EXOFS_ATTR_INODE_DIR_LAYOUT,
0);
/*
* Read the Linux inode info from the OSD, and return it as is. In exofs the
* inode info is in an application specific page/attribute of the osd-object.
*/
static int exofs_get_inode(struct super_block *sb, struct exofs_i_info *oi,
struct exofs_fcb *inode)
{
struct exofs_sb_info *sbi = sb->s_fs_info;
struct osd_attr attrs[] = {
[0] = g_attr_inode_data,
[1] = g_attr_inode_file_layout,
[2] = g_attr_inode_dir_layout,
};
struct ore_io_state *ios;
struct exofs_on_disk_inode_layout *layout;
int ret;
ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
return ret;
}
attrs[1].len = exofs_on_disk_inode_layout_size(sbi->oc.numdevs);
attrs[2].len = exofs_on_disk_inode_layout_size(sbi->oc.numdevs);
ios->in_attr = attrs;
ios->in_attr_len = ARRAY_SIZE(attrs);
ret = ore_read(ios);
if (unlikely(ret)) {
EXOFS_ERR("object(0x%llx) corrupted, return empty file=>%d\n",
_LLU(oi->one_comp.obj.id), ret);
memset(inode, 0, sizeof(*inode));
inode->i_mode = 0040000 | (0777 & ~022);
/* If object is lost on target we might as well enable it's
* delete.
*/
ret = 0;
goto out;
}
ret = extract_attr_from_ios(ios, &attrs[0]);
if (ret) {
EXOFS_ERR("%s: extract_attr 0 of inode failed\n", __func__);
goto out;
}
WARN_ON(attrs[0].len != EXOFS_INO_ATTR_SIZE);
memcpy(inode, attrs[0].val_ptr, EXOFS_INO_ATTR_SIZE);
ret = extract_attr_from_ios(ios, &attrs[1]);
if (ret) {
EXOFS_ERR("%s: extract_attr 1 of inode failed\n", __func__);
goto out;
}
if (attrs[1].len) {
layout = attrs[1].val_ptr;
if (layout->gen_func != cpu_to_le16(LAYOUT_MOVING_WINDOW)) {
EXOFS_ERR("%s: unsupported files layout %d\n",
__func__, layout->gen_func);
ret = -ENOTSUPP;
goto out;
}
}
ret = extract_attr_from_ios(ios, &attrs[2]);
if (ret) {
EXOFS_ERR("%s: extract_attr 2 of inode failed\n", __func__);
goto out;
}
if (attrs[2].len) {
layout = attrs[2].val_ptr;
if (layout->gen_func != cpu_to_le16(LAYOUT_MOVING_WINDOW)) {
EXOFS_ERR("%s: unsupported meta-data layout %d\n",
__func__, layout->gen_func);
ret = -ENOTSUPP;
goto out;
}
}
out:
ore_put_io_state(ios);
return ret;
}
static void __oi_init(struct exofs_i_info *oi)
{
init_waitqueue_head(&oi->i_wq);
oi->i_flags = 0;
}
/*
* Fill in an inode read from the OSD and set it up for use
*/
struct inode *exofs_iget(struct super_block *sb, unsigned long ino)
{
struct exofs_i_info *oi;
struct exofs_fcb fcb;
struct inode *inode;
int ret;
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
return inode;
oi = exofs_i(inode);
__oi_init(oi);
exofs_init_comps(&oi->oc, &oi->one_comp, sb->s_fs_info,
exofs_oi_objno(oi));
/* read the inode from the osd */
ret = exofs_get_inode(sb, oi, &fcb);
if (ret)
goto bad_inode;
set_obj_created(oi);
/* copy stuff from on-disk struct to in-memory struct */
inode->i_mode = le16_to_cpu(fcb.i_mode);
i_uid_write(inode, le32_to_cpu(fcb.i_uid));
i_gid_write(inode, le32_to_cpu(fcb.i_gid));
set_nlink(inode, le16_to_cpu(fcb.i_links_count));
inode->i_ctime.tv_sec = (signed)le32_to_cpu(fcb.i_ctime);
inode->i_atime.tv_sec = (signed)le32_to_cpu(fcb.i_atime);
inode->i_mtime.tv_sec = (signed)le32_to_cpu(fcb.i_mtime);
inode->i_ctime.tv_nsec =
inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec = 0;
oi->i_commit_size = le64_to_cpu(fcb.i_size);
i_size_write(inode, oi->i_commit_size);
inode->i_blkbits = EXOFS_BLKSHIFT;
inode->i_generation = le32_to_cpu(fcb.i_generation);
oi->i_dir_start_lookup = 0;
if ((inode->i_nlink == 0) && (inode->i_mode == 0)) {
ret = -ESTALE;
goto bad_inode;
}
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
if (fcb.i_data[0])
inode->i_rdev =
old_decode_dev(le32_to_cpu(fcb.i_data[0]));
else
inode->i_rdev =
new_decode_dev(le32_to_cpu(fcb.i_data[1]));
} else {
memcpy(oi->i_data, fcb.i_data, sizeof(fcb.i_data));
}
if (S_ISREG(inode->i_mode)) {
inode->i_op = &exofs_file_inode_operations;
inode->i_fop = &exofs_file_operations;
inode->i_mapping->a_ops = &exofs_aops;
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &exofs_dir_inode_operations;
inode->i_fop = &exofs_dir_operations;
inode->i_mapping->a_ops = &exofs_aops;
} else if (S_ISLNK(inode->i_mode)) {
if (exofs_inode_is_fast_symlink(inode)) {
inode->i_op = &simple_symlink_inode_operations;
inode->i_link = (char *)oi->i_data;
} else {
inode->i_op = &page_symlink_inode_operations;
inode_nohighmem(inode);
inode->i_mapping->a_ops = &exofs_aops;
}
} else {
inode->i_op = &exofs_special_inode_operations;
if (fcb.i_data[0])
init_special_inode(inode, inode->i_mode,
old_decode_dev(le32_to_cpu(fcb.i_data[0])));
else
init_special_inode(inode, inode->i_mode,
new_decode_dev(le32_to_cpu(fcb.i_data[1])));
}
unlock_new_inode(inode);
return inode;
bad_inode:
iget_failed(inode);
return ERR_PTR(ret);
}
int __exofs_wait_obj_created(struct exofs_i_info *oi)
{
if (!obj_created(oi)) {
EXOFS_DBGMSG("!obj_created\n");
BUG_ON(!obj_2bcreated(oi));
wait_event(oi->i_wq, obj_created(oi));
EXOFS_DBGMSG("wait_event done\n");
}
return unlikely(is_bad_inode(&oi->vfs_inode)) ? -EIO : 0;
}
/*
* Callback function from exofs_new_inode(). The important thing is that we
* set the obj_created flag so that other methods know that the object exists on
* the OSD.
*/
static void create_done(struct ore_io_state *ios, void *p)
{
struct inode *inode = p;
struct exofs_i_info *oi = exofs_i(inode);
struct exofs_sb_info *sbi = inode->i_sb->s_fs_info;
int ret;
ret = ore_check_io(ios, NULL);
ore_put_io_state(ios);
atomic_dec(&sbi->s_curr_pending);
if (unlikely(ret)) {
EXOFS_ERR("object=0x%llx creation failed in pid=0x%llx",
_LLU(exofs_oi_objno(oi)),
_LLU(oi->one_comp.obj.partition));
/*TODO: When FS is corrupted creation can fail, object already
* exist. Get rid of this asynchronous creation, if exist
* increment the obj counter and try the next object. Until we
* succeed. All these dangling objects will be made into lost
* files by chkfs.exofs
*/
}
set_obj_created(oi);
wake_up(&oi->i_wq);
}
/*
* Set up a new inode and create an object for it on the OSD
*/
struct inode *exofs_new_inode(struct inode *dir, umode_t mode)
{
struct super_block *sb = dir->i_sb;
struct exofs_sb_info *sbi = sb->s_fs_info;
struct inode *inode;
struct exofs_i_info *oi;
struct ore_io_state *ios;
int ret;
inode = new_inode(sb);
if (!inode)
return ERR_PTR(-ENOMEM);
oi = exofs_i(inode);
__oi_init(oi);
set_obj_2bcreated(oi);
inode_init_owner(inode, dir, mode);
inode->i_ino = sbi->s_nextid++;
inode->i_blkbits = EXOFS_BLKSHIFT;
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
oi->i_commit_size = inode->i_size = 0;
spin_lock(&sbi->s_next_gen_lock);
inode->i_generation = sbi->s_next_generation++;
spin_unlock(&sbi->s_next_gen_lock);
insert_inode_hash(inode);
exofs_init_comps(&oi->oc, &oi->one_comp, sb->s_fs_info,
exofs_oi_objno(oi));
exofs_sbi_write_stats(sbi); /* Make sure new sbi->s_nextid is on disk */
mark_inode_dirty(inode);
ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("exofs_new_inode: ore_get_io_state failed\n");
return ERR_PTR(ret);
}
ios->done = create_done;
ios->private = inode;
ret = ore_create(ios);
if (ret) {
ore_put_io_state(ios);
return ERR_PTR(ret);
}
atomic_inc(&sbi->s_curr_pending);
return inode;
}
/*
* struct to pass two arguments to update_inode's callback
*/
struct updatei_args {
struct exofs_sb_info *sbi;
struct exofs_fcb fcb;
};
/*
* Callback function from exofs_update_inode().
*/
static void updatei_done(struct ore_io_state *ios, void *p)
{
struct updatei_args *args = p;
ore_put_io_state(ios);
atomic_dec(&args->sbi->s_curr_pending);
kfree(args);
}
/*
* Write the inode to the OSD. Just fill up the struct, and set the attribute
* synchronously or asynchronously depending on the do_sync flag.
*/
static int exofs_update_inode(struct inode *inode, int do_sync)
{
struct exofs_i_info *oi = exofs_i(inode);
struct super_block *sb = inode->i_sb;
struct exofs_sb_info *sbi = sb->s_fs_info;
struct ore_io_state *ios;
struct osd_attr attr;
struct exofs_fcb *fcb;
struct updatei_args *args;
int ret;
args = kzalloc(sizeof(*args), GFP_KERNEL);
if (!args) {
EXOFS_DBGMSG("Failed kzalloc of args\n");
return -ENOMEM;
}
fcb = &args->fcb;
fcb->i_mode = cpu_to_le16(inode->i_mode);
fcb->i_uid = cpu_to_le32(i_uid_read(inode));
fcb->i_gid = cpu_to_le32(i_gid_read(inode));
fcb->i_links_count = cpu_to_le16(inode->i_nlink);
fcb->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
fcb->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
fcb->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
oi->i_commit_size = i_size_read(inode);
fcb->i_size = cpu_to_le64(oi->i_commit_size);
fcb->i_generation = cpu_to_le32(inode->i_generation);
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
if (old_valid_dev(inode->i_rdev)) {
fcb->i_data[0] =
cpu_to_le32(old_encode_dev(inode->i_rdev));
fcb->i_data[1] = 0;
} else {
fcb->i_data[0] = 0;
fcb->i_data[1] =
cpu_to_le32(new_encode_dev(inode->i_rdev));
fcb->i_data[2] = 0;
}
} else
memcpy(fcb->i_data, oi->i_data, sizeof(fcb->i_data));
ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
goto free_args;
}
attr = g_attr_inode_data;
attr.val_ptr = fcb;
ios->out_attr_len = 1;
ios->out_attr = &attr;
wait_obj_created(oi);
if (!do_sync) {
args->sbi = sbi;
ios->done = updatei_done;
ios->private = args;
}
ret = ore_write(ios);
if (!do_sync && !ret) {
atomic_inc(&sbi->s_curr_pending);
goto out; /* deallocation in updatei_done */
}
ore_put_io_state(ios);
free_args:
kfree(args);
out:
EXOFS_DBGMSG("(0x%lx) do_sync=%d ret=>%d\n",
inode->i_ino, do_sync, ret);
return ret;
}
int exofs_write_inode(struct inode *inode, struct writeback_control *wbc)
{
/* FIXME: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */
return exofs_update_inode(inode, 1);
}
/*
* Callback function from exofs_delete_inode() - don't have much cleaning up to
* do.
*/
static void delete_done(struct ore_io_state *ios, void *p)
{
struct exofs_sb_info *sbi = p;
ore_put_io_state(ios);
atomic_dec(&sbi->s_curr_pending);
}
/*
* Called when the refcount of an inode reaches zero. We remove the object
* from the OSD here. We make sure the object was created before we try and
* delete it.
*/
void exofs_evict_inode(struct inode *inode)
{
struct exofs_i_info *oi = exofs_i(inode);
struct super_block *sb = inode->i_sb;
struct exofs_sb_info *sbi = sb->s_fs_info;
struct ore_io_state *ios;
int ret;
truncate_inode_pages_final(&inode->i_data);
/* TODO: should do better here */
if (inode->i_nlink || is_bad_inode(inode))
goto no_delete;
inode->i_size = 0;
clear_inode(inode);
/* if we are deleting an obj that hasn't been created yet, wait.
* This also makes sure that create_done cannot be called with an
* already evicted inode.
*/
wait_obj_created(oi);
/* ignore the error, attempt a remove anyway */
/* Now Remove the OSD objects */
ret = ore_get_io_state(&sbi->layout, &oi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_get_io_state failed\n", __func__);
return;
}
ios->done = delete_done;
ios->private = sbi;
ret = ore_remove(ios);
if (ret) {
EXOFS_ERR("%s: ore_remove failed\n", __func__);
ore_put_io_state(ios);
return;
}
atomic_inc(&sbi->s_curr_pending);
return;
no_delete:
clear_inode(inode);
}
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include "exofs.h"
static inline int exofs_add_nondir(struct dentry *dentry, struct inode *inode)
{
int err = exofs_add_link(dentry, inode);
if (!err) {
d_instantiate(dentry, inode);
return 0;
}
inode_dec_link_count(inode);
iput(inode);
return err;
}
static struct dentry *exofs_lookup(struct inode *dir, struct dentry *dentry,
unsigned int flags)
{
struct inode *inode;
ino_t ino;
if (dentry->d_name.len > EXOFS_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
ino = exofs_inode_by_name(dir, dentry);
inode = ino ? exofs_iget(dir->i_sb, ino) : NULL;
return d_splice_alias(inode, dentry);
}
static int exofs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
bool excl)
{
struct inode *inode = exofs_new_inode(dir, mode);
int err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &exofs_file_inode_operations;
inode->i_fop = &exofs_file_operations;
inode->i_mapping->a_ops = &exofs_aops;
mark_inode_dirty(inode);
err = exofs_add_nondir(dentry, inode);
}
return err;
}
static int exofs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
dev_t rdev)
{
struct inode *inode;
int err;
inode = exofs_new_inode(dir, mode);
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
init_special_inode(inode, inode->i_mode, rdev);
mark_inode_dirty(inode);
err = exofs_add_nondir(dentry, inode);
}
return err;
}
static int exofs_symlink(struct inode *dir, struct dentry *dentry,
const char *symname)
{
struct super_block *sb = dir->i_sb;
int err = -ENAMETOOLONG;
unsigned l = strlen(symname)+1;
struct inode *inode;
struct exofs_i_info *oi;
if (l > sb->s_blocksize)
goto out;
inode = exofs_new_inode(dir, S_IFLNK | S_IRWXUGO);
err = PTR_ERR(inode);
if (IS_ERR(inode))
goto out;
oi = exofs_i(inode);
if (l > sizeof(oi->i_data)) {
/* slow symlink */
inode->i_op = &page_symlink_inode_operations;
inode_nohighmem(inode);
inode->i_mapping->a_ops = &exofs_aops;
memset(oi->i_data, 0, sizeof(oi->i_data));
err = page_symlink(inode, symname, l);
if (err)
goto out_fail;
} else {
/* fast symlink */
inode->i_op = &simple_symlink_inode_operations;
inode->i_link = (char *)oi->i_data;
memcpy(oi->i_data, symname, l);
inode->i_size = l-1;
}
mark_inode_dirty(inode);
err = exofs_add_nondir(dentry, inode);
out:
return err;
out_fail:
inode_dec_link_count(inode);
iput(inode);
goto out;
}
static int exofs_link(struct dentry *old_dentry, struct inode *dir,
struct dentry *dentry)
{
struct inode *inode = d_inode(old_dentry);
inode->i_ctime = current_time(inode);
inode_inc_link_count(inode);
ihold(inode);
return exofs_add_nondir(dentry, inode);
}
static int exofs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
{
struct inode *inode;
int err;
inode_inc_link_count(dir);
inode = exofs_new_inode(dir, S_IFDIR | mode);
err = PTR_ERR(inode);
if (IS_ERR(inode))
goto out_dir;
inode->i_op = &exofs_dir_inode_operations;
inode->i_fop = &exofs_dir_operations;
inode->i_mapping->a_ops = &exofs_aops;
inode_inc_link_count(inode);
err = exofs_make_empty(inode, dir);
if (err)
goto out_fail;
err = exofs_add_link(dentry, inode);
if (err)
goto out_fail;
d_instantiate(dentry, inode);
out:
return err;
out_fail:
inode_dec_link_count(inode);
inode_dec_link_count(inode);
iput(inode);
out_dir:
inode_dec_link_count(dir);
goto out;
}
static int exofs_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = d_inode(dentry);
struct exofs_dir_entry *de;
struct page *page;
int err = -ENOENT;
de = exofs_find_entry(dir, dentry, &page);
if (!de)
goto out;
err = exofs_delete_entry(de, page);
if (err)
goto out;
inode->i_ctime = dir->i_ctime;
inode_dec_link_count(inode);
err = 0;
out:
return err;
}
static int exofs_rmdir(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = d_inode(dentry);
int err = -ENOTEMPTY;
if (exofs_empty_dir(inode)) {
err = exofs_unlink(dir, dentry);
if (!err) {
inode->i_size = 0;
inode_dec_link_count(inode);
inode_dec_link_count(dir);
}
}
return err;
}
static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry,
unsigned int flags)
{
struct inode *old_inode = d_inode(old_dentry);
struct inode *new_inode = d_inode(new_dentry);
struct page *dir_page = NULL;
struct exofs_dir_entry *dir_de = NULL;
struct page *old_page;
struct exofs_dir_entry *old_de;
int err = -ENOENT;
if (flags & ~RENAME_NOREPLACE)
return -EINVAL;
old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
if (!old_de)
goto out;
if (S_ISDIR(old_inode->i_mode)) {
err = -EIO;
dir_de = exofs_dotdot(old_inode, &dir_page);
if (!dir_de)
goto out_old;
}
if (new_inode) {
struct page *new_page;
struct exofs_dir_entry *new_de;
err = -ENOTEMPTY;
if (dir_de && !exofs_empty_dir(new_inode))
goto out_dir;
err = -ENOENT;
new_de = exofs_find_entry(new_dir, new_dentry, &new_page);
if (!new_de)
goto out_dir;
err = exofs_set_link(new_dir, new_de, new_page, old_inode);
new_inode->i_ctime = current_time(new_inode);
if (dir_de)
drop_nlink(new_inode);
inode_dec_link_count(new_inode);
if (err)
goto out_dir;
} else {
err = exofs_add_link(new_dentry, old_inode);
if (err)
goto out_dir;
if (dir_de)
inode_inc_link_count(new_dir);
}
old_inode->i_ctime = current_time(old_inode);
exofs_delete_entry(old_de, old_page);
mark_inode_dirty(old_inode);
if (dir_de) {
err = exofs_set_link(old_inode, dir_de, dir_page, new_dir);
inode_dec_link_count(old_dir);
if (err)
goto out_dir;
}
return 0;
out_dir:
if (dir_de) {
kunmap(dir_page);
put_page(dir_page);
}
out_old:
kunmap(old_page);
put_page(old_page);
out:
return err;
}
const struct inode_operations exofs_dir_inode_operations = {
.create = exofs_create,
.lookup = exofs_lookup,
.link = exofs_link,
.unlink = exofs_unlink,
.symlink = exofs_symlink,
.mkdir = exofs_mkdir,
.rmdir = exofs_rmdir,
.mknod = exofs_mknod,
.rename = exofs_rename,
.setattr = exofs_setattr,
};
const struct inode_operations exofs_special_inode_operations = {
.setattr = exofs_setattr,
};
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/slab.h>
#include <linux/module.h>
#include <asm/div64.h>
#include <linux/lcm.h>
#include "ore_raid.h"
MODULE_AUTHOR("Boaz Harrosh <ooo@electrozaur.com>");
MODULE_DESCRIPTION("Objects Raid Engine ore.ko");
MODULE_LICENSE("GPL");
/* ore_verify_layout does a couple of things:
* 1. Given a minimum number of needed parameters fixes up the rest of the
* members to be operatonals for the ore. The needed parameters are those
* that are defined by the pnfs-objects layout STD.
* 2. Check to see if the current ore code actually supports these parameters
* for example stripe_unit must be a multple of the system PAGE_SIZE,
* and etc...
* 3. Cache some havily used calculations that will be needed by users.
*/
enum { BIO_MAX_PAGES_KMALLOC =
(PAGE_SIZE - sizeof(struct bio)) / sizeof(struct bio_vec),};
int ore_verify_layout(unsigned total_comps, struct ore_layout *layout)
{
u64 stripe_length;
switch (layout->raid_algorithm) {
case PNFS_OSD_RAID_0:
layout->parity = 0;
break;
case PNFS_OSD_RAID_5:
layout->parity = 1;
break;
case PNFS_OSD_RAID_PQ:
layout->parity = 2;
break;
case PNFS_OSD_RAID_4:
default:
ORE_ERR("Only RAID_0/5/6 for now received-enum=%d\n",
layout->raid_algorithm);
return -EINVAL;
}
if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
ORE_ERR("Stripe Unit(0x%llx)"
" must be Multples of PAGE_SIZE(0x%lx)\n",
_LLU(layout->stripe_unit), PAGE_SIZE);
return -EINVAL;
}
if (layout->group_width) {
if (!layout->group_depth) {
ORE_ERR("group_depth == 0 && group_width != 0\n");
return -EINVAL;
}
if (total_comps < (layout->group_width * layout->mirrors_p1)) {
ORE_ERR("Data Map wrong, "
"numdevs=%d < group_width=%d * mirrors=%d\n",
total_comps, layout->group_width,
layout->mirrors_p1);
return -EINVAL;
}
layout->group_count = total_comps / layout->mirrors_p1 /
layout->group_width;
} else {
if (layout->group_depth) {
printk(KERN_NOTICE "Warning: group_depth ignored "
"group_width == 0 && group_depth == %lld\n",
_LLU(layout->group_depth));
}
layout->group_width = total_comps / layout->mirrors_p1;
layout->group_depth = -1;
layout->group_count = 1;
}
stripe_length = (u64)layout->group_width * layout->stripe_unit;
if (stripe_length >= (1ULL << 32)) {
ORE_ERR("Stripe_length(0x%llx) >= 32bit is not supported\n",
_LLU(stripe_length));
return -EINVAL;
}
layout->max_io_length =
(BIO_MAX_PAGES_KMALLOC * PAGE_SIZE - layout->stripe_unit) *
(layout->group_width - layout->parity);
if (layout->parity) {
unsigned stripe_length =
(layout->group_width - layout->parity) *
layout->stripe_unit;
layout->max_io_length /= stripe_length;
layout->max_io_length *= stripe_length;
}
ORE_DBGMSG("max_io_length=0x%lx\n", layout->max_io_length);
return 0;
}
EXPORT_SYMBOL(ore_verify_layout);
static u8 *_ios_cred(struct ore_io_state *ios, unsigned index)
{
return ios->oc->comps[index & ios->oc->single_comp].cred;
}
static struct osd_obj_id *_ios_obj(struct ore_io_state *ios, unsigned index)
{
return &ios->oc->comps[index & ios->oc->single_comp].obj;
}
static struct osd_dev *_ios_od(struct ore_io_state *ios, unsigned index)
{
ORE_DBGMSG2("oc->first_dev=%d oc->numdevs=%d i=%d oc->ods=%p\n",
ios->oc->first_dev, ios->oc->numdevs, index,
ios->oc->ods);
return ore_comp_dev(ios->oc, index);
}
int _ore_get_io_state(struct ore_layout *layout,
struct ore_components *oc, unsigned numdevs,
unsigned sgs_per_dev, unsigned num_par_pages,
struct ore_io_state **pios)
{
struct ore_io_state *ios;
size_t size_ios, size_extra, size_total;
void *ios_extra;
/*
* The desired layout looks like this, with the extra_allocation
* items pointed at from fields within ios or per_dev:
struct __alloc_all_io_state {
struct ore_io_state ios;
struct ore_per_dev_state per_dev[numdevs];
union {
struct osd_sg_entry sglist[sgs_per_dev * numdevs];
struct page *pages[num_par_pages];
} extra_allocation;
} whole_allocation;
*/
/* This should never happen, so abort early if it ever does. */
if (sgs_per_dev && num_par_pages) {
ORE_DBGMSG("Tried to use both pages and sglist\n");
*pios = NULL;
return -EINVAL;
}
if (numdevs > (INT_MAX - sizeof(*ios)) /
sizeof(struct ore_per_dev_state))
return -ENOMEM;
size_ios = sizeof(*ios) + sizeof(struct ore_per_dev_state) * numdevs;
if (sgs_per_dev * numdevs > INT_MAX / sizeof(struct osd_sg_entry))
return -ENOMEM;
if (num_par_pages > INT_MAX / sizeof(struct page *))
return -ENOMEM;
size_extra = max(sizeof(struct osd_sg_entry) * (sgs_per_dev * numdevs),
sizeof(struct page *) * num_par_pages);
size_total = size_ios + size_extra;
if (likely(size_total <= PAGE_SIZE)) {
ios = kzalloc(size_total, GFP_KERNEL);
if (unlikely(!ios)) {
ORE_DBGMSG("Failed kzalloc bytes=%zd\n", size_total);
*pios = NULL;
return -ENOMEM;
}
ios_extra = (char *)ios + size_ios;
} else {
ios = kzalloc(size_ios, GFP_KERNEL);
if (unlikely(!ios)) {
ORE_DBGMSG("Failed alloc first part bytes=%zd\n",
size_ios);
*pios = NULL;
return -ENOMEM;
}
ios_extra = kzalloc(size_extra, GFP_KERNEL);
if (unlikely(!ios_extra)) {
ORE_DBGMSG("Failed alloc second part bytes=%zd\n",
size_extra);
kfree(ios);
*pios = NULL;
return -ENOMEM;
}
/* In this case the per_dev[0].sgilist holds the pointer to
* be freed
*/
ios->extra_part_alloc = true;
}
if (num_par_pages) {
ios->parity_pages = ios_extra;
ios->max_par_pages = num_par_pages;
}
if (sgs_per_dev) {
struct osd_sg_entry *sgilist = ios_extra;
unsigned d;
for (d = 0; d < numdevs; ++d) {
ios->per_dev[d].sglist = sgilist;
sgilist += sgs_per_dev;
}
ios->sgs_per_dev = sgs_per_dev;
}
ios->layout = layout;
ios->oc = oc;
*pios = ios;
return 0;
}
/* Allocate an io_state for only a single group of devices
*
* If a user needs to call ore_read/write() this version must be used becase it
* allocates extra stuff for striping and raid.
* The ore might decide to only IO less then @length bytes do to alignmets
* and constrains as follows:
* - The IO cannot cross group boundary.
* - In raid5/6 The end of the IO must align at end of a stripe eg.
* (@offset + @length) % strip_size == 0. Or the complete range is within a
* single stripe.
* - Memory condition only permitted a shorter IO. (A user can use @length=~0
* And check the returned ios->length for max_io_size.)
*
* The caller must check returned ios->length (and/or ios->nr_pages) and
* re-issue these pages that fall outside of ios->length
*/
int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
bool is_reading, u64 offset, u64 length,
struct ore_io_state **pios)
{
struct ore_io_state *ios;
unsigned numdevs = layout->group_width * layout->mirrors_p1;
unsigned sgs_per_dev = 0, max_par_pages = 0;
int ret;
if (layout->parity && length) {
unsigned data_devs = layout->group_width - layout->parity;
unsigned stripe_size = layout->stripe_unit * data_devs;
unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
u32 remainder;
u64 num_stripes;
u64 num_raid_units;
num_stripes = div_u64_rem(length, stripe_size, &remainder);
if (remainder)
++num_stripes;
num_raid_units = num_stripes * layout->parity;
if (is_reading) {
/* For reads add per_dev sglist array */
/* TODO: Raid 6 we need twice more. Actually:
* num_stripes / LCMdP(W,P);
* if (W%P != 0) num_stripes *= parity;
*/
/* first/last seg is split */
num_raid_units += layout->group_width;
sgs_per_dev = div_u64(num_raid_units, data_devs) + 2;
} else {
/* For Writes add parity pages array. */
max_par_pages = num_raid_units * pages_in_unit *
sizeof(struct page *);
}
}
ret = _ore_get_io_state(layout, oc, numdevs, sgs_per_dev, max_par_pages,
pios);
if (unlikely(ret))
return ret;
ios = *pios;
ios->reading = is_reading;
ios->offset = offset;
if (length) {
ore_calc_stripe_info(layout, offset, length, &ios->si);
ios->length = ios->si.length;
ios->nr_pages = ((ios->offset & (PAGE_SIZE - 1)) +
ios->length + PAGE_SIZE - 1) / PAGE_SIZE;
if (layout->parity)
_ore_post_alloc_raid_stuff(ios);
}
return 0;
}
EXPORT_SYMBOL(ore_get_rw_state);
/* Allocate an io_state for all the devices in the comps array
*
* This version of io_state allocation is used mostly by create/remove
* and trunc where we currently need all the devices. The only wastful
* bit is the read/write_attributes with no IO. Those sites should
* be converted to use ore_get_rw_state() with length=0
*/
int ore_get_io_state(struct ore_layout *layout, struct ore_components *oc,
struct ore_io_state **pios)
{
return _ore_get_io_state(layout, oc, oc->numdevs, 0, 0, pios);
}
EXPORT_SYMBOL(ore_get_io_state);
void ore_put_io_state(struct ore_io_state *ios)
{
if (ios) {
unsigned i;
for (i = 0; i < ios->numdevs; i++) {
struct ore_per_dev_state *per_dev = &ios->per_dev[i];
if (per_dev->or)
osd_end_request(per_dev->or);
if (per_dev->bio)
bio_put(per_dev->bio);
}
_ore_free_raid_stuff(ios);
kfree(ios);
}
}
EXPORT_SYMBOL(ore_put_io_state);
static void _sync_done(struct ore_io_state *ios, void *p)
{
struct completion *waiting = p;
complete(waiting);
}
static void _last_io(struct kref *kref)
{
struct ore_io_state *ios = container_of(
kref, struct ore_io_state, kref);
ios->done(ios, ios->private);
}
static void _done_io(struct osd_request *or, void *p)
{
struct ore_io_state *ios = p;
kref_put(&ios->kref, _last_io);
}
int ore_io_execute(struct ore_io_state *ios)
{
DECLARE_COMPLETION_ONSTACK(wait);
bool sync = (ios->done == NULL);
int i, ret;
if (sync) {
ios->done = _sync_done;
ios->private = &wait;
}
for (i = 0; i < ios->numdevs; i++) {
struct osd_request *or = ios->per_dev[i].or;
if (unlikely(!or))
continue;
ret = osd_finalize_request(or, 0, _ios_cred(ios, i), NULL);
if (unlikely(ret)) {
ORE_DBGMSG("Failed to osd_finalize_request() => %d\n",
ret);
return ret;
}
}
kref_init(&ios->kref);
for (i = 0; i < ios->numdevs; i++) {
struct osd_request *or = ios->per_dev[i].or;
if (unlikely(!or))
continue;
kref_get(&ios->kref);
osd_execute_request_async(or, _done_io, ios);
}
kref_put(&ios->kref, _last_io);
ret = 0;
if (sync) {
wait_for_completion(&wait);
ret = ore_check_io(ios, NULL);
}
return ret;
}
static void _clear_bio(struct bio *bio)
{
struct bio_vec *bv;
unsigned i;
bio_for_each_segment_all(bv, bio, i) {
unsigned this_count = bv->bv_len;
if (likely(PAGE_SIZE == this_count))
clear_highpage(bv->bv_page);
else
zero_user(bv->bv_page, bv->bv_offset, this_count);
}
}
int ore_check_io(struct ore_io_state *ios, ore_on_dev_error on_dev_error)
{
enum osd_err_priority acumulated_osd_err = 0;
int acumulated_lin_err = 0;
int i;
for (i = 0; i < ios->numdevs; i++) {
struct osd_sense_info osi;
struct ore_per_dev_state *per_dev = &ios->per_dev[i];
struct osd_request *or = per_dev->or;
int ret;
if (unlikely(!or))
continue;
ret = osd_req_decode_sense(or, &osi);
if (likely(!ret))
continue;
if ((OSD_ERR_PRI_CLEAR_PAGES == osi.osd_err_pri) &&
per_dev->bio) {
/* start read offset passed endof file.
* Note: if we do not have bio it means read-attributes
* In this case we should return error to caller.
*/
_clear_bio(per_dev->bio);
ORE_DBGMSG("start read offset passed end of file "
"offset=0x%llx, length=0x%llx\n",
_LLU(per_dev->offset),
_LLU(per_dev->length));
continue; /* we recovered */
}
if (on_dev_error) {
u64 residual = ios->reading ?
or->in.residual : or->out.residual;
u64 offset = (ios->offset + ios->length) - residual;
unsigned dev = per_dev->dev - ios->oc->first_dev;
struct ore_dev *od = ios->oc->ods[dev];
on_dev_error(ios, od, dev, osi.osd_err_pri,
offset, residual);
}
if (osi.osd_err_pri >= acumulated_osd_err) {
acumulated_osd_err = osi.osd_err_pri;
acumulated_lin_err = ret;
}
}
return acumulated_lin_err;
}
EXPORT_SYMBOL(ore_check_io);
/*
* L - logical offset into the file
*
* D - number of Data devices
* D = group_width - parity
*
* U - The number of bytes in a stripe within a group
* U = stripe_unit * D
*
* T - The number of bytes striped within a group of component objects
* (before advancing to the next group)
* T = U * group_depth
*
* S - The number of bytes striped across all component objects
* before the pattern repeats
* S = T * group_count
*
* M - The "major" (i.e., across all components) cycle number
* M = L / S
*
* G - Counts the groups from the beginning of the major cycle
* G = (L - (M * S)) / T [or (L % S) / T]
*
* H - The byte offset within the group
* H = (L - (M * S)) % T [or (L % S) % T]
*
* N - The "minor" (i.e., across the group) stripe number
* N = H / U
*
* C - The component index coresponding to L
*
* C = (H - (N * U)) / stripe_unit + G * D
* [or (L % U) / stripe_unit + G * D]
*
* O - The component offset coresponding to L
* O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
*
* LCMdP – Parity cycle: Lowest Common Multiple of group_width, parity
* divide by parity
* LCMdP = lcm(group_width, parity) / parity
*
* R - The parity Rotation stripe
* (Note parity cycle always starts at a group's boundary)
* R = N % LCMdP
*
* I = the first parity device index
* I = (group_width + group_width - R*parity - parity) % group_width
*
* Craid - The component index Rotated
* Craid = (group_width + C - R*parity) % group_width
* (We add the group_width to avoid negative numbers modulo math)
*/
void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
u64 length, struct ore_striping_info *si)
{
u32 stripe_unit = layout->stripe_unit;
u32 group_width = layout->group_width;
u64 group_depth = layout->group_depth;
u32 parity = layout->parity;
u32 D = group_width - parity;
u32 U = D * stripe_unit;
u64 T = U * group_depth;
u64 S = T * layout->group_count;
u64 M = div64_u64(file_offset, S);
/*
G = (L - (M * S)) / T
H = (L - (M * S)) % T
*/
u64 LmodS = file_offset - M * S;
u32 G = div64_u64(LmodS, T);
u64 H = LmodS - G * T;
u32 N = div_u64(H, U);
u32 Nlast;
/* "H - (N * U)" is just "H % U" so it's bound to u32 */
u32 C = (u32)(H - (N * U)) / stripe_unit + G * group_width;
u32 first_dev = C - C % group_width;
div_u64_rem(file_offset, stripe_unit, &si->unit_off);
si->obj_offset = si->unit_off + (N * stripe_unit) +
(M * group_depth * stripe_unit);
si->cur_comp = C - first_dev;
si->cur_pg = si->unit_off / PAGE_SIZE;
if (parity) {
u32 LCMdP = lcm(group_width, parity) / parity;
/* R = N % LCMdP; */
u32 RxP = (N % LCMdP) * parity;
si->par_dev = (group_width + group_width - parity - RxP) %
group_width + first_dev;
si->dev = (group_width + group_width + C - RxP) %
group_width + first_dev;
si->bytes_in_stripe = U;
si->first_stripe_start = M * S + G * T + N * U;
} else {
/* Make the math correct see _prepare_one_group */
si->par_dev = group_width;
si->dev = C;
}
si->dev *= layout->mirrors_p1;
si->par_dev *= layout->mirrors_p1;
si->offset = file_offset;
si->length = T - H;
if (si->length > length)
si->length = length;
Nlast = div_u64(H + si->length + U - 1, U);
si->maxdevUnits = Nlast - N;
si->M = M;
}
EXPORT_SYMBOL(ore_calc_stripe_info);
int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
unsigned pgbase, struct page **pages,
struct ore_per_dev_state *per_dev, int cur_len)
{
unsigned pg = *cur_pg;
struct request_queue *q =
osd_request_queue(_ios_od(ios, per_dev->dev));
unsigned len = cur_len;
int ret;
if (per_dev->bio == NULL) {
unsigned bio_size;
if (!ios->reading) {
bio_size = ios->si.maxdevUnits;
} else {
bio_size = (ios->si.maxdevUnits + 1) *
(ios->layout->group_width - ios->layout->parity) /
ios->layout->group_width;
}
bio_size *= (ios->layout->stripe_unit / PAGE_SIZE);
per_dev->bio = bio_kmalloc(GFP_KERNEL, bio_size);
if (unlikely(!per_dev->bio)) {
ORE_DBGMSG("Failed to allocate BIO size=%u\n",
bio_size);
ret = -ENOMEM;
goto out;
}
}
while (cur_len > 0) {
unsigned pglen = min_t(unsigned, PAGE_SIZE - pgbase, cur_len);
unsigned added_len;
cur_len -= pglen;
added_len = bio_add_pc_page(q, per_dev->bio, pages[pg],
pglen, pgbase);
if (unlikely(pglen != added_len)) {
/* If bi_vcnt == bi_max then this is a SW BUG */
ORE_DBGMSG("Failed bio_add_pc_page bi_vcnt=0x%x "
"bi_max=0x%x BIO_MAX=0x%x cur_len=0x%x\n",
per_dev->bio->bi_vcnt,
per_dev->bio->bi_max_vecs,
BIO_MAX_PAGES_KMALLOC, cur_len);
ret = -ENOMEM;
goto out;
}
_add_stripe_page(ios->sp2d, &ios->si, pages[pg]);
pgbase = 0;
++pg;
}
BUG_ON(cur_len);
per_dev->length += len;
*cur_pg = pg;
ret = 0;
out: /* we fail the complete unit on an error eg don't advance
* per_dev->length and cur_pg. This means that we might have a bigger
* bio than the CDB requested length (per_dev->length). That's fine
* only the oposite is fatal.
*/
return ret;
}
static int _add_parity_units(struct ore_io_state *ios,
struct ore_striping_info *si,
unsigned dev, unsigned first_dev,
unsigned mirrors_p1, unsigned devs_in_group,
unsigned cur_len)
{
unsigned do_parity;
int ret = 0;
for (do_parity = ios->layout->parity; do_parity; --do_parity) {
struct ore_per_dev_state *per_dev;
per_dev = &ios->per_dev[dev - first_dev];
if (!per_dev->length && !per_dev->offset) {
/* Only/always the parity unit of the first
* stripe will be empty. So this is a chance to
* initialize the per_dev info.
*/
per_dev->dev = dev;
per_dev->offset = si->obj_offset - si->unit_off;
}
ret = _ore_add_parity_unit(ios, si, per_dev, cur_len,
do_parity == 1);
if (unlikely(ret))
break;
if (do_parity != 1) {
dev = ((dev + mirrors_p1) % devs_in_group) + first_dev;
si->cur_comp = (si->cur_comp + 1) %
ios->layout->group_width;
}
}
return ret;
}
static int _prepare_for_striping(struct ore_io_state *ios)
{
struct ore_striping_info *si = &ios->si;
unsigned stripe_unit = ios->layout->stripe_unit;
unsigned mirrors_p1 = ios->layout->mirrors_p1;
unsigned group_width = ios->layout->group_width;
unsigned devs_in_group = group_width * mirrors_p1;
unsigned dev = si->dev;
unsigned first_dev = dev - (dev % devs_in_group);
unsigned cur_pg = ios->pages_consumed;
u64 length = ios->length;
int ret = 0;
if (!ios->pages) {
ios->numdevs = ios->layout->mirrors_p1;
return 0;
}
BUG_ON(length > si->length);
while (length) {
struct ore_per_dev_state *per_dev =
&ios->per_dev[dev - first_dev];
unsigned cur_len, page_off = 0;
if (!per_dev->length && !per_dev->offset) {
/* First time initialize the per_dev info. */
per_dev->dev = dev;
if (dev == si->dev) {
WARN_ON(dev == si->par_dev);
per_dev->offset = si->obj_offset;
cur_len = stripe_unit - si->unit_off;
page_off = si->unit_off & ~PAGE_MASK;
BUG_ON(page_off && (page_off != ios->pgbase));
} else {
per_dev->offset = si->obj_offset - si->unit_off;
cur_len = stripe_unit;
}
} else {
cur_len = stripe_unit;
}
if (cur_len >= length)
cur_len = length;
ret = _ore_add_stripe_unit(ios, &cur_pg, page_off, ios->pages,
per_dev, cur_len);
if (unlikely(ret))
goto out;
length -= cur_len;
dev = ((dev + mirrors_p1) % devs_in_group) + first_dev;
si->cur_comp = (si->cur_comp + 1) % group_width;
if (unlikely((dev == si->par_dev) || (!length && ios->sp2d))) {
if (!length && ios->sp2d) {
/* If we are writing and this is the very last
* stripe. then operate on parity dev.
*/
dev = si->par_dev;
/* If last stripe operate on parity comp */
si->cur_comp = group_width - ios->layout->parity;
}
/* In writes cur_len just means if it's the
* last one. See _ore_add_parity_unit.
*/
ret = _add_parity_units(ios, si, dev, first_dev,
mirrors_p1, devs_in_group,
ios->sp2d ? length : cur_len);
if (unlikely(ret))
goto out;
/* Rotate next par_dev backwards with wraping */
si->par_dev = (devs_in_group + si->par_dev -
ios->layout->parity * mirrors_p1) %
devs_in_group + first_dev;
/* Next stripe, start fresh */
si->cur_comp = 0;
si->cur_pg = 0;
si->obj_offset += cur_len;
si->unit_off = 0;
}
}
out:
ios->numdevs = devs_in_group;
ios->pages_consumed = cur_pg;
return ret;
}
int ore_create(struct ore_io_state *ios)
{
int i, ret;
for (i = 0; i < ios->oc->numdevs; i++) {
struct osd_request *or;
or = osd_start_request(_ios_od(ios, i));
if (unlikely(!or)) {
ORE_ERR("%s: osd_start_request failed\n", __func__);
ret = -ENOMEM;
goto out;
}
ios->per_dev[i].or = or;
ios->numdevs++;
osd_req_create_object(or, _ios_obj(ios, i));
}
ret = ore_io_execute(ios);
out:
return ret;
}
EXPORT_SYMBOL(ore_create);
int ore_remove(struct ore_io_state *ios)
{
int i, ret;
for (i = 0; i < ios->oc->numdevs; i++) {
struct osd_request *or;
or = osd_start_request(_ios_od(ios, i));
if (unlikely(!or)) {
ORE_ERR("%s: osd_start_request failed\n", __func__);
ret = -ENOMEM;
goto out;
}
ios->per_dev[i].or = or;
ios->numdevs++;
osd_req_remove_object(or, _ios_obj(ios, i));
}
ret = ore_io_execute(ios);
out:
return ret;
}
EXPORT_SYMBOL(ore_remove);
static int _write_mirror(struct ore_io_state *ios, int cur_comp)
{
struct ore_per_dev_state *master_dev = &ios->per_dev[cur_comp];
unsigned dev = ios->per_dev[cur_comp].dev;
unsigned last_comp = cur_comp + ios->layout->mirrors_p1;
int ret = 0;
if (ios->pages && !master_dev->length)
return 0; /* Just an empty slot */
for (; cur_comp < last_comp; ++cur_comp, ++dev) {
struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
struct osd_request *or;
or = osd_start_request(_ios_od(ios, dev));
if (unlikely(!or)) {
ORE_ERR("%s: osd_start_request failed\n", __func__);
ret = -ENOMEM;
goto out;
}
per_dev->or = or;
if (ios->pages) {
struct bio *bio;
if (per_dev != master_dev) {
bio = bio_clone_fast(master_dev->bio,
GFP_KERNEL, NULL);
if (unlikely(!bio)) {
ORE_DBGMSG(
"Failed to allocate BIO size=%u\n",
master_dev->bio->bi_max_vecs);
ret = -ENOMEM;
goto out;
}
bio->bi_disk = NULL;
bio->bi_next = NULL;
per_dev->offset = master_dev->offset;
per_dev->length = master_dev->length;
per_dev->bio = bio;
per_dev->dev = dev;
} else {
bio = master_dev->bio;
/* FIXME: bio_set_dir() */
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
}
osd_req_write(or, _ios_obj(ios, cur_comp),
per_dev->offset, bio, per_dev->length);
ORE_DBGMSG("write(0x%llx) offset=0x%llx "
"length=0x%llx dev=%d\n",
_LLU(_ios_obj(ios, cur_comp)->id),
_LLU(per_dev->offset),
_LLU(per_dev->length), dev);
} else if (ios->kern_buff) {
per_dev->offset = ios->si.obj_offset;
per_dev->dev = ios->si.dev + dev;
/* no cross device without page array */
BUG_ON((ios->layout->group_width > 1) &&
(ios->si.unit_off + ios->length >
ios->layout->stripe_unit));
ret = osd_req_write_kern(or, _ios_obj(ios, cur_comp),
per_dev->offset,
ios->kern_buff, ios->length);
if (unlikely(ret))
goto out;
ORE_DBGMSG2("write_kern(0x%llx) offset=0x%llx "
"length=0x%llx dev=%d\n",
_LLU(_ios_obj(ios, cur_comp)->id),
_LLU(per_dev->offset),
_LLU(ios->length), per_dev->dev);
} else {
osd_req_set_attributes(or, _ios_obj(ios, cur_comp));
ORE_DBGMSG2("obj(0x%llx) set_attributes=%d dev=%d\n",
_LLU(_ios_obj(ios, cur_comp)->id),
ios->out_attr_len, dev);
}
if (ios->out_attr)
osd_req_add_set_attr_list(or, ios->out_attr,
ios->out_attr_len);
if (ios->in_attr)
osd_req_add_get_attr_list(or, ios->in_attr,
ios->in_attr_len);
}
out:
return ret;
}
int ore_write(struct ore_io_state *ios)
{
int i;
int ret;
if (unlikely(ios->sp2d && !ios->r4w)) {
/* A library is attempting a RAID-write without providing
* a pages lock interface.
*/
WARN_ON_ONCE(1);
return -ENOTSUPP;
}
ret = _prepare_for_striping(ios);
if (unlikely(ret))
return ret;
for (i = 0; i < ios->numdevs; i += ios->layout->mirrors_p1) {
ret = _write_mirror(ios, i);
if (unlikely(ret))
return ret;
}
ret = ore_io_execute(ios);
return ret;
}
EXPORT_SYMBOL(ore_write);
int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp)
{
struct osd_request *or;
struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
struct osd_obj_id *obj = _ios_obj(ios, cur_comp);
unsigned first_dev = (unsigned)obj->id;
if (ios->pages && !per_dev->length)
return 0; /* Just an empty slot */
first_dev = per_dev->dev + first_dev % ios->layout->mirrors_p1;
or = osd_start_request(_ios_od(ios, first_dev));
if (unlikely(!or)) {
ORE_ERR("%s: osd_start_request failed\n", __func__);
return -ENOMEM;
}
per_dev->or = or;
if (ios->pages) {
if (per_dev->cur_sg) {
/* finalize the last sg_entry */
_ore_add_sg_seg(per_dev, 0, false);
if (unlikely(!per_dev->cur_sg))
return 0; /* Skip parity only device */
osd_req_read_sg(or, obj, per_dev->bio,
per_dev->sglist, per_dev->cur_sg);
} else {
/* The no raid case */
osd_req_read(or, obj, per_dev->offset,
per_dev->bio, per_dev->length);
}
ORE_DBGMSG("read(0x%llx) offset=0x%llx length=0x%llx"
" dev=%d sg_len=%d\n", _LLU(obj->id),
_LLU(per_dev->offset), _LLU(per_dev->length),
first_dev, per_dev->cur_sg);
} else {
BUG_ON(ios->kern_buff);
osd_req_get_attributes(or, obj);
ORE_DBGMSG2("obj(0x%llx) get_attributes=%d dev=%d\n",
_LLU(obj->id),
ios->in_attr_len, first_dev);
}
if (ios->out_attr)
osd_req_add_set_attr_list(or, ios->out_attr, ios->out_attr_len);
if (ios->in_attr)
osd_req_add_get_attr_list(or, ios->in_attr, ios->in_attr_len);
return 0;
}
int ore_read(struct ore_io_state *ios)
{
int i;
int ret;
ret = _prepare_for_striping(ios);
if (unlikely(ret))
return ret;
for (i = 0; i < ios->numdevs; i += ios->layout->mirrors_p1) {
ret = _ore_read_mirror(ios, i);
if (unlikely(ret))
return ret;
}
ret = ore_io_execute(ios);
return ret;
}
EXPORT_SYMBOL(ore_read);
int extract_attr_from_ios(struct ore_io_state *ios, struct osd_attr *attr)
{
struct osd_attr cur_attr = {.attr_page = 0}; /* start with zeros */
void *iter = NULL;
int nelem;
do {
nelem = 1;
osd_req_decode_get_attr_list(ios->per_dev[0].or,
&cur_attr, &nelem, &iter);
if ((cur_attr.attr_page == attr->attr_page) &&
(cur_attr.attr_id == attr->attr_id)) {
attr->len = cur_attr.len;
attr->val_ptr = cur_attr.val_ptr;
return 0;
}
} while (iter);
return -EIO;
}
EXPORT_SYMBOL(extract_attr_from_ios);
static int _truncate_mirrors(struct ore_io_state *ios, unsigned cur_comp,
struct osd_attr *attr)
{
int last_comp = cur_comp + ios->layout->mirrors_p1;
for (; cur_comp < last_comp; ++cur_comp) {
struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
struct osd_request *or;
or = osd_start_request(_ios_od(ios, cur_comp));
if (unlikely(!or)) {
ORE_ERR("%s: osd_start_request failed\n", __func__);
return -ENOMEM;
}
per_dev->or = or;
osd_req_set_attributes(or, _ios_obj(ios, cur_comp));
osd_req_add_set_attr_list(or, attr, 1);
}
return 0;
}
struct _trunc_info {
struct ore_striping_info si;
u64 prev_group_obj_off;
u64 next_group_obj_off;
unsigned first_group_dev;
unsigned nex_group_dev;
};
static void _calc_trunk_info(struct ore_layout *layout, u64 file_offset,
struct _trunc_info *ti)
{
unsigned stripe_unit = layout->stripe_unit;
ore_calc_stripe_info(layout, file_offset, 0, &ti->si);
ti->prev_group_obj_off = ti->si.M * stripe_unit;
ti->next_group_obj_off = ti->si.M ? (ti->si.M - 1) * stripe_unit : 0;
ti->first_group_dev = ti->si.dev - (ti->si.dev % layout->group_width);
ti->nex_group_dev = ti->first_group_dev + layout->group_width;
}
int ore_truncate(struct ore_layout *layout, struct ore_components *oc,
u64 size)
{
struct ore_io_state *ios;
struct exofs_trunc_attr {
struct osd_attr attr;
__be64 newsize;
} *size_attrs;
struct _trunc_info ti;
int i, ret;
ret = ore_get_io_state(layout, oc, &ios);
if (unlikely(ret))
return ret;
_calc_trunk_info(ios->layout, size, &ti);
size_attrs = kcalloc(ios->oc->numdevs, sizeof(*size_attrs),
GFP_KERNEL);
if (unlikely(!size_attrs)) {
ret = -ENOMEM;
goto out;
}
ios->numdevs = ios->oc->numdevs;
for (i = 0; i < ios->numdevs; ++i) {
struct exofs_trunc_attr *size_attr = &size_attrs[i];
u64 obj_size;
if (i < ti.first_group_dev)
obj_size = ti.prev_group_obj_off;
else if (i >= ti.nex_group_dev)
obj_size = ti.next_group_obj_off;
else if (i < ti.si.dev) /* dev within this group */
obj_size = ti.si.obj_offset +
ios->layout->stripe_unit - ti.si.unit_off;
else if (i == ti.si.dev)
obj_size = ti.si.obj_offset;
else /* i > ti.dev */
obj_size = ti.si.obj_offset - ti.si.unit_off;
size_attr->newsize = cpu_to_be64(obj_size);
size_attr->attr = g_attr_logical_length;
size_attr->attr.val_ptr = &size_attr->newsize;
ORE_DBGMSG2("trunc(0x%llx) obj_offset=0x%llx dev=%d\n",
_LLU(oc->comps->obj.id), _LLU(obj_size), i);
ret = _truncate_mirrors(ios, i * ios->layout->mirrors_p1,
&size_attr->attr);
if (unlikely(ret))
goto out;
}
ret = ore_io_execute(ios);
out:
kfree(size_attrs);
ore_put_io_state(ios);
return ret;
}
EXPORT_SYMBOL(ore_truncate);
const struct osd_attr g_attr_logical_length = ATTR_DEF(
OSD_APAGE_OBJECT_INFORMATION, OSD_ATTR_OI_LOGICAL_LENGTH, 8);
EXPORT_SYMBOL(g_attr_logical_length);
/*
* Copyright (C) 2011
* Boaz Harrosh <ooo@electrozaur.com>
*
* This file is part of the objects raid engine (ore).
*
* It is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as published
* by the Free Software Foundation.
*
* You should have received a copy of the GNU General Public License
* along with "ore". If not, write to the Free Software Foundation, Inc:
* "Free Software Foundation <info@fsf.org>"
*/
#include <linux/gfp.h>
#include <linux/async_tx.h>
#include "ore_raid.h"
#undef ORE_DBGMSG2
#define ORE_DBGMSG2 ORE_DBGMSG
static struct page *_raid_page_alloc(void)
{
return alloc_page(GFP_KERNEL);
}
static void _raid_page_free(struct page *p)
{
__free_page(p);
}
/* This struct is forward declare in ore_io_state, but is private to here.
* It is put on ios->sp2d for RAID5/6 writes only. See _gen_xor_unit.
*
* __stripe_pages_2d is a 2d array of pages, and it is also a corner turn.
* Ascending page index access is sp2d(p-minor, c-major). But storage is
* sp2d[p-minor][c-major], so it can be properlly presented to the async-xor
* API.
*/
struct __stripe_pages_2d {
/* Cache some hot path repeated calculations */
unsigned parity;
unsigned data_devs;
unsigned pages_in_unit;
bool needed ;
/* Array size is pages_in_unit (layout->stripe_unit / PAGE_SIZE) */
struct __1_page_stripe {
bool alloc;
unsigned write_count;
struct async_submit_ctl submit;
struct dma_async_tx_descriptor *tx;
/* The size of this array is data_devs + parity */
struct page **pages;
struct page **scribble;
/* bool array, size of this array is data_devs */
char *page_is_read;
} _1p_stripes[];
};
/* This can get bigger then a page. So support multiple page allocations
* _sp2d_free should be called even if _sp2d_alloc fails (by returning
* none-zero).
*/
static int _sp2d_alloc(unsigned pages_in_unit, unsigned group_width,
unsigned parity, struct __stripe_pages_2d **psp2d)
{
struct __stripe_pages_2d *sp2d;
unsigned data_devs = group_width - parity;
/*
* Desired allocation layout is, though when larger than PAGE_SIZE,
* each struct __alloc_1p_arrays is separately allocated:
struct _alloc_all_bytes {
struct __alloc_stripe_pages_2d {
struct __stripe_pages_2d sp2d;
struct __1_page_stripe _1p_stripes[pages_in_unit];
} __asp2d;
struct __alloc_1p_arrays {
struct page *pages[group_width];
struct page *scribble[group_width];
char page_is_read[data_devs];
} __a1pa[pages_in_unit];
} *_aab;
struct __alloc_1p_arrays *__a1pa;
struct __alloc_1p_arrays *__a1pa_end;
*/
char *__a1pa;
char *__a1pa_end;
const size_t sizeof_stripe_pages_2d =
sizeof(struct __stripe_pages_2d) +
sizeof(struct __1_page_stripe) * pages_in_unit;
const size_t sizeof__a1pa =
ALIGN(sizeof(struct page *) * (2 * group_width) + data_devs,
sizeof(void *));
const size_t sizeof__a1pa_arrays = sizeof__a1pa * pages_in_unit;
const size_t alloc_total = sizeof_stripe_pages_2d +
sizeof__a1pa_arrays;
unsigned num_a1pa, alloc_size, i;
/* FIXME: check these numbers in ore_verify_layout */
BUG_ON(sizeof_stripe_pages_2d > PAGE_SIZE);
BUG_ON(sizeof__a1pa > PAGE_SIZE);
/*
* If alloc_total would be larger than PAGE_SIZE, only allocate
* as many a1pa items as would fill the rest of the page, instead
* of the full pages_in_unit count.
*/
if (alloc_total > PAGE_SIZE) {
num_a1pa = (PAGE_SIZE - sizeof_stripe_pages_2d) / sizeof__a1pa;
alloc_size = sizeof_stripe_pages_2d + sizeof__a1pa * num_a1pa;
} else {
num_a1pa = pages_in_unit;
alloc_size = alloc_total;
}
*psp2d = sp2d = kzalloc(alloc_size, GFP_KERNEL);
if (unlikely(!sp2d)) {
ORE_DBGMSG("!! Failed to alloc sp2d size=%d\n", alloc_size);
return -ENOMEM;
}
/* From here Just call _sp2d_free */
/* Find start of a1pa area. */
__a1pa = (char *)sp2d + sizeof_stripe_pages_2d;
/* Find end of the _allocated_ a1pa area. */
__a1pa_end = __a1pa + alloc_size;
/* Allocate additionally needed a1pa items in PAGE_SIZE chunks. */
for (i = 0; i < pages_in_unit; ++i) {
struct __1_page_stripe *stripe = &sp2d->_1p_stripes[i];
if (unlikely(__a1pa >= __a1pa_end)) {
num_a1pa = min_t(unsigned, PAGE_SIZE / sizeof__a1pa,
pages_in_unit - i);
alloc_size = sizeof__a1pa * num_a1pa;
__a1pa = kzalloc(alloc_size, GFP_KERNEL);
if (unlikely(!__a1pa)) {
ORE_DBGMSG("!! Failed to _alloc_1p_arrays=%d\n",
num_a1pa);
return -ENOMEM;
}
__a1pa_end = __a1pa + alloc_size;
/* First *pages is marked for kfree of the buffer */
stripe->alloc = true;
}
/*
* Attach all _lp_stripes pointers to the allocation for
* it which was either part of the original PAGE_SIZE
* allocation or the subsequent allocation in this loop.
*/
stripe->pages = (void *)__a1pa;
stripe->scribble = stripe->pages + group_width;
stripe->page_is_read = (char *)stripe->scribble + group_width;
__a1pa += sizeof__a1pa;
}
sp2d->parity = parity;
sp2d->data_devs = data_devs;
sp2d->pages_in_unit = pages_in_unit;
return 0;
}
static void _sp2d_reset(struct __stripe_pages_2d *sp2d,
const struct _ore_r4w_op *r4w, void *priv)
{
unsigned data_devs = sp2d->data_devs;
unsigned group_width = data_devs + sp2d->parity;
int p, c;
if (!sp2d->needed)
return;
for (c = data_devs - 1; c >= 0; --c)
for (p = sp2d->pages_in_unit - 1; p >= 0; --p) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
if (_1ps->page_is_read[c]) {
struct page *page = _1ps->pages[c];
r4w->put_page(priv, page);
_1ps->page_is_read[c] = false;
}
}
for (p = 0; p < sp2d->pages_in_unit; p++) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
memset(_1ps->pages, 0, group_width * sizeof(*_1ps->pages));
_1ps->write_count = 0;
_1ps->tx = NULL;
}
sp2d->needed = false;
}
static void _sp2d_free(struct __stripe_pages_2d *sp2d)
{
unsigned i;
if (!sp2d)
return;
for (i = 0; i < sp2d->pages_in_unit; ++i) {
if (sp2d->_1p_stripes[i].alloc)
kfree(sp2d->_1p_stripes[i].pages);
}
kfree(sp2d);
}
static unsigned _sp2d_min_pg(struct __stripe_pages_2d *sp2d)
{
unsigned p;
for (p = 0; p < sp2d->pages_in_unit; p++) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
if (_1ps->write_count)
return p;
}
return ~0;
}
static unsigned _sp2d_max_pg(struct __stripe_pages_2d *sp2d)
{
int p;
for (p = sp2d->pages_in_unit - 1; p >= 0; --p) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
if (_1ps->write_count)
return p;
}
return ~0;
}
static void _gen_xor_unit(struct __stripe_pages_2d *sp2d)
{
unsigned p;
unsigned tx_flags = ASYNC_TX_ACK;
if (sp2d->parity == 1)
tx_flags |= ASYNC_TX_XOR_ZERO_DST;
for (p = 0; p < sp2d->pages_in_unit; p++) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
if (!_1ps->write_count)
continue;
init_async_submit(&_1ps->submit, tx_flags,
NULL, NULL, NULL, (addr_conv_t *)_1ps->scribble);
if (sp2d->parity == 1)
_1ps->tx = async_xor(_1ps->pages[sp2d->data_devs],
_1ps->pages, 0, sp2d->data_devs,
PAGE_SIZE, &_1ps->submit);
else /* parity == 2 */
_1ps->tx = async_gen_syndrome(_1ps->pages, 0,
sp2d->data_devs + sp2d->parity,
PAGE_SIZE, &_1ps->submit);
}
for (p = 0; p < sp2d->pages_in_unit; p++) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
/* NOTE: We wait for HW synchronously (I don't have such HW
* to test with.) Is parallelism needed with today's multi
* cores?
*/
async_tx_issue_pending(_1ps->tx);
}
}
void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
struct ore_striping_info *si, struct page *page)
{
struct __1_page_stripe *_1ps;
sp2d->needed = true;
_1ps = &sp2d->_1p_stripes[si->cur_pg];
_1ps->pages[si->cur_comp] = page;
++_1ps->write_count;
si->cur_pg = (si->cur_pg + 1) % sp2d->pages_in_unit;
/* si->cur_comp is advanced outside at main loop */
}
void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
bool not_last)
{
struct osd_sg_entry *sge;
ORE_DBGMSG("dev=%d cur_len=0x%x not_last=%d cur_sg=%d "
"offset=0x%llx length=0x%x last_sgs_total=0x%x\n",
per_dev->dev, cur_len, not_last, per_dev->cur_sg,
_LLU(per_dev->offset), per_dev->length,
per_dev->last_sgs_total);
if (!per_dev->cur_sg) {
sge = per_dev->sglist;
/* First time we prepare two entries */
if (per_dev->length) {
++per_dev->cur_sg;
sge->offset = per_dev->offset;
sge->len = per_dev->length;
} else {
/* Here the parity is the first unit of this object.
* This happens every time we reach a parity device on
* the same stripe as the per_dev->offset. We need to
* just skip this unit.
*/
per_dev->offset += cur_len;
return;
}
} else {
/* finalize the last one */
sge = &per_dev->sglist[per_dev->cur_sg - 1];
sge->len = per_dev->length - per_dev->last_sgs_total;
}
if (not_last) {
/* Partly prepare the next one */
struct osd_sg_entry *next_sge = sge + 1;
++per_dev->cur_sg;
next_sge->offset = sge->offset + sge->len + cur_len;
/* Save cur len so we know how mutch was added next time */
per_dev->last_sgs_total = per_dev->length;
next_sge->len = 0;
} else if (!sge->len) {
/* Optimize for when the last unit is a parity */
--per_dev->cur_sg;
}
}
static int _alloc_read_4_write(struct ore_io_state *ios)
{
struct ore_layout *layout = ios->layout;
int ret;
/* We want to only read those pages not in cache so worst case
* is a stripe populated with every other page
*/
unsigned sgs_per_dev = ios->sp2d->pages_in_unit + 2;
ret = _ore_get_io_state(layout, ios->oc,
layout->group_width * layout->mirrors_p1,
sgs_per_dev, 0, &ios->ios_read_4_write);
return ret;
}
/* @si contains info of the to-be-inserted page. Update of @si should be
* maintained by caller. Specificaly si->dev, si->obj_offset, ...
*/
static int _add_to_r4w(struct ore_io_state *ios, struct ore_striping_info *si,
struct page *page, unsigned pg_len)
{
struct request_queue *q;
struct ore_per_dev_state *per_dev;
struct ore_io_state *read_ios;
unsigned first_dev = si->dev - (si->dev %
(ios->layout->group_width * ios->layout->mirrors_p1));
unsigned comp = si->dev - first_dev;
unsigned added_len;
if (!ios->ios_read_4_write) {
int ret = _alloc_read_4_write(ios);
if (unlikely(ret))
return ret;
}
read_ios = ios->ios_read_4_write;
read_ios->numdevs = ios->layout->group_width * ios->layout->mirrors_p1;
per_dev = &read_ios->per_dev[comp];
if (!per_dev->length) {
per_dev->bio = bio_kmalloc(GFP_KERNEL,
ios->sp2d->pages_in_unit);
if (unlikely(!per_dev->bio)) {
ORE_DBGMSG("Failed to allocate BIO size=%u\n",
ios->sp2d->pages_in_unit);
return -ENOMEM;
}
per_dev->offset = si->obj_offset;
per_dev->dev = si->dev;
} else if (si->obj_offset != (per_dev->offset + per_dev->length)) {
u64 gap = si->obj_offset - (per_dev->offset + per_dev->length);
_ore_add_sg_seg(per_dev, gap, true);
}
q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
added_len = bio_add_pc_page(q, per_dev->bio, page, pg_len,
si->obj_offset % PAGE_SIZE);
if (unlikely(added_len != pg_len)) {
ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
per_dev->bio->bi_vcnt);
return -ENOMEM;
}
per_dev->length += pg_len;
return 0;
}
/* read the beginning of an unaligned first page */
static int _add_to_r4w_first_page(struct ore_io_state *ios, struct page *page)
{
struct ore_striping_info si;
unsigned pg_len;
ore_calc_stripe_info(ios->layout, ios->offset, 0, &si);
pg_len = si.obj_offset % PAGE_SIZE;
si.obj_offset -= pg_len;
ORE_DBGMSG("offset=0x%llx len=0x%x index=0x%lx dev=%x\n",
_LLU(si.obj_offset), pg_len, page->index, si.dev);
return _add_to_r4w(ios, &si, page, pg_len);
}
/* read the end of an incomplete last page */
static int _add_to_r4w_last_page(struct ore_io_state *ios, u64 *offset)
{
struct ore_striping_info si;
struct page *page;
unsigned pg_len, p, c;
ore_calc_stripe_info(ios->layout, *offset, 0, &si);
p = si.cur_pg;
c = si.cur_comp;
page = ios->sp2d->_1p_stripes[p].pages[c];
pg_len = PAGE_SIZE - (si.unit_off % PAGE_SIZE);
*offset += pg_len;
ORE_DBGMSG("p=%d, c=%d next-offset=0x%llx len=0x%x dev=%x par_dev=%d\n",
p, c, _LLU(*offset), pg_len, si.dev, si.par_dev);
BUG_ON(!page);
return _add_to_r4w(ios, &si, page, pg_len);
}
static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
{
struct bio_vec *bv;
unsigned i, d;
/* loop on all devices all pages */
for (d = 0; d < ios->numdevs; d++) {
struct bio *bio = ios->per_dev[d].bio;
if (!bio)
continue;
bio_for_each_segment_all(bv, bio, i) {
struct page *page = bv->bv_page;
SetPageUptodate(page);
if (PageError(page))
ClearPageError(page);
}
}
}
/* read_4_write is hacked to read the start of the first stripe and/or
* the end of the last stripe. If needed, with an sg-gap at each device/page.
* It is assumed to be called after the to_be_written pages of the first stripe
* are populating ios->sp2d[][]
*
* NOTE: We call ios->r4w->lock_fn for all pages needed for parity calculations
* These pages are held at sp2d[p].pages[c] but with
* sp2d[p].page_is_read[c] = true. At _sp2d_reset these pages are
* ios->r4w->lock_fn(). The ios->r4w->lock_fn might signal that the page is
* @uptodate=true, so we don't need to read it, only unlock, after IO.
*
* TODO: The read_4_write should calc a need_to_read_pages_count, if bigger then
* to-be-written count, we should consider the xor-in-place mode.
* need_to_read_pages_count is the actual number of pages not present in cache.
* maybe "devs_in_group - ios->sp2d[p].write_count" is a good enough
* approximation? In this mode the read pages are put in the empty places of
* ios->sp2d[p][*], xor is calculated the same way. These pages are
* allocated/freed and don't go through cache
*/
static int _read_4_write_first_stripe(struct ore_io_state *ios)
{
struct ore_striping_info read_si;
struct __stripe_pages_2d *sp2d = ios->sp2d;
u64 offset = ios->si.first_stripe_start;
unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
if (offset == ios->offset) /* Go to start collect $200 */
goto read_last_stripe;
min_p = _sp2d_min_pg(sp2d);
max_p = _sp2d_max_pg(sp2d);
ORE_DBGMSG("stripe_start=0x%llx ios->offset=0x%llx min_p=%d max_p=%d\n",
offset, ios->offset, min_p, max_p);
for (c = 0; ; c++) {
ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
read_si.obj_offset += min_p * PAGE_SIZE;
offset += min_p * PAGE_SIZE;
for (p = min_p; p <= max_p; p++) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
struct page **pp = &_1ps->pages[c];
bool uptodate;
if (*pp) {
if (ios->offset % PAGE_SIZE)
/* Read the remainder of the page */
_add_to_r4w_first_page(ios, *pp);
/* to-be-written pages start here */
goto read_last_stripe;
}
*pp = ios->r4w->get_page(ios->private, offset,
&uptodate);
if (unlikely(!*pp))
return -ENOMEM;
if (!uptodate)
_add_to_r4w(ios, &read_si, *pp, PAGE_SIZE);
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
read_si.obj_offset += PAGE_SIZE;
offset += PAGE_SIZE;
}
offset += (sp2d->pages_in_unit - p) * PAGE_SIZE;
}
read_last_stripe:
return 0;
}
static int _read_4_write_last_stripe(struct ore_io_state *ios)
{
struct ore_striping_info read_si;
struct __stripe_pages_2d *sp2d = ios->sp2d;
u64 offset;
u64 last_stripe_end;
unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
offset = ios->offset + ios->length;
if (offset % PAGE_SIZE)
_add_to_r4w_last_page(ios, &offset);
/* offset will be aligned to next page */
last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
* bytes_in_stripe;
if (offset == last_stripe_end) /* Optimize for the aligned case */
goto read_it;
ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
p = read_si.cur_pg;
c = read_si.cur_comp;
if (min_p == sp2d->pages_in_unit) {
/* Didn't do it yet */
min_p = _sp2d_min_pg(sp2d);
max_p = _sp2d_max_pg(sp2d);
}
ORE_DBGMSG("offset=0x%llx stripe_end=0x%llx min_p=%d max_p=%d\n",
offset, last_stripe_end, min_p, max_p);
while (offset < last_stripe_end) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
if ((min_p <= p) && (p <= max_p)) {
struct page *page;
bool uptodate;
BUG_ON(_1ps->pages[c]);
page = ios->r4w->get_page(ios->private, offset,
&uptodate);
if (unlikely(!page))
return -ENOMEM;
_1ps->pages[c] = page;
/* Mark read-pages to be cache_released */
_1ps->page_is_read[c] = true;
if (!uptodate)
_add_to_r4w(ios, &read_si, page, PAGE_SIZE);
}
offset += PAGE_SIZE;
if (p == (sp2d->pages_in_unit - 1)) {
++c;
p = 0;
ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
} else {
read_si.obj_offset += PAGE_SIZE;
++p;
}
}
read_it:
return 0;
}
static int _read_4_write_execute(struct ore_io_state *ios)
{
struct ore_io_state *ios_read;
unsigned i;
int ret;
ios_read = ios->ios_read_4_write;
if (!ios_read)
return 0;
/* FIXME: Ugly to signal _sbi_read_mirror that we have bio(s). Change
* to check for per_dev->bio
*/
ios_read->pages = ios->pages;
/* Now read these devices */
for (i = 0; i < ios_read->numdevs; i += ios_read->layout->mirrors_p1) {
ret = _ore_read_mirror(ios_read, i);
if (unlikely(ret))
return ret;
}
ret = ore_io_execute(ios_read); /* Synchronus execution */
if (unlikely(ret)) {
ORE_DBGMSG("!! ore_io_execute => %d\n", ret);
return ret;
}
_mark_read4write_pages_uptodate(ios_read, ret);
ore_put_io_state(ios_read);
ios->ios_read_4_write = NULL; /* Might need a reuse at last stripe */
return 0;
}
/* In writes @cur_len means length left. .i.e cur_len==0 is the last parity U */
int _ore_add_parity_unit(struct ore_io_state *ios,
struct ore_striping_info *si,
struct ore_per_dev_state *per_dev,
unsigned cur_len, bool do_xor)
{
if (ios->reading) {
if (per_dev->cur_sg >= ios->sgs_per_dev) {
ORE_DBGMSG("cur_sg(%d) >= sgs_per_dev(%d)\n" ,
per_dev->cur_sg, ios->sgs_per_dev);
return -ENOMEM;
}
_ore_add_sg_seg(per_dev, cur_len, true);
} else {
struct __stripe_pages_2d *sp2d = ios->sp2d;
struct page **pages = ios->parity_pages + ios->cur_par_page;
unsigned num_pages;
unsigned array_start = 0;
unsigned i;
int ret;
si->cur_pg = _sp2d_min_pg(sp2d);
num_pages = _sp2d_max_pg(sp2d) + 1 - si->cur_pg;
if (!per_dev->length) {
per_dev->offset += si->cur_pg * PAGE_SIZE;
/* If first stripe, Read in all read4write pages
* (if needed) before we calculate the first parity.
*/
if (do_xor)
_read_4_write_first_stripe(ios);
}
if (!cur_len && do_xor)
/* If last stripe r4w pages of last stripe */
_read_4_write_last_stripe(ios);
_read_4_write_execute(ios);
for (i = 0; i < num_pages; i++) {
pages[i] = _raid_page_alloc();
if (unlikely(!pages[i]))
return -ENOMEM;
++(ios->cur_par_page);
}
BUG_ON(si->cur_comp < sp2d->data_devs);
BUG_ON(si->cur_pg + num_pages > sp2d->pages_in_unit);
ret = _ore_add_stripe_unit(ios, &array_start, 0, pages,
per_dev, num_pages * PAGE_SIZE);
if (unlikely(ret))
return ret;
if (do_xor) {
_gen_xor_unit(sp2d);
_sp2d_reset(sp2d, ios->r4w, ios->private);
}
}
return 0;
}
int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
{
if (ios->parity_pages) {
struct ore_layout *layout = ios->layout;
unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
if (_sp2d_alloc(pages_in_unit, layout->group_width,
layout->parity, &ios->sp2d)) {
return -ENOMEM;
}
}
return 0;
}
void _ore_free_raid_stuff(struct ore_io_state *ios)
{
if (ios->sp2d) { /* writing and raid */
unsigned i;
for (i = 0; i < ios->cur_par_page; i++) {
struct page *page = ios->parity_pages[i];
if (page)
_raid_page_free(page);
}
if (ios->extra_part_alloc)
kfree(ios->parity_pages);
/* If IO returned an error pages might need unlocking */
_sp2d_reset(ios->sp2d, ios->r4w, ios->private);
_sp2d_free(ios->sp2d);
} else {
/* Will only be set if raid reading && sglist is big */
if (ios->extra_part_alloc)
kfree(ios->per_dev[0].sglist);
}
if (ios->ios_read_4_write)
ore_put_io_state(ios->ios_read_4_write);
}
/*
* Copyright (C) from 2011
* Boaz Harrosh <ooo@electrozaur.com>
*
* This file is part of the objects raid engine (ore).
*
* It is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as published
* by the Free Software Foundation.
*
* You should have received a copy of the GNU General Public License
* along with "ore". If not, write to the Free Software Foundation, Inc:
* "Free Software Foundation <info@fsf.org>"
*/
#include <scsi/osd_ore.h>
#define ORE_ERR(fmt, a...) printk(KERN_ERR "ore: " fmt, ##a)
#ifdef CONFIG_EXOFS_DEBUG
#define ORE_DBGMSG(fmt, a...) \
printk(KERN_NOTICE "ore @%s:%d: " fmt, __func__, __LINE__, ##a)
#else
#define ORE_DBGMSG(fmt, a...) \
do { if (0) printk(fmt, ##a); } while (0)
#endif
/* u64 has problems with printk this will cast it to unsigned long long */
#define _LLU(x) (unsigned long long)(x)
#define ORE_DBGMSG2(M...) do {} while (0)
/* #define ORE_DBGMSG2 ORE_DBGMSG */
/* ios_raid.c stuff needed by ios.c */
int _ore_post_alloc_raid_stuff(struct ore_io_state *ios);
void _ore_free_raid_stuff(struct ore_io_state *ios);
void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
bool not_last);
int _ore_add_parity_unit(struct ore_io_state *ios, struct ore_striping_info *si,
struct ore_per_dev_state *per_dev, unsigned cur_len,
bool do_xor);
void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
struct ore_striping_info *si, struct page *page);
static inline void _add_stripe_page(struct __stripe_pages_2d *sp2d,
struct ore_striping_info *si, struct page *page)
{
if (!sp2d) /* Inline the fast path */
return; /* Hay no raid stuff */
_ore_add_stripe_page(sp2d, si, page);
}
/* ios.c stuff needed by ios_raid.c */
int _ore_get_io_state(struct ore_layout *layout,
struct ore_components *oc, unsigned numdevs,
unsigned sgs_per_dev, unsigned num_par_pages,
struct ore_io_state **pios);
int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
unsigned pgbase, struct page **pages,
struct ore_per_dev_state *per_dev, int cur_len);
int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp);
int ore_io_execute(struct ore_io_state *ios);
/*
* Copyright (C) 2005, 2006
* Avishay Traeger (avishay@gmail.com)
* Copyright (C) 2008, 2009
* Boaz Harrosh <ooo@electrozaur.com>
*
* Copyrights for code taken from ext2:
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card (card@masi.ibp.fr)
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
* from
* linux/fs/minix/inode.c
* Copyright (C) 1991, 1992 Linus Torvalds
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation. Since it is based on ext2, and the only
* valid version of GPL for the Linux kernel is version 2, the only valid
* version of GPL for exofs is version 2.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <linux/string.h>
#include <linux/parser.h>
#include <linux/vfs.h>
#include <linux/random.h>
#include <linux/module.h>
#include <linux/exportfs.h>
#include <linux/slab.h>
#include <linux/iversion.h>
#include "exofs.h"
#define EXOFS_DBGMSG2(M...) do {} while (0)
/******************************************************************************
* MOUNT OPTIONS
*****************************************************************************/
/*
* struct to hold what we get from mount options
*/
struct exofs_mountopt {
bool is_osdname;
const char *dev_name;
uint64_t pid;
int timeout;
};
/*
* exofs-specific mount-time options.
*/
enum { Opt_name, Opt_pid, Opt_to, Opt_err };
/*
* Our mount-time options. These should ideally be 64-bit unsigned, but the
* kernel's parsing functions do not currently support that. 32-bit should be
* sufficient for most applications now.
*/
static match_table_t tokens = {
{Opt_name, "osdname=%s"},
{Opt_pid, "pid=%u"},
{Opt_to, "to=%u"},
{Opt_err, NULL}
};
/*
* The main option parsing method. Also makes sure that all of the mandatory
* mount options were set.
*/
static int parse_options(char *options, struct exofs_mountopt *opts)
{
char *p;
substring_t args[MAX_OPT_ARGS];
int option;
bool s_pid = false;
EXOFS_DBGMSG("parse_options %s\n", options);
/* defaults */
memset(opts, 0, sizeof(*opts));
opts->timeout = BLK_DEFAULT_SG_TIMEOUT;
while ((p = strsep(&options, ",")) != NULL) {
int token;
char str[32];
if (!*p)
continue;
token = match_token(p, tokens, args);
switch (token) {
case Opt_name:
kfree(opts->dev_name);
opts->dev_name = match_strdup(&args[0]);
if (unlikely(!opts->dev_name)) {
EXOFS_ERR("Error allocating dev_name");
return -ENOMEM;
}
opts->is_osdname = true;
break;
case Opt_pid:
if (0 == match_strlcpy(str, &args[0], sizeof(str)))
return -EINVAL;
opts->pid = simple_strtoull(str, NULL, 0);
if (opts->pid < EXOFS_MIN_PID) {
EXOFS_ERR("Partition ID must be >= %u",
EXOFS_MIN_PID);
return -EINVAL;
}
s_pid = true;
break;
case Opt_to:
if (match_int(&args[0], &option))
return -EINVAL;
if (option <= 0) {
EXOFS_ERR("Timeout must be > 0");
return -EINVAL;
}
opts->timeout = option * HZ;
break;
}
}
if (!s_pid) {
EXOFS_ERR("Need to specify the following options:\n");
EXOFS_ERR(" -o pid=pid_no_to_use\n");
return -EINVAL;
}
return 0;
}
/******************************************************************************
* INODE CACHE
*****************************************************************************/
/*
* Our inode cache. Isn't it pretty?
*/
static struct kmem_cache *exofs_inode_cachep;
/*
* Allocate an inode in the cache
*/
static struct inode *exofs_alloc_inode(struct super_block *sb)
{
struct exofs_i_info *oi;
oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL);
if (!oi)
return NULL;
inode_set_iversion(&oi->vfs_inode, 1);
return &oi->vfs_inode;
}
static void exofs_i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
}
/*
* Remove an inode from the cache
*/
static void exofs_destroy_inode(struct inode *inode)
{
call_rcu(&inode->i_rcu, exofs_i_callback);
}
/*
* Initialize the inode
*/
static void exofs_init_once(void *foo)
{
struct exofs_i_info *oi = foo;
inode_init_once(&oi->vfs_inode);
}
/*
* Create and initialize the inode cache
*/
static int init_inodecache(void)
{
exofs_inode_cachep = kmem_cache_create_usercopy("exofs_inode_cache",
sizeof(struct exofs_i_info), 0,
SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD |
SLAB_ACCOUNT,
offsetof(struct exofs_i_info, i_data),
sizeof_field(struct exofs_i_info, i_data),
exofs_init_once);
if (exofs_inode_cachep == NULL)
return -ENOMEM;
return 0;
}
/*
* Destroy the inode cache
*/
static void destroy_inodecache(void)
{
/*
* Make sure all delayed rcu free inodes are flushed before we
* destroy cache.
*/
rcu_barrier();
kmem_cache_destroy(exofs_inode_cachep);
}
/******************************************************************************
* Some osd helpers
*****************************************************************************/
void exofs_make_credential(u8 cred_a[OSD_CAP_LEN], const struct osd_obj_id *obj)
{
osd_sec_init_nosec_doall_caps(cred_a, obj, false, true);
}
static int exofs_read_kern(struct osd_dev *od, u8 *cred, struct osd_obj_id *obj,
u64 offset, void *p, unsigned length)
{
struct osd_request *or = osd_start_request(od);
/* struct osd_sense_info osi = {.key = 0};*/
int ret;
if (unlikely(!or)) {
EXOFS_DBGMSG("%s: osd_start_request failed.\n", __func__);
return -ENOMEM;
}
ret = osd_req_read_kern(or, obj, offset, p, length);
if (unlikely(ret)) {
EXOFS_DBGMSG("%s: osd_req_read_kern failed.\n", __func__);
goto out;
}
ret = osd_finalize_request(or, 0, cred, NULL);
if (unlikely(ret)) {
EXOFS_DBGMSG("Failed to osd_finalize_request() => %d\n", ret);
goto out;
}
ret = osd_execute_request(or);
if (unlikely(ret))
EXOFS_DBGMSG("osd_execute_request() => %d\n", ret);
/* osd_req_decode_sense(or, ret); */
out:
osd_end_request(or);
EXOFS_DBGMSG2("read_kern(0x%llx) offset=0x%llx "
"length=0x%llx dev=%p ret=>%d\n",
_LLU(obj->id), _LLU(offset), _LLU(length), od, ret);
return ret;
}
static const struct osd_attr g_attr_sb_stats = ATTR_DEF(
EXOFS_APAGE_SB_DATA,
EXOFS_ATTR_SB_STATS,
sizeof(struct exofs_sb_stats));
static int __sbi_read_stats(struct exofs_sb_info *sbi)
{
struct osd_attr attrs[] = {
[0] = g_attr_sb_stats,
};
struct ore_io_state *ios;
int ret;
ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
return ret;
}
ios->in_attr = attrs;
ios->in_attr_len = ARRAY_SIZE(attrs);
ret = ore_read(ios);
if (unlikely(ret)) {
EXOFS_ERR("Error reading super_block stats => %d\n", ret);
goto out;
}
ret = extract_attr_from_ios(ios, &attrs[0]);
if (ret) {
EXOFS_ERR("%s: extract_attr of sb_stats failed\n", __func__);
goto out;
}
if (attrs[0].len) {
struct exofs_sb_stats *ess;
if (unlikely(attrs[0].len != sizeof(*ess))) {
EXOFS_ERR("%s: Wrong version of exofs_sb_stats "
"size(%d) != expected(%zd)\n",
__func__, attrs[0].len, sizeof(*ess));
goto out;
}
ess = attrs[0].val_ptr;
sbi->s_nextid = le64_to_cpu(ess->s_nextid);
sbi->s_numfiles = le32_to_cpu(ess->s_numfiles);
}
out:
ore_put_io_state(ios);
return ret;
}
static void stats_done(struct ore_io_state *ios, void *p)
{
ore_put_io_state(ios);
/* Good thanks nothing to do anymore */
}
/* Asynchronously write the stats attribute */
int exofs_sbi_write_stats(struct exofs_sb_info *sbi)
{
struct osd_attr attrs[] = {
[0] = g_attr_sb_stats,
};
struct ore_io_state *ios;
int ret;
ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_get_io_state failed.\n", __func__);
return ret;
}
sbi->s_ess.s_nextid = cpu_to_le64(sbi->s_nextid);
sbi->s_ess.s_numfiles = cpu_to_le64(sbi->s_numfiles);
attrs[0].val_ptr = &sbi->s_ess;
ios->done = stats_done;
ios->private = sbi;
ios->out_attr = attrs;
ios->out_attr_len = ARRAY_SIZE(attrs);
ret = ore_write(ios);
if (unlikely(ret)) {
EXOFS_ERR("%s: ore_write failed.\n", __func__);
ore_put_io_state(ios);
}
return ret;
}
/******************************************************************************
* SUPERBLOCK FUNCTIONS
*****************************************************************************/
static const struct super_operations exofs_sops;
static const struct export_operations exofs_export_ops;
/*
* Write the superblock to the OSD
*/
static int exofs_sync_fs(struct super_block *sb, int wait)
{
struct exofs_sb_info *sbi;
struct exofs_fscb *fscb;
struct ore_comp one_comp;
struct ore_components oc;
struct ore_io_state *ios;
int ret = -ENOMEM;
fscb = kmalloc(sizeof(*fscb), GFP_KERNEL);
if (unlikely(!fscb))
return -ENOMEM;
sbi = sb->s_fs_info;
/* NOTE: We no longer dirty the super_block anywhere in exofs. The
* reason we write the fscb here on unmount is so we can stay backwards
* compatible with fscb->s_version == 1. (What we are not compatible
* with is if a new version FS crashed and then we try to mount an old
* version). Otherwise the exofs_fscb is read-only from mkfs time. All
* the writeable info is set in exofs_sbi_write_stats() above.
*/
exofs_init_comps(&oc, &one_comp, sbi, EXOFS_SUPER_ID);
ret = ore_get_io_state(&sbi->layout, &oc, &ios);
if (unlikely(ret))
goto out;
ios->length = offsetof(struct exofs_fscb, s_dev_table_oid);
memset(fscb, 0, ios->length);
fscb->s_nextid = cpu_to_le64(sbi->s_nextid);
fscb->s_numfiles = cpu_to_le64(sbi->s_numfiles);
fscb->s_magic = cpu_to_le16(sb->s_magic);
fscb->s_newfs = 0;
fscb->s_version = EXOFS_FSCB_VER;
ios->offset = 0;
ios->kern_buff = fscb;
ret = ore_write(ios);
if (unlikely(ret))
EXOFS_ERR("%s: ore_write failed.\n", __func__);
out:
EXOFS_DBGMSG("s_nextid=0x%llx ret=%d\n", _LLU(sbi->s_nextid), ret);
ore_put_io_state(ios);
kfree(fscb);
return ret;
}
static void _exofs_print_device(const char *msg, const char *dev_path,
struct osd_dev *od, u64 pid)
{
const struct osd_dev_info *odi = osduld_device_info(od);
printk(KERN_NOTICE "exofs: %s %s osd_name-%s pid-0x%llx\n",
msg, dev_path ?: "", odi->osdname, _LLU(pid));
}
static void exofs_free_sbi(struct exofs_sb_info *sbi)
{
unsigned numdevs = sbi->oc.numdevs;
while (numdevs) {
unsigned i = --numdevs;
struct osd_dev *od = ore_comp_dev(&sbi->oc, i);
if (od) {
ore_comp_set_dev(&sbi->oc, i, NULL);
osduld_put_device(od);
}
}
kfree(sbi->oc.ods);
kfree(sbi);
}
/*
* This function is called when the vfs is freeing the superblock. We just
* need to free our own part.
*/
static void exofs_put_super(struct super_block *sb)
{
int num_pend;
struct exofs_sb_info *sbi = sb->s_fs_info;
/* make sure there are no pending commands */
for (num_pend = atomic_read(&sbi->s_curr_pending); num_pend > 0;
num_pend = atomic_read(&sbi->s_curr_pending)) {
wait_queue_head_t wq;
printk(KERN_NOTICE "%s: !!Pending operations in flight. "
"This is a BUG. please report to osd-dev@open-osd.org\n",
__func__);
init_waitqueue_head(&wq);
wait_event_timeout(wq,
(atomic_read(&sbi->s_curr_pending) == 0),
msecs_to_jiffies(100));
}
_exofs_print_device("Unmounting", NULL, ore_comp_dev(&sbi->oc, 0),
sbi->one_comp.obj.partition);
exofs_sysfs_sb_del(sbi);
exofs_free_sbi(sbi);
sb->s_fs_info = NULL;
}
static int _read_and_match_data_map(struct exofs_sb_info *sbi, unsigned numdevs,
struct exofs_device_table *dt)
{
int ret;
sbi->layout.stripe_unit =
le64_to_cpu(dt->dt_data_map.cb_stripe_unit);
sbi->layout.group_width =
le32_to_cpu(dt->dt_data_map.cb_group_width);
sbi->layout.group_depth =
le32_to_cpu(dt->dt_data_map.cb_group_depth);
sbi->layout.mirrors_p1 =
le32_to_cpu(dt->dt_data_map.cb_mirror_cnt) + 1;
sbi->layout.raid_algorithm =
le32_to_cpu(dt->dt_data_map.cb_raid_algorithm);
ret = ore_verify_layout(numdevs, &sbi->layout);
EXOFS_DBGMSG("exofs: layout: "
"num_comps=%u stripe_unit=0x%x group_width=%u "
"group_depth=0x%llx mirrors_p1=%u raid_algorithm=%u\n",
numdevs,
sbi->layout.stripe_unit,
sbi->layout.group_width,
_LLU(sbi->layout.group_depth),
sbi->layout.mirrors_p1,
sbi->layout.raid_algorithm);
return ret;
}
static unsigned __ra_pages(struct ore_layout *layout)
{
const unsigned _MIN_RA = 32; /* min 128K read-ahead */
unsigned ra_pages = layout->group_width * layout->stripe_unit /
PAGE_SIZE;
unsigned max_io_pages = exofs_max_io_pages(layout, ~0);
ra_pages *= 2; /* two stripes */
if (ra_pages < _MIN_RA)
ra_pages = roundup(_MIN_RA, ra_pages / 2);
if (ra_pages > max_io_pages)
ra_pages = max_io_pages;
return ra_pages;
}
/* @odi is valid only as long as @fscb_dev is valid */
static int exofs_devs_2_odi(struct exofs_dt_device_info *dt_dev,
struct osd_dev_info *odi)
{
odi->systemid_len = le32_to_cpu(dt_dev->systemid_len);
if (likely(odi->systemid_len))
memcpy(odi->systemid, dt_dev->systemid, OSD_SYSTEMID_LEN);
odi->osdname_len = le32_to_cpu(dt_dev->osdname_len);
odi->osdname = dt_dev->osdname;
/* FIXME support long names. Will need a _put function */
if (dt_dev->long_name_offset)
return -EINVAL;
/* Make sure osdname is printable!
* mkexofs should give us space for a null-terminator else the
* device-table is invalid.
*/
if (unlikely(odi->osdname_len >= sizeof(dt_dev->osdname)))
odi->osdname_len = sizeof(dt_dev->osdname) - 1;
dt_dev->osdname[odi->osdname_len] = 0;
/* If it's all zeros something is bad we read past end-of-obj */
return !(odi->systemid_len || odi->osdname_len);
}
static int __alloc_dev_table(struct exofs_sb_info *sbi, unsigned numdevs,
struct exofs_dev **peds)
{
/* Twice bigger table: See exofs_init_comps() and comment at
* exofs_read_lookup_dev_table()
*/
const size_t numores = numdevs * 2 - 1;
struct exofs_dev *eds;
unsigned i;
sbi->oc.ods = kzalloc(numores * sizeof(struct ore_dev *) +
numdevs * sizeof(struct exofs_dev), GFP_KERNEL);
if (unlikely(!sbi->oc.ods)) {
EXOFS_ERR("ERROR: failed allocating Device array[%d]\n",
numdevs);
return -ENOMEM;
}
/* Start of allocated struct exofs_dev entries */
*peds = eds = (void *)sbi->oc.ods[numores];
/* Initialize pointers into struct exofs_dev */
for (i = 0; i < numdevs; ++i)
sbi->oc.ods[i] = &eds[i].ored;
return 0;
}
static int exofs_read_lookup_dev_table(struct exofs_sb_info *sbi,
struct osd_dev *fscb_od,
unsigned table_count)
{
struct ore_comp comp;
struct exofs_device_table *dt;
struct exofs_dev *eds;
unsigned table_bytes = table_count * sizeof(dt->dt_dev_table[0]) +
sizeof(*dt);
unsigned numdevs, i;
int ret;
dt = kmalloc(table_bytes, GFP_KERNEL);
if (unlikely(!dt)) {
EXOFS_ERR("ERROR: allocating %x bytes for device table\n",
table_bytes);
return -ENOMEM;
}
sbi->oc.numdevs = 0;
comp.obj.partition = sbi->one_comp.obj.partition;
comp.obj.id = EXOFS_DEVTABLE_ID;
exofs_make_credential(comp.cred, &comp.obj);
ret = exofs_read_kern(fscb_od, comp.cred, &comp.obj, 0, dt,
table_bytes);
if (unlikely(ret)) {
EXOFS_ERR("ERROR: reading device table\n");
goto out;
}
numdevs = le64_to_cpu(dt->dt_num_devices);
if (unlikely(!numdevs)) {
ret = -EINVAL;
goto out;
}
WARN_ON(table_count != numdevs);
ret = _read_and_match_data_map(sbi, numdevs, dt);
if (unlikely(ret))
goto out;
ret = __alloc_dev_table(sbi, numdevs, &eds);
if (unlikely(ret))
goto out;
/* exofs round-robins the device table view according to inode
* number. We hold a: twice bigger table hence inodes can point
* to any device and have a sequential view of the table
* starting at this device. See exofs_init_comps()
*/
memcpy(&sbi->oc.ods[numdevs], &sbi->oc.ods[0],
(numdevs - 1) * sizeof(sbi->oc.ods[0]));
/* create sysfs subdir under which we put the device table
* And cluster layout. A Superblock is identified by the string:
* "dev[0].osdname"_"pid"
*/
exofs_sysfs_sb_add(sbi, &dt->dt_dev_table[0]);
for (i = 0; i < numdevs; i++) {
struct exofs_fscb fscb;
struct osd_dev_info odi;
struct osd_dev *od;
if (exofs_devs_2_odi(&dt->dt_dev_table[i], &odi)) {
EXOFS_ERR("ERROR: Read all-zeros device entry\n");
ret = -EINVAL;
goto out;
}
printk(KERN_NOTICE "Add device[%d]: osd_name-%s\n",
i, odi.osdname);
/* the exofs id is currently the table index */
eds[i].did = i;
/* On all devices the device table is identical. The user can
* specify any one of the participating devices on the command
* line. We always keep them in device-table order.
*/
if (fscb_od && osduld_device_same(fscb_od, &odi)) {
eds[i].ored.od = fscb_od;
++sbi->oc.numdevs;
fscb_od = NULL;
exofs_sysfs_odev_add(&eds[i], sbi);
continue;
}
od = osduld_info_lookup(&odi);
if (IS_ERR(od)) {
ret = PTR_ERR(od);
EXOFS_ERR("ERROR: device requested is not found "
"osd_name-%s =>%d\n", odi.osdname, ret);
goto out;
}
eds[i].ored.od = od;
++sbi->oc.numdevs;
/* Read the fscb of the other devices to make sure the FS
* partition is there.
*/
ret = exofs_read_kern(od, comp.cred, &comp.obj, 0, &fscb,
sizeof(fscb));
if (unlikely(ret)) {
EXOFS_ERR("ERROR: Malformed participating device "
"error reading fscb osd_name-%s\n",
odi.osdname);
goto out;
}
exofs_sysfs_odev_add(&eds[i], sbi);
/* TODO: verify other information is correct and FS-uuid
* matches. Benny what did you say about device table
* generation and old devices?
*/
}
out:
kfree(dt);
if (unlikely(fscb_od && !ret)) {
EXOFS_ERR("ERROR: Bad device-table container device not present\n");
osduld_put_device(fscb_od);
return -EINVAL;
}
return ret;
}
/*
* Read the superblock from the OSD and fill in the fields
*/
static int exofs_fill_super(struct super_block *sb,
struct exofs_mountopt *opts,
struct exofs_sb_info *sbi,
int silent)
{
struct inode *root;
struct osd_dev *od; /* Master device */
struct exofs_fscb fscb; /*on-disk superblock info */
struct ore_comp comp;
unsigned table_count;
int ret;
/* use mount options to fill superblock */
if (opts->is_osdname) {
struct osd_dev_info odi = {.systemid_len = 0};
odi.osdname_len = strlen(opts->dev_name);
odi.osdname = (u8 *)opts->dev_name;
od = osduld_info_lookup(&odi);
kfree(opts->dev_name);
opts->dev_name = NULL;
} else {
od = osduld_path_lookup(opts->dev_name);
}
if (IS_ERR(od)) {
ret = -EINVAL;
goto free_sbi;
}
/* Default layout in case we do not have a device-table */
sbi->layout.stripe_unit = PAGE_SIZE;
sbi->layout.mirrors_p1 = 1;
sbi->layout.group_width = 1;
sbi->layout.group_depth = -1;
sbi->layout.group_count = 1;
sbi->s_timeout = opts->timeout;
sbi->one_comp.obj.partition = opts->pid;
sbi->one_comp.obj.id = 0;
exofs_make_credential(sbi->one_comp.cred, &sbi->one_comp.obj);
sbi->oc.single_comp = EC_SINGLE_COMP;
sbi->oc.comps = &sbi->one_comp;
/* fill in some other data by hand */
memset(sb->s_id, 0, sizeof(sb->s_id));
strcpy(sb->s_id, "exofs");
sb->s_blocksize = EXOFS_BLKSIZE;
sb->s_blocksize_bits = EXOFS_BLKSHIFT;
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_max_links = EXOFS_LINK_MAX;
atomic_set(&sbi->s_curr_pending, 0);
sb->s_bdev = NULL;
sb->s_dev = 0;
comp.obj.partition = sbi->one_comp.obj.partition;
comp.obj.id = EXOFS_SUPER_ID;
exofs_make_credential(comp.cred, &comp.obj);
ret = exofs_read_kern(od, comp.cred, &comp.obj, 0, &fscb, sizeof(fscb));
if (unlikely(ret))
goto free_sbi;
sb->s_magic = le16_to_cpu(fscb.s_magic);
/* NOTE: we read below to be backward compatible with old versions */
sbi->s_nextid = le64_to_cpu(fscb.s_nextid);
sbi->s_numfiles = le32_to_cpu(fscb.s_numfiles);
/* make sure what we read from the object store is correct */
if (sb->s_magic != EXOFS_SUPER_MAGIC) {
if (!silent)
EXOFS_ERR("ERROR: Bad magic value\n");
ret = -EINVAL;
goto free_sbi;
}
if (le32_to_cpu(fscb.s_version) > EXOFS_FSCB_VER) {
EXOFS_ERR("ERROR: Bad FSCB version expected-%d got-%d\n",
EXOFS_FSCB_VER, le32_to_cpu(fscb.s_version));
ret = -EINVAL;
goto free_sbi;
}
/* start generation numbers from a random point */
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);
table_count = le64_to_cpu(fscb.s_dev_table_count);
if (table_count) {
ret = exofs_read_lookup_dev_table(sbi, od, table_count);
if (unlikely(ret))
goto free_sbi;
} else {
struct exofs_dev *eds;
ret = __alloc_dev_table(sbi, 1, &eds);
if (unlikely(ret))
goto free_sbi;
ore_comp_set_dev(&sbi->oc, 0, od);
sbi->oc.numdevs = 1;
}
__sbi_read_stats(sbi);
/* set up operation vectors */
ret = super_setup_bdi(sb);
if (ret) {
EXOFS_DBGMSG("Failed to super_setup_bdi\n");
goto free_sbi;
}
sb->s_bdi->ra_pages = __ra_pages(&sbi->layout);
sb->s_fs_info = sbi;
sb->s_op = &exofs_sops;
sb->s_export_op = &exofs_export_ops;
root = exofs_iget(sb, EXOFS_ROOT_ID - EXOFS_OBJ_OFF);
if (IS_ERR(root)) {
EXOFS_ERR("ERROR: exofs_iget failed\n");
ret = PTR_ERR(root);
goto free_sbi;
}
sb->s_root = d_make_root(root);
if (!sb->s_root) {
EXOFS_ERR("ERROR: get root inode failed\n");
ret = -ENOMEM;
goto free_sbi;
}
if (!S_ISDIR(root->i_mode)) {
dput(sb->s_root);
sb->s_root = NULL;
EXOFS_ERR("ERROR: corrupt root inode (mode = %hd)\n",
root->i_mode);
ret = -EINVAL;
goto free_sbi;
}
exofs_sysfs_dbg_print();
_exofs_print_device("Mounting", opts->dev_name,
ore_comp_dev(&sbi->oc, 0),
sbi->one_comp.obj.partition);
return 0;
free_sbi:
EXOFS_ERR("Unable to mount exofs on %s pid=0x%llx err=%d\n",
opts->dev_name, sbi->one_comp.obj.partition, ret);
exofs_free_sbi(sbi);
return ret;
}
/*
* Set up the superblock (calls exofs_fill_super eventually)
*/
static struct dentry *exofs_mount(struct file_system_type *type,
int flags, const char *dev_name,
void *data)
{
struct super_block *s;
struct exofs_mountopt opts;
struct exofs_sb_info *sbi;
int ret;
ret = parse_options(data, &opts);
if (ret) {
kfree(opts.dev_name);
return ERR_PTR(ret);
}
sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi) {
kfree(opts.dev_name);
return ERR_PTR(-ENOMEM);
}
s = sget(type, NULL, set_anon_super, flags, NULL);
if (IS_ERR(s)) {
kfree(opts.dev_name);
kfree(sbi);
return ERR_CAST(s);
}
if (!opts.dev_name)
opts.dev_name = dev_name;
ret = exofs_fill_super(s, &opts, sbi, flags & SB_SILENT ? 1 : 0);
if (ret) {
deactivate_locked_super(s);
return ERR_PTR(ret);
}
s->s_flags |= SB_ACTIVE;
return dget(s->s_root);
}
/*
* Return information about the file system state in the buffer. This is used
* by the 'df' command, for example.
*/
static int exofs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct exofs_sb_info *sbi = sb->s_fs_info;
struct ore_io_state *ios;
struct osd_attr attrs[] = {
ATTR_DEF(OSD_APAGE_PARTITION_QUOTAS,
OSD_ATTR_PQ_CAPACITY_QUOTA, sizeof(__be64)),
ATTR_DEF(OSD_APAGE_PARTITION_INFORMATION,
OSD_ATTR_PI_USED_CAPACITY, sizeof(__be64)),
};
uint64_t capacity = ULLONG_MAX;
uint64_t used = ULLONG_MAX;
int ret;
ret = ore_get_io_state(&sbi->layout, &sbi->oc, &ios);
if (ret) {
EXOFS_DBGMSG("ore_get_io_state failed.\n");
return ret;
}
ios->in_attr = attrs;
ios->in_attr_len = ARRAY_SIZE(attrs);
ret = ore_read(ios);
if (unlikely(ret))
goto out;
ret = extract_attr_from_ios(ios, &attrs[0]);
if (likely(!ret)) {
capacity = get_unaligned_be64(attrs[0].val_ptr);
if (unlikely(!capacity))
capacity = ULLONG_MAX;
} else
EXOFS_DBGMSG("exofs_statfs: get capacity failed.\n");
ret = extract_attr_from_ios(ios, &attrs[1]);
if (likely(!ret))
used = get_unaligned_be64(attrs[1].val_ptr);
else
EXOFS_DBGMSG("exofs_statfs: get used-space failed.\n");
/* fill in the stats buffer */
buf->f_type = EXOFS_SUPER_MAGIC;
buf->f_bsize = EXOFS_BLKSIZE;
buf->f_blocks = capacity >> 9;
buf->f_bfree = (capacity - used) >> 9;
buf->f_bavail = buf->f_bfree;
buf->f_files = sbi->s_numfiles;
buf->f_ffree = EXOFS_MAX_ID - sbi->s_numfiles;
buf->f_namelen = EXOFS_NAME_LEN;
out:
ore_put_io_state(ios);
return ret;
}
static const struct super_operations exofs_sops = {
.alloc_inode = exofs_alloc_inode,
.destroy_inode = exofs_destroy_inode,
.write_inode = exofs_write_inode,
.evict_inode = exofs_evict_inode,
.put_super = exofs_put_super,
.sync_fs = exofs_sync_fs,
.statfs = exofs_statfs,
};
/******************************************************************************
* EXPORT OPERATIONS
*****************************************************************************/
static struct dentry *exofs_get_parent(struct dentry *child)
{
unsigned long ino = exofs_parent_ino(child);
if (!ino)
return ERR_PTR(-ESTALE);
return d_obtain_alias(exofs_iget(child->d_sb, ino));
}
static struct inode *exofs_nfs_get_inode(struct super_block *sb,
u64 ino, u32 generation)
{
struct inode *inode;
inode = exofs_iget(sb, ino);
if (IS_ERR(inode))
return ERR_CAST(inode);
if (generation && inode->i_generation != generation) {
/* we didn't find the right inode.. */
iput(inode);
return ERR_PTR(-ESTALE);
}
return inode;
}
static struct dentry *exofs_fh_to_dentry(struct super_block *sb,
struct fid *fid, int fh_len, int fh_type)
{
return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
exofs_nfs_get_inode);
}
static struct dentry *exofs_fh_to_parent(struct super_block *sb,
struct fid *fid, int fh_len, int fh_type)
{
return generic_fh_to_parent(sb, fid, fh_len, fh_type,
exofs_nfs_get_inode);
}
static const struct export_operations exofs_export_ops = {
.fh_to_dentry = exofs_fh_to_dentry,
.fh_to_parent = exofs_fh_to_parent,
.get_parent = exofs_get_parent,
};
/******************************************************************************
* INSMOD/RMMOD
*****************************************************************************/
/*
* struct that describes this file system
*/
static struct file_system_type exofs_type = {
.owner = THIS_MODULE,
.name = "exofs",
.mount = exofs_mount,
.kill_sb = generic_shutdown_super,
};
MODULE_ALIAS_FS("exofs");
static int __init init_exofs(void)
{
int err;
err = init_inodecache();
if (err)
goto out;
err = register_filesystem(&exofs_type);
if (err)
goto out_d;
/* We don't fail if sysfs creation failed */
exofs_sysfs_init();
return 0;
out_d:
destroy_inodecache();
out:
return err;
}
static void __exit exit_exofs(void)
{
exofs_sysfs_uninit();
unregister_filesystem(&exofs_type);
destroy_inodecache();
}
MODULE_AUTHOR("Avishay Traeger <avishay@gmail.com>");
MODULE_DESCRIPTION("exofs");
MODULE_LICENSE("GPL");
module_init(init_exofs)
module_exit(exit_exofs)
/*
* Copyright (C) 2012
* Sachin Bhamare <sbhamare@panasas.com>
* Boaz Harrosh <ooo@electrozaur.com>
*
* This file is part of exofs.
*
* exofs is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License 2 as published by
* the Free Software Foundation.
*
* exofs is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with exofs; if not, write to the:
* Free Software Foundation <licensing@fsf.org>
*/
#include <linux/kobject.h>
#include <linux/device.h>
#include "exofs.h"
struct odev_attr {
struct attribute attr;
ssize_t (*show)(struct exofs_dev *, char *);
ssize_t (*store)(struct exofs_dev *, const char *, size_t);
};
static ssize_t odev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct exofs_dev *edp = container_of(kobj, struct exofs_dev, ed_kobj);
struct odev_attr *a = container_of(attr, struct odev_attr, attr);
return a->show ? a->show(edp, buf) : 0;
}
static ssize_t odev_attr_store(struct kobject *kobj, struct attribute *attr,
const char *buf, size_t len)
{
struct exofs_dev *edp = container_of(kobj, struct exofs_dev, ed_kobj);
struct odev_attr *a = container_of(attr, struct odev_attr, attr);
return a->store ? a->store(edp, buf, len) : len;
}
static const struct sysfs_ops odev_attr_ops = {
.show = odev_attr_show,
.store = odev_attr_store,
};
static struct kset *exofs_kset;
static ssize_t osdname_show(struct exofs_dev *edp, char *buf)
{
struct osd_dev *odev = edp->ored.od;
const struct osd_dev_info *odi = osduld_device_info(odev);
return snprintf(buf, odi->osdname_len + 1, "%s", odi->osdname);
}
static ssize_t systemid_show(struct exofs_dev *edp, char *buf)
{
struct osd_dev *odev = edp->ored.od;
const struct osd_dev_info *odi = osduld_device_info(odev);
memcpy(buf, odi->systemid, odi->systemid_len);
return odi->systemid_len;
}
static ssize_t uri_show(struct exofs_dev *edp, char *buf)
{
return snprintf(buf, edp->urilen, "%s", edp->uri);
}
static ssize_t uri_store(struct exofs_dev *edp, const char *buf, size_t len)
{
uint8_t *new_uri;
edp->urilen = strlen(buf) + 1;
new_uri = krealloc(edp->uri, edp->urilen, GFP_KERNEL);
if (new_uri == NULL)
return -ENOMEM;
edp->uri = new_uri;
strncpy(edp->uri, buf, edp->urilen);
return edp->urilen;
}
#define OSD_ATTR(name, mode, show, store) \
static struct odev_attr odev_attr_##name = \
__ATTR(name, mode, show, store)
OSD_ATTR(osdname, S_IRUGO, osdname_show, NULL);
OSD_ATTR(systemid, S_IRUGO, systemid_show, NULL);
OSD_ATTR(uri, S_IRWXU, uri_show, uri_store);
static struct attribute *odev_attrs[] = {
&odev_attr_osdname.attr,
&odev_attr_systemid.attr,
&odev_attr_uri.attr,
NULL,
};
static struct kobj_type odev_ktype = {
.default_attrs = odev_attrs,
.sysfs_ops = &odev_attr_ops,
};
static struct kobj_type uuid_ktype = {
};
void exofs_sysfs_dbg_print(void)
{
#ifdef CONFIG_EXOFS_DEBUG
struct kobject *k_name, *k_tmp;
list_for_each_entry_safe(k_name, k_tmp, &exofs_kset->list, entry) {
printk(KERN_INFO "%s: name %s ref %d\n",
__func__, kobject_name(k_name),
(int)kref_read(&k_name->kref));
}
#endif
}
/*
* This function removes all kobjects under exofs_kset
* At the end of it, exofs_kset kobject will have a refcount
* of 1 which gets decremented only on exofs module unload
*/
void exofs_sysfs_sb_del(struct exofs_sb_info *sbi)
{
struct kobject *k_name, *k_tmp;
struct kobject *s_kobj = &sbi->s_kobj;
list_for_each_entry_safe(k_name, k_tmp, &exofs_kset->list, entry) {
/* Remove all that are children of this SBI */
if (k_name->parent == s_kobj)
kobject_put(k_name);
}
kobject_put(s_kobj);
}
/*
* This function creates sysfs entries to hold the current exofs cluster
* instance (uniquely identified by osdname,pid tuple).
* This function gets called once per exofs mount instance.
*/
int exofs_sysfs_sb_add(struct exofs_sb_info *sbi,
struct exofs_dt_device_info *dt_dev)
{
struct kobject *s_kobj;
int retval = 0;
uint64_t pid = sbi->one_comp.obj.partition;
/* allocate new uuid dirent */
s_kobj = &sbi->s_kobj;
s_kobj->kset = exofs_kset;
retval = kobject_init_and_add(s_kobj, &uuid_ktype,
&exofs_kset->kobj, "%s_%llx", dt_dev->osdname, pid);
if (retval) {
EXOFS_ERR("ERROR: Failed to create sysfs entry for "
"uuid-%s_%llx => %d\n", dt_dev->osdname, pid, retval);
return -ENOMEM;
}
return 0;
}
int exofs_sysfs_odev_add(struct exofs_dev *edev, struct exofs_sb_info *sbi)
{
struct kobject *d_kobj;
int retval = 0;
/* create osd device group which contains following attributes
* osdname, systemid & uri
*/
d_kobj = &edev->ed_kobj;
d_kobj->kset = exofs_kset;
retval = kobject_init_and_add(d_kobj, &odev_ktype,
&sbi->s_kobj, "dev%u", edev->did);
if (retval) {
EXOFS_ERR("ERROR: Failed to create sysfs entry for "
"device dev%u\n", edev->did);
return retval;
}
return 0;
}
int exofs_sysfs_init(void)
{
exofs_kset = kset_create_and_add("exofs", NULL, fs_kobj);
if (!exofs_kset) {
EXOFS_ERR("ERROR: kset_create_and_add exofs failed\n");
return -ENOMEM;
}
return 0;
}
void exofs_sysfs_uninit(void)
{
kset_unregister(exofs_kset);
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册