> JuiceFS use local mapping of user and UID. So, you should [sync all the needed users and their UIDs](./how_to_sync_the_same_account.md) across the whole Hadoop cluster to avoid permission error.
> JuiceFS use local mapping of user and UID. So, you should [sync all the needed users and their UIDs](sync_accounts_between_multiple_hosts.md) across the whole Hadoop cluster to avoid permission error.
## Hadoop Compatibility
...
...
@@ -24,6 +24,8 @@ $ cd sdk/java
$ make
```
> **Tip**: For users in China, it's recommended to set a local Maven mirror to speed-up compilation, e.g. [Aliyun Maven Mirror](https://maven.aliyun.com).
## Deploy JuiceFS Hadoop Java SDK
After compiling you could find the JAR file in `sdk/java/target` directory, e.g. `juicefs-hadoop-0.10.0.jar`. Beware that file with `original-` prefix, it doesn't contain third-party dependencies. It's recommended to use the JAR file with third-party dependencies.
...
...
@@ -70,15 +72,31 @@ Then put the JAR file and `$JAVA_HOME/lib/tools.jar` to the classpath of each Ha
| ------------- | ------------- | ----------- |
| `juicefs.cache-dir` | | Directory paths of local cache. Use colon to separate multiple paths. Also support wildcard in path. **It's recommended create these directories manually and set `0777` permission so that different applications could share the cache data.** |
| `juicefs.cache-size` | 0 | Maximum size of local cache in MiB. It's the total size when set multiple cache directories. |
| `juicefs.cache-full-block` | `true` | Whether cache every read blocks, `false` means only cache random/small read blocks. |
| `juicefs.free-space` | 0.2 | Min free space ratio of cache directory |
| `juicefs.discover-nodes-url` | | The URL to discover cluster nodes, refresh every 10 minutes.<br/><br/>YARN: `yarn`<br/>Spark Standalone: `http://spark-master:web-ui-port/json/`<br/>Spark ThriftServer: `http://thrift-server:4040/api/v1/applications/`<br/>Presto: `http://coordinator:discovery-uri-port/v1/service/presto/` |
### I/O Configurations
| Configuration | Default Value | Description |
| ------------- | ------------- | ----------- |
| `juicefs.max-uploads` | 50 | The max number of connections to upload |
| `juicefs.get-timeout` | 5 | The max number of seconds to download an object |
| `juicefs.put-timeout` | 60 | The max number of seconds to upload an object |
| `juicefs.memory-size` | 300 | Total read/write buffering in MiB |
| `juicefs.prefetch` | 3 | Prefetch N blocks in parallel |
| `juicefs.access-log` | | Access log path. Ensure Hadoop application has write permission, e.g. `/tmp/juicefs.access.log`. The log file will rotate automatically to keep at most 7 files. |
| `juicefs.superuser` | `hdfs` | The super user |
| `juicefs.push-gateway` | | [Prometheus Pushgateway](https://github.com/prometheus/pushgateway) address, format is `<host>:<port>`. |
| `juicefs.no-usage-report` | `false` | Whether disable usage reporting. JuiceFS only collects anonymous usage data (e.g. version number), no user or any sensitive data will be collected. |
When you use multiple JuiceFS file systems, all these configurations could be set to specific file system alone. You need put file system name in the middle of configuration, for example (replace `{JFS_NAME}` with appropriate value):
...
...
@@ -182,6 +200,27 @@ CREATE TABLE IF NOT EXISTS person
)LOCATION'jfs://{JFS_NAME}/tmp/person';
```
## Metrics
JuiceFS Hadoop Java SDK supports reporting metrics to [Prometheus Pushgateway](https://github.com/prometheus/pushgateway), then you can use [Grafana](https://grafana.com) and [dashboard template](k8s_grafana_template.json) to visualize these metrics.
Enable metrics reporting through following configurations:
```xml
<property>
<name>juicefs.push-gateway</name>
<value>host:port</value>
</property>
```
> **Note**: Each process using JuiceFS Hadoop Java SDK will have a unique metric, and Pushgateway will always remember all the collected metrics, resulting in the continuous accumulation of metrics and taking up too much memory, which will also slow down Prometheus crawling metrics. It is recommended to clean up metrics which `job` is `juicefs` on Pushgateway regularly. It is recommended to use the following command to clean up once every hour. The running Hadoop Java SDK will continue to update after the metrics are cleared, which basically does not affect the use.
JuiceFS supports POSIX compatible ACL to manage permissions in the granularity of directory or file. The behavior is the same as a local file system.
In order to make the permission experience intuitive to user (e.g. the files accessible by user A in host X should be accessible in host Y with the same user), the same user who want to access JuiceFS should have the same UID and GID on all hosts.
Here we provide a simple [Ansible](https://www.ansible.com/community) playbook to demonstrate how to ensure an account with same UID and GID on multiple hosts.
## Install ansible
Select a host as a [control node](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#managed-node-requirements) which can access all hosts using `ssh` with the same privileged account like `root` or other sudo account. Install ansible on this host. Read [Installing Ansible](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#installing-ansible) for more installation details.
## Ensure the same account on all hosts
Create an empty directory `account-sync` , save below content in `play.yaml` under this directory.
```yaml
---
- hosts: all
tasks:
- name: "Ensure group {{ group }} with gid {{ gid }} exists"
group:
name: "{{ group }}"
gid: "{{ gid }}"
state: present
- name: "Ensure user {{ user }} with uid {{ uid }} exists"
user:
name: "{{ user }}"
uid: "{{ uid }}"
group: "{{ gid }}"
state: present
```
Create a file named `hosts` in this directory, place IP addresses of all hosts need to create account in this file, each line with a host's IP.
Here we ensure an account `alice` with UID 1200 and group `staff` with GID 500 on 2 hosts:
In above example, the group ID 1000 has been allocated to another group on host `172.16.255.180` , we should **change the GID** or **delete the group with GID 1000** on host `172.16.255.180` , then run the playbook again.
> **CAUTION**
>
> If the user account has already existed on the host and we change it to another UID or GID value, the user may loss permissions to the files and directories which they previously have. For example:
>
> ```
> $ ls -l /tmp/hello.txt
> -rw-r--r-- 1 alice staff 6 Apr 26 21:43 /tmp/hello.txt
hadoop jar juicefs-hadoop.jar io.juicefs.bench.NNBenchWithoutMR -operation open -numberOfFiles 10000 -baseDir jfs://{JFS_NAME}/benchmarks/nnbench_local
JuiceFS supports POSIX compatible ACL to manage permissions in the granularity of directory or file. The behavior is the same as a local file system.
In order to make the permission experience intuitive to user (e.g. the files accessible by user A in host X should be accessible in host Y with the same user), the same user who want to access JuiceFS should have the same UID and GID on all hosts.
...
...
@@ -127,4 +125,4 @@ In above example, the group ID 1000 has been allocated to another group on host
> $ rm /tmp/hello.txt
> rm: remove write-protected regular file '/tmp/hello.txt'? y
> rm: cannot remove '/tmp/hello.txt': Operation not permitted