提交 · 3f763763aaedb78c952d52f2bc58d0b25c6e7105 · kvdb / rocksdb

09 8月, 2022 5 次提交

Change `bottommost_temperture` to `last_level_temperture` (#10471) · 3f763763

由 Jay Zhuang 提交于 8月 08, 2022

Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527

3f763763

Improve universal compaction picker for tiered compaction (#10467) · 37553475

由 Jay Zhuang 提交于 8月 08, 2022

Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
   amp for cold tier is intended;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467

Test Plan: CI and added unittest

Reviewed By: siying

Differential Revision: D38391350

Pulled By: jay-zhuang

fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2

37553475

Disable subcompactions for user_defined_timestamp (#10503) · 563f5743

由 Akanksha Mahajan 提交于 8月 08, 2022

Summary:
Currently user_defined_timestamp is failing in stress test with
subcompactions. So disabling it for now and will re enable it once its
fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10503

Test Plan: make crash_test_with_ts -j32

Reviewed By: riversand963

Differential Revision: D38510485

Pulled By: akankshamahajan15

fbshipit-source-id: 82fd0ec8cf86a96ff6653edd5bad7623cb9e0a15

563f5743

Tiered storage stress test (#10493) · 1e86d424

由 Jay Zhuang 提交于 8月 08, 2022

Summary:
Add Tiered storage stress test and db_bench option

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10493

Test Plan:
new crashtest:
https://app.circleci.com/pipelines/github/facebook/rocksdb/16905/workflows/68c2967c-9274-434f-8506-1403cf441ead

Reviewed By: ajkr

Differential Revision: D38481892

Pulled By: jay-zhuang

fbshipit-source-id: 217a0be4acb93d420222e6ede2a1290d9f464776

1e86d424

Avoid dynamic memory allocation on read path (#10453) · 0d885e80

由 Jay Zhuang 提交于 8月 08, 2022

Summary:
lambda function dynamicly allocates memory from heap if it needs to
capture multiple values, which could be expensive.
Switch to explictly use local functor from stack.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10453

Test Plan:
CI
db_bench shows ~2-3% read improvement:
```
# before the change
TEST_TMPDIR=/tmp/dbbench4 ./db_bench_main --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.528 micros/op 117265 ops/sec 85.277 seconds 10000000 operations;   13.0 MB/s (10000000 of 10000000 found)
# after the change
TEST_TMPDIR=/tmp/dbbench5 ./db_bench_new --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.263 micros/op 121015 ops/sec 82.634 seconds 10000000 operations;   13.4 MB/s (10000000 of 10000000 found)
```
details: https://gist.github.com/jay-zhuang/5ac0628db8fc9cbcb499e056d4cb5918

Micro-benchmark shows a similar improvement ~1-2%:
before the change:
https://gist.github.com/jay-zhuang/9dc0ebf51bbfbf4af82f6193d43cf75b
after the change:
https://gist.github.com/jay-zhuang/fc061f1813cd8f441109ad0b0fe7c185

Reviewed By: ajkr

Differential Revision: D38345056

Pulled By: jay-zhuang

fbshipit-source-id: f3597aeeee338a804d37bf2e81386d5a100665e0

0d885e80

08 8月, 2022 1 次提交

Respect fill_cache when reading blobs in DBIter (#10492) · 0cc9e98b

由 Levi Tamasi 提交于 8月 08, 2022

Summary:
Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have
to explicitly set the `fill_cache` read option when reading blobs in
`DBIter` to prevent the cache from getting polluted by queries with
`fill_cache` set to false. (Before we added support for a blob cache,
the setting had not made any difference either way.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38476121

Pulled By: ltamasi

fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80

0cc9e98b

06 8月, 2022 6 次提交

Remove local static string (#8103) · e446bc65

由 Burton Li 提交于 8月 05, 2022

Summary:
Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.

P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103

Reviewed By: ajkr

Differential Revision: D38477235

Pulled By: ltamasi

fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9

e446bc65

Close the Logger before rolling to next one in AutoRollLogger (#10488) · ce370d6b

由 Akanksha Mahajan 提交于 8月 05, 2022

Summary:
Close the existing logger first to release the existing
handle before renaming the file using the file system.
Since `AutoRollLogger::Flush` pinned down the `logger_`, `logger_` can't be closed unless its
the last reference otherwise it  gives seg fault during Flush on file
that has been closed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10488

Test Plan: CircleCI jobs

Reviewed By: ajkr

Differential Revision: D38469249

Pulled By: akankshamahajan15

fbshipit-source-id: dfbdb89b4ac37639aefcc503526f24753445fd3f

ce370d6b

Include some legal contents in website (#10491) · 2259bb9c

由 sdong 提交于 8月 05, 2022

Summary:
We are asked to include TOS, Privacy Policy and copyright in the website. Added it.
Also changed the github and twitter link to RocksDB's rather than Facebook Open Source's and link to Meta open source's home page.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10491

Test Plan: Test the website locally.

Reviewed By: jay-zhuang

Differential Revision: D38475212

fbshipit-source-id: f73622f8f3d361b4586221ffb6deac4f4a11bb15

2259bb9c

Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473) · edae671c

由 Jay Zhuang 提交于 8月 05, 2022

Summary:
The feature `SuggestCompactRange()` is still experimental. Just
re-add the test back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473

Test Plan: CI

Reviewed By: akankshamahajan15

Differential Revision: D38427153

Pulled By: jay-zhuang

fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a

edae671c

Deflake ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 (#10481) · 56dbcb4f

由 Hui Xiao 提交于 8月 05, 2022

Summary:
**Context/summary:**
`ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 ` relies on `DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles` happens before verifying `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(),
              1 * CacheReservationManagerImpl<
                      CacheEntryRole::kFileMetadata>::GetDummyEntrySize());` or `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 0);` to ensure appropriate cache reservation release is done before checking.

However, this might not be the case under some timing delay and spurious wake-up as coerced below.

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4378f3212..3e4f60853 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2989,6 +2989,8 @@ void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
     if (job_context.HaveSomethingToClean() ||
         job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
       mutex_.Unlock();
+      bg_cv_.SignalAll();
+      usleep(1000);
         // Have to flush the info logs before bg_compaction_scheduled_--
        // because if bg_flush_scheduled_ becomes 0 and the lock is
        // released, the deconstructor of DB can kick in and destroy all the
        // states of DB so info_log might not be available after that point.
        // It also applies to access other states that DB owns.
        log_buffer.FlushBufferToLog();
        if (job_context.HaveSomethingToDelete()) {
          PurgeObsoleteFiles(job_context);
          TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles");
        }
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10481

Test Plan:
The test of interest failed often at the above coercion:

After fix, the test of interest passed at the above coercion:

Reviewed By: jay-zhuang

Differential Revision: D38438256

Pulled By: hx235

fbshipit-source-id: de80ecdb250174f00e7c2f5e4d952695ed56f51e

56dbcb4f

Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f

由 Changyu Bi 提交于 8月 05, 2022

Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380

Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.

```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:
| Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
| 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
| 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
| 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
| 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |

32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
| 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
| 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
| 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |

Reviewed By: ajkr

Differential Revision: D37916564

Pulled By: cbi42

fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e

9d77bf8f

05 8月, 2022 3 次提交

Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472) · f28d0c20

由 Bo Wang 提交于 8月 04, 2022

Summary:
Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472

Test Plan:
unit tests
stress tests

Reviewed By: pdillinger

Differential Revision: D38403991

Pulled By: gitbw95

fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2

f28d0c20

Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb

由 anand76 提交于 8月 04, 2022

Summary:
This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432

Reviewed By: akankshamahajan15

Differential Revision: D38245910

Pulled By: anand1976

fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de

bf4532eb

Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456) · 538df26f

由 Yanqin Jin 提交于 8月 04, 2022

Summary:
Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
on a specific interleaving of two background flush threads. We call them
bg1 and bg2, and assume bg1 starts to install flush results ahead of
bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
and needed to commit to the MANIFEST but falls behind bg1, then bg2
needs to wait for bg1 to finish writing to MANIFEST. This is a circular
dependency.

Fix this by allowing bg2 to start only after bg1 grabs the chance to
sync the WAL and commit to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456

Test Plan:
1. make check

2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion

Reviewed By: ltamasi

Differential Revision: D38391856

Pulled By: riversand963

fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96

538df26f

04 8月, 2022 2 次提交

Avoid allocations/copies for large `GetMergeOperands()` results (#10458) · 504fe4de

由 Andrew Kryczka 提交于 8月 04, 2022

Summary:
This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.

The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458

Test Plan:
- new DB level test
- measured benefit/regression in a number of memtable scenarios

Setup command:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
```

Benchmark command:
```
./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
```

Worst regression is when a key has many tiny operands:

- Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
- `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)

The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.

Best improvement is when a key has large operands and high concurrency:

- Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
- `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).

Reviewed By: cbi42

Differential Revision: D38336578

Pulled By: ajkr

fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258

504fe4de

Fix the error path of PLUGIN_ROOT (#10446) · d23752f6

由 Qiaolin Yu 提交于 8月 03, 2022

Summary:
When we try to use RocksDB with plugins as a third-party library for other databases, the plugin folder cannot be compiled correctly because of the wrong PLUGIN_ROOT variable. So we fix this error to ensure that it works perfectly when the directory of RocksDB is not the root directory.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10446

Reviewed By: jay-zhuang

Differential Revision: D38371321

Pulled By: ajkr

fbshipit-source-id: 0801b7b7dfa87751c8332fb52aac569dcdd72b5d
Co-authored-by: NSuperMT <supertempler@gmail.com>

d23752f6

03 8月, 2022 5 次提交

increase buffer size in PosixFileSystem::GetAbsolutePath to PATH_MAX (#10413) · 8d664ccb

由 Vladimir Kikhtenko 提交于 8月 02, 2022

Summary:
RocksDB fails to open database with relative path when length of cwd
is longer than 256 bytes. This happens due to ERANGE in getcwd call.
Here we simply increase buffer size to the most common PATH_MAX value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10413

Reviewed By: riversand963

Differential Revision: D38189254

Pulled By: ajkr

fbshipit-source-id: 8a0d3a78bbe87645499fbf29fb12bd3d04cd4657

8d664ccb

Split cache to minimize internal fragmentation (#10287) · 87b82f28

由 Bo Wang 提交于 8月 02, 2022

Summary:
### **Summary:**
To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.

_Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._

To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.

**T1**
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    90000000
RawSize:    9956.4 MB (estimated)
FileSize:   5664.8 MB (estimated)

| Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
| - | - | - |
| T1_3 | 4000 | 4000 |
| T1_4 | 2000 | 3000 |

Populate the DB:
./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
Overwrite it to a stable state:
./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1

Run read tests with differnt cache setting:
T1_3:
MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &

T1_4:
MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &

For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.

| Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
| - | - | - | - | - |
| allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
| available (MB)  | 8753 | 8092 | 5536 | 5051 |
| external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
| resident (MB)  | 8956 | 8365 | 5655 | 5235 |

**T2**
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     256 bytes each (128 bytes after compression)
Entries:    40000000
RawSize:    10986.3 MB (estimated)
FileSize:   6103.5 MB (estimated)

| Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
| - | - | - |
| T2_3 | 4000 | 4000 |
| T2_4 | 2000 | 3000 |

Create DB (10GB):
./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
Overwrite it to a stable state:
./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2

Run read tests with differnt cache setting:
T2_3:
MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &

T2_4:
MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &

For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.

| Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
| -  | - | - | - | - |
| allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
| available (MB)  | 8489 | 8138 | 5435 | 5158 |
| external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
| resident (MB)  | 8676 | 8392 | 5541 | 5321 |

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287

Test Plan: Unit tests.

Reviewed By: anand1976

Differential Revision: D37743362

Pulled By: gitbw95

fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad

87b82f28

Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944) · bef3127b

由 mpoeter 提交于 8月 02, 2022

Summary:
Resolves https://github.com/facebook/rocksdb/issues/9692

This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.

The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944

Reviewed By: pdillinger

Differential Revision: D36134604

Pulled By: riversand963

fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a

bef3127b

Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59

由 Peter Dillinger 提交于 8月 02, 2022

Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.

More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).

More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.

Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460

Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)

Reviewed By: ajkr

Differential Revision: D38357922

Pulled By: pdillinger

fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137

27f3af59

regression_test.sh: kill very old db_bench (and more) (#10441) · 9da97a37

由 Peter Dillinger 提交于 8月 02, 2022

Summary:
If a db_bench process gets hung or runaway on a machine, that
could prevent regression_test.sh from ever making progress. To fix that,
regression_test.sh will now kill any db_bench process that is >12 hours
old. Also made this more reliable by not using string matching (grep) to
get db_bench process IDs.

I also had to make some other updates to get local runs working
reliably:
* Fix some quoting hell and other dubious complexity with db_bench_cmd
* Only save a DB for re-use when building it passes
* Report failed command in more cases
* Add safeguards against "rm -rf ."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441

Test Plan:
manual (local and remote), with temporary changes e.g. to have
a manageable age threshold etc.

Reviewed By: riversand963

Differential Revision: D38285537

Pulled By: pdillinger

fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8

9da97a37

02 8月, 2022 4 次提交

Do not put blobs read during compaction into cache (#10457) · cc8ded61

由 Levi Tamasi 提交于 8月 01, 2022

Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38333528

Pulled By: ltamasi

fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2

cc8ded61

Remove unused fields from FileMetaData (temporarily) (#10443) · fbfcf5cb

由 Yanqin Jin 提交于 8月 01, 2022

Summary:
FileMetaData::[min|max]_timestamp are not currently being used or
tracked by RocksDB, even when user-defined timestamp is enabled. Each of
them is a std::string which can occupy 32 bytes. Remove them for now.
They may be added back when we have a pressing need for them. When we do
add them back, consider store them in a more compact way, e.g. one
boolean flag and a byte array of size 16.

Per file min/max timestamp bounds are available as table properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D38292275

Pulled By: riversand963

fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94

fbfcf5cb

Use EnvLogger instead of PosixLogger (#10436) · cc209980

由 sdong 提交于 8月 01, 2022

Summary:
EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code,
Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436

Test Plan: Run db_bench and watch info log files.

Reviewed By: anand1976

Differential Revision: D38259855

fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8

cc209980

Add CompressedSecondaryCache into stress test (#10442) · e1b176d2

由 gitbw95 提交于 8月 01, 2022

Summary:
The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442

Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run  `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `

Reviewed By: anand1976

Differential Revision: D38290796

Pulled By: gitbw95

fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8

e1b176d2

01 8月, 2022 1 次提交

Provide support for subcompactions with user-defined timestamps (#10344) · 56463d44

由 Akanksha Mahajan 提交于 7月 31, 2022

Summary:
The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.

This PR provide support to reenable subcompactions with user-defined timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344

Test Plan:
Added new unit test
- Without fix for Issue1 unit test MultipleSubCompactions fails with error:
```
db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Received signal 6 (Aborted)                                                                             │
#0   /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Aborted (core dumped)
```
Ran stress test
`make crash_test_with_ts -j32`

Reviewed By: riversand963

Differential Revision: D38220841

Pulled By: akankshamahajan15

fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb

56463d44

30 7月, 2022 4 次提交

Fix cache metrics update when secondary cache is used (#10440) · 54aebb2c

由 anand76 提交于 7月 29, 2022

Summary:
If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```).

To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions.

Test plan -
Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440

Reviewed By: gitbw95

Differential Revision: D38283968

Pulled By: anand1976

fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1

54aebb2c

Update passing rate_limiter_priority for a PartitionedFilterBlockReader function to FS (#10438) · 1aab5b32

由 Bo Wang 提交于 7月 29, 2022

Summary:
Add param rate_limiter_parameter in PartitionedFilterBlockReader::GetFilterPartitionBlock .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10438

Test Plan: Unit Tests.

Reviewed By: anand1976

Differential Revision: D38266395

Pulled By: gitbw95

fbshipit-source-id: 3ed062a3b43d6df323371cb0d266f7fe869e9ad2

1aab5b32

db_bench -use_stderr_info_logger to print timestamp (#10435) · aec28eba

由 sdong 提交于 7月 29, 2022

Summary:
Right now db_bench -use_stderr_info_logger would redirect RocksDB info logging to stderr but no timetamp is printed out. Add timestamp to there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10435

Test Plan: Run "db_bench -use_stderr_info_logger"

Reviewed By: riversand963

Differential Revision: D38258699

fbshipit-source-id: 3fee6eb1205127b923bc6a660f86bd2742519aec

aec28eba

Fix regression_test.sh deleterandom duration (#10437) · 15da2252

由 Peter Dillinger 提交于 7月 29, 2022

Summary:
deleterandom tests are too fast to get good signal, e.g.
--deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491
seconds for readrandom. Removing the special handling (unknown
motivation in faa7eb3b) should suffice.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437

Test Plan: watch continuous results

Reviewed By: ltamasi

Differential Revision: D38261185

Pulled By: pdillinger

fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d

15da2252

29 7月, 2022 2 次提交

Revert "Add a blob-specific cache priority (#10309)" (#10434) · 65036e42

由 Peter Dillinger 提交于 7月 29, 2022

Summary:
This reverts commit 8d178090
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b

65036e42

Allow manual compactions to run in parallel by default (#10317) · c7ccbb33

由 Andrew Kryczka 提交于 7月 28, 2022

Summary:
This PR changes the default value of
`CompactRangeOptions::exclusive_manual_compaction` from true to false so
manual `CompactRange()`s can run in parallel with other compactions. I
believe no artificial parallelism restriction is the intuitive behavior
so feel the old default value is a trap, which I have fallen into
several times, including yesterday.
`CompactRangeOptions::exclusive_manual_compaction == false` has been
used in both our correctness test and in production for years so should
be reasonably safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317

Reviewed By: jay-zhuang

Differential Revision: D37659392

Pulled By: ajkr

fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5

c7ccbb33

28 7月, 2022 5 次提交

Best efforts recovery to skip empty MANIFEST (#10416) · 87649d32

由 Jay Zhuang 提交于 7月 27, 2022

Summary:
Skip empty MANIFEST fie during best_efforts_recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10416

Test Plan: make failed db_stress test pass

Reviewed By: riversand963

Differential Revision: D38126273

Pulled By: jay-zhuang

fbshipit-source-id: 4498d322b09eaa194dd2cbf9c683d62ab54bfb01

87649d32

Add a blob-specific cache priority (#10309) · 8d178090

由 Gang Liao 提交于 7月 27, 2022

Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5

8d178090

Fix assertion failure and memory leak in ClockCache. (#10430) · d976f689

由 Guido Tagliavini Ponce 提交于 7月 27, 2022

Summary:
This fixes two issues:
- [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table.
- [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430

Test Plan:
- ``make -j24 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D38226114

Pulled By: guidotag

fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008

d976f689

Mention kRoundRobin in HISTORY.md (#10421) · 8b2d4292

由 Zichen Zhu 提交于 7月 27, 2022

Summary:
Update HISTORY.md for CompactionPri::kRoundRobin. Detailed implementation can be found in [PR10107](https://github.com/facebook/rocksdb/pull/10107), [PR10227](https://github.com/facebook/rocksdb/pull/10227), [PR10250](https://github.com/facebook/rocksdb/pull/10250), [PR10278](https://github.com/facebook/rocksdb/pull/10278), [PR10316](https://github.com/facebook/rocksdb/pull/10316), and [PR10341](https://github.com/facebook/rocksdb/pull/10341)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10421

Reviewed By: ajkr

Differential Revision: D38194070

Pulled By: littlepig2013

fbshipit-source-id: 4ce153dc0bf22cd865d09c5429955023dbc90f37

8b2d4292

add trace tools flags in CMakeLists (#10404) · 8c0810de

由 BilyZ98 提交于 7月 27, 2022

Summary:
It seems like there is no flags in CMakeLists.txt to control the generation of trace tools including trace_analyzer and block_cache_trace_analyzer.

So I add it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10404

Reviewed By: ajkr

Differential Revision: D38077673

Pulled By: jay-zhuang

fbshipit-source-id: b4d83b3a3281edf34b2ef4a8715c2835e53ffc0f

8c0810de

27 7月, 2022 2 次提交

ldb to display public unique id and dump work with key range (#10417) · 6a0010eb

由 Jay Zhuang 提交于 7月 26, 2022

Summary:
2 ldb command improvements:
1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST;
2. `ldb dump` has `--from/to` option, but not working. Add support for that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417

Test Plan:
run the command locally
```
$ ldb manifest_dump --path=MANIFEST-000026 --verbose
...
AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780
```
```
$ ldb dump --path=000036.sst --from=key000006 --to=key000009
Sst file format: block-based
'key000006' seq:2411, type:1 => value6
'key000007' seq:2412, type:1 => value7
'key000008' seq:2413, type:1 => value8
...
```

Reviewed By: ajkr

Differential Revision: D38136140

Pulled By: jay-zhuang

fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3

6a0010eb

Allow sufficient subcompactions under round-robin compaction priority (#10422) · c945a9a6

由 Zichen Zhu 提交于 7月 26, 2022

Summary:
Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.

Test Case:
Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422

Reviewed By: ajkr

Differential Revision: D38186545

Pulled By: littlepig2013

fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e

c945a9a6

kvdb / rocksdb 10 个月 前同步成功

kvdb / rocksdb
10 个月前同步成功