1. 09 8月, 2022 5 次提交
  2. 08 8月, 2022 1 次提交
  3. 06 8月, 2022 6 次提交
    • B
      Remove local static string (#8103) · e446bc65
      Burton Li 提交于
      Summary:
      Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.
      
      P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103
      
      Reviewed By: ajkr
      
      Differential Revision: D38477235
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9
      e446bc65
    • A
      Close the Logger before rolling to next one in AutoRollLogger (#10488) · ce370d6b
      Akanksha Mahajan 提交于
      Summary:
      Close the existing logger first to release the existing
      handle before renaming the file using the file system.
      Since `AutoRollLogger::Flush` pinned down the `logger_`, `logger_` can't be closed unless its
      the last reference otherwise it  gives seg fault during Flush on file
      that has been closed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10488
      
      Test Plan: CircleCI jobs
      
      Reviewed By: ajkr
      
      Differential Revision: D38469249
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: dfbdb89b4ac37639aefcc503526f24753445fd3f
      ce370d6b
    • S
      Include some legal contents in website (#10491) · 2259bb9c
      sdong 提交于
      Summary:
      We are asked to include TOS, Privacy Policy and copyright in the website. Added it.
      Also changed the github and twitter link to RocksDB's rather than Facebook Open Source's and link to Meta open source's home page.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10491
      
      Test Plan: Test the website locally.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38475212
      
      fbshipit-source-id: f73622f8f3d361b4586221ffb6deac4f4a11bb15
      2259bb9c
    • J
      Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473) · edae671c
      Jay Zhuang 提交于
      Summary:
      The feature `SuggestCompactRange()` is still experimental. Just
      re-add the test back.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473
      
      Test Plan: CI
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38427153
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a
      edae671c
    • H
      Deflake ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 (#10481) · 56dbcb4f
      Hui Xiao 提交于
      Summary:
      **Context/summary:**
      `ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 ` relies on `DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles` happens before verifying `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(),
                    1 * CacheReservationManagerImpl<
                            CacheEntryRole::kFileMetadata>::GetDummyEntrySize());` or `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 0);` to ensure appropriate cache reservation release is done before checking.
      
      However, this might not be the case under some timing delay and spurious wake-up as coerced below.
      
      ```
       diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
      index 4378f3212..3e4f60853 100644
       --- a/db/db_impl/db_impl_compaction_flush.cc
      +++ b/db/db_impl/db_impl_compaction_flush.cc
      @@ -2989,6 +2989,8 @@ void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
           if (job_context.HaveSomethingToClean() ||
               job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
             mutex_.Unlock();
      +      bg_cv_.SignalAll();
      +      usleep(1000);
               // Have to flush the info logs before bg_compaction_scheduled_--
              // because if bg_flush_scheduled_ becomes 0 and the lock is
              // released, the deconstructor of DB can kick in and destroy all the
              // states of DB so info_log might not be available after that point.
              // It also applies to access other states that DB owns.
              log_buffer.FlushBufferToLog();
              if (job_context.HaveSomethingToDelete()) {
                PurgeObsoleteFiles(job_context);
                TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles");
              }
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10481
      
      Test Plan:
      The test of interest failed often at the above coercion:
      
      After fix, the test of interest passed at the above coercion:
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38438256
      
      Pulled By: hx235
      
      fbshipit-source-id: de80ecdb250174f00e7c2f5e4d952695ed56f51e
      56dbcb4f
    • C
      Fragment memtable range tombstone in the write path (#10380) · 9d77bf8f
      Changyu Bi 提交于
      Summary:
      - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
      - db_bench is updated to print out the number of range deletions executed if there is any.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
      
      Test Plan:
      - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
      - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.
      
      ```
      single thread:
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
      
      multi_thread
      ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
      ```
      Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
      Results are averaged over 5 runs.
      
      Single thread result:
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
      | 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
      | 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
      | 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
      | 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |
      
      32-thread result: note that "Max # tombstones" is per thread.
      | Max # tombstones  | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op |  99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
      | ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
      | 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
      | 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
      | 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
      | 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |
      
      Reviewed By: ajkr
      
      Differential Revision: D37916564
      
      Pulled By: cbi42
      
      fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
      9d77bf8f
  4. 05 8月, 2022 3 次提交
    • B
      Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472) · f28d0c20
      Bo Wang 提交于
      Summary:
      Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472
      
      Test Plan:
      unit tests
      stress tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D38403991
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2
      f28d0c20
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
    • Y
      Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456) · 538df26f
      Yanqin Jin 提交于
      Summary:
      Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies
      on a specific interleaving of two background flush threads. We call them
      bg1 and bg2, and assume bg1 starts to install flush results ahead of
      bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also
      enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can
      proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()`
      and needed to commit to the MANIFEST but falls behind bg1, then bg2
      needs to wait for bg1 to finish writing to MANIFEST. This is a circular
      dependency.
      
      Fix this by allowing bg2 to start only after bg1 grabs the chance to
      sync the WAL and commit to MANIFEST.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456
      
      Test Plan:
      1. make check
      
      2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion
      
      Reviewed By: ltamasi
      
      Differential Revision: D38391856
      
      Pulled By: riversand963
      
      fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96
      538df26f
  5. 04 8月, 2022 2 次提交
    • A
      Avoid allocations/copies for large `GetMergeOperands()` results (#10458) · 504fe4de
      Andrew Kryczka 提交于
      Summary:
      This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
      
      The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
      
      Test Plan:
      - new DB level test
      - measured benefit/regression in a number of memtable scenarios
      
      Setup command:
      ```
      $ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
      ```
      
      Benchmark command:
      ```
      ./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
      ```
      
      Worst regression is when a key has many tiny operands:
      
      - Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
      - `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
      
      The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
      
      Best improvement is when a key has large operands and high concurrency:
      
      - Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
      - `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
      
      Reviewed By: cbi42
      
      Differential Revision: D38336578
      
      Pulled By: ajkr
      
      fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
      504fe4de
    • Q
      Fix the error path of PLUGIN_ROOT (#10446) · d23752f6
      Qiaolin Yu 提交于
      Summary:
      When we try to use RocksDB with plugins as a third-party library for other databases, the plugin folder cannot be compiled correctly because of the wrong PLUGIN_ROOT variable. So we fix this error to ensure that it works perfectly when the directory of RocksDB is not the root directory.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10446
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D38371321
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0801b7b7dfa87751c8332fb52aac569dcdd72b5d
      Co-authored-by: NSuperMT <supertempler@gmail.com>
      d23752f6
  6. 03 8月, 2022 5 次提交
    • V
      increase buffer size in PosixFileSystem::GetAbsolutePath to PATH_MAX (#10413) · 8d664ccb
      Vladimir Kikhtenko 提交于
      Summary:
      RocksDB fails to open database with relative path when length of cwd
      is longer than 256 bytes. This happens due to ERANGE in getcwd call.
      Here we simply increase buffer size to the most common PATH_MAX value.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10413
      
      Reviewed By: riversand963
      
      Differential Revision: D38189254
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8a0d3a78bbe87645499fbf29fb12bd3d04cd4657
      8d664ccb
    • B
      Split cache to minimize internal fragmentation (#10287) · 87b82f28
      Bo Wang 提交于
      Summary:
      ### **Summary:**
      To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.
      
      _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._
      
      To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.
      
      **T1**
      Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
      Values:     100 bytes each (50 bytes after compression)
      Entries:    90000000
      RawSize:    9956.4 MB (estimated)
      FileSize:   5664.8 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T1_3 | 4000 | 4000 |
      | T1_4 | 2000 | 3000 |
      
      Populate the DB:
      ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1
      
      Run read tests with differnt cache setting:
      T1_3:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &
      
      T1_4:
      MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &
      
      For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
      | - | - | - | - | - |
      | allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
      | available (MB)  | 8753 | 8092 | 5536 | 5051 |
      | external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
      | resident (MB)  | 8956 | 8365 | 5655 | 5235 |
      
      **T2**
      Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
      Values:     256 bytes each (128 bytes after compression)
      Entries:    40000000
      RawSize:    10986.3 MB (estimated)
      FileSize:   6103.5 MB (estimated)
      
      | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
      | - | - | - |
      | T2_3 | 4000 | 4000 |
      | T2_4 | 2000 | 3000 |
      
      Create DB (10GB):
      ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      Overwrite it to a stable state:
      ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
      
      Run read tests with differnt cache setting:
      T2_3:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &
      
      T2_4:
      MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &
      
      For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.
      
      | Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
      | -  | - | - | - | - |
      | allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
      | available (MB)  | 8489 | 8138 | 5435 | 5158 |
      | external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
      | resident (MB)  | 8676 | 8392 | 5541 | 5321 |
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287
      
      Test Plan: Unit tests.
      
      Reviewed By: anand1976
      
      Differential Revision: D37743362
      
      Pulled By: gitbw95
      
      fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
      87b82f28
    • M
      Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944) · bef3127b
      mpoeter 提交于
      Summary:
      Resolves https://github.com/facebook/rocksdb/issues/9692
      
      This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix.
      
      The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944
      
      Reviewed By: pdillinger
      
      Differential Revision: D36134604
      
      Pulled By: riversand963
      
      fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a
      bef3127b
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59
    • P
      regression_test.sh: kill very old db_bench (and more) (#10441) · 9da97a37
      Peter Dillinger 提交于
      Summary:
      If a db_bench process gets hung or runaway on a machine, that
      could prevent regression_test.sh from ever making progress. To fix that,
      regression_test.sh will now kill any db_bench process that is >12 hours
      old. Also made this more reliable by not using string matching (grep) to
      get db_bench process IDs.
      
      I also had to make some other updates to get local runs working
      reliably:
      * Fix some quoting hell and other dubious complexity with db_bench_cmd
      * Only save a DB for re-use when building it passes
      * Report failed command in more cases
      * Add safeguards against "rm -rf ."
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441
      
      Test Plan:
      manual (local and remote), with temporary changes e.g. to have
      a manageable age threshold etc.
      
      Reviewed By: riversand963
      
      Differential Revision: D38285537
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8
      9da97a37
  7. 02 8月, 2022 4 次提交
    • L
      Do not put blobs read during compaction into cache (#10457) · cc8ded61
      Levi Tamasi 提交于
      Summary:
      During compaction, blobs are currently read using the default
      `ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
      this didn't make any difference since we didn't have a blob cache;
      however, now we have to explicitly set this flag to false to avoid
      polluting the cache during compaction.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D38333528
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
      cc8ded61
    • Y
      Remove unused fields from FileMetaData (temporarily) (#10443) · fbfcf5cb
      Yanqin Jin 提交于
      Summary:
      FileMetaData::[min|max]_timestamp are not currently being used or
      tracked by RocksDB, even when user-defined timestamp is enabled. Each of
      them is a std::string which can occupy 32 bytes. Remove them for now.
      They may be added back when we have a pressing need for them. When we do
      add them back, consider store them in a more compact way, e.g. one
      boolean flag and a byte array of size 16.
      
      Per file min/max timestamp bounds are available as table properties.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D38292275
      
      Pulled By: riversand963
      
      fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
      fbfcf5cb
    • S
      Use EnvLogger instead of PosixLogger (#10436) · cc209980
      sdong 提交于
      Summary:
      EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code,
      Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436
      
      Test Plan: Run db_bench and watch info log files.
      
      Reviewed By: anand1976
      
      Differential Revision: D38259855
      
      fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8
      cc209980
    • G
      Add CompressedSecondaryCache into stress test (#10442) · e1b176d2
      gitbw95 提交于
      Summary:
      The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442
      
      Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run  `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `
      
      Reviewed By: anand1976
      
      Differential Revision: D38290796
      
      Pulled By: gitbw95
      
      fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
      e1b176d2
  8. 01 8月, 2022 1 次提交
    • A
      Provide support for subcompactions with user-defined timestamps (#10344) · 56463d44
      Akanksha Mahajan 提交于
      Summary:
      The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
      Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
      Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.
      
      This PR provide support to reenable subcompactions with user-defined timestamps.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344
      
      Test Plan:
      Added new unit test
      - Without fix for Issue1 unit test MultipleSubCompactions fails with error:
      ```
      db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
      or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
      Received signal 6 (Aborted)                                                                             │
      #0   /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
      Aborted (core dumped)
      ```
      Ran stress test
      `make crash_test_with_ts -j32`
      
      Reviewed By: riversand963
      
      Differential Revision: D38220841
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb
      56463d44
  9. 30 7月, 2022 4 次提交
  10. 29 7月, 2022 2 次提交
  11. 28 7月, 2022 5 次提交
  12. 27 7月, 2022 2 次提交
    • J
      ldb to display public unique id and dump work with key range (#10417) · 6a0010eb
      Jay Zhuang 提交于
      Summary:
      2 ldb command improvements:
      1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST;
      2. `ldb dump` has `--from/to` option, but not working. Add support for that.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417
      
      Test Plan:
      run the command locally
      ```
      $ ldb manifest_dump --path=MANIFEST-000026 --verbose
      ...
      AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780
      ```
      ```
      $ ldb dump --path=000036.sst --from=key000006 --to=key000009
      Sst file format: block-based
      'key000006' seq:2411, type:1 => value6
      'key000007' seq:2412, type:1 => value7
      'key000008' seq:2413, type:1 => value8
      ...
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D38136140
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3
      6a0010eb
    • Z
      Allow sufficient subcompactions under round-robin compaction priority (#10422) · c945a9a6
      Zichen Zhu 提交于
      Summary:
      Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.
      
      Test Case:
      Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422
      
      Reviewed By: ajkr
      
      Differential Revision: D38186545
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e
      c945a9a6