提交 fdb6be4e 编写于 作者: I Igor Canadi

Rewritten system for scheduling background work

Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.

The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.

Here are the performance results:

Command:

    ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000  --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333

Before the patch:

     fillrandom   :      26.950 micros/op 37105 ops/sec;    4.1 MB/s

After the patch:

      fillrandom   :      17.404 micros/op 57456 ops/sec;    6.4 MB/s

Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:

      fillrandom   :       7.590 micros/op 131758 ops/sec;   14.6 MB/s

Test Plan:
make check

two stress tests:

Big number of compactions and flushes:

    ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

max_background_flushes=0, to verify that this case also works correctly

    ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30123
上级 a3001b1d
......@@ -223,14 +223,11 @@ void SuperVersionUnrefHandle(void* ptr) {
}
} // anonymous namespace
ColumnFamilyData::ColumnFamilyData(uint32_t id, const std::string& name,
Version* _dummy_versions,
Cache* _table_cache,
WriteBuffer* write_buffer,
const ColumnFamilyOptions& cf_options,
const DBOptions* db_options,
const EnvOptions& env_options,
ColumnFamilySet* column_family_set)
ColumnFamilyData::ColumnFamilyData(
uint32_t id, const std::string& name, Version* _dummy_versions,
Cache* _table_cache, WriteBuffer* write_buffer,
const ColumnFamilyOptions& cf_options, const DBOptions* db_options,
const EnvOptions& env_options, ColumnFamilySet* column_family_set)
: id_(id),
name_(name),
dummy_versions_(_dummy_versions),
......@@ -250,7 +247,9 @@ ColumnFamilyData::ColumnFamilyData(uint32_t id, const std::string& name,
next_(nullptr),
prev_(nullptr),
log_number_(0),
column_family_set_(column_family_set) {
column_family_set_(column_family_set),
pending_flush_(false),
pending_compaction_(false) {
Ref();
// if _dummy_versions is nullptr, then this is a dummy column family.
......@@ -285,10 +284,14 @@ ColumnFamilyData::ColumnFamilyData(uint32_t id, const std::string& name,
new LevelCompactionPicker(ioptions_, &internal_comparator_));
}
Log(InfoLogLevel::INFO_LEVEL,
ioptions_.info_log, "Options for column family \"%s\":\n",
name.c_str());
options_.Dump(ioptions_.info_log);
if (column_family_set_->NumberOfColumnFamilies() < 10) {
Log(InfoLogLevel::INFO_LEVEL, ioptions_.info_log,
"--------------- Options for column family [%s]:\n", name.c_str());
options_.Dump(ioptions_.info_log);
} else {
Log(InfoLogLevel::INFO_LEVEL, ioptions_.info_log,
"\t(skipping printing options)\n");
}
}
RecalculateWriteStallConditions(mutable_cf_options_);
......@@ -313,6 +316,11 @@ ColumnFamilyData::~ColumnFamilyData() {
current_->Unref();
}
// It would be wrong if this ColumnFamilyData is in flush_queue_ or
// compaction_queue_ and we destroyed it
assert(!pending_flush_);
assert(!pending_compaction_);
if (super_version_ != nullptr) {
// Release SuperVersion reference kept in ThreadLocalPtr.
// This must be done outside of mutex_ since unref handler can lock mutex.
......@@ -434,6 +442,10 @@ void ColumnFamilyData::CreateNewMemtable(
mem_->Ref();
}
bool ColumnFamilyData::NeedsCompaction() const {
return compaction_picker_->NeedsCompaction(current_->storage_info());
}
Compaction* ColumnFamilyData::PickCompaction(
const MutableCFOptions& mutable_options, LogBuffer* log_buffer) {
auto* result = compaction_picker_->PickCompaction(
......
......@@ -210,8 +210,11 @@ class ColumnFamilyData {
// See documentation in compaction_picker.h
// REQUIRES: DB mutex held
bool NeedsCompaction() const;
// REQUIRES: DB mutex held
Compaction* PickCompaction(const MutableCFOptions& mutable_options,
LogBuffer* log_buffer);
// REQUIRES: DB mutex held
Compaction* CompactRange(
const MutableCFOptions& mutable_cf_options,
int input_level, int output_level, uint32_t output_path_id,
......@@ -248,6 +251,7 @@ class ColumnFamilyData {
// if its reference count is zero and needs deletion or nullptr if not
// As argument takes a pointer to allocated SuperVersion to enable
// the clients to allocate SuperVersion outside of mutex.
// IMPORTANT: Only call this from DBImpl::InstallSuperVersion()
SuperVersion* InstallSuperVersion(SuperVersion* new_superversion,
port::Mutex* db_mutex,
const MutableCFOptions& mutable_cf_options);
......@@ -261,6 +265,12 @@ class ColumnFamilyData {
bool triggered_flush_slowdown,
bool triggered_flush_stop);
// Protected by DB mutex
void set_pending_flush(bool value) { pending_flush_ = value; }
void set_pending_compaction(bool value) { pending_compaction_ = value; }
bool pending_flush() { return pending_flush_; }
bool pending_compaction() { return pending_compaction_; }
private:
friend class ColumnFamilySet;
ColumnFamilyData(uint32_t id, const std::string& name,
......@@ -328,6 +338,13 @@ class ColumnFamilyData {
ColumnFamilySet* column_family_set_;
std::unique_ptr<WriteControllerToken> write_controller_token_;
// If true --> this ColumnFamily is currently present in DBImpl::flush_queue_
bool pending_flush_;
// If true --> this ColumnFamily is currently present in
// DBImpl::compaction_queue_
bool pending_compaction_;
};
// ColumnFamilySet has interesting thread-safety requirements
......
......@@ -695,15 +695,6 @@ Compaction* LevelCompactionPicker::PickCompaction(
Compaction* c = nullptr;
int level = -1;
// Compute the compactions needed. It is better to do it here
// and also in LogAndApply(), otherwise the values could be stale.
std::vector<uint64_t> size_being_compacted(NumberLevels() - 1);
SizeBeingCompacted(size_being_compacted);
CompactionOptionsFIFO dummy_compaction_options_fifo;
vstorage->ComputeCompactionScore(
mutable_cf_options, dummy_compaction_options_fifo, size_being_compacted);
// We prefer compactions triggered by too much data in a level over
// the compactions triggered by seeks.
//
......@@ -766,6 +757,21 @@ Compaction* LevelCompactionPicker::PickCompaction(
compactions_in_progress_[level].insert(c);
c->mutable_cf_options_ = mutable_cf_options;
// Creating a compaction influences the compaction score because the score
// takes running compactions into account (by skipping files that are already
// being compacted). Since we just changed compaction score, we recalculate it
// here
{ // this piece of code recomputes compaction score
std::vector<uint64_t> size_being_compacted(NumberLevels() - 1);
SizeBeingCompacted(size_being_compacted);
CompactionOptionsFIFO dummy_compaction_options_fifo;
vstorage->ComputeCompactionScore(mutable_cf_options,
dummy_compaction_options_fifo,
size_being_compacted);
}
return c;
}
......
此差异已折叠。
......@@ -362,6 +362,8 @@ class DBImpl : public DB {
ColumnFamilyData* GetColumnFamilyDataByName(const std::string& cf_name);
void MaybeScheduleFlushOrCompaction();
void SchedulePendingFlush(ColumnFamilyData* cfd);
void SchedulePendingCompaction(ColumnFamilyData* cfd);
static void BGWorkCompaction(void* db);
static void BGWorkFlush(void* db);
void BackgroundCallCompaction();
......@@ -393,6 +395,12 @@ class DBImpl : public DB {
// hold the data set.
Status ReFitLevel(ColumnFamilyData* cfd, int level, int target_level = -1);
// helper functions for adding and removing from flush & compaction queues
void AddToCompactionQueue(ColumnFamilyData* cfd);
ColumnFamilyData* PopFirstFromCompactionQueue();
void AddToFlushQueue(ColumnFamilyData* cfd);
ColumnFamilyData* PopFirstFromFlushQueue();
// table_cache_ provides its own synchronization
std::shared_ptr<Cache> table_cache_;
......@@ -460,9 +468,32 @@ class DBImpl : public DB {
// State is protected with db mutex.
std::list<uint64_t> pending_outputs_;
// At least one compaction or flush job is pending but not yet scheduled
// because of the max background thread limit.
bool bg_schedule_needed_;
// flush_queue_ and compaction_queue_ hold column families that we need to
// flush and compact, respectively.
// A column family is inserted into flush_queue_ when it satisfies condition
// cfd->imm()->IsFlushPending()
// A column family is inserted into compaction_queue_ when it satisfied
// condition cfd->NeedsCompaction()
// Column families in this list are all Ref()-erenced
// TODO(icanadi) Provide some kind of ReferencedColumnFamily class that will
// do RAII on ColumnFamilyData
// Column families are in this queue when they need to be flushed or
// compacted. Consumers of these queues are flush and compaction threads. When
// column family is put on this queue, we increase unscheduled_flushes_ and
// unscheduled_compactions_. When these variables are bigger than zero, that
// means we need to schedule background threads for compaction and thread.
// Once the background threads are scheduled, we decrease unscheduled_flushes_
// and unscheduled_compactions_. That way we keep track of number of
// compaction and flush threads we need to schedule. This scheduling is done
// in MaybeScheduleFlushOrCompaction()
// invariant(column family present in flush_queue_ <==>
// ColumnFamilyData::pending_flush_ == true)
std::deque<ColumnFamilyData*> flush_queue_;
// invariant(column family present in compaction_queue_ <==>
// ColumnFamilyData::pending_compaction_ == true)
std::deque<ColumnFamilyData*> compaction_queue_;
int unscheduled_flushes_;
int unscheduled_compactions_;
// count how many background compactions are running or have been scheduled
int bg_compaction_scheduled_;
......@@ -553,9 +584,17 @@ class DBImpl : public DB {
ColumnFamilyData* cfd, JobContext* job_context,
const MutableCFOptions& mutable_cf_options);
SuperVersion* InstallSuperVersion(
ColumnFamilyData* cfd, SuperVersion* new_sv,
const MutableCFOptions& mutable_cf_options);
// All ColumnFamily state changes go through this function. Here we analyze
// the new state and we schedule background work if we detect that the new
// state needs flush or compaction.
// If dont_schedule_bg_work == true, then caller asks us to not schedule flush
// or compaction here, but it also promises to schedule needed background
// work. We use this to scheduling background compactions when we are in the
// write thread, which is very performance critical. Caller schedules
// background work as soon as it exits the write thread
SuperVersion* InstallSuperVersion(ColumnFamilyData* cfd, SuperVersion* new_sv,
const MutableCFOptions& mutable_cf_options,
bool dont_schedule_bg_work = false);
// Find Super version and reference it. Based on options, it might return
// the thread local cached one.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册