提交 5a92c415 编写于 作者: S Steven Li

Enhanced crash_gen tool to run clusters, with a new README file

上级 f7a0b6b8
<center><h1>User's Guide to the Crash_Gen Tool</h1></center>
# Introduction
To effectively test and debug our TDengine product, we have developed a simple tool to
exercise various functions of the system in a randomized fashion, hoping to expose
maximum number of problems, hopefully without a pre-determined scenario.
# Preparation
To run this tool, please ensure the followed preparation work is done first.
1. Fetch a copy of the TDengine source code, and build it successfully in the `build/`
1. Ensure that the system has Python3.8 or above properly installed. We use
Ubuntu 20.04LTS as our own development environment, and suggest you also use such
an environment if possible.
# Simple Execution
To run the tool with the simplest method, follow the steps below:
1. Open a terminal window, start the `taosd` service in the `build/` directory
(or however you prefer to start the `taosd` service)
1. Open another terminal window, go into the `tests/pytest/` directory, and
run `./crash_gen.sh -p -t 3 -s 10` (change the two parameters here as you wish)
1. Watch the output to the end and see if you get a `SUCCESS` or `FAILURE`
That's it!
# Running Clusters
This tool also makes it easy to test/verify the clustering capabilities of TDengine. You
can start a cluster quite easily with the following command:
$ cd tests/pytest/
$ ./crash_gen.sh -e -o 3
The `-e` option above tells the tool to start the service, and do not run any tests, while
the `-o 3` option tells the tool to start 3 DNodes and join them together in a cluster.
Obviously you can adjust the the number here.
## Behind the Scenes
When the tool runs a cluster, it users a number of directories, each holding the information
for a single DNode, see:
$ ls build/cluster*
cfg data log
cfg data log
cfg data log
Therefore, when something goes wrong and you want to reset everything with the cluster, simple
erase all the files:
$ rm -rf build/cluster_dnode_*
## Addresses and Ports
The DNodes in the cluster all binds the the `` IP address (for now anyway), and
uses port 6030 for the first DNode, and 6130 for the 2nd one, and so on.
## Testing Against a Cluster
In a separate terminal window, you can invoke the tool in client mode and test against
a cluster, such as:
$ ./crash_gen.sh -p -t 10 -s 100 -i 3
Here the `-i` option tells the tool to always create tables with 3 replicas, and run
all tests against such tables.
# Additional Features
The exhaustive features of the tool is available through the `-h` option:
$ ./crash_gen.sh -h
usage: crash_gen_bootstrap.py [-h] [-a] [-b MAX_DBS] [-c CONNECTOR_TYPE] [-d] [-e] [-g IGNORE_ERRORS] [-i MAX_REPLICAS] [-l] [-n] [-o NUM_DNODES] [-p] [-r]
[-s MAX_STEPS] [-t NUM_THREADS] [-v] [-x]
TDengine Auto Crash Generator (PLEASE NOTICE the Prerequisites Below)
1. You build TDengine in the top level ./build directory, as described in offical docs
2. You run the server there before this script: ./build/bin/taosd -c test/cfg
optional arguments:
-h, --help show this help message and exit
-a, --auto-start-service
Automatically start/stop the TDengine service (default: false)
-b MAX_DBS, --max-dbs MAX_DBS
Maximum number of DBs to keep, set to disable dropping DB. (default: 0)
Connector type to use: native, rest, or mixed (default: 10)
-d, --debug Turn on DEBUG mode for more logging (default: false)
-e, --run-tdengine Run TDengine service in foreground (default: false)
Ignore error codes, comma separated, 0x supported (default: None)
-i MAX_REPLICAS, --max-replicas MAX_REPLICAS
Maximum number of replicas to use, when testing against clusters. (default: 1)
-l, --larger-data Write larger amount of data during write operations (default: false)
-n, --dynamic-db-table-names
Use non-fixed names for dbs/tables, useful for multi-instance executions (default: false)
-o NUM_DNODES, --num-dnodes NUM_DNODES
Number of Dnodes to initialize, used with -e option. (default: 1)
-p, --per-thread-db-connection
Use a single shared db connection (default: false)
-r, --record-ops Use a pair of always-fsynced fils to record operations performing + performed, for power-off tests (default: false)
-s MAX_STEPS, --max-steps MAX_STEPS
Maximum number of steps to run (default: 100)
-t NUM_THREADS, --num-threads NUM_THREADS
Number of threads to run (default: 10)
-v, --verify-data Verify data written in a number of places by reading back (default: false)
-x, --continue-on-exception
Continue execution after encountering unexpected/disallowed errors/exceptions (default: false)
......@@ -18,6 +18,7 @@ from __future__ import annotations
from typing import Set
from typing import Dict
from typing import List
from typing import Optional # Type hinting, ref: https://stackoverflow.com/questions/19202633/python-3-type-hinting-for-none
import textwrap
import time
......@@ -62,9 +63,10 @@ gContainer: Container
class WorkerThread:
def __init__(self, pool: ThreadPool, tid, tc: ThreadCoordinator,
# te: TaskExecutor,
): # note: main thread context!
def __init__(self, pool: ThreadPool, tid, tc: ThreadCoordinator):
Note: this runs in the main thread context
# self._curStep = -1
self._pool = pool
self._tid = tid
......@@ -1007,6 +1009,8 @@ class Database:
possibly in a cluster environment.
For now we use it to manage state transitions in that database
TODO: consider moving, but keep in mind it contains "StateMachine"
_clsLock = threading.Lock() # class wide lock
_lastInt = 101 # next one is initial integer
......@@ -1182,7 +1186,7 @@ class Task():
def __init__(self, execStats: ExecutionStats, db: Database):
self._workerThread = None
self._err = None # type: Exception
self._err: Optional[Exception] = None
self._aborted = False
self._curStep = None
self._numRows = None # Number of rows affected
......@@ -1318,10 +1322,11 @@ class Task():
self._aborted = True
except BaseException: # TODO: what is this again??!!
"[=] Unexpected exception, SQL: {}".format(
raise RuntimeError("Punt")
# self.logDebug(
# "[=] Unexpected exception, SQL: {}".format(
# wt.getDbConn().getLastSql()))
# raise
self._execStats.endTaskType(self.__class__.__name__, self.isSuccess())
self.logDebug("[X] task execution completed, {}, status: {}".format(
......@@ -1498,7 +1503,8 @@ class TaskCreateDb(StateTransitionTask):
# was: self.execWtSql(wt, "create database db")
repStr = ""
if gConfig.max_replicas != 1:
numReplica = Dice.throw(gConfig.max_replicas) + 1 # 1,2 ... N
# numReplica = Dice.throw(gConfig.max_replicas) + 1 # 1,2 ... N
numReplica = gConfig.max_replicas # fixed, always
repStr = "replica {}".format(numReplica)
self.execWtSql(wt, "create database {} {}"
.format(self._db.getName(), repStr) )
......@@ -2050,7 +2056,7 @@ class ClientManager:
class MainExec:
def __init__(self):
self._clientMgr = None
self._svcMgr = None
self._svcMgr = None # type: ServiceManager
signal.signal(signal.SIGTERM, self.sigIntHandler)
signal.signal(signal.SIGINT, self.sigIntHandler)
......@@ -2063,17 +2069,16 @@ class MainExec:
self._svcMgr.sigUsrHandler(signalNumber, frame)
def sigIntHandler(self, signalNumber, frame):
if self._svcMgr:
if self._svcMgr:
self._svcMgr.sigIntHandler(signalNumber, frame)
if self._clientMgr:
if self._clientMgr:
self._clientMgr.sigIntHandler(signalNumber, frame)
def runClient(self):
global gSvcMgr
if gConfig.auto_start_service:
self._svcMgr = ServiceManager()
gSvcMgr = self._svcMgr # hack alert
self._svcMgr.startTaosService() # we start, don't run
gSvcMgr = self._svcMgr = ServiceManager() # hack alert
gSvcMgr.startTaosService() # we start, don't run
self._clientMgr = ClientManager()
ret = None
......@@ -2086,12 +2091,10 @@ class MainExec:
def runService(self):
global gSvcMgr
self._svcMgr = ServiceManager()
gSvcMgr = self._svcMgr # save it in a global variable TODO: hack alert
gSvcMgr = self._svcMgr = ServiceManager(gConfig.num_dnodes) # save it in a global variable TODO: hack alert
self._svcMgr.run() # run to some end state
self._svcMgr = None
gSvcMgr = None
gSvcMgr.run() # run to some end state
gSvcMgr = self._svcMgr = None
def init(self): # TODO: refactor
global gContainer
......@@ -2165,6 +2168,13 @@ class MainExec:
help='Use non-fixed names for dbs/tables, useful for multi-instance executions (default: false)')
help='Number of Dnodes to initialize, used with -e option. (default: 1)')
......@@ -2209,7 +2219,12 @@ class MainExec:
def run(self):
if gConfig.run_tdengine: # run server
return 0 # success
except ConnectionError as err:
Logging.error("Failed to make DB connection, please check DB instance manually")
return -1 # failure
return self.runClient()
......@@ -12,7 +12,9 @@ from util.cases import *
from util.dnodes import *
from util.log import *
from .misc import Logging, CrashGenError, Helper
from .misc import Logging, CrashGenError, Helper, Dice
import os
import datetime
# from .service_manager import TdeInstance
class DbConn:
......@@ -44,6 +46,9 @@ class DbConn:
self._lastSql = None
self._dbTarget = dbTarget
def __repr__(self):
return "[DbConn: type={}, target={}]".format(self._type, self._dbTarget)
def getLastSql(self):
return self._lastSql
......@@ -54,7 +59,7 @@ class DbConn:
# below implemented by child classes
Logging.debug("[DB] data connection opened, type = {}".format(self._type))
Logging.debug("[DB] data connection opened: {}".format(self))
self.isOpen = True
def close(self):
......@@ -277,15 +282,18 @@ class DbTarget:
self.cfgPath = cfgPath
self.hostAddr = hostAddr
self.port = port
def __repr__(self):
return "[DbTarget: cfgPath={}, host={}:{}]".format(
self.cfgPath, self.hostAddr, self.port)
Helper.getFriendlyPath(self.cfgPath), self.hostAddr, self.port)
def getEp(self):
return "{}:{}".format(self.hostAddr, self.port)
class DbConnNative(DbConn):
# Class variables
_lock = threading.Lock()
_connInfoDisplayed = False
# _connInfoDisplayed = False # TODO: find another way to display this
totalConnections = 0 # Not private
def __init__(self, dbTarget):
......@@ -304,9 +312,9 @@ class DbConnNative(DbConn):
cls = self.__class__ # Get the class, to access class variables
with cls._lock: # force single threading for opening DB connections. # TODO: whaaat??!!!
dbTarget = self._dbTarget
if not cls._connInfoDisplayed:
cls._connInfoDisplayed = True # updating CLASS variable
Logging.info("Initiating TAOS native connection to {}".format(dbTarget))
# if not cls._connInfoDisplayed:
# cls._connInfoDisplayed = True # updating CLASS variable
Logging.debug("Initiating TAOS native connection to {}".format(dbTarget))
# Make the connection
# self._conn = taos.connect(host=hostAddr, config=cfgPath) # TODO: make configurable
# self._cursor = self._conn.cursor()
......@@ -424,3 +432,4 @@ class DbManager():
def cleanUp(self):
import threading
import random
import logging
import os
class CrashGenError(Exception):
......@@ -26,7 +27,7 @@ class LoggingFilter(logging.Filter):
class MyLoggingAdapter(logging.LoggerAdapter):
def process(self, msg, kwargs):
return "[{}]{}".format(threading.get_ident() % 10000, msg), kwargs
return "[{}] {}".format(threading.get_ident() % 10000, msg), kwargs
# return '[%s] %s' % (self.extra['connid'], msg), kwargs
......@@ -71,12 +72,44 @@ class Logging:
def warning(cls, msg):
def error(cls, msg):
class Status:
def __init__(self, status):
def __repr__(self):
return "[Status: v={}]".format(self._status)
def set(self, status):
self._status = status
def get(self):
return self._status
def isStarting(self):
return self._status == Status.STATUS_STARTING
def isRunning(self):
# return self._thread and self._thread.is_alive()
return self._status == Status.STATUS_RUNNING
def isStopping(self):
return self._status == Status.STATUS_STOPPING
def isStopped(self):
return self._status == Status.STATUS_STOPPED
def isStable(self):
return self.isRunning() or self.isStopped()
# Deterministic random number generator
class Dice():
seeded = False # static, uninitialized
......@@ -118,14 +151,23 @@ class Helper:
def convertErrno(cls, errno):
return errno if (errno > 0) else 0x80000000 + errno
def getFriendlyPath(cls, path): # returns .../xxx/yyy
ht1 = os.path.split(path)
ht2 = os.path.split(ht1[0])
return ".../" + ht2[1] + '/' + ht1[1]
class Progress:
tokens = {
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
想要评论请 注册