Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
apache
DolphinScheduler
提交
0cc0ee77
DolphinScheduler
项目概览
apache
/
DolphinScheduler
上一次同步 1 年多
通知
706
Star
9572
Fork
3514
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
DolphinScheduler
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
未验证
提交
0cc0ee77
编写于
5月 17, 2022
作者:
C
caishunfeng
提交者:
GitHub
5月 17, 2022
1
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
[Bug][Master] fix master task failover (#10065)
* fix master task failover * ui
上级
c1642402
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
114 addition
and
61 deletion
+114
-61
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/task/TaskProcessorFactory.java
...duler/server/master/runner/task/TaskProcessorFactory.java
+9
-0
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/service/FailoverService.java
...lphinscheduler/server/master/service/FailoverService.java
+38
-25
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/service/FailoverServiceTest.java
...nscheduler/server/master/service/FailoverServiceTest.java
+67
-36
未找到文件。
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/task/TaskProcessorFactory.java
浏览文件 @
0cc0ee77
...
...
@@ -52,4 +52,13 @@ public class TaskProcessorFactory {
return
iTaskProcessor
.
getClass
().
newInstance
();
}
/**
* if match master processor, then this task type is processed on the master
* @param type
* @return
*/
public
static
boolean
isMasterTask
(
String
type
)
{
return
PROCESS_MAP
.
containsKey
(
type
);
}
}
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/service/FailoverService.java
浏览文件 @
0cc0ee77
...
...
@@ -30,6 +30,7 @@ import org.apache.dolphinscheduler.plugin.task.api.enums.ExecutionStatus;
import
org.apache.dolphinscheduler.server.builder.TaskExecutionContextBuilder
;
import
org.apache.dolphinscheduler.server.master.config.MasterConfig
;
import
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool
;
import
org.apache.dolphinscheduler.server.master.runner.task.TaskProcessorFactory
;
import
org.apache.dolphinscheduler.server.utils.ProcessUtils
;
import
org.apache.dolphinscheduler.service.process.ProcessService
;
import
org.apache.dolphinscheduler.service.registry.RegistryClient
;
...
...
@@ -127,7 +128,11 @@ public class FailoverService {
long
startTime
=
System
.
currentTimeMillis
();
List
<
ProcessInstance
>
needFailoverProcessInstanceList
=
processService
.
queryNeedFailoverProcessInstances
(
masterHost
);
LOGGER
.
info
(
"start master[{}] failover, process list size:{}"
,
masterHost
,
needFailoverProcessInstanceList
.
size
());
List
<
Server
>
workerServers
=
registryClient
.
getServerList
(
NodeType
.
WORKER
);
// servers need to contains master hosts and worker hosts, otherwise the logic task will failover fail.
List
<
Server
>
servers
=
registryClient
.
getServerList
(
NodeType
.
WORKER
);
servers
.
addAll
(
registryClient
.
getServerList
(
NodeType
.
MASTER
));
for
(
ProcessInstance
processInstance
:
needFailoverProcessInstanceList
)
{
if
(
Constants
.
NULL
.
equals
(
processInstance
.
getHost
()))
{
continue
;
...
...
@@ -136,7 +141,7 @@ public class FailoverService {
List
<
TaskInstance
>
validTaskInstanceList
=
processService
.
findValidTaskListByProcessId
(
processInstance
.
getId
());
for
(
TaskInstance
taskInstance
:
validTaskInstanceList
)
{
LOGGER
.
info
(
"failover task instance id: {}, process instance id: {}"
,
taskInstance
.
getId
(),
taskInstance
.
getProcessInstanceId
());
failoverTaskInstance
(
processInstance
,
taskInstance
,
workerS
ervers
);
failoverTaskInstance
(
processInstance
,
taskInstance
,
s
ervers
);
}
if
(
serverStartupTime
!=
null
&&
processInstance
.
getRestartTime
()
!=
null
...
...
@@ -198,29 +203,37 @@ public class FailoverService {
/**
* failover task instance
* <p>
* 1. kill yarn job if there are yarn jobs in tasks.
* 1. kill yarn job if
run on worker and
there are yarn jobs in tasks.
* 2. change task state from running to need failover.
* 3. try to notify local master
* @param processInstance
* @param taskInstance
* @param servers if failover master, servers container master servers and worker servers; if failover worker, servers contain worker servers.
*/
private
void
failoverTaskInstance
(
ProcessInstance
processInstance
,
TaskInstance
taskInstance
,
List
<
Server
>
workerS
ervers
)
{
private
void
failoverTaskInstance
(
ProcessInstance
processInstance
,
TaskInstance
taskInstance
,
List
<
Server
>
s
ervers
)
{
if
(
processInstance
==
null
)
{
LOGGER
.
error
(
"failover task instance error, processInstance {} of taskInstance {} is null"
,
taskInstance
.
getProcessInstanceId
(),
taskInstance
.
getId
());
return
;
}
if
(!
checkTaskInstanceNeedFailover
(
workerS
ervers
,
taskInstance
))
{
if
(!
checkTaskInstanceNeedFailover
(
s
ervers
,
taskInstance
))
{
return
;
}
boolean
isMasterTask
=
TaskProcessorFactory
.
isMasterTask
(
taskInstance
.
getTaskType
());
taskInstance
.
setProcessInstance
(
processInstance
);
TaskExecutionContext
taskExecutionContext
=
TaskExecutionContextBuilder
.
get
()
.
buildTaskInstanceRelatedInfo
(
taskInstance
)
.
buildProcessInstanceRelatedInfo
(
processInstance
)
.
create
();
if
(
masterConfig
.
isKillYarnJobWhenTaskFailover
())
{
// only kill yarn job if exists , the local thread has exited
ProcessUtils
.
killYarnJob
(
taskExecutionContext
);
if
(!
isMasterTask
)
{
TaskExecutionContext
taskExecutionContext
=
TaskExecutionContextBuilder
.
get
()
.
buildTaskInstanceRelatedInfo
(
taskInstance
)
.
buildProcessInstanceRelatedInfo
(
processInstance
)
.
create
();
if
(
masterConfig
.
isKillYarnJobWhenTaskFailover
())
{
// only kill yarn job if exists , the local thread has exited
ProcessUtils
.
killYarnJob
(
taskExecutionContext
);
}
}
taskInstance
.
setState
(
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
...
...
@@ -256,13 +269,13 @@ public class FailoverService {
}
/**
* task needs failover if task start before
work
er starts
* task needs failover if task start before
serv
er starts
*
* @param
workerServers
worker servers
* @param
servers servers, can container master servers or
worker servers
* @param taskInstance task instance
* @return true if task instance need fail over
*/
private
boolean
checkTaskInstanceNeedFailover
(
List
<
Server
>
workerS
ervers
,
TaskInstance
taskInstance
)
{
private
boolean
checkTaskInstanceNeedFailover
(
List
<
Server
>
s
ervers
,
TaskInstance
taskInstance
)
{
boolean
taskNeedFailover
=
true
;
...
...
@@ -279,14 +292,13 @@ public class FailoverService {
return
false
;
}
//now no host will execute this task instance,so no need to failover the task
if
(
taskInstance
.
getHost
()
==
null
)
{
return
false
;
}
//if task start after
work
er starts, there is no need to failover the task.
if
(
checkTaskAfter
WorkerStart
(
workerS
ervers
,
taskInstance
))
{
//if task start after
serv
er starts, there is no need to failover the task.
if
(
checkTaskAfter
ServerStart
(
s
ervers
,
taskInstance
))
{
taskNeedFailover
=
false
;
}
...
...
@@ -296,19 +308,20 @@ public class FailoverService {
/**
* check task start after the worker server starts.
*
* @param servers servers, can contain master servers or worker servers
* @param taskInstance task instance
* @return true if task instance start time after
worker
server start date
* @return true if task instance start time after server start date
*/
private
boolean
checkTaskAfter
WorkerStart
(
List
<
Server
>
workerS
ervers
,
TaskInstance
taskInstance
)
{
private
boolean
checkTaskAfter
ServerStart
(
List
<
Server
>
s
ervers
,
TaskInstance
taskInstance
)
{
if
(
StringUtils
.
isEmpty
(
taskInstance
.
getHost
()))
{
return
false
;
}
Date
workerServerStartDate
=
getServerStartupTime
(
workerS
ervers
,
taskInstance
.
getHost
());
if
(
workerS
erverStartDate
!=
null
)
{
Date
serverStartDate
=
getServerStartupTime
(
s
ervers
,
taskInstance
.
getHost
());
if
(
s
erverStartDate
!=
null
)
{
if
(
taskInstance
.
getStartTime
()
==
null
)
{
return
taskInstance
.
getSubmitTime
().
after
(
workerS
erverStartDate
);
return
taskInstance
.
getSubmitTime
().
after
(
s
erverStartDate
);
}
else
{
return
taskInstance
.
getStartTime
().
after
(
workerS
erverStartDate
);
return
taskInstance
.
getStartTime
().
after
(
s
erverStartDate
);
}
}
return
false
;
...
...
dolphinscheduler-master/src/test/java/org/apache/dolphinscheduler/server/master/service/FailoverServiceTest.java
浏览文件 @
0cc0ee77
...
...
@@ -17,6 +17,9 @@
package
org.apache.dolphinscheduler.server.master.service
;
import
static
org
.
apache
.
dolphinscheduler
.
common
.
Constants
.
COMMON_TASK_TYPE
;
import
static
org
.
apache
.
dolphinscheduler
.
plugin
.
task
.
api
.
TaskConstants
.
TASK_TYPE_DEPENDENT
;
import
static
org
.
apache
.
dolphinscheduler
.
plugin
.
task
.
api
.
TaskConstants
.
TASK_TYPE_SWITCH
;
import
static
org
.
mockito
.
BDDMockito
.
given
;
import
static
org
.
mockito
.
Mockito
.
doNothing
;
...
...
@@ -30,9 +33,11 @@ import org.apache.dolphinscheduler.dao.entity.TaskInstance;
import
org.apache.dolphinscheduler.plugin.task.api.enums.ExecutionStatus
;
import
org.apache.dolphinscheduler.server.master.config.MasterConfig
;
import
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool
;
import
org.apache.dolphinscheduler.service.bean.SpringApplicationContext
;
import
org.apache.dolphinscheduler.service.process.ProcessService
;
import
org.apache.dolphinscheduler.service.registry.RegistryClient
;
import
java.util.ArrayList
;
import
java.util.Arrays
;
import
java.util.Date
;
...
...
@@ -46,6 +51,7 @@ import org.mockito.Mockito;
import
org.powermock.core.classloader.annotations.PowerMockIgnore
;
import
org.powermock.core.classloader.annotations.PrepareForTest
;
import
org.powermock.modules.junit4.PowerMockRunner
;
import
org.springframework.context.ApplicationContext
;
import
org.springframework.test.util.ReflectionTestUtils
;
import
com.google.common.collect.Lists
;
...
...
@@ -72,22 +78,34 @@ public class FailoverServiceTest {
@Mock
private
WorkflowExecuteThreadPool
workflowExecuteThreadPool
;
private
String
testHost
;
private
static
int
masterPort
=
5678
;
private
static
int
workerPort
=
1234
;
private
String
testMasterHost
;
private
String
testWorkerHost
;
private
ProcessInstance
processInstance
;
private
TaskInstance
taskInstance
;
private
TaskInstance
masterTaskInstance
;
private
TaskInstance
workerTaskInstance
;
@Before
public
void
before
()
throws
Exception
{
given
(
masterConfig
.
getListenPort
()).
willReturn
(
8080
);
// init spring context
ApplicationContext
applicationContext
=
Mockito
.
mock
(
ApplicationContext
.
class
);
SpringApplicationContext
springApplicationContext
=
new
SpringApplicationContext
();
springApplicationContext
.
setApplicationContext
(
applicationContext
);
given
(
masterConfig
.
getListenPort
()).
willReturn
(
masterPort
);
testHost
=
failoverService
.
getLocalAddress
();
String
ip
=
testHost
.
split
(
":"
)[
0
];
int
port
=
Integer
.
valueOf
(
testHost
.
split
(
":"
)[
1
]);
Assert
.
assertEquals
(
8080
,
port
);
testMasterHost
=
failoverService
.
getLocalAddress
();
String
ip
=
testMasterHost
.
split
(
":"
)[
0
];
int
port
=
Integer
.
valueOf
(
testMasterHost
.
split
(
":"
)[
1
]);
Assert
.
assertEquals
(
masterPort
,
port
);
testWorkerHost
=
ip
+
":"
+
workerPort
;
given
(
registryClient
.
getLock
(
Mockito
.
anyString
())).
willReturn
(
true
);
given
(
registryClient
.
releaseLock
(
Mockito
.
anyString
())).
willReturn
(
true
);
given
(
registryClient
.
getHostByEventDataPath
(
Mockito
.
anyString
())).
willReturn
(
testHost
);
given
(
registryClient
.
getHostByEventDataPath
(
Mockito
.
anyString
())).
willReturn
(
test
Master
Host
);
given
(
registryClient
.
getStoppable
()).
willReturn
(
cause
->
{
});
given
(
registryClient
.
checkNodeExists
(
Mockito
.
anyString
(),
Mockito
.
any
())).
willReturn
(
true
);
...
...
@@ -95,30 +113,43 @@ public class FailoverServiceTest {
processInstance
=
new
ProcessInstance
();
processInstance
.
setId
(
1
);
processInstance
.
setHost
(
testHost
);
processInstance
.
setHost
(
test
Master
Host
);
processInstance
.
setRestartTime
(
new
Date
());
processInstance
.
setHistoryCmd
(
"xxx"
);
processInstance
.
setCommandType
(
CommandType
.
STOP
);
taskInstance
=
new
TaskInstance
();
taskInstance
.
setId
(
1
);
taskInstance
.
setStartTime
(
new
Date
());
taskInstance
.
setHost
(
testHost
);
masterTaskInstance
=
new
TaskInstance
();
masterTaskInstance
.
setId
(
1
);
masterTaskInstance
.
setStartTime
(
new
Date
());
masterTaskInstance
.
setHost
(
testMasterHost
);
masterTaskInstance
.
setTaskType
(
TASK_TYPE_SWITCH
);
workerTaskInstance
=
new
TaskInstance
();
workerTaskInstance
.
setId
(
2
);
workerTaskInstance
.
setStartTime
(
new
Date
());
workerTaskInstance
.
setHost
(
testWorkerHost
);
workerTaskInstance
.
setTaskType
(
COMMON_TASK_TYPE
);
given
(
processService
.
queryNeedFailoverTaskInstances
(
Mockito
.
anyString
())).
willReturn
(
Arrays
.
asList
(
t
askInstance
));
given
(
processService
.
queryNeedFailoverProcessInstanceHost
()).
willReturn
(
Lists
.
newArrayList
(
testHost
));
given
(
processService
.
queryNeedFailoverTaskInstances
(
Mockito
.
anyString
())).
willReturn
(
Arrays
.
asList
(
masterTaskInstance
,
workerT
askInstance
));
given
(
processService
.
queryNeedFailoverProcessInstanceHost
()).
willReturn
(
Lists
.
newArrayList
(
test
Master
Host
));
given
(
processService
.
queryNeedFailoverProcessInstances
(
Mockito
.
anyString
())).
willReturn
(
Arrays
.
asList
(
processInstance
));
doNothing
().
when
(
processService
).
processNeedFailoverProcessInstances
(
Mockito
.
any
(
ProcessInstance
.
class
));
given
(
processService
.
findValidTaskListByProcessId
(
Mockito
.
anyInt
())).
willReturn
(
Lists
.
newArrayList
(
t
askInstance
));
given
(
processService
.
findValidTaskListByProcessId
(
Mockito
.
anyInt
())).
willReturn
(
Lists
.
newArrayList
(
masterTaskInstance
,
workerT
askInstance
));
given
(
processService
.
findProcessInstanceDetailById
(
Mockito
.
anyInt
())).
willReturn
(
processInstance
);
Thread
.
sleep
(
1000
);
Server
server
=
new
Server
();
server
.
setHost
(
ip
);
server
.
setPort
(
port
);
server
.
setCreateTime
(
new
Date
());
given
(
registryClient
.
getServerList
(
NodeType
.
WORKER
)).
willReturn
(
Arrays
.
asList
(
server
));
given
(
registryClient
.
getServerList
(
NodeType
.
MASTER
)).
willReturn
(
Arrays
.
asList
(
server
));
Server
masterServer
=
new
Server
();
masterServer
.
setHost
(
ip
);
masterServer
.
setPort
(
masterPort
);
masterServer
.
setCreateTime
(
new
Date
());
Server
workerServer
=
new
Server
();
workerServer
.
setHost
(
ip
);
workerServer
.
setPort
(
workerPort
);
workerServer
.
setCreateTime
(
new
Date
());
given
(
registryClient
.
getServerList
(
NodeType
.
WORKER
)).
willReturn
(
new
ArrayList
<>(
Arrays
.
asList
(
workerServer
)));
given
(
registryClient
.
getServerList
(
NodeType
.
MASTER
)).
willReturn
(
new
ArrayList
<>(
Arrays
.
asList
(
masterServer
)));
ReflectionTestUtils
.
setField
(
failoverService
,
"registryClient"
,
registryClient
);
doNothing
().
when
(
workflowExecuteThreadPool
).
submitStateEvent
(
Mockito
.
any
(
StateEvent
.
class
));
...
...
@@ -132,26 +163,26 @@ public class FailoverServiceTest {
@Test
public
void
failoverMasterTest
()
{
processInstance
.
setHost
(
Constants
.
NULL
);
t
askInstance
.
setState
(
ExecutionStatus
.
RUNNING_EXECUTION
);
failoverService
.
failoverServerWhenDown
(
testHost
,
NodeType
.
MASTER
);
Assert
.
assertNotEquals
(
t
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
processInstance
.
setHost
(
testHost
);
t
askInstance
.
setState
(
ExecutionStatus
.
SUCCESS
);
failoverService
.
failoverServerWhenDown
(
testHost
,
NodeType
.
MASTER
);
Assert
.
assertNotEquals
(
t
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
masterT
askInstance
.
setState
(
ExecutionStatus
.
RUNNING_EXECUTION
);
failoverService
.
failoverServerWhenDown
(
test
Master
Host
,
NodeType
.
MASTER
);
Assert
.
assertNotEquals
(
masterT
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
processInstance
.
setHost
(
test
Master
Host
);
masterT
askInstance
.
setState
(
ExecutionStatus
.
SUCCESS
);
failoverService
.
failoverServerWhenDown
(
test
Master
Host
,
NodeType
.
MASTER
);
Assert
.
assertNotEquals
(
masterT
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
Assert
.
assertEquals
(
Constants
.
NULL
,
processInstance
.
getHost
());
processInstance
.
setHost
(
testHost
);
t
askInstance
.
setState
(
ExecutionStatus
.
RUNNING_EXECUTION
);
failoverService
.
failoverServerWhenDown
(
testHost
,
NodeType
.
MASTER
);
Assert
.
assertEquals
(
t
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
processInstance
.
setHost
(
test
Master
Host
);
masterT
askInstance
.
setState
(
ExecutionStatus
.
RUNNING_EXECUTION
);
failoverService
.
failoverServerWhenDown
(
test
Master
Host
,
NodeType
.
MASTER
);
Assert
.
assertEquals
(
masterT
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
Assert
.
assertEquals
(
Constants
.
NULL
,
processInstance
.
getHost
());
}
@Test
public
void
failoverWorkTest
()
{
failoverService
.
failoverServerWhenDown
(
testHost
,
NodeType
.
WORKER
);
Assert
.
assertEquals
(
t
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
failoverService
.
failoverServerWhenDown
(
test
Worker
Host
,
NodeType
.
WORKER
);
Assert
.
assertEquals
(
workerT
askInstance
.
getState
(),
ExecutionStatus
.
NEED_FAULT_TOLERANCE
);
}
}
GitCode官方
@csdn_codechina
mentioned in commit
aa51c66d
·
5月 29, 2022
mentioned in commit
aa51c66d
mentioned in commit aa51c66d9196997bfdfd6cb42be07311652fa0e9
开关提交列表
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录