之前有篇文章介绍nova新增的快速疏散接口的说明
https://www.backendcloud.cn/2017/06/08/force-down/
现在在开发的VM HA程序中加入快速疏散功能
- 修改配置文件,增加quick_evacuate_level选项
level 1: (fastest) 不关机,只force down nova-compute-service
level 2: (faster)(default) 先关机,关机介绍之后再force down nova-compute-service
level 3: (disable quick_evacuate) - change python-novaclient>=6.0.0 to support force-service-down
- 修改疏散流程处理逻辑,增加快速疏散
原来疏散的流程是发现故障计算节点后,先关机,然后等待故障节点因为心跳丢失而导致nova service state down,再进行疏散
现在改成:
level 1:(fastest) 不关机,只force down nova-compute-service
level 2:(faster)(default) 先关机,关机介绍之后再force down nova-compute-service
level 3: (disable quick_evacuate)什么也不做,和原来的流程一样
下面的代码def _ipmi_handle
,变更前实现的将故障节点执行关机操作,变更后,在level 1的时候,先进行force down nova service state并跳过关机流程。
def _ipmi_handle(self, node):
# cmd = 'ssh %s systemctl stop openstack-nova-compute' % node
# return os.system(cmd)
ret = 1
################ add ################
if CONF.quick_evacuate_level == 1:
try:
self.nova_client.services.force_down(host=node,
binary='nova-compute',
force_down=True,
)
service = self.nova_client.services.list(
host=node,
binary='nova-compute')
if service:
if service[0].state == 'down':
return ret
raise
except Exception as e:
content = node + ' nova-compute service force down failed!'
title = 'force down failed'
self.db.add_or_update_guardian_log(**{'title': title,
'detail': content,
'level': 'ERROR'})
LOG.error(content)
LOG.error(e)
################ add ################
LOG.info('IPMI enabled,need to check power status of %s' % node)
ipmi_info = self.db.get_ipmiInfo(node)
if ipmi_info is None:
LOG.warning(
'can not find ipmi info for %s, will ignore evacuate...'
% node)
return 0
LOG.debug('get ipmi info (ip %s user %s, password %s)' % (
ipmi_info.ip, ipmi_info.username, ipmi_info.password))
ipmi_util = IPMIUtil(ipmi_info.ip, ipmi_info.username,
ipmi_info.password)
power_status = ipmi_util.get_power_status()
LOG.info('power status of %s is %s' % (node, power_status))
if 'off' != power_status:
LOG.info('power status of %s is not off,power off it...' % node)
ipmi_util.do_power_off()
for i in range(1, CONF.ipmi_check_max_count + 1):
power_status = ipmi_util.get_power_status()
if 'off' != power_status:
LOG.info(
'power status of %s is not off,wait for %s seconds...'
% (node, CONF.ipmi_check_interval))
time.sleep(CONF.ipmi_check_interval)
else:
break
power_status = ipmi_util.get_power_status()
if 'off' != power_status:
LOG.info(
'after %s check,can not confirm power status of '
'%s is off,will ignore evacuate...'
% (node,
CONF.ipmi_check_max_count*CONF.ipmi_check_interval))
ret = 0
return ret
下面的代码在level 2时,不执行上面代码的跳过关机流程,在检查nova service state前force down实现快速疏散(不去检查心跳的流程)。level 3就是什么也不干,和原来一样。
################ add ################
if CONF.quick_evacuate_level == 2:
try:
self.nova_client.services.force_down(host=node,
binary='nova-compute',
force_down=True,
)
service = self.nova_client.services.list(
host=node,
binary='nova-compute')
if service:
if service[0].state != 'down':
raise
except Exception as e:
content = node + ' nova-compute service force down failed!'
title = 'force down failed'
self.db.add_or_update_guardian_log(**{'title': title,
'detail': content,
'level': 'ERROR'})
LOG.error(content)
LOG.error(e)
################ add ################
if self._wait_for_service_to_down(node):
for server in servers:
can_evacuate = self._evacuate_predict(server)
if can_evacuate:
try:
# just evacuate, let nova scheduler
# choose one host if not in drbd env
respone = server.evacuate(
host=evacuated_to,
on_shared_storage=CONF.on_share_storage)
LOG.info(
'evacuate vm info:[id:%s,name:%s,status:%s],'
'reponse is %s'
% (server.id, server.name, server.status,
respone))
content = '虚拟机被疏散:虚拟机ID: %s' % server.id
self.send_snmp(content=content)
except Exception as e:
content = ('evacuate vm %s failed' % server.id)
title = ('疏散虚拟机%s失败' % server.id)
self.db.add_or_update_guardian_log(
**{'title': title,
'detail': content,
'level': 'ERROR'})
content = '虚拟机疏散调用API失败:虚拟机ID: %s' % server.id
self.send_snmp(content=content)
LOG.error(content)
LOG.error(e)
# add to db
try:
self.db.add_evacuate_history(
instance_id=server.id,
instance_name=server.name,
host=getattr(server, 'OS-EXT-SRV-ATTR:host'))
content = (
'add_evacuate_history '
'[instance_id:%s,instance_name:%s,host:%s]'
% (server.id, server.name,
getattr(server, 'OS-EXT-SRV-ATTR:host')))
LOG.info(content)
title = ('增加疏散记录%s' % server.name)
self.db.add_or_update_guardian_log(
**{'title': title,
'detail': content,
'level': 'INFO'})
except Exception as e:
content = (
'add_evacuate_history for vm %s failed'
% server.id)
title = ('增加疏散记录%s失败' % server.id)
self.db.add_or_update_guardian_log(
**{'title': title,
'detail': content,
'level': 'ERROR'})
LOG.error(content)
LOG.error(e)
else:
content = '虚拟机没有被疏散:虚拟机ID: %s' % server.id
self.send_snmp(content=content)
LOG.info('server %s can not evacuate' % server.id)
# disable this service
try:
self.nova_client.services.disable(node, 'nova-compute')
except Exception as e:
content = (
'failed to disable compute service on node %s' % node)
title = ('disable节点%s服务失败' % node)
self.db.add_or_update_guardian_log(
**{'title': title,
'detail': content,
'level': 'ERROR'})
LOG.error(content)
LOG.error(e)
raise
heart_beat = self.db.get_heartbeat_by_name(node)
heart_beat.has_evacuated = 1
heart_beat.status = 'done'
heart_beat.save(self.db.session)
LOG.info('node %s update with [has_evacuated: 1]' % node)
转载请注明来源,欢迎指出任何有错误或不够清晰的表达。可以邮件至 backendcloud@gmail.com