马哥 39_01 _Linux集群系列之十一——高可用集群之heartbeat基于crm进行资源管理有大用

星期五, 2020-11-27 06:22 — adminshiping1

知识回顾:

RA classes:

OCF

pacemaker

linbit

LSB

Legacy Heartbeat V1

STONITH

RA: Resource Agent

代为管理资源

LRM: Local Resource Manager

TE:

PE:

CRM: Cluster Resource Manager

haresource (heartbeat v1)

crm, haresource (heartbeat v2)

pacemaker (heartbeat V3)

rgmanager (RHCS)

为那些非ha-aware的应用程序提供调用的基础平台

很多资源由CRM来代为管理,监控其运行状态,由CRM来实现对其启动停止等各方面的动作的采用.

CRM要提供强大的基础管理框架,让管理员去配置这些非ha-aware的应用程序(比如 httpd)的某些在ha( high avaliablity )上工作特性

haresource文件是管理资源的配置接口,但功能薄弱,所以haresource被提升为了crm

crm在 heartbeat 2中,本身作为一个进程来运行的叫 crmd (cluster resource manager daemon): crm守护进程,运行在每一个节点,并且通过某个监听的套接字彼此之间进行互相通信,但是通信依赖于底层的Messaging layer, crm监听的套接字是为了管理员管理,实现运行在crm之上的每一个资源的管理的接口(api)

crm很多工具进行资源管理发展到v2版本后,功能和可扩展性得到了极大的提升

GUI:网页或图形窗口界面配置资源,配置节点(好像不能,好像只能配置资源),手动完成资源转移等等

CLI:命令行接口,添加资源,将资源归并为组等等

不提供额外的命令行工具,与命令行接口可能没有关系,但是在各节点上,也能够方便的进行CRM内部的各种信息的管理??看不懂

HA service:

从属于同一个服务的所有资源必须要同时运行在一个节点上,如果不定义组的话,是balance法则,资源平均运行在各个节点上的;;比如 (Web vip,httpd,filesystem),如果三个资源不归并为一个组的话,它们会分散运行

Resourced Type:

primitive(native)类型的资源主资源(基本资源,主要资源,原始资源)

group 类型的资源, 组资源,同进同退

clone 类型的资源,有些资源必须同时运行在多个节点上, clone n份,可在每一个节点上运行

比如 STONITH 设备,管理STONITH的管理进程也被配置为高可用服务的资源

比如 Cluster Filesystem 的 dlm(distribute lock manager)分布式锁管理器,(要定义为高可用的一个资源,要运行在每一个节点上,它们之间可以通信),集群文件系统,多个节点可以同时读写同一个共享存储的,某一节点读写某一文件,会加锁的,锁持有的信息一定会通知给其它能够读写同一个文件的节点,就是靠这些定义为分布式锁管理器的资源来完成的 ;;;它是集群文件系统自带的工具,是个应用程序,需要把这个应用程序配置为一个资源

比如 STONITH

master/slave 类型的资源,定义为主资源,只能clone两份,运行在两个主从关系的节点上

比如 drbd 分布式磁盘块镜像,分布式复制块设备,2.6.33之后,已经被直接整合进内核了,红帽5.8 5.9是2.6.18,,别人提供了内核模块,不需要重新编译内核,只需要安装内核模块即可

资源粘性:资源倾向性

资源是否倾向于留在当前节点,这才是最准确的定义

正数:倾向于,乐意正无穷,表示只要有可能,就在这儿

正数:离开负无穷,表示只要有可能,就不在这儿

比如: node1.magedu.com: 资源粘性:100,位置约束:200

node2.magedu.com: 资源粘性:100,位置约束:inf 假如它坏了,资源留在node1,假如它好了,资源又会在node2上,因为它的位置约束是inf(正无穷大)

资源约束:

location 位置约束

collocation 排列约束

order 顺序约束

两个节点的高可用集群中,法定票数都不大于一半,借助于第三设备,来判定是自己挂了,还是别人挂了,

ping node ,ping group,仲裁磁盘等 ,帮助判定,哪一半哪个节点是正常的

但是多个节点(超过两个)的时候,总法定票数为奇数,只要分裂不超过两个(超过两个也没关系),只要某一个子集群,它的法定票数能够大于半数,就可以认为自己是正常的,其它的都是非正常,为了避免资源竞争,正常的节点可以fence其它任意节点的,

一个正常集群应该具备fence设备(stonith设备),

资源隔离两种级别

节点级别 stonith

资源级别

heartbeat配置:

1) authkeys

2) ha.cf 主厅两点

node

bcast,mcast,ucast 只使用一种就可以了

3) haresource 给资源提供配置信息的

HA:前提条件

1)时间同步

两个节点 192.168.0.45 192.168.0.55 都要执行 # ntpdate 192.168.0.75

2)SSH双机互信

3)主机名称与 uname -n 一致,并通过 /etc/hosts解析,不建议使用dns解析

CIB: Cluster Information Base 集群信息库,这是基于crm实现管理高可用集群的时候,它的资源定义的配置文件,里面可能包含多个配置文件,可能将集群的所有配置信息都保存在cib的配置文件当中,,,,cib与haresources类似,在每个节点上都要一样,cib是xml格式的

下面涉及到的文件都在 /var/lib/heartbeat/目录下

执行haresources2cib.py脚本,并指定haresources文件, 它在执行结束之后,会将我们的haresources转换为cib的xml格式并保存到 /var/lib/heartbeat/crm 目录下的,再启用集群服务,这些资源都会生效的

当然也可以直接通过GUI或CLI来进行配置的

haresources2cib.py的执行依赖于 ha_propagate?(到底谁依赖于谁,我头有点乱),编辑好集群信息库文件以后,会自动调用ha_propagate,把配置信息自动复制到其它节点(通过ssh的双机互信)

send_arp就是节点抢过VIP后,广播新mac与vip的对应关系,进行arp欺骗(arp伪装,)的,这样,路由器(或其它设备)的mac缓存就更新了

PE:pengine文件 TE:tengine文件 stonithd文件(爆头) quorumd文件(法定票数) crmd文件(运行crm程序) ccm(cluster config manager 集群配置资源管理器,为crm管理cib提供的专门的管理工具,讲到rhcs时有涉及) lrmd(location resource manage daemon 本地资源管理器守护进程)

crm->pacemaker 以后,配置命令功能特别强大

马哥看日志

也说大家一起做集群,都在172.16这个网段内,所以每个学生自建的两个节点,相互之间传递的心跳信息,别人都能收到一份,因为使用的是bcast(广播),,不过别人看到的是 failed authentication 认证失败

大家每个人做集群,启动起来以后,日志疯狂增长,,每一个人节点在不停的向这里传心跳信息的

可以使用多播的方式来实现,建议不要使用默认的多播地址,要稍微改一下的

原理简介

组播报文的目的地址使用D类IP地址，范围是从224.0.0.0到239.255.255.255 D类地址不能出现在IP报文的源IP地址字段. 单播数据传输过程中，一个数据包传输的路径是从源地址路由到目的地址，利用“逐跳”(hop-by-hop) 的原理在IP网络中传输。

然而在ip组播环中，数据包的目的地址不是一个，而是一组，形成组地址。所有的信息接收者都加入到一个组内，并且一旦加入之后，流向组地址的数据立即开始向接收者传输，组中的所有成员都能接收到数据包。组播组中的成员是动态的，主机可以在任何时刻加入和离开组播组。

组播组分类

组指组可以是永久的也可以是临时的。组播组地址中，有一部分官方分配的:称为永久组插组。永久组播组保持不变的是它的ip地址,组中的成员构成可以发生变化。永久组播组中成员的数量都可以是任意的，甚至可以为零。那些没有保留下来供永久组播组使用的ip组播地址，可以被临时组播组利用。

224.0.0.0~224.0.0.255为预留的组播地址(永久组地址)，地址224.0.0.0保留不做分配，其它地址供路由协议使用:

224.0.1.0~224.0.1. 255是公用组播地址:可以用于Internet:

224.0.2.0~238.255.255.255为用户可用的组播地址(临时组地址)，全网范围内有效:

239.0.0.0~239.255.255.255为本地管理组播地址，仅在特定的本地范围内有效。

常用预留组播地址

列表如下:

224.0.0.0基准地址(保留)

224.0.0.1所有主机的地址(包括所有路由器地址)

224.0.0.2所有组播路由器的地址

224.0.0.3不分配

224.0.0.4 dvmrp路由器

224.0.0.5 ospf 路由器

224.0.0.6 ospf dr

224.0.0.7 st路由器

224.0.0.8 st主机

224.0.0.9 rip-2路由器

224.0.0.10 Eigrp路由器

224.0.0.11活动代理

224.0.0.12 dhcp服务器/中继代理

224.0.0.13所有pim路由器

224.0.0.14 rsvp封装

224.0.0.15所有cbt路由器

224.0.0.16指定sbm

224.0.0.17所有sbms

224.0.0.18 vrrp

以太网传输单播ip报文的时候;目的mac地址使用的是接收者的mac地址。但是在传输组播报文时:传输目的不再是一个具体的接收者:而是一个成员不确定的组，所以使用的是组播mac地址。组播mac地址是和组播ip地址对应的。iana (internet assigned number authority) 规定，组播mac地址的高24bit为8x01005e; mac地址的低23bit为组播ip地址的低23bit.

由于ip组播地址的后28位中只有23位被映射到mac地址，这样就会有32个ip组播地址映射到同一mac地址上。???不懂

#在第一个节点上(192.168.0.45)

[root@node1 ha.d]# ssh node2 '/etc/init.d/heartbeat stop' #停掉第二个节点(192.168.0.55)的集群服务

Stopping High-Availability services:

[确定]

[root@node1 ha.d]#/etc/init.d/heartbeat stop #停止当前节点 (192.168.0.45)

Stopping High-Availability services:

[确定]

[root@node1 ha.d]#

[root@node1 ha.d]# pwd

/etc/ha.d

[root@node1 ha.d]# vim ha.cf #把bcast改成 mcast

#bcast eth0 # Linux #注释掉广播

#mcast eth0 225.0.0.1 694 1 0

mcast eth0 225.0.100.19 694 1 0 #组播地址 225.0.100.19 改成不一样的吧

#ucast eth0 192.168.1.2 #使用单播的话,那么两个节点的配置文件不一样了 192.168.1.2 是另一个节点的ip

crm respawn #表示它将使用crm的机制来管理我们的集群资源 respawn 或者 on,,,crm与haresource并不兼容,所以它不会读取haresources配置文件 haresources 里面配置的 node1.magedu.com IPaddr::192.168.0.50/24/eth0 Filesystem::192.168.0.75:/web/htdocs::/www/a.org::nfs httpd 通通都失效了

#ha.cf 这个文件使用了 crm 之后,改好了后,不要忙着启动

[root@node1 ha.d]# cd /usr/lib/heartbeat

[root@node1 heartbeat]# ls

#haresources2cib 有个概念,叫cib(cluster information base)集群信息库

api_test crm_primitive.pyc hb_setsite ocf-shellfuncs

apphbd crm_primitive.pyo hb_setweight pengine

apphbtest crm_utils.py hb_standby pingd

atest crm_utils.pyc hb_takeover plugins

attrd crm_utils.pyo heartbeat quorumd

base64_md5_test cts ipctest quorumdtest

BasicSanityCheck dopd ipctransientclient ra-api-1.dtd

ccm drbd-peer-outdater ipctransientserver recoverymgrd

ccm_testclient findif ipfail req_resource

cib ha_config logtest ResourceManager

cibmon ha_logd lrmadmin send_arp

clmtest ha_logger lrmd stonithd

crm_commands.py ha_propagate lrmtest stonithdtest

crm_commands.pyc haresources2cib.py mach_down tengine

crm_commands.pyo haresources2cib.pyc mgmtd TestHeartbeatComm

crmd haresources2cib.pyo mgmtdtest transient-test.sh

crm.dtd hb_addnode mlock ttest

crm_primitive.py hb_delnode ocf-returncodes utillib.sh

[root@node1 heartbeat]#

[root@node1 ~]# /usr/lib/heartbeat/ha_propagate #ha.cf authkeys 直接复制到另一节点,以后不需要用scp了

Propagating HA configuration files to node node2.magedu.com.

ha.cf 100% 10KB 10.5KB/s 00:00

authkeys 100% 691 0.7KB/s 00:00

Setting HA startup configuration on node node2.magedu.com.

在 GNU 公共许可的条款下，本软件可以被自由发行。

用法： chkconfig --list [name]

chkconfig --add <name>

chkconfig --del <name>

chkconfig [--level <levels>] <name> <on|off|reset|resetpriorities>

[root@node1 ~]#

#启动两个节点

在第一个节点 192.168.0.45上操作

[root@node1 ~]# service heartbeat start

logd is already running

Starting High-Availability services:

2020/12/01_08:34:22 INFO: Resource is stopped

[确定]

[root@node1 ~]# ssh node2 'service heartbeat start'

Starting High-Availability services:

2020/12/01_08:34:39 INFO: Resource is stopped

[确定]

[root@node1 ~]#

看日志

在第一个节点 192.168.0.45 上操作

[root@node1 ~]# tail -f /var/log/messages

Dec 1 08:34:53 node1 pengine: [4969]: info: determine_online_status: Node node2.magedu.com is online

Dec 1 08:34:53 node1 crmd: [4961]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]

Dec 1 08:34:53 node1 tengine: [4968]: info: process_te_message: Processing graph derived from /var/lib/heartbeat/pengine/pe-input-3.bz2

Dec 1 08:34:53 node1 tengine: [4968]: info: unpack_graph: Unpacked transition 3: 0 actions in 0 synapses

Dec 1 08:34:53 node1 tengine: [4968]: info: run_graph: Transition 3: (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)

Dec 1 08:34:53 node1 tengine: [4968]: info: notify_crmd: Transition 3 status: te_complete - <null>

Dec 1 08:34:53 node1 crmd: [4961]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]

Dec 1 08:34:53 node1 pengine: [4969]: info: process_pe_message: Transition 3: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-3.bz2

Dec 1 08:35:59 node1 avahi-daemon[4901]: Invalid query packet.

Dec 1 08:36:39 node1 last message repeated 8 times

生产环境中建议使用多播,而不是使用广播,只有两个节点 ,使用单播也可以

在第一个节点 192.168.0.45上操作

[root@node1 ~]# netstat -tnlp

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

tcp 0 0 127.0.0.1:2208 0.0.0.0:* LISTEN 3906/./hpiod

tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN -

tcp 0 0 0.0.0.0:706 0.0.0.0:* LISTEN 3492/rpc.statd

tcp 0 0 0.0.0.0:32803 0.0.0.0:* LISTEN -

tcp 0 0 0.0.0.0:901 0.0.0.0:* LISTEN 3972/xinetd

tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 3919/php-fpm

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 4495/mysqld

tcp 0 0 0.0.0.0:875 0.0.0.0:* LISTEN 4022/rpc.rquotad

tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 3440/portmap

tcp 0 0 0.0.0.0:21 0.0.0.0:* LISTEN 4161/vsftpd

tcp 0 0 192.168.0.45:53 0.0.0.0:* LISTEN 3846/named

tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 3846/named

tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3940/sshd

tcp 0 0 0.0.0.0:23 0.0.0.0:* LISTEN 3972/xinetd

tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 3954/cupsd

tcp 0 0 0.0.0.0:5560 0.0.0.0:* LISTEN 4962/mgmtd (management daemon的进程,就是crm的相关进程)

tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN 4729/master

tcp 0 0 127.0.0.1:953 0.0.0.0:* LISTEN 3846/named

tcp 0 0 0.0.0.0:892 0.0.0.0:* LISTEN 4064/rpc.mountd

tcp 0 0 127.0.0.1:2207 0.0.0.0:* LISTEN 3911/python

tcp 0 0 :::22 :::* LISTEN 3940/sshd

tcp 0 0 ::1:953 :::* LISTEN 3846/named

[root@node1 ~]#

在第二个节点 192.168.0.55上操作,也启动了 mgmtd 进程

[root@node2 ~]# netstat -tnlp

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

tcp 0 0 127.0.0.1:2208 0.0.0.0:* LISTEN 3883/./hpiod

tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN -

tcp 0 0 0.0.0.0:32803 0.0.0.0:* LISTEN -

tcp 0 0 0.0.0.0:901 0.0.0.0:* LISTEN 3949/xinetd

tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 3896/php-fpm

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 4451/mysqld

tcp 0 0 0.0.0.0:682 0.0.0.0:* LISTEN 3468/rpc.statd

tcp 0 0 0.0.0.0:875 0.0.0.0:* LISTEN 3999/rpc.rquotad

tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 3416/portmap

tcp 0 0 0.0.0.0:21 0.0.0.0:* LISTEN 4138/vsftpd

tcp 0 0 192.168.0.55:53 0.0.0.0:* LISTEN 3823/named

tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 3823/named

tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 3917/sshd

tcp 0 0 0.0.0.0:23 0.0.0.0:* LISTEN 3949/xinetd

tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 3931/cupsd

tcp 0 0 0.0.0.0:5560 0.0.0.0:* LISTEN 11235/mgmtd

tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN 4727/master

tcp 0 0 127.0.0.1:953 0.0.0.0:* LISTEN 3823/named

tcp 0 0 0.0.0.0:892 0.0.0.0:* LISTEN 4041/rpc.mountd

tcp 0 0 127.0.0.1:2207 0.0.0.0:* LISTEN 3888/python

tcp 0 0 :::22 :::* LISTEN 3917/sshd

tcp 0 0 ::1:953 :::* LISTEN 3823/named

[root@node2 ~]#

crmd 需要在每一个节点上运行起来

crm提供了很多命令行工具,有些是cib开头的

在第一个节点 192.168.0.45上操作

[root@node1 ~]# pwd

/root

[root@node1 ~]# cib (按 Tab 键)

cibadmin ciblint

[root@node1 ~]# cibadmin --help #用来管理 cib 信息库的工具专门用于解析xml的cib的信息库的

usage: cibadmin [V?o:QDUCEX:t:Srwlsh:MmBfbdRx:pP5] command

where necessary, XML data will be obtained using -X, -x, or -p options

Options

--obj_type (-o) <type> object type being operated on

Valid values are: nodes, resources, constraints, crm_config, status

--verbose (-V) turn on debug info. additional instance increase verbosity

--help (-?) this help message

Commands

--cib_erase (-E) Erase the contents of the whole CIB #清空信息库

--cib_query (-Q) #查询信息库

--cib_create (-C)

--md5-sum (-5) Calculate an XML file's digest. Requires either -X, -x or -p

--cib_replace (-R) Recursivly replace an object in the CIB

--cib_update (-U) Recursivly update an object in the CIB

--cib_modify (-M) Find the object somewhere in the CIB's XML tree and update is as --cib_update would

--cib_delete (-D)

Delete the first object matching the supplied criteria

Eg. <op id="rsc1_op1" name="monitor"/>

The tagname and all attributes must match in order for the element to be deleted

--cib_delete_alt (-d)

Delete the object at specified fully qualified location

Eg. <resource id="rsc1"><operations><op id="rsc1_op1"/>...

Requires -o

--cib_bump (-B)

--cib_ismaster (-m)

--cib_sync (-S)

XML data

--crm_xml (-X) <string> Retrieve XML from the supplied string

--xml-file (-x) <filename> Retrieve XML from the named file

--xml-pipe (-p) Retrieve XML from STDIN

Advanced Options

--host (-h) send command to specified host. Applies to cib_query and cib_sync commands only

--local (-l) command takes effect locally on the specified host

--no-bcast (-b) command will not be broadcast even if it altered the CIB

--sync-call (-s) wait for call to complete before returning

[root@node1 ~]#

[root@node1 ~]# crm #(按tab键)

crmadmin crm_failcount(统计下故障转移次数的) crm_resource(用来作资源配置的) crm_uuid

crm_attribute(定义(或获取)每一个资源的相关属性) crm_master crm_sh(crm的shell,crm中的命令行工具) crm_verify(校验crm配置完后的主配置文件cib.xml有没有语法错误)

crm_diff crm_mon(crom monitor 监控集群) crm_standby(让自己转为备节点)

[root@node1 ~]# crm_mon #每隔15秒刷新一次当前的集群状态的

Refresh in 14s...

============

Last updated: Tue Dec 1 09:02:20 2020

Current DC: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104) #当前DC

2 Nodes configured. #两个节点

0 Resources configured.

============

Node: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104): online #node1节点在线

Node: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636): online #node2节点在线

在第一个节点 192.168.0.45上操作

[root@node1 ~]# crm_resource --help

usage: crm_resource [-?VS] -(L|Q|W|D|C|P|p) [options]

--help (-?) : this help message

--verbose (-V) : turn on debug info. additional instances increase verbosity

--quiet (-Q) : Print only the value on stdout (for use with -W)

Commands

--list (-L) : List all resources

--query-xml (-x) : Query a resource

Requires: -r

--locate (-W) : Locate a resource #定位查找一个资源在哪个节点上

Requires: -r

--migrate (-M) : Migrate a resource from it current location. Use -H to specify a destination #手动转移一个资源,从一个节点转移到另一个节点

If -H is not specified, we will force the resource to move by creating a rule for the current location and a score of -INFINITY

NOTE: This will prevent the resource from running on this node until the constraint is removed with -U

Requires: -r, Optional: -H, -f, --lifetime

--un-migrate (-U) : Remove all constraints created by -M

Requires: -r

--delete (-D) : Delete a resource from the CIB #删除一个资源

Requires: -r, -t

--cleanup (-C) : Delete a resource from the LRM #清空一个资源的状态信息

Requires: -r. Optional: -H

--reprobe (-P) : Recheck for resources started outside of the CRM #重新探测

Optional: -H

--refresh (-R) : Refresh the CIB from the LRM #刷新

Optional: -H

--set-parameter (-p) <string> : Set the named parameter for a resource #给资源设置参数

Requires: -r, -v. Optional: -i, -s, --meta

--get-parameter (-g) <string> : Get the named parameter for a resource #得到

Requires: -r. Optional: -i, -s, --meta

--delete-parameter (-d) <string>: Delete the named parameter for a resource #删除

Requires: -r. Optional: -i, --meta

Options

--resource (-r) <string> : Resource ID

--resource-type (-t) <string> : Resource type (primitive, clone, group, ...)

--property-value (-v) <string> : Property value

--host-uname (-H) <string> : Host name

--meta : Modify a resource's configuration option rather than one which is passed to the resource agent script.

For use with -p, -g, -d

--lifetime (-u) <string> : Lifespan of migration constraints

--force (-f) : Force the resource to move by creating a rule for the current location and a score of -INFINITY

This should be used if the resource's stickiness and constraint scores total more than INFINITY (Currently 100,000)

NOTE: This will prevent the resource from running on this node until the constraint is removed with -U or the --lifetime duration expires

-s <string> : (Advanced Use Only) ID of the instance_attributes object to change

-i <string> : (Advanced Use Only) ID of the nvpair object to change/delete

[root@node1 ~]#

[root@node1 ~]# crm_standby --help #让自己转为备节点

usage: crm_standby [-?V] -(u|U) -(D|G|v) [-l]

Options

--help (-?) : this help message

--verbose (-V) : turn on debug info. additional instances increase verbosity

--quiet (-Q) : Print only the value on stdout (use with -G)

--get-value (-G) : Retrieve rather than set the preference to be promoted

--delete-attr (-D) : Delete rather than set the attribute

--attr-value (-v) <string> : Value to use (ignored with -G)

--attr-id (-i) <string> : The 'id' of the attribute. Advanced use only.

--node-uuid (-u) <node_uuid> : UUID of the node to change #指定哪个节点转为备节点

--node-uname (-U) <node_uname> : uname of the node to change #指定哪个名称的节点转为备节点

--lifetime (-l) <string> : How long the preference lasts (reboot|forever)

If a forever value exists, it is ALWAYS used by the CRM

instead of any reboot value

[root@node1 ~]#

[root@node1 ~]# crm_attribute --help

usage: crm_attribute [-?V] -(D|G|v) [options]

Options

--help (-?) : this help message

--verbose (-V) : turn on debug info. additional instances increase verbosity

--quiet (-Q) : Print only the value on stdout (use with -G)

--get-value (-G) : Retrieve rather than set the preference to be promoted

--delete-attr (-D) : Delete rather than set the attribute

--attr-value (-v) <string> : Value to use (ignored with -G)

--attr-id (-i) <string> : The 'id' of the attribute. Advanced use only.

--node-uuid (-u) <node_uuid> : UUID of the node to change #获取节点的属性?

--node-uname (-U) <node_uname> : uname of the node to change #获取某名称的节点的属性?

--set-name (-s) <string> : Set of attributes in which to read/write the attribute

--attr-name (-n) <string> : Attribute to set

--type (-t) <string> : Which section of the CIB to set the attribute: (nodes|status|crm_config)

-t=nodes options: -(U|u) -n [-s]

-t=status options: -(U|u) -n [-s]

-t=crm_config options: -n [-s]

--inhibit-policy-engine (-!) : Make a change and prevent the TE/PE from seeing it straight away.

You may think you want this option but you don't. Advanced use only - you have been warned!

[root@node1 ~]#

在第一个节点 192.168.0.45上操作

[root@node1 ~]# crm_sh #进入crm 的 shell 模式

crm #

crm # help #帮助命令

Usage: crm (nodes|config|resources) #管理节点|配置|资源

crm #

crm # nodes

crm nodes # #子模式,跟交换机配置类似?

crm nodes # help

Usage: nodes (status|list)

crm nodes #

crm nodes # list #列出当前的节点

crm nodes #

crm nodes # status #获取每个节点的状态

<node_state id="9d79885a-9277-4672-9da6-914b79278104" uname="node1.magedu.com" crmd="online" crm-debug-origin="do_lrm_query" shutdown="0" in_ccm="true" ha="active" join="member" expected="member">

<node_state id="9c0242c7-8660-450b-a1dc-63eb26ef1636" uname="node2.magedu.com" ha="active" crm-debug-origin="do_lrm_query" crmd="online" shutdown="0" in_ccm="true" join="member" expected="member">

crm nodes #

crm nodes # exit # 退出

ERROR: ** unknown exception encountered, details follow

Traceback (most recent call last):

File "/usr/sbin/crm_sh", line 338, in ?

rc = main_loop(args)

File "/usr/sbin/crm_sh", line 258, in main_loop

return d()

File "/usr/sbin/crm_sh", line 257, in <lambda>

d = lambda: func(*cmd_args, **cmd_options)

File "/usr/lib/heartbeat/crm_commands.py", line 73, in exit

sys.exit(0)

NameError: global name 'sys' is not defined

[root@node1 ~]#

在第一个节点 192.168.0.45上操作

[root@node1 ~]# crm_sh

crm # help

Usage: crm (nodes|config|resources)

crm # resources

crm resources # help

Usage: resources (status|list)

crm resources #

crm resources # list #列出所有资源

NO resources configured

crm resources #

crm resources # exit

ERROR: ** unknown exception encountered, details follow

Traceback (most recent call last):

File "/usr/sbin/crm_sh", line 338, in ?

rc = main_loop(args)

File "/usr/sbin/crm_sh", line 258, in main_loop

return d()

File "/usr/sbin/crm_sh", line 257, in <lambda>

d = lambda: func(*cmd_args, **cmd_options)

File "/usr/lib/heartbeat/crm_commands.py", line 73, in exit

sys.exit(0)

NameError: global name 'sys' is not defined

[root@node1 ~]#

在第一个节点 192.168.0.45上操作

[root@node1 i386]# pwd

/root/i386

[root@node1 i386]#

[root@node1 i386]# ls hear*

heartbeat-2.1.4-11.el5.i386.rpm heartbeat-pils-2.1.4-11.el5.i386.rpm

heartbeat-gui-2.1.4-11.el5.i386.rpm(图形界面窗口,在windows中可以执行,,,否则只有打开linux的图形界面才能打开 heartbeat-gui ,需要账号密码,账号是自动创建的hacluster,默认无密码,需要在哪个节点上使用hacluster,就在哪个节点上配置密码 ) heartbeat-stonith-2.1.4-11.el5.i386.rpm

[root@node1 i386]#

[root@node1 i386]# tail /etc/passwd

redis:x:2530:2530::/home/redis:/bin/bash

vuser:x:2531:2531::/var/ftproot:/sbin/nologin

vsftpd:x:2532:2532::/home/vsftpd:/sbin/nologin

nfstest:x:510:510::/home/nfstest:/bin/bash

eucalyptus:x:2533:2533::/home/eucalyptus:/bin/bash

fedora:x:2534:2534::/home/fedora:/bin/bash

redhat:x:2535:2536::/home/redhat:/bin/bash

test:x:2536:2537::/home/test:/bin/bash

ha:x:2537:2538::/home/ha:/bin/bash

hacluster:x:307:307:heartbeat user:/var/lib/heartbeat/cores/hacluster:/sbin/nologin

[root@node1 i386]#

[root@node1 i386]# passwd hacluster

Changing password for user hacluster.

New UNIX password:

Retype new UNIX password:

passwd: all authentication tokens updated successfully.

[root@node1 i386]#

[root@node1 i386]# hb_gui & # 即 heartbeat gui 与符号表示在后台运行

图形界面是在windows中打开的,所以上面的putty报错

我使用的是 xshell xmanager 是可以打开的 ,如下图,其实是 linux 的窗口,在windows中打开了而己

[root@node1 i386]# hb_gui &

下图是 mgmt的 5560端口

可连接任何一节的 hacluster用户,但是用户必须要有密码

马哥的图

有ping node 所以三个节点

dc是 node2

Node Name:节点名称

Online:在线

Is it DC:是不是DC

Type:member

Standby:是不是备用节点, false表示不是备用

no Quorum Policy : 不满足法定票数时候的策略 stop freeze ignore

with quorum # 满足法定票数,因为加上 ping node 是三票

symmetric cluster:

对称集群(左对称):都可转移,可以阻止哪个不能转移,,堵的策略;;;;;古代关城门,根据画像来堵

非对称集群(右对称):都不能转移,可以允许哪个能转移,,通的策略;;;;古代关城门,根据良民证来通

Stonith Enabled:必须要有stonith设备的,配置stonith的

Stonith Action:重启还是关机;;;;一般重启是最好的策略

Default Resource Stickiness: 默认资源粘性为0表示运行在哪个节点上都可以,墙头划,随风倒,你让它在哪,它就在哪,为100表示更倾向于留在当前节点;正数表示更倾向留在当前节点,负数表示更倾向离开当前节点,,,位置约束才表示更倾向于留在哪个节点

Default Resource Failure Stickiness: 马哥没解释

Is Managed Default:默认是不是可管理的,以后讲到crm时再说

DC Deadtime:DC多长时间会死掉

Cluster Recheck Interval:集群重新检测的时间间隔

Election Timeout:选举DC(还有某节点故障了,它的资源最终运行在哪个节点上,最终要计算出来;;;还有crmd等等 )的超时时间

添加资源,不能添加节点(因为节点在ha.cf中配置的)

native 就是 primitive 基本资源

group 组资源

location 约束

order 约束

colocation 约束

当然还有 clone , master/slave,不能直接创建, ,它首先是 primitive 或 group,,然后再 clone 或 master/slave,

我的图与马哥的图差不多,就是语言不一样

创建 vip ,,,httpd 资源

资源=>添加=>普通资源=>确定

资源id,属于哪个组(应该同进同退,不然会在不同的节点上)

对于crm来讲我们要想配置配置文件的话(要改cib配置文件的话),一般会从背后连到DC那个crmd上,非DC的crmd本身不允许修改的,DC上改完后,DC会同步到其它每一个节点,,,,但是各种工具会自动连接DC( 在哪个节点上面执行都没关系它背后会自动的协调在DC上进行执行)

前端的各种工具,比如 corosync 上,也是使用crm在哪个节点上面执行都没关系,它背后会自动的协调在DC上进行执行

使用# hb_gui & 弹出图形界面时

最好这里连DC的那个ip地址

好像随便连哪个节点但马哥出了点问题

所以说最好二字,,,,但DC可能是不确定的,所以要先看下DC是哪个IP

一)下面几个图是创建一个vip资源

在 192.168.0.45 这个节点看看

[1]+ Exit 1 hb_gui

[root@node1 i386]# ifconfig #vip己经配置好了

eth0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.45 Bcast:192.168.0.255 Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fe3d:b03c/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:155631 errors:0 dropped:0 overruns:0 frame:0

TX packets:115299 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:26058200 (24.8 MiB) TX bytes:53989377 (51.4 MiB)

Interrupt:67 Base address:0x2000

eth0:0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.50 Bcast:192.168.0.255 Mask:255.255.255.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

Interrupt:67 Base address:0x2000

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:52128 errors:0 dropped:0 overruns:0 frame:0

TX packets:52128 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:45911894 (43.7 MiB) TX bytes:45911894 (43.7 MiB)

[root@node1 i386]#

二)下面几个图是创建一个httpd资源

下面两图 ocf的 apache 和 lsb 的httpd 都可以

ocf的 apache 可接受参数,灵活性更大

lsb 的httpd 配置更简单, 不接受参数 (lsb格式的脚本基本上不接受参数的)

我这里使用的是 lsb 的httpd

默认资源平衡运行

在不同节点运行

可以使用排列约束,它俩必须在一起,

也可以把它们加为一个组

下面几个图是定义为一个组把它们加进来

必须要先定义组,然后才能添加组里面的资源

先删除webip和 httpd 两个资源吧,顺序是先停掉,再清理资源(防止保留状态),最后删除

重新开始添加组

1)在组里添加 webip资源

2)在组里添加 httpd 资源

在组上右键 start,可以启动组里面的所有资源

马哥的看到组里面资源启动了

马哥能正常访问了

我这边没有看到组里面的资源运行

看日志,好像是 webip 有问题,

[root@node1 i386]# tail /var/log/messages

Dec 1 16:16:58 node1 tengine: [4968]: info: unpack_graph: Unpacked transition 67: 0 actions in 0 synapses

Dec 1 16:16:58 node1 tengine: [4968]: info: run_graph: Transition 67: (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)

Dec 1 16:16:58 node1 pengine: [4969]: WARN: process_pe_message: Transition 67: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-58.bz2

Dec 1 16:16:58 node1 tengine: [4968]: info: notify_crmd: Transition 67 status: te_complete - <null>

Dec 1 16:16:58 node1 pengine: [4969]: info: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues.

Dec 1 16:16:58 node1 haclient: on_event:evt:cib_changed

Dec 1 16:16:58 node1 crmd: [4961]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]

Dec 1 16:16:58 node1 haclient: on_event:evt:cib_changed

Dec 1 16:16:59 node1 mgmtd: [4962]: ERROR: unpack_rsc_op: Hard error: webip_start_0 failed with rc=6.

Dec 1 16:16:59 node1 mgmtd: [4962]: ERROR: unpack_rsc_op: Preventing webip from re-starting anywhere in the cluster

[root@node1 i386]#

先停止,清理,删除 webip (vip) 这个资源吧

然后再在资源组 webservice 里面添加webip (vip) 这个资源

停掉 webservice ,再启动 webservcie 这个组,然后就可以看到正常的了

组里面的资源启动顺序是默认从上往下依次启动

[root@node1 i386]# ifconfig

eth0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.45 Bcast:192.168.0.255 Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fe3d:b03c/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:227174 errors:0 dropped:0 overruns:0 frame:0

TX packets:188330 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:34990572 (33.3 MiB) TX bytes:120763522 (115.1 MiB)

Interrupt:67 Base address:0x2000

eth0:0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.55 Bcast:192.168.0.255 Mask:255.255.255.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

Interrupt:67 Base address:0x2000

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:109376 errors:0 dropped:0 overruns:0 frame:0

TX packets:109376 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:111751682 (106.5 MiB) TX bytes:111751682 (106.5 MiB)

[root@node1 i386]#

原来 webip又配置错了如何排错? 看 /var/log/messages

[root@node1 i386]# ifconfig # 在第一个节点 192.168.0.45 上配置的vip (eth0:0)上竟然是 192.168.0.55 ,应该是 192.168.0.50)

eth0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.45 Bcast:192.168.0.255 Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fe3d:b03c/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:233584 errors:0 dropped:0 overruns:0 frame:0

TX packets:195205 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:35835795 (34.1 MiB) TX bytes:126868439 (120.9 MiB)

Interrupt:67 Base address:0x2000

eth0:0 Link encap:Ethernet HWaddr 00:0C:29:3D:B0:3C

inet addr:192.168.0.55 Bcast:192.168.0.255 Mask:255.255.255.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

Interrupt:67 Base address:0x2000

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:115689 errors:0 dropped:0 overruns:0 frame:0

TX packets:115689 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:117881712 (112.4 MiB) TX bytes:117881712 (112.4 MiB)

[root@node1 i386]#

所以先停止,清理,删除 webip (vip) 这个资源吧

然后再在资源组 webservice 里面添加webip (vip) 这个资源

如下图 ,现在一切正常

可以正常访问 http://192.168.0.50/index.html

让 node1 成为备用节点看看

资源运行在 node2 节点上了,,,记住如果有问题的话 ,要清理资源(比如清理 webservice组资源,清理webip资源,清理 httpd资源);再修改添加

ha.cf当中 auto_failback on,默认都是转移回去的

[root@node1 ~]# cd /etc/ha.d/

[root@node1 ha.d]# vim ha.cf

auto_failback on

资源粘性,资源对当前节点的分数大于0,且大于对其它的节点的位置分数

三)下面几个图是创建一个nfs资源

1)先停掉资源组

排列约束,资源是否必须要一起 true

按顺序添加就按照顺序来启动 true

只好把 webservice 清理资源,再httpd 先清理,再删掉,然后再添加 httpd 吧,这样子,httpd就在最下面了,

我这边正常了,,但是马哥那边 webstore 挂不上(马哥那边是由于nfs的ip地址错了)

http://192.168.0.50/index.html

如下图,马哥先停掉组资源,在这里修改nfs地址的 (也就是ocf的参数)

马哥改了nfs的ip还是有问题,他看日志,解决问题的思路

在第一个节点 192.168.0.45

[root@node1 ha.d]# cd /var/lib/heartbeat/crm

[root@node1 crm]# ls

cib.xml cib.xml.last cib.xml.sig cib.xml.sig.last

[root@node1 crm]# cat cib.xml #就可以看到关于crm的配置

切记,每一次操作了有错误,都要清理,清理,清理 ,清理,清理资源,再进行操作,

因为它是根据上一次的状态信息来决定下次转移时(任何动作)????如何完成的

马哥改了 nfs 的 ip 还是有问题,,他最后的办法是清理,清理,清理,然后问题解决了

下面 ,把 node1 的 heartbeat down掉看看

在第一个节点 192.168.0.45 上执行

[root@node1 crm]# /etc/init.d/heartbeat stop

Stopping High-Availability services:

[确定]

[root@node1 crm]#

在第二个节点 192.168.0.55 上执行 (因为第一个节点 192.168.0.45 上无法执行了,因为它的 heartbeat停了)

[root@node2 ~]# tail /var/log/messages

Dec 2 09:36:13 node2 crmd: [11234]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]

Dec 2 09:36:43 node2 heartbeat: [11219]: WARN: node node1.magedu.com: is dead

Dec 2 09:36:43 node2 heartbeat: [11219]: info: Link node1.magedu.com:eth0 dead.

Dec 2 09:36:43 node2 crmd: [11234]: notice: crmd_ha_status_callback: Status update: Node node1.magedu.com now has status [dead]

Dec 2 09:37:49 node2 named[3823]: listening on IPv4 interface eth0:3, 192.168.0.50#53

Dec 2 09:38:20 node2 cib: [11230]: info: cib_stats: Processed 12 operations (2500.00us average, 0% utilization) in the last 10min

Dec 2 09:40:21 node2 crm_mon: [14931]: info: G_main_add_SignalHandler: Added signal handler for signal 15

Dec 2 09:40:21 node2 crm_mon: [14931]: info: G_main_add_SignalHandler: Added signal handler for signal 2

Dec 2 09:41:38 node2 crm_mon: [14934]: info: G_main_add_SignalHandler: Added signal handler for signal 15

Dec 2 09:41:38 node2 crm_mon: [14934]: info: G_main_add_SignalHandler: Added signal handler for signal 2

[root@node2 ~]#

在第二个节点 192.168.0.55 上执行 (因为第一个节点 192.168.0.45 上无法执行了,因为它的 heartbeat停了)

[root@node2 crm]# crm_mon #(即 crm monitor)

Refresh in 14s...

============

Last updated: Wed Dec 2 09:41:38 2020

Current DC: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636)

2 Nodes configured.

1 Resources configured.

============

Node: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104): OFFLINE

Node: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636): online

Resource Group: webservice

webip (ocf::heartbeat:IPaddr): Started node2.magedu.com

webstore (ocf::heartbeat:Filesystem): Started node2.magedu.com

httpd (lsb:httpd): Started node2.magedu.com

http://192.168.0.50/index.html 正常访问

在第一个节点 192.168.0.45 上执行

[root@node1 crm]# /etc/init.d/heartbeat start

Starting High-Availability services:

2020/12/02_09:44:58 INFO: Resource is stopped

[确定]

[root@node1 crm]#

[root@node1 crm]# tail /var/log/messages #由这个日志,大约可以看出来 node1(第一个节点 192.168.0.45 ) failback了

Dec 2 09:45:16 node1 lrmd: [12075]: WARN: For LSB init script, no additional parameters are needed.

Dec 2 09:45:16 node1 lrmd: [11863]: info: RA output: (httpd:start:stdout) 启动 httpd：

Dec 2 09:45:16 node1 lrmd: [11863]: info: RA output: (httpd:start:stdout) [

Dec 2 09:45:16 node1 lrmd: [11863]: info: RA output: (httpd:start:stdout) 确定

Dec 2 09:45:16 node1 lrmd: [11863]: info: RA output: (httpd:start:stdout) ]

Dec 2 09:45:16 node1 lrmd: [11863]: info: RA output: (httpd:start:stdout)

Dec 2 09:45:16 node1 crmd: [11866]: info: process_lrm_event: LRM operation httpd_start_0 (call=7, rc=0) complete

Dec 2 09:45:16 node1 setroubleshoot: SELinux is preventing httpd from loading /usr/local/apache/modules/libphp5.so which requires text relocation. For complete SELinux messages. run sealert -l 80d61f8e-d344-4d6a-836a-65f9b96e22a6

Dec 2 09:45:17 node1 haclient: on_event: from message queue: evt:cib_changed

[root@node1 crm]#

[root@node1 crm]# crm_mon #看监控又转回 node1(第一个节点 192.168.0.45 )

Refresh in 14s...

============

Last updated: Wed Dec 2 09:50:57 2020

Current DC: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636)

2 Nodes configured.

1 Resources configured.

============

Node: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104): online

Node: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636): online

Resource Group: webservice

webip (ocf::heartbeat:IPaddr): Started node1.magedu.com

webstore (ocf::heartbeat:Filesystem): Started node1.magedu.com

httpd (lsb:httpd): Started node1.magedu.com

下面是不通过组,通过约束的方式把这三个资源定义在一起

还要有启动顺序: webip,webstore,httpd

可以让它更倾向于运行在 node1 上

1)停止 webservice

2)清理组webservice 下面的资源,清理组webservice

3)删除组webservice

4)添加三个资源

a)添加 webip 资源

b)添加 webstore

b)添加 httpd

启动起来三个资源,看到它们位于不同的节点上

5) 增加排列

增加排列 httpd与webstore

增加排列 webip与httpd

一般 webip与 httpd 没有启动的先后顺序,

但是如果 httpd 监听在webip上面,那么应该先启动 webip,再启动httpd

最好还是设 webstore 与 webip 的排列吧

webstore 依赖于 webip,设置好之后如下的关系

6) 增加顺序

7)资源与节点的关系

由下图,看到资源粘性都为0

资源粘性改成 100 试试

昨天不知怎么搞的 ,几个服务都运行不起来了,查 /var/log/messages 和 crm_mon 都看不出原因

如下图点击右下角缺省,就可以运行了

资源粘性,更倾向于留在当前节点

资源与节点的位置关系,更倾向于留在哪个节点

位置约束:无论对于哪一个资源,我们给它非常大的分数,它更倾向于运行在某个节点上,

最终会导致三个资源的结合性的值都会大于资源粘性值??? 没听懂

上图点应用后不管原来资源在哪里,都会直接切换到 node2 上

看看所有资源的粘性值与位置约束值的分数

资源对于node1的分数: webip的资源粘性 + webstore的资源粘性100 + httpd的资源粘性100 = 300

资源对于node2的分数: webip的位置约束的无穷大 + webstore的资源粘性100 + httpd的资源粘性100 = 无穷大

所以所有资源(它们是连在一起运行的)更倾身于在node2上运行

记住: 位置约束分数是要加起来进行比较的,如果没有位置约束分数,那么就用它的粘性值来计算

现在资源运行在node2 ,把node2关机,看资源是否会转移

如下图 node2关机

过一会儿

http://192.168.0.50/index.html 可以打开

node2停掉后,node1会始终等待node2的心跳信息,默认会等待30秒,

所以直接关掉node2,node1的等待时间比停止 hearbeat 服务要慢,

node2停止 hearbeat 服务时,会通知其它节点,我停掉了,所以速度快

node2停掉后,

在节点一 192.168.0.45上

#反正大约要过30秒,才看到如下的最终信息

[root@node1 ~]# crm_mon

Refresh in 14s...

============

Last updated: Thu Dec 3 13:12:52 2020

Current DC: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104)

2 Nodes configured.

3 Resources configured.

============

Node: node1.magedu.com (9d79885a-9277-4672-9da6-914b79278104): online

Node: node2.magedu.com (9c0242c7-8660-450b-a1dc-63eb26ef1636): OFFLINE

webip (ocf::heartbeat:IPaddr): Started node1.magedu.com

webstore (ocf::heartbeat:Filesystem): Started node1.magedu.com

httpd (lsb:httpd): Started node1.magedu.com

node2重启后,会重新加入集群中去的(要确保hearbeat服务自动启动)

资源更倾向于 node2,所以资源又会在node2上

heartbeat比 cronsync 更简单一点

普通分类:

linux

You are here

友情链接

搜索表单

用户登录

You are here

马哥 39_01 _Linux集群系列之十一——高可用集群之heartbeat基于crm进行资源管理 有大用

友情链接

马哥 39_01 _Linux集群系列之十一——高可用集群之heartbeat基于crm进行资源管理有大用