Case:CDH的CM节点挂掉,两个NameNode之前无法通信

报错

CM HDFS管理界面的报错(由于CM down这个信息是无法通过管理界面查看的,这里是从日志中获得的):

  • The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /opt/tmp/.cloudera_health_monitoring_canary_files.



排查并处理

(1)CDH的CM节点挂掉

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status

cloudera-scm-server dead but pid file exists


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /usr/java/jdk1.8.0_111/bin/jps

20656 Main

20626 Main

25667 Jps

20630 EventCatcherService

20632 AlertPublisher

29995 Main

10619 -- process information unavailable


#从这里可以看到,没有7180这个端口,说明CM没有正常启动,少了一个Main进程

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ss -nltup|grep 718*

tcp    LISTEN     0      50                     *:7184                  *:*      users:(("java",20630,233))

tcp    LISTEN     0      50                     *:7185                  *:*      users:(("java",20630,241))

tcp    LISTEN     0      5                      *:4433                  *:*      users:(("python2.6",17152,8))

tcp    LISTEN     0      5              127.0.0.1:7190                  *:*      users:(("python2.6",17152,11))

tcp    LISTEN     0      5                      *:7191                  *:*      users:(("python2.6",17152,7))



#我们的CDH相关的数据是存放在MySQL数据库中,由于CM down,导致无法查看CDH的其他相关组件,所以需要查看数据库信息,看看这个CDH都包括哪些节点

mysql> select * from hosts;
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
| HOST_ID | OPTIMISTIC_LOCK_VERSION | HOST_IDENTIFIER                      | NAME                       | IP_ADDRESS     | RACK_ID  | STATUS | CONFIG_CONTAINER_ID | MAINTENANCE_COUNT | DECOMMISSION_COUNT | CLUSTER_ID | NUM_CORES | TOTAL_PHYS_MEM_BYTES | PUBLIC_NAME | PUBLIC_IP_ADDRESS | CLOUD_PROVIDER |
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
|       1 |                      11 | 264b10bb-b488-4ee7-8fcd-3c68f7a8860a | ec6s-logshedcl58manager-01 | 10.177.101.146 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       2 |                      17 | b584457b-705d-4b1f-8000-df0e6da1838d | ec6s-logshedcl58dn-03      | 10.177.102.38  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       3 |                      16 | e28dabc1-c105-464e-8bf6-0bd0435ace9a | ec6s-logshedcl58dn-02      | 10.177.102.193 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       4 |                      17 | 994cf04e-2510-426a-8336-6e2d28a3001d | ec6s-logshedcl58nn-02      | 10.177.102.218 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       5 |                      16 | a9cab0d5-5e48-49a7-8fb0-e57a0bac16db | ec6s-logshedcl58nn-01      | 10.177.101.60  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       6 |                      16 | 60bf1721-d6db-4d72-9164-41d89f81e789 | ec6s-logshedcl58dn-01      | 10.177.101.64  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
6 rows in set (0.00 sec)
mysql> select * from roles;
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
| ROLE_ID | NAME                                                     | HOST_ID | ROLE_TYPE          | CONFIGURED_STATUS | SERVICE_ID | MERGED_KEYTAB | MAINTENANCE_COUNT | DECOMMISSION_COUNT | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | HAS_EVER_STARTED |
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
|      14 | mgmt-HOSTMONITOR-92f15c379891f3c8dbdbbcbe57db9067        |       1 | HOSTMONITOR        | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   25 |                1 |
|      15 | mgmt-EVENTSERVER-92f15c379891f3c8dbdbbcbe57db9067        |       1 | EVENTSERVER        | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   21 |                1 |
|      16 | mgmt-ACTIVITYMONITOR-92f15c379891f3c8dbdbbcbe57db9067    |       1 | ACTIVITYMONITOR    | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   22 |                1 |
|      17 | mgmt-SERVICEMONITOR-92f15c379891f3c8dbdbbcbe57db9067     |       1 | SERVICEMONITOR     | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   24 |                1 |
|      18 | mgmt-ALERTPUBLISHER-92f15c379891f3c8dbdbbcbe57db9067     |       1 | ALERTPUBLISHER     | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   20 |                1 |
|      19 | zookeeper-SERVER-5779e83332b2c66cc02029a8ab2c3628        |       3 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      20 | zookeeper-SERVER-c103ed4dcdd93fc8bbaf467aa1c6d927        |       2 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      21 | zookeeper-SERVER-dc971e0a60f4e798e85e2ab9bd57a041        |       6 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      23 | hdfs-NAMENODE-ed39ed17d751bee1bd6ad84c0db46ca1           |       5 | NAMENODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      22 |                   30 |                1 |
|      24 | hdfs-DATANODE-c103ed4dcdd93fc8bbaf467aa1c6d927           |       2 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      25 | hdfs-DATANODE-5779e83332b2c66cc02029a8ab2c3628           |       3 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      26 | hdfs-DATANODE-dc971e0a60f4e798e85e2ab9bd57a041           |       6 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      27 | hdfs-NAMENODE-16c21945a5f07e23a510dd5e32caa6dd           |       4 | NAMENODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                       6 |                   30 |                1 |
|      28 | hdfs-FAILOVERCONTROLLER-ed39ed17d751bee1bd6ad84c0db46ca1 |       5 | FAILOVERCONTROLLER | RUNNING           |          6 | NULL          |                 0 |                  0 |                       4 |                   29 |                1 |
|      29 | hdfs-FAILOVERCONTROLLER-16c21945a5f07e23a510dd5e32caa6dd |       4 | FAILOVERCONTROLLER | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   29 |                1 |
|      30 | hdfs-JOURNALNODE-c103ed4dcdd93fc8bbaf467aa1c6d927        |       2 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      31 | hdfs-JOURNALNODE-dc971e0a60f4e798e85e2ab9bd57a041        |       6 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      32 | hdfs-JOURNALNODE-5779e83332b2c66cc02029a8ab2c3628        |       3 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      36 | kafka-KAFKA_BROKER-c103ed4dcdd93fc8bbaf467aa1c6d927      |       2 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                       9 |                   40 |                1 |
|      37 | kafka-KAFKA_BROKER-ed39ed17d751bee1bd6ad84c0db46ca1      |       5 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                      10 |                   40 |                1 |
|      38 | kafka-KAFKA_BROKER-16c21945a5f07e23a510dd5e32caa6dd      |       4 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                      10 |                   40 |                1 |
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
21 rows in set (0.00 sec)
mysql> select * from services;
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
| SERVICE_ID | OPTIMISTIC_LOCK_VERSION | NAME      | SERVICE_TYPE | CLUSTER_ID | MAINTENANCE_COUNT | DISPLAY_NAME                | GENERATION |
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
|          4 |                      14 | mgmt      | MGMT         |       NULL |                 0 | Cloudera Management Service |          1 |
|          5 |                       7 | zookeeper | ZOOKEEPER    |          5 |                 0 | ZooKeeper                   |          1 |
|          6 |                      23 | hdfs      | HDFS         |          5 |                 0 | HDFS                        |          1 |
|          8 |                      15 | kafka     | KAFKA        |          5 |                 0 | Kafka                       |          1 |
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+


#重启cloudera-scm-server服务

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status

cloudera-scm-server dead but pid file exists


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server stop 

cloudera-scm-server is already stopped


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# cat /var/run/cloudera-scm-server.pid

10617

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ps -ef|grep 10617

root     28331 27755  0 19:02 pts/3    00:00:00 grep 10617


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20656

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20626

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20630

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 29995

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20632



[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server start

[root@ec6s-logshedcl58manager-01 ~]#  /etc/init.d/cloudera-scm-server status

cloudera-scm-server (pid  1378) is running...


#正常启动

[root@ec6s-logshedcl58manager-01 ~]# /usr/java/jdk1.8.0_111/bin/jps

1380 Main

2469 Main

2471 EventCatcherService

7272 Jps

2473 AlertPublisher

2475 Main

2462 Main



(2)两个NameNode之前无法通信,但是没有挂掉

当上面的CM正常起来之后,我们就可以通过图像界面管理NameNode,从图形界面上得到的信息是,NameNode彼此不能通信,NameNode无法写日志到Jounral Node中

日志报错:

Jul 18, 5:38:09.355 PMFATALorg.apache.hadoop.hdfs.server.namenode.FSEditLog
Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.177.101.64:8485, 10.177.102.193:8485, 10.177.102.38:8485], stream=QuorumOutputStream starting at txid 1338050))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1783)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)



从日志可以看出,NameNode写journal文件失败,导致NameNode超时,因为公司用的AWS ec2环境,可能但是在做网络维护,导致instance网络不稳定,如果出现timeout的情况,我们可以把默认的20s修改成60s,如

#vim /etc/hadoop/conf/hdfs-site.xml 

<property>

        <name>dfs.qjournal.write-txns.timeout.ms</name>

        <value>60000</value>

</property>


然后可以通过CM的管理平台:http://10.177.101.146:7180 分别重启两个NameNode

相关文章
相关标签/搜索