联系
Knight's Tale » 技术

zookeeper session expired的处理方法

2014-11-18 20:22

ZK 会话失效 怎么办?

官方的解答比较多,可见下面英文。总结来说,就是以下几点:

  • zk client在与所有server断开连接后(有可能是各种原因),client 会收到 disconnted消息。当zk server 恢复后,zk client会自动与server连接上,但是此时会话已失效,client收到 session expired消息。前一个会话的所有数据均丢失。接下来你要怎么做,得看你的程序用途:
    • 如果只是读写,没有主备机切换情况(就是当一台主机一台备机,当主机挂机时,ZK通知备机成为主机),那么,重新new一个会话,将原来session的树状结构重新建立起来就行了。
    • 如果是主备机切换情况,那就不能简单的建立 树状结构,因为这时我们是不知道到底 主机是挂掉了,还是session expired了。就只能把它当作真的是主机挂机来处理。
  • ZK所有集群均不可用情况是比较少见的。但是session expired需要引起重视起来。
  • 一般情况下,集群中一两台机器的挂机和启动,我们都不用关心,apache的zk client可以帮我们自动 处理这些问题。
  • disconf也考虑这种问题:https://github.com/knightliao/disconf/wiki/Zookeeper%E5%BC%82%E5%B8%B8%E8%80%83%E8%99%91
  • mac上模拟 zk session expired的工具可以采用 IceFloor

http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A9

SESSION_EXPIRED automatically closes the ZooKeeper handle. In a correctly operating cluster, you should never see SESSION_EXPIRED. It means that the client was partitioned off from the ZooKeeper service for more the the session timeout and ZooKeeper decided that the client died. Because the ZooKeeper service is ground truth, the client should consider itself dead and go into recovery. If the client is only reading state from ZooKeeper, recovery means just reconnecting. In more complex applications, recovery means recreating ephemeral nodes, vying for leadership roles, and reconstructing published state.

Library writers should be conscious of the severity of the expired state and not try to recover from it. Instead libraries should return a fatal error. Even if the library is simply reading from ZooKeeper, the user of the library may also be doing other things with ZooKeeper that requires more complex recovery.

Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a "timeout" value. This value is used by the cluster to determine when the client's session expires. Expirations happens when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat). At session expiration the cluster will delete any/all ephemeral nodes owned by that session and immediately notify any/all connected clients of the change (anyone watching those znodes). At this point the client of the expired session is still disconnected from the cluster, it will not be notified of the session expiration until/unless it is able to re-establish a connection to the cluster. The client will stay in disconnected state until the TCP connection is re-established with the cluster, at which point the watcher of the expired session will receive the "session expired" notification.

Example state transitions for an expired session as seen by the expired session's watcher:

  • 'connected' : session is established and client is communicating with cluster (client/server communication is operating properly) .... client is partitioned from the cluster
  • 'disconnected' : client has lost connectivity with the cluster .... time elapses, after 'timeout' period the cluster expires the session, nothing is seen by client as it is disconnected from cluster .... time elapses, the client regains network level connectivity with the cluster
  • 'expired' : eventually the client reconnects to the cluster, it is then notified of the expiration

其它参考