Quick question around clustered ejabberd config. I have two nodes in an HA configuration. After a patch/reboot cycle, the servers would not start up. Here's the errors:
Node 0:
<0.147.0> Mnesia(ejabberd@XMPPUSW00): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, ejabberd@XMPPUSW01}
Node 1:
** FATAL ** Failed to merge schema: Cannot merge definitions of table session. Local = {cstruct,session,set,[ejabberd@XMPPUSW00,ejabberd@XMPPUSW01],[],[],0,read_write,false,[],[],false,session,[sid,usr,us,priority,info],[],[],[],{{1520429458666496,-576460752303423335,1},ejabberd@XMPPUSW01},{{61,0},{ejabberd@XMPPUSW00,{1495,65102,60000}}}}, Remote = {cstruct,session,set,[ejabberd@XMPPUSW00,ejabberd@XMPPUSW01],[],[],0,read_write,false,[4],[],false,session,[sid,usr,us,priority,info],[],[],[],{{1520429458666496,-576460752303423335,1},ejabberd@XMPPUSW01},{{60,0},{ejabberd@XMPPUSW00,{1495,65102,29000}}}}
Poking around at it, I was able to fix the issue two ways:
Copying the schema.dat from one node to the other
Using ejabberdctl set_master to force one of the nodes as an mnesia master.
I don't know if setting one of the nodes as mnesia master is a particularly robust action for general HA, I think I'd prefer to be set up as master/master. When I dumped and diffed the mnesia tables between the nodes, I was unable to find any table named 'session', and there were no schema differences (a few row diffs, though). Any idea what I might have wrong in my base configuration?
Thanks!
Disclaimer: I'm not a
Disclaimer: I'm not a clustering expert, so I just comment what I know in case they give you ideas to solve your problem.
If one node is defined as master and the other is slave, try to stop first the slave, then the node.
The "session" mnesia table keeps an element per each XMPP client session: erlang pid, user jid, session priority... So, this table is only relevant while the server is online; when it stops, the table contents are useless. I guess that's why its contents aren't dumped. That table may be replicated across nodes to allow the nodes to act as a cluster.
If a node stops, and the other keeps receiving new session connections, then its table contents deviate from the other node. When both synchronise again, their contents don't match. But as I mentioned, once the cluster shutdowns, that table content is useless.
Regarding your error report, it complains that the "session" table definitions, not the content, are different. Looking at them, there are just a pair of changes, just a pair of numbers, but I don't know what exactly refer to.