I have setup a ejabberd cluster with 4 nodes (all on separate machines). I have configured two nodes to persist (RAM and disk copy) the passwd tables. I want to use 2 of the 4 nodes as the persistent storage of passwd as they will be behind an additional firewall. I will refer to these as the node1 and node2.
Next I have node3 and node4 which are configured to rely on node1 and node2 by providing the -mnesia extra_db_nodes ['ejabberd@sol101','ejabberd@sol102']
in the ejabberdctl script that starts these nodes.
I do not have these entries in node1 and node2.
The cluster starts fine and functions well. I have clients configured to only connect to node3 and node4. I can successfully stop/start node3 and node4 and the clients (which have internal retry to each server) will move from one to the other. Each of these nodes can be stop/started just fine (as long as either node3 or node4 is running).
Now when I try to fail over node1 and node2 the interesting thing happens. Note I installed node1 first. When I shutdown node1 the system still works (messages are delivered). I can stop/start node1 and all is functioning. The same is for node2 while node1 is running. Node2 can stop/start fine.
Here is the issue... When node1 is shutdown and I bring node2 down and try to restart it fails to establish the ports. It is waiting for node1 to be in a running state. As soon as I start node1, node2 opens ports (5222, 5280) and is functional. This implies a Master slave relationship even though I have provided the same Mnesia DB settings for both node1 and node2 (see below).
It gets even more interesting... when I use the admin web interface to make a change on the Mnesia DB settings (was doing this to test different settings). Turns out that when I change a configuration on node2 and submit via the web admin console this action defines node2 as the "master". So now when I test my stop/start fail over between node1 and node2 the opposite affect happens. Node1 will stop/start just fine as long as node2 is running. As soon as node2 is down, node1 will not start (not start meaning beam process runs but it does not open ports).
This unexpected relationship of having a "master" can be toggled based on what node I last commit a change to via the web admin console.
Am I missing something? Is this how ejabberd clustering should work?
What I would like (and expected) was to be able to configure two nodes to have "RAM and disk copy" so that they could stop/start, with zero reliance on one another. Is this possible?
ejabberd version 2.1.18
node1 and node2 Database configuration:
acl RAM and disc copy
captcha RAM copy
config RAM and disc copy
iq_response RAM copy
local_config RAM and disc copy
passwd RAM and disc copy
reg_users_counter RAM copy
route RAM copy
s2s RAM copy
schema RAM and disc copy
session RAM copy
session_counter RAM copy
node3 and node4 Database configuration:
acl RAM copy
captcha RAM copy
config RAM copy
iq_response RAM copy
local_config RAM copy
passwd Remote copy
reg_users_counter Remote copy
route RAM copy
s2s RAM copy
schema RAM copy
session RAM copy
session_counter RAM copy