Hi,
I'm tearing my hair off trying to get eJabberd clustering working. I have two eJabberd servers running behind a load balancer, the publically visible hostname is:
jabber.example.local
and the server names are:
node1.example.local
node2.example.local
eJabberd is installed as the user "ejabberd" on both servers, and running in /u01/ejabberd-2.1.13/
I configured both servers with host as below:
{hosts, ["jabber.example.local"]}.
and each node name I have tried configuring as alternatively the fqdn and the short name (node1.example.com and node1 respectively)
I can start the first server and it all works as expected. I've copied the .erlang.cookie file over to the home directory (/home/ejabberd) on the second server, and run the magic command:
/u01/ejabberd-2.1.13/bin/erl -sname ejabberd -mnesia dir '"/u01/ejabberd-2.1.13/database/ejabberd@node2"' -mnesia extra_db_nodes "['ejabberd@node1']" -s mnesia
I then run mnesia:info(). to validate that it's all working fine and I get...
(ejabberd@node2)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
schema : with 1 records occupying 383 words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node1" is NOT used.
use fallback at restart = false
running db nodes = [ejabberd@node2]
stopped db nodes = [ejabberd@node1]
master node tables = []
remote = []
ram_copies = [schema]
disc_copies = []
disc_only_copies = []
[{ejabberd@node2,ram_copies}] = [schema]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2)2>
When I run with fqdn I get some errors, but also it shows the databases better:
[ejabberd@node2 ~]$ /u01/ejabberd/bin/erl -sname ejabberd -mnesia dir '"/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local"' -mnesia extra_db_nodes "['ejabberd@node1.example.local']" -s mnesia
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.8.5 (abort with ^G)
(ejabberd@node2)1>
=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node1.example.local is illegal **
=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node1.example.local is illegal **
=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node2.example.local is illegal **
=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node2.example.local is illegal **
(ejabberd@node2)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
schema : with 37 records occupying 4637 words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local" is used.
use fallback at restart = false
running db nodes = [ejabberd@node2]
stopped db nodes = ['ejabberd@node2.example.local','ejabberd@node1.example.local']
master node tables = []
remote = [acl,caps_features,captcha,config,http_bind,iq_response,
last_activity,local_config,mod_register_ip,motd,
motd_users,muc_online_room,muc_registered,muc_room,
offline_msg,passwd,privacy,private_storage,pubsub_index,
pubsub_item,pubsub_last_item,pubsub_node,pubsub_state,
pubsub_subscription,reg_users_counter,roster,
roster_version,route,s2s,session,session_counter,
sr_group,sr_user,temporarily_blocked,vcard,vcard_search]
ram_copies = [schema]
disc_copies = []
disc_only_copies = []
[] = [pubsub_node,muc_online_room,muc_registered,temporarily_blocked,
iq_response,pubsub_state,muc_room,pubsub_item,private_storage,session,
motd_users,vcard_search,mod_register_ip,session_counter,sr_group,
caps_features,pubsub_index,captcha,vcard,s2s,acl,motd,route,offline_msg,
pubsub_last_item,roster_version,sr_user,last_activity,roster,passwd,
local_config,privacy,pubsub_subscription,reg_users_counter,http_bind,
config]
[{ejabberd@node2,ram_copies}] = [schema]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2)2>
I also tried the steps located here:
with no luck, it just clobbers my databases within ejabberd on node2, and the commands to copy the table over doesn't work.
I have validated that the cookie is the same using erlang:get_cookie(). on both instances (within the erl session on node2 and within ejabberdctl debug session on node1). I have validated connectivity on port 4369 using telnet. There doesn't seem to be any reason why the connection isn't working, but I can't get the first node in the "running db nodes" list on the second server no matter what I try. Nothing shows up on the ejabberd.log on node1. The traffic is definitely getting to node1 though because I see the comms on port 4369 when I start erlang. I had a look with wireshark and all I see is two conversations which appeared to be:
EPMD_PORT2_REQ ejabberd
EPMD_PORT2_RESP OK ejabberd port=60636
So it's definitely hitting the right service.
Am I doing anything wrong here? It seems like I am doing everything right but I just can't get it working and it's driving me insane.
Any help would be very much appreciated.
Thanks!
Ramiro
I've discovered that I
I've discovered that I shouldn't be using the -sname option when using FQDN but instead should be using the -name option. So at least I got rid of the errors, and it looks a lot better. But I'm still getting node1.example.local only ever showing up in the "stopped db nodes" despite what the documentation tells me I should be seeing.
(ejabberd@node2.example.local)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
mod_register_ip: with 0 records occupying 283 words of mem
local_config : with 0 records occupying 283 words of mem
caps_features : with 0 records occupying 5752 bytes on disc
config : with 0 records occupying 283 words of mem
http_bind : with 0 records occupying 283 words of mem
reg_users_counter: with 0 records occupying 283 words of mem
pubsub_subscription: with 0 records occupying 283 words of mem
privacy : with 0 records occupying 283 words of mem
passwd : with 0 records occupying 283 words of mem
roster : with 0 records occupying 283 words of mem
last_activity : with 0 records occupying 283 words of mem
sr_user : with 0 records occupying 283 words of mem
roster_version : with 0 records occupying 283 words of mem
pubsub_last_item: with 0 records occupying 283 words of mem
offline_msg : with 0 records occupying 5752 bytes on disc
route : with 0 records occupying 283 words of mem
motd : with 0 records occupying 283 words of mem
acl : with 0 records occupying 283 words of mem
s2s : with 0 records occupying 283 words of mem
vcard : with 0 records occupying 5752 bytes on disc
captcha : with 0 records occupying 283 words of mem
pubsub_index : with 0 records occupying 283 words of mem
sr_group : with 0 records occupying 283 words of mem
session_counter: with 0 records occupying 283 words of mem
vcard_search : with 0 records occupying 283 words of mem
motd_users : with 0 records occupying 283 words of mem
schema : with 37 records occupying 4637 words of mem
session : with 0 records occupying 283 words of mem
private_storage: with 0 records occupying 5752 bytes on disc
pubsub_item : with 0 records occupying 5752 bytes on disc
muc_room : with 0 records occupying 283 words of mem
pubsub_state : with 0 records occupying 283 words of mem
iq_response : with 0 records occupying 283 words of mem
temporarily_blocked: with 0 records occupying 283 words of mem
muc_registered : with 0 records occupying 283 words of mem
muc_online_room: with 0 records occupying 283 words of mem
pubsub_node : with 0 records occupying 283 words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local" is used.
use fallback at restart = false
running db nodes = ['ejabberd@node2.example.local']
stopped db nodes = ['ejabberd@node1.example.local']
master node tables = []
remote = []
ram_copies = [captcha,http_bind,iq_response,mod_register_ip,
muc_online_room,pubsub_last_item,reg_users_counter,
route,s2s,session,session_counter,temporarily_blocked]
disc_copies = [acl,config,last_activity,local_config,motd,motd_users,
muc_registered,muc_room,passwd,privacy,pubsub_index,
pubsub_node,pubsub_state,pubsub_subscription,roster,
roster_version,schema,sr_group,sr_user,vcard_search]
disc_only_copies = [caps_features,offline_msg,private_storage,pubsub_item,
vcard]
[{'ejabberd@node2.example.local',disc_copies}] = [pubsub_node,
muc_registered,
pubsub_state,
muc_room,schema,
motd_users,
vcard_search,
sr_group,
pubsub_index,acl,
motd,roster_version,
sr_user,
last_activity,roster,
passwd,local_config,
privacy,
pubsub_subscription,
config]
[{'ejabberd@node2.example.local',disc_only_copies}] = [pubsub_item,
private_storage,
caps_features,
vcard,
offline_msg]
[{'ejabberd@node2.example.local',ram_copies}] = [muc_online_room,
temporarily_blocked,
iq_response,session,
mod_register_ip,
session_counter,
captcha,s2s,route,
pubsub_last_item,
reg_users_counter,
http_bind]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2.example.local)2>
OK I found the solution
OK I found the solution myself.
Turns out that nowhere in any documentation was it mentioned that the database replication does not actually occur over port 4369, but that's just a port mapping daemon, and a random port is created for the actual database replication. This port was being blocked by my firewall, and that was causing the whole thing to break.
I had to open a hole in the firewall and set the FIREWALL_WINDOW environment variable in ejabberdctl.cfg, and database replication works perfectly now.