Trouble with eJabberd clustering.

Hi,

I'm tearing my hair off trying to get eJabberd clustering working. I have two eJabberd servers running behind a load balancer, the publically visible hostname is:
jabber.example.local

and the server names are:
node1.example.local
node2.example.local

eJabberd is installed as the user "ejabberd" on both servers, and running in /u01/ejabberd-2.1.13/

I configured both servers with host as below:

{hosts, ["jabber.example.local"]}.

and each node name I have tried configuring as alternatively the fqdn and the short name (node1.example.com and node1 respectively)

I can start the first server and it all works as expected. I've copied the .erlang.cookie file over to the home directory (/home/ejabberd) on the second server, and run the magic command:

/u01/ejabberd-2.1.13/bin/erl -sname ejabberd -mnesia dir '"/u01/ejabberd-2.1.13/database/ejabberd@node2"' -mnesia extra_db_nodes "['ejabberd@node1']" -s mnesia

I then run mnesia:info(). to validate that it's all working fine and I get...

(ejabberd@node2)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
schema         : with 1        records occupying 383      words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node1" is NOT used.
use fallback at restart = false
running db nodes   = [ejabberd@node2]
stopped db nodes   = [ejabberd@node1]
master node tables = []
remote             = []
ram_copies         = [schema]
disc_copies        = []
disc_only_copies   = []
[{ejabberd@node2,ram_copies}] = [schema]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2)2>

When I run with fqdn I get some errors, but also it shows the databases better:

[ejabberd@node2 ~]$ /u01/ejabberd/bin/erl -sname ejabberd -mnesia dir '"/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local"' -mnesia extra_db_nodes "['ejabberd@node1.example.local']" -s mnesia
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.5  (abort with ^G)
(ejabberd@node2)1>
=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node1.example.local is illegal **

=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node1.example.local is illegal **

=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node2.example.local is illegal **

=ERROR REPORT==== 19-Sep-2013::17:26:22 ===
** System NOT running to use fully qualified hostnames **
** Hostname node2.example.local is illegal **

(ejabberd@node2)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
schema         : with 37       records occupying 4637     words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local" is used.
use fallback at restart = false
running db nodes   = [ejabberd@node2]
stopped db nodes   = ['ejabberd@node2.example.local','ejabberd@node1.example.local']
master node tables = []
remote             = [acl,caps_features,captcha,config,http_bind,iq_response,
                      last_activity,local_config,mod_register_ip,motd,
                      motd_users,muc_online_room,muc_registered,muc_room,
                      offline_msg,passwd,privacy,private_storage,pubsub_index,
                      pubsub_item,pubsub_last_item,pubsub_node,pubsub_state,
                      pubsub_subscription,reg_users_counter,roster,
                      roster_version,route,s2s,session,session_counter,
                      sr_group,sr_user,temporarily_blocked,vcard,vcard_search]
ram_copies         = [schema]
disc_copies        = []
disc_only_copies   = []
[] = [pubsub_node,muc_online_room,muc_registered,temporarily_blocked,
      iq_response,pubsub_state,muc_room,pubsub_item,private_storage,session,
      motd_users,vcard_search,mod_register_ip,session_counter,sr_group,
      caps_features,pubsub_index,captcha,vcard,s2s,acl,motd,route,offline_msg,
      pubsub_last_item,roster_version,sr_user,last_activity,roster,passwd,
      local_config,privacy,pubsub_subscription,reg_users_counter,http_bind,
      config]
[{ejabberd@node2,ram_copies}] = [schema]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2)2>

I also tried the steps located here:
http://rfid-ale.blogspot.com.au/2009/10/how-to-make-ejabberd-cluster-set...
with no luck, it just clobbers my databases within ejabberd on node2, and the commands to copy the table over doesn't work.

I have validated that the cookie is the same using erlang:get_cookie(). on both instances (within the erl session on node2 and within ejabberdctl debug session on node1). I have validated connectivity on port 4369 using telnet. There doesn't seem to be any reason why the connection isn't working, but I can't get the first node in the "running db nodes" list on the second server no matter what I try. Nothing shows up on the ejabberd.log on node1. The traffic is definitely getting to node1 though because I see the comms on port 4369 when I start erlang. I had a look with wireshark and all I see is two conversations which appeared to be:
EPMD_PORT2_REQ ejabberd
EPMD_PORT2_RESP OK ejabberd port=60636
So it's definitely hitting the right service.

Am I doing anything wrong here? It seems like I am doing everything right but I just can't get it working and it's driving me insane.

Any help would be very much appreciated.

Thanks!
Ramiro

I've discovered that I

I've discovered that I shouldn't be using the -sname option when using FQDN but instead should be using the -name option. So at least I got rid of the errors, and it looks a lot better. But I'm still getting node1.example.local only ever showing up in the "stopped db nodes" despite what the documentation tells me I should be seeing.

(ejabberd@node2.example.local)1> mnesia:info().
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
mod_register_ip: with 0        records occupying 283      words of mem
local_config   : with 0        records occupying 283      words of mem
caps_features  : with 0        records occupying 5752     bytes on disc
config         : with 0        records occupying 283      words of mem
http_bind      : with 0        records occupying 283      words of mem
reg_users_counter: with 0        records occupying 283      words of mem
pubsub_subscription: with 0        records occupying 283      words of mem
privacy        : with 0        records occupying 283      words of mem
passwd         : with 0        records occupying 283      words of mem
roster         : with 0        records occupying 283      words of mem
last_activity  : with 0        records occupying 283      words of mem
sr_user        : with 0        records occupying 283      words of mem
roster_version : with 0        records occupying 283      words of mem
pubsub_last_item: with 0        records occupying 283      words of mem
offline_msg    : with 0        records occupying 5752     bytes on disc
route          : with 0        records occupying 283      words of mem
motd           : with 0        records occupying 283      words of mem
acl            : with 0        records occupying 283      words of mem
s2s            : with 0        records occupying 283      words of mem
vcard          : with 0        records occupying 5752     bytes on disc
captcha        : with 0        records occupying 283      words of mem
pubsub_index   : with 0        records occupying 283      words of mem
sr_group       : with 0        records occupying 283      words of mem
session_counter: with 0        records occupying 283      words of mem
vcard_search   : with 0        records occupying 283      words of mem
motd_users     : with 0        records occupying 283      words of mem
schema         : with 37       records occupying 4637     words of mem
session        : with 0        records occupying 283      words of mem
private_storage: with 0        records occupying 5752     bytes on disc
pubsub_item    : with 0        records occupying 5752     bytes on disc
muc_room       : with 0        records occupying 283      words of mem
pubsub_state   : with 0        records occupying 283      words of mem
iq_response    : with 0        records occupying 283      words of mem
temporarily_blocked: with 0        records occupying 283      words of mem
muc_registered : with 0        records occupying 283      words of mem
muc_online_room: with 0        records occupying 283      words of mem
pubsub_node    : with 0        records occupying 283      words of mem
===> System info in version "4.5", debug level = none <===
opt_disc. Directory "/u01/ejabberd-2.1.13/database/ejabberd@node2.example.local" is used.
use fallback at restart = false
running db nodes   = ['ejabberd@node2.example.local']
stopped db nodes   = ['ejabberd@node1.example.local']
master node tables = []
remote             = []
ram_copies         = [captcha,http_bind,iq_response,mod_register_ip,
                      muc_online_room,pubsub_last_item,reg_users_counter,
                      route,s2s,session,session_counter,temporarily_blocked]
disc_copies        = [acl,config,last_activity,local_config,motd,motd_users,
                      muc_registered,muc_room,passwd,privacy,pubsub_index,
                      pubsub_node,pubsub_state,pubsub_subscription,roster,
                      roster_version,schema,sr_group,sr_user,vcard_search]
disc_only_copies   = [caps_features,offline_msg,private_storage,pubsub_item,
                      vcard]
[{'ejabberd@node2.example.local',disc_copies}] = [pubsub_node,
                                                         muc_registered,
                                                         pubsub_state,
                                                         muc_room,schema,
                                                         motd_users,
                                                         vcard_search,
                                                         sr_group,
                                                         pubsub_index,acl,
                                                         motd,roster_version,
                                                         sr_user,
                                                         last_activity,roster,
                                                         passwd,local_config,
                                                         privacy,
                                                         pubsub_subscription,
                                                         config]
[{'ejabberd@node2.example.local',disc_only_copies}] = [pubsub_item,
                                                              private_storage,
                                                              caps_features,
                                                              vcard,
                                                              offline_msg]
[{'ejabberd@node2.example.local',ram_copies}] = [muc_online_room,
                                                        temporarily_blocked,
                                                        iq_response,session,
                                                        mod_register_ip,
                                                        session_counter,
                                                        captcha,s2s,route,
                                                        pubsub_last_item,
                                                        reg_users_counter,
                                                        http_bind]
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
ok
(ejabberd@node2.example.local)2>

OK I found the solution

OK I found the solution myself.

Turns out that nowhere in any documentation was it mentioned that the database replication does not actually occur over port 4369, but that's just a port mapping daemon, and a random port is created for the actual database replication. This port was being blocked by my firewall, and that was causing the whole thing to break.

I had to open a hole in the firewall and set the FIREWALL_WINDOW environment variable in ejabberdctl.cfg, and database replication works perfectly now.

Syndicate content