That's great Andy.
Just 2 questions really Why didn't Bind complain? surely it must know if it has bound to the port. But now we are aware of it, we know what to look for. Secondly, how did the previous process start? I am so used to using systemctl start xxx and automatically following it with journalctl -xe to see if any problems. 
It has to be down to something I did.
First priority now though is get all the zones loaded up and tidy the conf files with rate limiting and proper acls etc. Then try to investigate through SCREEN logfile

On Wed, 24 Jul 2019 at 13:44, Andy Smith <andy@bitfolk.com> wrote:
On Wed, Jul 24, 2019 at 12:05:56PM +0000, Andy Smith wrote:
> All of those processes (16870, 11064, 27705, 20717) thought they were listening
> on the main IP port 53, even while systemd thinks there is no bind9 service running.

Here's where 27705 came into being:

Jul 23 07:23:25 westnorfolk named[26462]: shutting down
Jul 23 07:23:25 westnorfolk named[26462]: stopping command channel on 127.0.0.1#953
Jul 23 07:23:25 westnorfolk named[26462]: stopping command channel on ::1#953
Jul 23 07:23:25 westnorfolk named[26462]: no longer listening on ::#53
Jul 23 07:23:25 westnorfolk named[26462]: no longer listening on 127.0.0.1#53
Jul 23 07:23:25 westnorfolk named[26462]: no longer listening on 85.119.82.237#53
Jul 23 07:23:25 westnorfolk named[26462]: exiting
Jul 23 07:23:25 westnorfolk rndc[27676]: rndc: connect failed: 127.0.0.1#953: connection refused
Jul 23 07:23:25 westnorfolk systemd[1]: bind9.service: Control process exited, code=exited status=1
Jul 23 07:23:25 westnorfolk systemd[1]: bind9.service: Unit entered failed state.
Jul 23 07:23:25 westnorfolk systemd[1]: bind9.service: Failed with result 'exit-code'.
Jul 23 07:23:29 westnorfolk named[27705]: starting BIND 9.10.3-P4-Debian <id:ebd72b3> -c /etc/bind/named.conf

So since 07:23 yesterday this has been running and snarfing up all
the transfer requests, which is why none of the later configuration
changes seemed to make any difference.

The normal named command line when run under ssytemd looks like
this:

andy@westnorfolk:~$ ps awux | grep named
bind     10022  0.0  0.7 288940 23924 ?        Ssl  12:47   0:00 /usr/sbin/named -f -u bind

so I suspect that these other processes were started in some other
way - command line maybe? So that would be why systemd doesn't know
about them. I still think that bind should have complained about not
being able to bind to, e.g. 85.119.82.237#53 but perhaps it didn't
know it was unable to (silent failure)?

Turn off bind9 and run something that will hold the port (a socat
server copying everythign it receives to terminal):

andy@westnorfolk:~$ sudo systemctl stop bind9
andy@westnorfolk:~$ sudo socat -v tcp-l:53,fork -

It's defintiely got the port:

andy@westnorfolk:~$ sudo lsof -p 10688
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
socat   10688 root    5u  IPv4             642785      0t0     TCP *:domain (LISTEN)

And it sees traffic:

[another machine]
$ nc 85.119.82.237 53
hello
^C

[back on Keith's VM]
> 2019/07/24 13:35:41.546776  length=6 from=0 to=5
hello
hello

Start bind9 again while socat is holding the port:

andy@westnorfolk:~$ sudo systemctl start bind9
andy@westnorfolk:~$ sudo systemctl status bind9
● bind9.service - BIND Domain Name Server
   Loaded: loaded (/lib/systemd/system/bind9.service; enabled; vendor preset: enabled
   Active: active (running) since Wed 2019-07-24 13:39:02 BST; 5s ago
     Docs: man:named(8)
  Process: 10592 ExecStop=/usr/sbin/rndc stop (code=exited, status=0/SUCCESS)
 Main PID: 10780 (named)
    Tasks: 5 (limit: 4915)
   CGroup: /system.slice/bind9.service
           └─10780 /usr/sbin/named -f -u bind

andy@westnorfolk:~$ sudo grep 85.119.82.237#53 /var/log/syslog
Jul 24 13:39:02 westnorfolk named[10780]: listening on IPv4 interface eth0, 85.119.82.237#53

Why didn't it cry about not being able to bind 85.119.82.237:53?

So right now we have bind9 thinking it's running fine but it will
never see a zone transfer request because this socat process is
hogging port 53.

Is this normal? I am used to daemons giving up when they can't
exclusively bind.

Interestingly if I kill the socat, other servers now see
"connection refused", i.e. named hasn't tried to bind port 53 again
(or never did).

Cheers,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
users mailing list
users@lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/users