Broken Email: A Case Study

(2020-04-02)

Summary: Some of the nameservers for wa-students.org intermittently return CNAMEs for the domain which hide its MX, TXT, NS, and SOA records. The address for that returned name does not run a mail server. The sending mail ultimately times out (and bounces) the email when it is not able to find the mail exchanger.

The email for wa-students.org has been intermittently broken for a few years. (Sometimes it consistently works, sometimes is consistently fails.) In August 2017, it was noticed that emails sent to two students' @wa-students.org addresses bounced. They attempted to get sent to an @edlio.map.fastly.net address. The following is a dig output from August 2017 showing the MX records and the authoritative answer ("aa") flag. (Emphasis added in bold.)

$ dig +nsid wa-students.org MX @ns1.hover.com

; <<>> DiG 9.10.2-P3 <<>> +nsid wa-students.org MX @ns1.hover.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11726
;; flags: qr aa rd; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2800
; NSID: 68 6f 76 65 72 30 35 2e 64 6e 73 2e 62 72 61 2e 70 72 6f 64 2e 
74 75 63 6f 77 73 2e 6e 65 74 ("hover05.dns.bra.prod.tucows.net")
;; QUESTION SECTION:
;wa-students.org.               IN      MX

;; ANSWER SECTION:
wa-students.org.        900     IN      MX      10 ASPMX3.GOOGLEMAIL.COM.
wa-students.org.        900     IN      MX      10 ASPMX2.GOOGLEMAIL.COM.
wa-students.org.        900     IN      MX      5 ALT1.ASPMX.L.GOOGLE.COM.
wa-students.org.        900     IN      MX      5 ALT2.ASPMX.L.GOOGLE.COM.
wa-students.org.        900     IN      MX      1 ASPMX.L.GOOGLE.COM.
wa-students.org.        900     IN      MX      10 mx.hover.com.cust.hostedemail.com.

;; Query time: 78 msec
;; SERVER: 216.40.47.26#53(216.40.47.26)
;; WHEN: Tue Aug 29 21:04:14 UTC 2017
;; MSG SIZE  rcvd: 261

The following is also from August 2017 showing the errant CNAME response and missing authoritative answer flag:

$ dig +nsid wa-students.org MX @ns1.hover.com

; <<>> DiG 9.10.2-P3 <<>> +nsid wa-students.org MX @ns1.hover.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44220
;; flags: qr rd; QUERY: 1, ANSWER: 2, AUTHORITY: 13, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2800
; NSID: 68 6f 76 65 72 30 31 2e 64 6e 73 2e 62 72 61 2e 70 72 6f 64 2e 
74 75 63 6f 77 73 2e 6e 65 74 ("hover01.dns.bra.prod.tucows.net")
;; QUESTION SECTION:
;wa-students.org.               IN      MX

;; ANSWER SECTION:
wa-students.org.        900     IN      CNAME   www.westlakeacademy.org.
www.westlakeacademy.org. 900    IN      CNAME   westlakeacademy.sites.edlio.com.

;; AUTHORITY SECTION:
.                       518400  IN      NS      a.root-servers.net.
.                       518400  IN      NS      b.root-servers.net.
.                       518400  IN      NS      c.root-servers.net.
.                       518400  IN      NS      d.root-servers.net.
.                       518400  IN      NS      e.root-servers.net.
.                       518400  IN      NS      f.root-servers.net.
.                       518400  IN      NS      g.root-servers.net.
.                       518400  IN      NS      h.root-servers.net.
.                       518400  IN      NS      i.root-servers.net.
.                       518400  IN      NS      j.root-servers.net.
.                       518400  IN      NS      k.root-servers.net.
.                       518400  IN      NS      l.root-servers.net.
.                       518400  IN      NS      m.root-servers.net.

;; Query time: 78 msec
;; SERVER: 216.40.47.26#53(216.40.47.26)
;; WHEN: Tue Aug 29 21:04:17 UTC 2017
;; MSG SIZE  rcvd: 369

This could be repeated multiple times from different networks and targetting different ns1 versus ns2 Hover DNS servers, then and still today.

Here is the explanation:

1) wa-students.org DNS is hosted by ns1.hover.com and ns2.hover.com. (These are their authoritative DNS servers,)

2) Hover's servers use a farm of DNS servers hosted at Tucows for the same two IP addresses. Two examples above can be seen in the EDNS Name Server Identifier (NSID) output above. Other NSIDs were seen also. Note that even some same NSID identifiers report different information at near the same time.

3) The various name servers return different values. They are not synchronized. Some return authoritative answers ("aa") as they should since they are delegated and responsible nameservers for that domain. Some don't return the "aa" flag and behave as sending from a cache. Even the TTLs (time to lives) are reduced on later lookups indicating the data comes from a cache and not authoritative data. Note that when doing a non-recursive query, it will return the "aa" flag but will still provide the problem CNAMEs.

4) Some of these name servers are broken. They mistakenly (even illegally per specification) provide CNAME records to follow for the wa-students.org query. A CNAME record is used for an alias. In DNS this basically means go look at this target instead (asking for the desired record type). In this case, it says go look at www.westlakeacademy.org which happens to be a CNAME which says go look at westlakeacademy.sites.edlio.com (which is a CNAME to sites.edlio.com which is a CNAME to edlio.map.fastly.net). So if something asks for an address for wa-students.org it ends up with asking for the address for the Fastly CDN mirror. (Assume this is what they want for the "www" website, but not for the SMTP email service.)

Per the DNS specification "If a CNAME RR is present at a node, no other data should be present". (Ignore newer specification that allows a signature for a CNAME record.) This is where the problem is. Email servers (MTA) first look for MX record to see who handles the email for that DNS name. As you can see in the two dig outputs above done within a second of each other, one returns an MX and one does not. The one with the CNAME cannot provide the MX as you cannot have a MX record at same place as a CNAME record. (For CNAME, a DNS query would ask the question again for the desired record type at the new name.)

So by the load balancer chance you get the CNAME, then per DNS it will fall back to ask again for the MX for the CNAME target(s). When the MTA doesn't get an MX answer(s), then it simply falls back to the address. So the MTA attempts to connect to a mail service at edlio.map.fastly.net. (That is the standard behavior, but some email servers may behave differently.)

As an additional problem, the TXT record is also intermittently hid. This means the SPF rules are not available for its email anti-fraud use.

This issue was reported to Hover as reference number #1957071.

Our extensive DNS audit service fails for the domain. (Note the missing failure for "Authoritative test query must return authoritative answer (AA)" because the tests didn't do recursion desired (RD) query.)


As an unbiased view, see several third-party DNS resolvers getting different (TXT) answers (click to see the many failures):

whatsmydns.net: Click to see screenshot

Resolutions

The delegated nameservers should return authoritative answers ("aa") always. They should not provide top-level CNAMEs -- they should provide the MX, TXT, NS, and SOA records there. All the nameservers in the load balancing farm should provide the same correct data.