SERVFAIL: The DNS Nightmare for NOCs
Published on March 12, 2025

When a website fails to load due to a DNS resolution issue, users typically blame their Internet Service Provider (ISP). However, many of these problems are not caused by the ISP but by misconfigurations in authoritative DNS servers. The most frustrating error for Network Operations Centers (NOCs) is SERVFAIL, a vague response indicating that a DNS resolver was unable to retrieve a valid answer.
Unlike NXDOMAIN, which clearly signals that a domain does not exist, SERVFAIL does not specify the cause of failure. It can be triggered by a wide range of issues, including misconfigured nameservers, DNSSEC failures, network filtering, outdated glue records, and even authoritative servers refusing queries.
What makes SERVFAIL particularly difficult to troubleshoot is that different resolvers handle failures in different ways. While some public DNS services like Google (8.8.8.8) or Cloudflare (1.1.1.1) attempt to work around certain misconfigurations, BIND9-based resolvers used by ISPs tend to fail strictly, leading to user complaints that are difficult to explain.
This article explores the root causes of SERVFAIL, why it is so challenging for NOCs, and how ISPs can better diagnose and mitigate the impact of this DNS nightmare.
🔍 The Many Causes of SERVFAIL
- Authoritative DNS Misconfigurations
✅ Lame Delegations
A lame delegation occurs when a domain’s NS records point to nameservers that do not properly answer queries for that domain.
🔴 Example of a lame delegation:
example.com. 3600 IN NS ns1.example.net.
example.com. 3600 IN NS ns2.example.net.
If ns1.example.net and ns2.example.net exist but are not configured to serve example.com, they will either time out or return REFUSED, leading to SERVFAIL on recursive resolvers.
✅ NS Records Pointing to CNAMEs (Illegal Configuration)
According to RFC 1034, an NS record must point to an A or AAAA record, either directly or via glue records. If an NS record is instead a CNAME, recursive resolvers may refuse to follow it, resulting in SERVFAIL.
🔴 Example of an invalid CNAME in NS records:
example.com. 3600 IN NS ns1.example.net.
ns1.example.net. 3600 IN CNAME some.otherdomain.com.
While Cloudflare (1.1.1.1) or Google (8.8.8.8) may attempt to resolve the CNAME and continue, BIND9-based resolvers used by ISPs will fail with SERVFAIL. This leads to confusion, as users report that “Google DNS works, but my ISP’s DNS is broken.”
This discrepancy often causes frustration for ISPs, as users blame them for what is actually a misconfiguration at the authoritative level.
✅ Outdated or Differing Glue Records in the Registrar
Glue records, stored at the domain registrar, provide the IP addresses of nameservers. If these records differ from the actual NS records published in the authoritative zone, resolvers may send queries to incorrect or unreachable servers, causing SERVFAIL.
🔴 Example of differing glue records:
At the registrar:
example.com. 3600 IN NS ns1.example.com.
ns1.example.com. 3600 IN A 192.0.2.1 # Old IP
But in the zone file:
example.com. 3600 IN NS ns1.example.com.
ns1.example.com. 3600 IN A 203.0.113.1 # New IP
Resolvers may cache outdated glue records, leading to intermittent failures that are extremely difficult to debug.
- Recursive Resolver Failures
✅ Time-Outs and Unreachable Authoritative Servers
If all authoritative nameservers for a domain become unreachable, recursive resolvers will retry until timeouts occur, eventually returning SERVFAIL.
Causes include:
- DDoS attacks on authoritative nameservers.
- Firewall misconfigurations blocking queries.
- Servers being offline or misconfigured.
✅ Circular or Looping NS Delegations
If an NS record indirectly depends on itself, queries may loop indefinitely.
🔴 Example of a circular delegation:
example.com. 3600 IN NS ns1.loop.net.
loop.net. 3600 IN NS example.com.
In this case, example.com depends on loop.net, which in turn depends on example.com, creating an infinite loop that results in SERVFAIL.
- DNSSEC Misconfigurations That Cause SERVFAIL
DNSSEC validation failures definitely cause SERVFAIL, as validating resolvers reject responses that fail cryptographic checks.
🔴 Common DNSSEC failures leading to SERVFAIL:
- Expired RRSIG signatures (not automatically renewed).
- Incorrect DS records at the registrar (forgot to update after a key rollover).
- Clock skew between the authoritative server and the resolver, causing validation failure.
- Missing or corrupt Zone Signing Keys (ZSKs) in the authoritative zone.
- Why SERVFAIL is a Nightmare for NOCs
Diagnosing SERVFAIL is particularly difficult because:
- It does not specify the root cause.
- Some resolvers (Google, Cloudflare) try to “work around” misconfigurations, while others (BIND9, Unbound) follow stricter rules.
- Different resolvers behave differently, leading to user complaints when an ISP’s DNS fails but Google’s DNS succeeds.
- Misconfigurations often lie at the authoritative level, meaning the ISP cannot directly fix the issue.
This results in endless troubleshooting for NOCs, often requiring:
- Querying multiple resolvers (8.8.8.8, 1.1.1.1, ISP’s resolver) to compare responses.
- Checking authoritative nameservers manually.
- Testing DNSSEC validation.
- Using tools like dig +trace to follow delegation paths.
✅ Conclusion
SERVFAIL is one of the most challenging DNS issues to diagnose, often leading to frustration for ISPs and users alike. Causes range from lame delegations, CNAME misconfigurations in NS records, DNSSEC failures, outdated glue records, and resolver-side limits.
Because different DNS resolvers handle errors differently, users often wrongly blame their ISP when Google DNS works but their provider’s resolver does not. The key to resolving SERVFAIL issues efficiently is thorough analysis, logging, and monitoring of DNS behavior.
For ISPs and enterprises, ensuring proper DNS configurations, proactive monitoring, and robust troubleshooting workflows is essential to minimize disruptions and ensure seamless connectivity.