How Wi-Fi engineers actually troubleshoot - not what the textbook describes
The methodology is correct. The field is messier. Here's what 15 years of 802.11 troubleshooting actually looks like - scope first, controller second, Wireshark last.
— Shankar K. · field notes, not textbook excerpts
CWNA teaches: identify → discover scope → define causes → narrow down → action plan → verify → document.
The field teaches: figure out if it's actually a Wi-Fi problem first.
The real first question
Before touching a controller, before opening Wireshark, before anything - ask one question:
If the user can plug in and reproduce the same problem, you just saved yourself an hour. DNS failure, firewall rule, VLAN misconfiguration, upstream outage - all of these look like "Wi-Fi problems" to users because Wi-Fi is the only connection they ever see. This is the fastest triage step in wireless troubleshooting and the one most engineers skip when they're eager to dive into the controller.
The scope question that cuts resolution time in half
Once you've confirmed it's a Wi-Fi issue, scope it before doing anything else. Scope tells you where to start - and more importantly, where not to waste time.
| SYMPTOM | LIKELY CAUSE | WHERE TO START |
|---|---|---|
| One user, one device | Client-side | Device, driver, saved profile |
| One user, all their devices | Credential or account issue | RADIUS logs, user account |
| Multiple users, one area | AP or RF in that zone | That specific AP's status |
| Multiple users, everywhere | Infrastructure failure | Controller, DHCP, RADIUS, upstream |
A single user with a single device failing is almost never an AP problem. Don't start with the AP.
The 80% that never needs Wireshark
Most wireless tickets - roughly 80% in a well-run enterprise environment - resolve before a PCAP is ever opened.
The single most common cause of "I can't connect." If the network password changed recently and the user hasn't updated it, they'll hit a 4-way handshake timeout on every attempt. The fix is a 30-second conversation, not a packet capture.
Responsible for a surprising percentage of intermittent disconnections, especially after OS updates. Windows Update frequently touches wireless drivers without the user noticing. Check Device Manager first.
The device remembers the network but has a corrupted or mismatched profile from a previous security configuration. Forget the network, reconnect fresh. Resolves in under two minutes.
LEDs don't lie. An AP cycling through boot states explains why an entire floor can't connect. Check the controller - if the AP isn't registering, check PoE, then uplink, then hardware.
Classic symptom: client associates successfully, shows "connected," but has no internet. Check DHCP server - if the pool is full, no new clients get addresses. Reduce lease time or expand the pool.
They moved to a different part of the building. The signal is marginal. Their device is sticky and won't roam. A quick RSSI check on the controller tells you this in under a minute.
What the controller dashboard actually tells you
The wireless controller is your first stop on almost every real ticket - not Wireshark. Here's what to look for and what it means.
Look up the client's MAC and read the connection event history. You'll see association attempts, authentication outcomes, deauths with reason codes, and roaming events. This answers 60% of remaining questions after basic triage - and most engineers don't use it enough.
High client count + high channel utilization = capacity problem, not coverage problem. These two are often confused. Coverage problems need more APs. Capacity problems need load balancing or fewer SSIDs.
Retry rates above 15–20% signal something wrong - high CCI, marginal RSSI for some clients, or a specific problematic client dragging the AP down. The controller shows you this without a capture.
For 802.1X networks, always check RADIUS logs before assuming the problem is on the wireless side. An Access-Reject from the RADIUS server looks identical to a Wi-Fi association failure from the user's perspective.
What "Wi-Fi is slow" actually means
"Slow" is the most common complaint and the least specific. Translate it before doing anything.
Authentication or DHCP is taking too long - not a throughput problem. Look at EAPOL timing, RADIUS response time, DHCP server latency.
Actual throughput is degraded. Look at retry rates, MCS index, channel utilization, and whether the client is in a coverage hole.
Probably not Wi-Fi at all. QoS policy, firewall rule, DNS resolution time, or application server issue. Eliminate the network with ping/iperf first.
Almost always a capacity or channel utilization issue. Pull historical utilization data from the controller for the reported time window.
What "keeps disconnecting" actually means
"Disconnecting" has four completely different root causes depending on the pattern.
Likely a roaming issue. Sticky client drops below usable RSSI, deauths, reconnects. Check RSSI at time of deauth in the controller event log.
An idle timeout or session timeout is kicking the client. Check idle-timeout settings - especially for IoT and mobile devices.
Could be DFS radar causing a channel change, AP firmware bug, or a driver issue. Start with the AP event log - look for unexpected restarts or channel changes.
Roaming failure. Client transitioning between APs but handoff isn't clean - PMKID caching failure, FT misconfiguration, or sticky client waiting too long to roam.
When to actually open Wireshark
The remaining 20% of tickets - the ones that don't resolve from basic triage and the controller dashboard - are where protocol analysis earns its value.
Open Wireshark when:
- → The controller says connected, but the user says it's not working
- → Authentication appears to succeed but the client has no IP
- → The problem is intermittent and controller logs don't show a clear failure
- → You need to prove a specific failure mode to a vendor or escalation team
- → The reason code alone isn't enough to explain the failure
For what to look for once Wireshark is open → 802.11 Connection Failure PCAP Guide
The tools engineers actually use daily
Aruba Central, Cisco Catalyst Center, Juniper Mist, Meraki, Ruckus One - whichever platform you're on, this is where you spend the majority of troubleshooting time. Know where the client event logs live. Know how to filter by MAC address. Know how to pull AP event history.
Unglamorous. Indispensable. A ping to the default gateway confirms L3 reachability in five seconds. iperf between the client and a wired server separates a "slow Wi-Fi" complaint from a "slow application" complaint in under a minute.
A quick channel scan in the problem area tells you: what APs are visible, what RSSI the client sees, what channels are in use, whether there's CCI. Faster than pulling a survey tool for quick triage.
netsh wlan show interfaces (Windows) and airport -I (macOS) show the current SSID, BSSID, signal strength, channel, and auth type. Often reveals a sticky client problem immediately.
Most engineers forget to check here. These live on the authentication server, not the wireless controller. Most 802.1X failures - certificate expiry, policy mismatch, wrong credentials - are visible here and invisible in the controller.
The unwritten rules
"It's been broken all day" often means "I noticed it two hours ago." The actual failure event might be a 3am AP reboot, a RADIUS cert that expired at midnight, or a DFS channel change at 6am. Always pull the controller event log for the 24 hours before the ticket.
Scale matters but it's not binary. Ten users affected in one room might be one failing AP. Ten users affected across three floors might be one RADIUS server returning errors. Let the scope table do the thinking, not the user count.
These disagree more often than they should. Controller dashboards count an eventual successful re-auth as a successful roam. The PCAP shows the 1.5 seconds of dropped voice call in between. When a user says "the call keeps cutting out for a second" - believe the PCAP over the dashboard.
It means you haven't found the pattern yet. Ask: same time of day? Same location? Same application? Same device type? A problem that "just happens randomly" almost always ties to a roaming event, a DHCP lease renewal, or a RADIUS session timeout.
Rebooting an AP clears transient state and often resolves the immediate complaint - but if the root cause is still there, the problem comes back. Document what you rebooted and why. Check whether it recurs within 24 hours.
When to escalate and to whom
| CONDITION | ESCALATE TO |
|---|---|
| AP won't register after reboot | Network ops / hardware swap |
| RADIUS Access-Reject with no obvious cause | Identity / security team |
| DHCP pool exhausted repeatedly | Server / infrastructure team |
| Firmware bug confirmed | Vendor TAC |
| Intermittent interference, no identifiable source | RF engineering / spectrum analysis |
| Problem persists across AP replacement | Controller config audit |
The certification vs. reality gap
CWNA gives you the vocabulary, the framework, and the protocol knowledge. That's real value - it makes you dangerous in the right way when you get to the deep end. But the cert teaches you the what of every failure mode. The field teaches you which ones you'll actually see, in what order to check them, and which three questions to ask the user before touching anything.
The gap closes with every ticket. The engineers who close it fastest are the ones who stay curious about what the frames are actually saying - even when the ticket resolves at the controller level. That habit is what separates someone who can troubleshoot from someone who can troubleshoot and explain exactly why the fix worked.
When basic triage and the controller don't give you the answer - the PCAP will.
— Shankar K., Wi-Fi engineer, Irving TX
Building WiFi Analyser V2 · CWNA-109 in progress · one post every two weeks
Building WiFi Analyser V2 · CWNA-109 in progress · one post every two weeks