It�s been more than two months since the ransomware attack on KQED in San Francisco on June 15. Since then, every work day feels like I am trying to run under water. That�s because life without normal network services in a modern large broadcast operation (about 350 employees) is, to say the least, a major challenge.
Many of my daily activities are impossible, while others require re-inventing the wheel or doing things the old way BC � before computers.
After consulting with expert infection consultants and the FBI, KQED�s initial decision to not to pay the ransom and proceed to making a full recovery on our own was confirmed as the best approach. Without going into too many details of exactly how the attack was able to succeed, I will try to give you an idea of what was affected and how we got through the problems.
My first clue that something was wrong was a call from the Burk remote control informing me our Sacramento station had no audio. The KQEI(FM) site audio is fed over an Intraplex by an MPLS data line. We occasionally get short line drop-outs, so my first response was to call our master control to get them to connect via ISDN, if they had not already done so.
I also turned on my little radio to make sure our main San Francisco station had audio. It did. I could not get through on the MCR hot line and quickly discovered I could not call any phones at KQED. We have a VOIP phone system, and it was down hard.
OK, I thought, that all made sense: The MPLS and the VOIP phone system share IP services. I reasoned there was a network issue at the studio.
About this time, my colleague Steve Pinch, our FM engineering IT expert, called me to let me know there was a virus attack on the network, and he was headed back to work. He wouldn�t be able to go home for a full night�s sleep for the next several days.
Not fully trusting VOIP, we have a second hot line into MCR that provides a dial tone from the telco central office. Once I got through, I found the ISDN was connected and audio had returned to our Sacramento station. I could now assist the announcer getting traffic reports on the air by using our Telos phone system, which was connected to the telco central office with a PRI and was working normally.
KQED(FM)'s Steve Pinch worked almost around the clock for days to
developed work-arounds for every problem and showed the production �
and news people how to make them work.��
Our Comrex BRIC-Link for getting traffic reports via AOIP was dead. The traffic reports would have to come from one of our talk show lines for the next three days. It didn�t sound great, but it worked. Our multiple streaming audio feeds to various streaming providers were down. These would not be back until we cautiously began restoring the most critical network services, 12 hours after the shutdown.
Steve and the IT staff immediately placed themselves into the trouble, but I didn�t have to be at work until the next morning. It gave me time to ponder what the changes would be and what work-arounds we would need to keep going.
The next morning I was met with hand-written signs scattered throughout the building, warning people not to use their computers or phones, and that information updates could be found on the white board in the central atrium. Except for the fact we were on the air with normal programing, it felt like we were back in the 19th century.
The IT staff, assisted by Steve from radio engineering, worked at a feverish pace to keep what wasn�t infected safe and to restore services. Virtually all ports on the network switches were immediately shut off when the infection was detected. Until we knew exactly what happened and how it spread through the network, �disconnect everything� was the philosophy, in the hope the infection could be kept from doing any more damage.
This means several things were turned off that weren�t infected and couldn�t be infected. But as we all know, it�s better to be safe than sorry.
That is why we lost the traffic report BRIC-Link and program audio to the Sacramento transmitter and a lot of other things. The Intraplex uses IP for its connection and the ports were shut off. That�s also why ISDN was able to keep going for the next three days, no IP needed. Higher priorities came first and then we could start getting other things going again, port by port. The highest priority was staying on the air with normal programming. On air and news came first, and finding ways to keep them going was a challenge for both the engineers and the news people.
We hired outside consultants to assist us in protecting what was not infected, while at the same time getting services going again. The FBI was on site to collect evidence of a crime. They copied the entire contents of several PCs, including both those that had been left on and those that had been power cycled. Other PCs, like mine, had been �touched� earlier by the malware, but had been turned off before the actual attack. All of these details helped determine just how the bad guys worked and perhaps give clues as to who was the culprit.
In many ways, we were lucky. We have a great IT department, good backups and the resources, both human and financial to take the crisis head-on and keep going.
Our main bit of luck was that our on-air broadcast systems for both TV and radio were not hit by the attack. More about this later, but for now let�s just say it could have been a lot worse, were it not for several good decisions made to keep critical systems isolated.
Dalet Galaxy, our news and production system, was not so lucky. This was hit, and in the end, the servers and clients needed to be completely reloaded. The Dalet database and file storage, including all audio, stories and metadata were not hit, but they would not be accessible for four weeks.
We were also lucky that our Public Radio Satellite System equipment was unaffected and on the same network as our on air Dalet system. We were still receiving both live and non-real time programming. Also, our Telos phone system was still working for the two-hour daily talk show and news interviews. The assistant producer call screening program was not working. It needed to link to the Telos on a different VLAN, and that link was disconnected. Communication with the hosts became primitive and the next event countdown clock was gone.
Why were some computers affected and not others? The infection agent gleaned a system password and found its way into the active directory server. From there, it found and attacked every PC that was on active directory and was powered up. All of this we would figure out later.
How did our on-air Dalet keep going? Dalet Radio Suite, our on air system, was not on the KQED Active Directory. By design, Radio Suite had a separate domain and its own network switch. The ports on the network switch for Dalet Radio Suite were never turned off, but its link to the main KQED network was removed. It truly became an island with no outside connectivity.
Most production and news was done with Dalet Galaxy, which was on the KQED active directory. On Thursday morning before the infection, we had about 50 working Galaxy servers and clients. By late Thursday afternoon, we had zero. However, each production room had one on-air Radio Suite computer mostly used by producers as a utility PC and call screener. Despite their reduced capacity, these Radio Suite PCs became the new main production computers.
The Sadie computers we use for craft editing were not infected, but with the network disconnected getting audio to and from the computers became a real challenge. Dalet Galaxy computers in the news edit rooms were replaced with Radio Suite PCs, and several were added in the newsroom for editors and other news use.
HERE�S WHAT ELSE WAS AFFECTED
The Nautel Importer; the Arctic Palm HD and RDS scrolling information manager; the main shared production utility computer; the on-air utility computer; the main VOIP phone system computer; the computer at the KQED transmitter; most desktop computers on the network; and the building security system. It�s not a broadcast PC, but access rights could not be changed to get people into areas they needed to get to in order to fix network problems.
There were other computers and services not infected, but since we could no longer connect them to the KQED network, we could no longer use them, and those included our new Telos VX phone system; the PC that received Associated Press and Bay City News wires and passed them on the Dalet; audio file converters and file transfer between systems; the Burk remote control and the Burk Autopilot application; the EAS CAP network connection; and, the transmitter status and control via IP. Our NTP server couldn�t be reached by 50+ devices. The Comrex and Tieline devices were off for several weeks until we turned their network ports back on.
We quickly rebuilt the Nautel Importer, since we are part of the Broadcaster Traffic Consortium. It was a bit of a challenge to get it going on 64 bit Windows 7, but after several days we got it going and back in use. In the mean time we discovered we could feed the Harris Importer to the Nautel Exporter and it worked fine.
THE SLOW ROAD TO THE NEW NORMAL
IT set up a full-time help desk in the main atrium, and this is where staff could go to get laptops connected to the internet-only Wi-Fi.
At first, smartphones were also connected to the Wi-Fi, but the Wi-Fi slowed down so much that they had to be removed. The LTE at our building is ultra-fast and a better choice for the phones.
There were also several printers set up, and requests for important files to be retrieved from the backups could be made there, as well. After four weeks, the network printers were added to the Wi-Fi, and people could print directly from their laptops.
Almost immediately after the attack, most people installed Slack, an instant messager for business, on their smart phones and that became the message service in place of email in the days after the attack. Although the main Exchange email server was down we had a backup email service through Mimecast, which people were able to access in the first week.
Phones returned after two weeks.
After three weeks, most files could be retrieved from the network by request and placed in a Google drive.
Before the attack, reporters could already get audio into Dalet from the field by using FTP. This became the main source for audio imports. Since the network connection was off for protection, in order to provide a method to connect to the FTP site over the internet, a wireless dongle was added to the PC that runs the audio importing application. The Wi-Fi-to-internet was shared with all of the KQED staff, and at times ran so slowly the app would need to be restarted. Even when it was running, the change from a direct Gig connection to a Wi-Fi connection that would drop to 1Mb and the huge increase in use caused many extreme slowdowns.
One work-around was for the production people to copy files in real time over an audio cable between PCs in order to meet a deadline. At first the use of USB drives to transfer files was prohibited. These can be used to spread viruses and we weren�t taking any chances. Later some USB drives were used, but these had to be scanned by the IT department.
NETWORK SERVICES RETURN
After patching network routers and switches, new filters could be created to limit what devices could be seen on the network. No PC needs to be able to see every other PC on the network, even if it is password protected. As in our case, network accounts can be compromised. To be sure nothing stopped working, the implementation of these filters had to be a careful process.
Needless to say, there were a few surprises when some device would quit working. The filter would be removed, and the device with the issue would be examined to see what it needed to work and the filter would be modified and reapplied.
File importing and exporting was reconfigured and placed back on the high speed network and that made life easier for News and Production. The call screening software was re-enabled for our talk show. The roll out of rebuilt PCs started in early August and people could once again log into a network. The completely rebuilt Dalet Galaxy system returned, and at that point everybody breathed a collective sigh of relief.
This story is by no means over. It will probably be several months before all the noncritical services and utilities return. (I hope to be able to provide an update soon to contrast what �normal� was before the attack and after.)
With the benefit of this experience, there are certain things I�d like you to know and consider:
- Takeaway # 1.Have a reliable way of communicating with your MCR. It could be a landline. It could be every operator�s cell phone number. It could be a good old radio two-way; we use them at KQED and having one in our MCR is now a priority.
- Takeaway # 2.Have a way to communicate with staff, such as a general voicemail box off site, or a business instant message program, like Slack. Make sure all staff members know how to access it. A cloud-based station wiki could contain lots of �what to do if� documents. Keep it updated.
- Takeaway # 3.Have hard copies of critical documents on file, and a copy on an engineering laptop not normally on the network. My network documentation, including all IP addresses and passwords, was not available for three weeks � including transmitter wiring, shipping forms, EAS log masters, time sheets, etc. Don�t forget to keep the hard copies updated. I had a copy of all important network, engineering and transmitter files on my home PC, but they were nine years old.
- Takeaway # 4.Have work-arounds in place for everything. How do your news people get interviews into their computers and edit them without a network? And then how do they get those files into your on-air system? If any equipment in your air chain requires a network connection to function, have a non-network dependent device as a backup.
- Takeaway # 5.Some staff will be more understanding and adaptable than others. Keeping people up to date about the crisis will go a long way. But there was also a large increase in sweet treats throughout the building in the weeks following the attack.
- Takeaway # 6.Keep your critical systems in protected islands! Our radio listeners and TV viewers could never tell there was a problem, as our regular programming kept going.
- Takeaway # 7.The staff working on the crisis will work longer and harder than anyone could expect. Keep them fed when they are here because they won�t take the time out to feed themselves, as they should. Don�t get in their way and make sure all staff requests go through managers. Recognize their efforts, thank them regularly.
�While we don�t know who the villains in this story are, we do know who the heroes were.
First, the IT department led by Michael Kadel. They were the real saviors of the day and the weeks that followed.
Steve Pinch in FM engineering is truly responsible for ensuring that KQED(FM) stayed on the air with normal programming. He worked almost around the clock for days to developed work-arounds for every problem and showed the production and news people how to make them work.
His counterparts in TV, Jay Strauss and Larry Bursten, kept TV on the air.
I also think the production people, the news people and all the staff at KQED deserve credit for their dedication and perseverance in difficult times. We often talk about how we would respond to a disaster like an earthquake. This was a disaster of a different kind and we got through it intact and still going strong. We are a news organization with a mission to serve our listeners, our viewers and our internet audience. We lived up to our mission, and we will apply what we learned to future disaster recovery planning.